Tue, 11/09/2010 - 17:32

I'm running OpenMx 1.0.2 on Ubuntu 10.04. OpenMx shows the degrees of freedom differently for using raw data than for using the covariance matrix (and means vector). The model has one factor with 6 indicators and 9 parameters are being estimated (some parameters are constrained to equal other parameters).

Here is the output for the raw data:

observed statistics: 120

estimated parameters: 9

degrees of freedom: 111

-2 log likelihood: 894.0595

saturated -2 log likelihood: NA

number of observations: 20

chi-square: NA

p: NA

AIC (Mx): 672.0595

BIC (Mx): 280.7666

adjusted BIC:

RMSEA: NA

timestamp: 2010-11-09 16:20:59

frontend time: 0.3233688 secs

backend time: 0.03640819 secs

independent submodels time: 8.201599e-05 secs

wall clock time: 0.359859 secs

cpu time: 0.359859 secs

openmx version number: 1.0.2-1497

Note that the chi-square and RMSEA are not estimated.

If instead I input the covariance matrix and means vector:

observed statistics: 27

estimated parameters: 9

degrees of freedom: 18

-2 log likelihood: 645.686

saturated -2 log likelihood: 574.1632

number of observations: 20

chi-square: 71.52271

p: 2.492757e-08

AIC (Mx): 35.52271

BIC (Mx): 8.799766

adjusted BIC:

RMSEA: 0.3855829

timestamp: 2010-11-09 16:23:32

frontend time: 0.1366770 secs

backend time: 0.01501584 secs

independent submodels time: 8.106232e-05 secs

wall clock time: 0.1517739 secs

cpu time: 0.1517739 secs

openmx version number: 1.0.2-1497

The parameter estimates are not exactly the same between the two runs but similar (not sure why, only difference is the input format) but the degrees of freedom are different (what I would expect when I input the covariance matrix) and the chi-square and RMSEA are computed.

What am I missing?

Yes, there are different degrees of freedom for raw data and moment matrices. There's a lot more information in the raw data.

Degrees of freedom for both sets of data are defined as the number of observed statistics minus the number of free parameters (which is constant). With k variables and n rows, the observed statistics for covariance data are defined as (k*(k+1))/2, with an extra k statistics for the means. Observed statistics for raw data are found by adding up the number of non-missing observations for each variables, which will be n*k when there is no missing data. The degrees of freedom are different because the datasets are very different; the moment matrices are sufficient for any linear relationship between the variables, but there's a lot more you can do with the extra information in the raw data. This is a notable difference between OpenMx and other programs for SEM; the raw data df are consistent with a GLM approach, whereas other programs treat every model is a structural equation model.

You get more statistics with the moment matrices because some fit statistics (chi square and RMSEA) depend on comparison with a fully saturated model. With the covariance matrix and means vector, there's really only one version of the saturated model and it has an analytic solution. With raw data, one could define several versions of "saturated" models, including the SEM saturated model but also models that contain all possible parameters. People are free to specify whatever saturated model they want, estimate it as a new MxModel object and supply either its likelihood or the fitted MxModel to the summary function.

I can understand how the amount of information differs between using the raw data as input versus just using the covariance matrix. How does this affect model identifiability? Or does it? For example, for a one-factor model with just two manifest variables as indicators, you have 3 statistics (two variances and one covariance). I always thought that this limits the number of parameters you can estimate (where we're just sticking to linear relationships). Does this change if you use the raw data?

Typically, the number of factors that one can estimate doesn't change, because this depends on the number of covariances in the model. However, there are circumstances in which the raw data provides more information, most obviously in the case of using definition variables to specify covariates. For example, supposing we allow the covariance between two variables in the model to be a function of age. In essence, this model is addressing the third moment (the covariance between age and a covariance) and this information is simply not available if one feeds in summary statistics in the form of covariance matrix and means.

Conversely, one could imagine a raw dataset on two variables in which the data are:

X Y

.5 NA

.3 NA

NA .1

NA .6

In this case there is no information about the covariance between X and Y. True, using summary statistics would not make a lot of sense here (except perhaps as two means and two variances).

Nevertheless, I think you are advocating for counting the number of statistics based on the number of means and covariances, and for certain purposes I agree that this would be helpful - putting goodness of fit statistics on the same metric. Doing so would, however, make it possible to specify a fully identified model in which the number of degrees of freedom would be negative. This also has its problems.

I appreciate the in-depth reply. Thanks. My goal was to create a multigroup model and I put together a simple one using the ram approach inputting covariance matrices (with only two groups). Yet, when I run the multigroup model, the chi-square and RMSEA are not computed. Before moving to more complex models, I want to understand how to handle a simple multigroup model and evaluate how well it fits. If I can get chi-square and RMSEA for each model separately, why can't I get it when running the multigroup model? Or, maybe there is a better approach.

"If I can get chi-square and RMSEA for each model separately, why can't I get it when running the multigroup model?"

Great question. Despite the relative simplicity of each of the group models (where simplicity means "we can calculate your saturated model"), combining them makes the model too complex for OpenMx to make assumptions about your saturated model. The objective function for your multiple-group model is the mxAlgebraObjective, which depends on an MxAlgebra that sums the objectives of your individual groups. There are potential dependencies across groups and different datasets across groups. The shortish answer is that beyond the simple single-model case with moment-matrix data, there are options as to what the saturated model could be, so we don't assume one.

You can always supply your own saturated model for comparison. In a simple multiple-group, you'd do something kinda like this: