Should the degrees of freedom depend on how the data are input?

5 replies [Last post]
rabil's picture
Joined: 01/14/2010

I'm running OpenMx 1.0.2 on Ubuntu 10.04. OpenMx shows the degrees of freedom differently for using raw data than for using the covariance matrix (and means vector). The model has one factor with 6 indicators and 9 parameters are being estimated (some parameters are constrained to equal other parameters).

Here is the output for the raw data:

observed statistics: 120
estimated parameters: 9
degrees of freedom: 111
-2 log likelihood: 894.0595
saturated -2 log likelihood: NA
number of observations: 20
chi-square: NA
p: NA
AIC (Mx): 672.0595
BIC (Mx): 280.7666
adjusted BIC:
timestamp: 2010-11-09 16:20:59
frontend time: 0.3233688 secs
backend time: 0.03640819 secs
independent submodels time: 8.201599e-05 secs
wall clock time: 0.359859 secs
cpu time: 0.359859 secs
openmx version number: 1.0.2-1497

Note that the chi-square and RMSEA are not estimated.

If instead I input the covariance matrix and means vector:

observed statistics: 27
estimated parameters: 9
degrees of freedom: 18
-2 log likelihood: 645.686
saturated -2 log likelihood: 574.1632
number of observations: 20
chi-square: 71.52271
p: 2.492757e-08
AIC (Mx): 35.52271
BIC (Mx): 8.799766
adjusted BIC:
RMSEA: 0.3855829
timestamp: 2010-11-09 16:23:32
frontend time: 0.1366770 secs
backend time: 0.01501584 secs
independent submodels time: 8.106232e-05 secs
wall clock time: 0.1517739 secs
cpu time: 0.1517739 secs
openmx version number: 1.0.2-1497

The parameter estimates are not exactly the same between the two runs but similar (not sure why, only difference is the input format) but the degrees of freedom are different (what I would expect when I input the covariance matrix) and the chi-square and RMSEA are computed.

What am I missing?

Ryne's picture
Joined: 07/31/2009
Yes, there are different

Yes, there are different degrees of freedom for raw data and moment matrices. There's a lot more information in the raw data.

Degrees of freedom for both sets of data are defined as the number of observed statistics minus the number of free parameters (which is constant). With k variables and n rows, the observed statistics for covariance data are defined as (k*(k+1))/2, with an extra k statistics for the means. Observed statistics for raw data are found by adding up the number of non-missing observations for each variables, which will be n*k when there is no missing data. The degrees of freedom are different because the datasets are very different; the moment matrices are sufficient for any linear relationship between the variables, but there's a lot more you can do with the extra information in the raw data. This is a notable difference between OpenMx and other programs for SEM; the raw data df are consistent with a GLM approach, whereas other programs treat every model is a structural equation model.

You get more statistics with the moment matrices because some fit statistics (chi square and RMSEA) depend on comparison with a fully saturated model. With the covariance matrix and means vector, there's really only one version of the saturated model and it has an analytic solution. With raw data, one could define several versions of "saturated" models, including the SEM saturated model but also models that contain all possible parameters. People are free to specify whatever saturated model they want, estimate it as a new MxModel object and supply either its likelihood or the fitted MxModel to the summary function.

rabil's picture
Joined: 01/14/2010
I can understand how the

I can understand how the amount of information differs between using the raw data as input versus just using the covariance matrix. How does this affect model identifiability? Or does it? For example, for a one-factor model with just two manifest variables as indicators, you have 3 statistics (two variances and one covariance). I always thought that this limits the number of parameters you can estimate (where we're just sticking to linear relationships). Does this change if you use the raw data?

neale's picture
Joined: 07/31/2009
Typically, the number of

Typically, the number of factors that one can estimate doesn't change, because this depends on the number of covariances in the model. However, there are circumstances in which the raw data provides more information, most obviously in the case of using definition variables to specify covariates. For example, supposing we allow the covariance between two variables in the model to be a function of age. In essence, this model is addressing the third moment (the covariance between age and a covariance) and this information is simply not available if one feeds in summary statistics in the form of covariance matrix and means.

Conversely, one could imagine a raw dataset on two variables in which the data are:
.5 NA
.3 NA
NA .1
NA .6

In this case there is no information about the covariance between X and Y. True, using summary statistics would not make a lot of sense here (except perhaps as two means and two variances).

Nevertheless, I think you are advocating for counting the number of statistics based on the number of means and covariances, and for certain purposes I agree that this would be helpful - putting goodness of fit statistics on the same metric. Doing so would, however, make it possible to specify a fully identified model in which the number of degrees of freedom would be negative. This also has its problems.

rabil's picture
Joined: 01/14/2010
I appreciate the in-depth

I appreciate the in-depth reply. Thanks. My goal was to create a multigroup model and I put together a simple one using the ram approach inputting covariance matrices (with only two groups). Yet, when I run the multigroup model, the chi-square and RMSEA are not computed. Before moving to more complex models, I want to understand how to handle a simple multigroup model and evaluate how well it fits. If I can get chi-square and RMSEA for each model separately, why can't I get it when running the multigroup model? Or, maybe there is a better approach.

Ryne's picture
Joined: 07/31/2009
"If I can get chi-square and

"If I can get chi-square and RMSEA for each model separately, why can't I get it when running the multigroup model?"

Great question. Despite the relative simplicity of each of the group models (where simplicity means "we can calculate your saturated model"), combining them makes the model too complex for OpenMx to make assumptions about your saturated model. The objective function for your multiple-group model is the mxAlgebraObjective, which depends on an MxAlgebra that sums the objectives of your individual groups. There are potential dependencies across groups and different datasets across groups. The shortish answer is that beyond the simple single-model case with moment-matrix data, there are options as to what the saturated model could be, so we don't assume one.

You can always supply your own saturated model for comparison. In a simple multiple-group, you'd do something kinda like this:

modelA <- mxModel("Sat1",
    mxData(group1, "cov", group1mean, 60),
    mxMatrix("Symm", 2, 2, TRUE, diag(2), name="cov1"),
    mxMatrix("Full", 1, 2, TRUE, name="mean1"),
    mxMLObjective("cov1", "mean1", dimnames=dimnames(group1)[[1]])
modelB <- mxModel("Sat2",
    mxData(group2, "cov", group2mean, 60),
    mxMatrix("Symm", 2, 2, TRUE, diag(2), name="cov2"),
    mxMatrix("Full", 1, 2, TRUE, name="mean2"),
    mxMLObjective("cov2", "mean2", dimnames=dimnames(group2)[[1]])
satModel <- mxModel("Saturated",
    modelA, modelB,
    mxAlgebra(Sat1.objective + Sat2.objective, name="obj"),
satRes <- mxRun(satModel)
satRes now contains a saturated model that you can compare your two-group model to. If your fitted model is in an objected called 'multipleGroup', you can compare them like so:
summary(multipleGroup, SaturatedLikelihood=satRes)