fit statistics for multigroup models?

8 replies [Last post]
tbates's picture
Offline
Joined: 07/31/2009

Hi,
Is there any guidance on how people should be getting the fit of their models, esp multi group models?
running an ACE type model, the summary doesn't calculate any fit statistics, nor does it seem to know how many observations were being made etc so df is just -est parameters etc.

summary(fit)
         name      matrix row col parameter estimate error estimate
1        <NA>     all.a_c   1   1       2.384409e-01     0.22324048
2        <NA>     all.a_c   2   1       4.199091e-01     0.21043198
3        <NA>     all.a_c   3   1       2.366656e-01     0.17050014
4        <NA>     all.c_c   1   1       4.794349e-01     0.16473834
5        <NA>     all.c_c   2   1       1.533741e-01     0.14155519
6        <NA>     all.c_c   3   1       1.806579e-01     0.13491689
7        <NA>     all.e_c   1   1       1.838531e-01     0.08764916
8        <NA>     all.e_c   2   1       3.122362e-01     0.12072371
9        <NA>     all.e_c   3   1       4.914389e-01     0.15620607
10       <NA>       all.a   1   1       4.452644e-01     0.20181557
11       <NA>       all.a   2   2       3.546483e-01     0.19978474
12       <NA>       all.a   3   3       4.331486e-01     0.07694248
13       <NA>       all.c   1   1       1.310573e-05     1.85878064
14       <NA>       all.c   2   2      -1.006976e-06     0.56912795
15       <NA>       all.c   3   3      -2.730591e-06     0.56363373
16       <NA>       all.e   1   1       5.441768e-01     0.05386086
17       <NA>       all.e   2   2       4.934105e-01     0.07569937
18       <NA>       all.e   3   3       4.127164e-01     0.17458587
19 Trait1mean all.expMean   1   1       2.497818e+00     0.07429617
20 Trait2mean all.expMean   1   2       2.976677e+00     0.06878513
21 Trait3mean all.expMean   1   3       2.082244e+00     0.05968714
 
Observed statistics:  0 
Estimated parameters:  21 
Degrees of freedom:  -21 
-2 log likelihood:  8259.418 
Saturated -2 log likelihood:  
Chi-Square:  
p:  
AIC (Mx):  
BIC (Mx):  
adjusted BIC: 
RMSEA:  

The submodels don't know this either:
> summary(fit@submodels$MZ)
Observed statistics:
Estimated parameters:
Degrees of freedom:
-2 log likelihood:
Saturated -2 log likelihood:
Chi-Square:
p:
AIC (Mx):
BIC (Mx):
adjusted BIC:
RMSEA:

tbates's picture
Offline
Joined: 07/31/2009
might be worth just making

might be worth just making this a limitation: if you are doing mixtures, and you want the fit computed correctly, don't shuffle your data columns, rows, or dataframe names when you are using the same data :-)

tbates's picture
Offline
Joined: 07/31/2009
bump... please :-)

bump... please :-)

mspiegel's picture
Offline
Joined: 07/31/2009
OK by the end of the day I

OK by the end of the day I should have finished square-bracket substitution and then I can take a look at this issue. Some questions that need to be answered: let's say I have two submodels that use a FIML objective function, and then a top model that uses an MxAlgebra objective function. Do I compute fit statistics for the submodels? Do I compute fit statistics for the top model? How do I compute fit statistics for the top model with an arbitrary algebra as the objective function (pretend you don't know it's a "+")?

neale's picture
Offline
Joined: 07/31/2009
I think it is reasonable to

I think it is reasonable to compute the total number of statistics being used. What is tricky is the mixture distribution case, in which the same data are used multiple times (and mentioned in each of the components of the mixture). So probably it is necessary to check that each dataset is not the same as a previous one.

How to do this check cleanly is not clear. It was pretty simple in Mx1 because the components of a mixture were always specified in one data group, which had one dataset attached to it. In OpenMx things are a good deal more flexible; the same or different datasets could be applied to different components of a mixture. Ordinarily, it would not be a mixture distribution if different datasets are being applied. Thus mixture distributions could in principle be used in a different way in OpenMx - whether this is a good or bad thing is open to question. So, it is not sufficient to just examine the dataframe and variables within it that are being used for data for a particular model. In principle, the dataframe could be named differently and could have variables with different names, yet be exactly the same data. So I would recommend some form of is.samebloodything(dataframe1,dataframe2) function, which would test if the datasets are same by dataframe name and variables. If these are different we can then perform the more costly check that they are physically identical (same values down each column). Note, however that this check is still not sufficient, because the columns could be reordered from one frame to the next. So a loop over all columns in dataframe1 needs to be compared to all columns in dataframe2. In the event that there is a partial match (say column 2 in dataframe1 is the same as column3 in dataframe2 but otherwise everything is unique) then this number of statistics should not be added to the total count. Phew, quite expensive at times, this additional flexibility... Luckily such tests only have to be carried out once for each model.

tbates's picture
Offline
Joined: 07/31/2009
A concrete example using the

A concrete example using the openmx script:
trunk/models/passing/univACEP.R
and its mx 1 counterpart
trunk/models/passing/mx-scripts/univACE.mx

mx 1.x allows the user to pass in a -2LL and df from a saturated model, and reports the following:

Your model has    4 estimated parameters and   1777 Observed statistics
 
 -2 times log-likelihood of data >>>  4067.663
 Degrees of freedom >>>>>>>>>>>>>>>>      1773
 
 Saturated model fit* >>>>>>>>>>>  4055.935
 Saturated model df*  >>>>>>>>>>>      1767
 Difference Chi-squared  >>>>>>>>    11.728
 Difference d.f.  >>>>>>>>>>>>>>>         6
 Probability >>>>>>>>>>>>>>>>>>>>      .068
 Akaike's Information Criterion >     -.272
 * Saturated model statistic supplied by user
<pre>
 
OpenMx reports
<pre>
Observed statistics:  0 
Estimated parameters:  4 
Degrees of freedom:  -4 
-2 log likelihood:  4067.663 
Saturated -2 log likelihood:  
Chi-Square:  
p:  
AIC (Mx):  
BIC (Mx):  
adjusted BIC: 
RMSEA: 

Would be good to get the Observed statistics right, which would flow through to DF

mspiegel's picture
Offline
Joined: 07/31/2009
Ah. Tim Bates' example was

Ah. Tim Bates' example was very helpful, along with the input from Mike Neale. There is some partial support for multigroup models checked into the subversion repository. Run summary(twinACEFit, SaturatedLikelihood=4055.935) to view the new output. Several questions remain:

  • How to calculate the number of observations? The current approach is to sum up the number of observations across all data sets. If two data sets are identical, I should probably not count them twice (TODO). However if two data sets contain some identical columns and some non-identical columns, what to do? The number of observations may be manually specified using a 'numObs=' argument to the summary function. Note that this is a harder problem then calculating degrees of freedom (where each column can be checked independently). For calculating degrees of freedom I used Mike Neale's suggestions.
  • How to compute the F value? The current approach is to use '-2 log-likelihood' if all datasets are raw, or 'chi' if all datasets are covariance matrices, or 'NA' otherwise.
  • How do I use the saturated model degrees of freedom?
  • How do I correct the probability and AIC calculations?
irebollo's picture
Offline
Joined: 09/24/2009
Hi, I agree, at least we

Hi,
I agree, at least we should get the number of observed statistics, and the df right. It should also be possible to get those independently out of the summary using, eg mxEval, so that one can estimate Chi square and p values using R commands.

tbates's picture
Offline
Joined: 07/31/2009
warning on: I am not a great

warning on: I am not a great person to ask here...

> Do I compute fit statistics for the sub-models?

That would be helpful for people wanting to see straightforwardly which parts of the supermodel were contributing to bad fit.

> Do I compute fit statistics for the top model?
That's the goal.

> How do I compute fit when the top model has an arbitrary algebra as the objective?

I think in the first instance it would be fine to assume the user knew what they were doing when specifying the objective, so its likelihood is correctly scaled.