Using Correlation Matrices v. Raw Data as Input

5 replies [Last post]
kspoon's picture
Joined: 06/17/2011

While it's relevant for both the Behavioral Genetics and OpenMX General Help forums, the overall question is rather general so I figured it should be posted here.

I'm currently working on fitting a multivariate twin model with ordinal data using OpenMX. The idea is eventually to look at common and independent pathways for a model with 8 ordinal measures. Before scaling up to this, I was initially working with just 5 of the variables. When trying to fit a ACE model, it took roughly 10 minutes to run and gave me a code RED as output. I then ran Ryne's start value loop (THANK YOU) with a few modifications to see if I could either clear the code red or at least find a best model over the 50 iterations - every run resulted in a code red and the -2ll varied across all runs.

I scaled it back again, to 3 variables, to find similar issues. Running time wasn't as long, but I was still getting code red. This time, I have 8 of the 50 starting value runs coming up error-free or code GREEN with the same answer, so I can at least feel confident about the estimates I'm getting from that best model.

My coworker was asking if I could run using the polychoric correlation matrix instead of raw data to combat some of these issues. I generated MZ and DZ correlation matrices using hetcor in the polychor package. You can see how they compare to the correlation matrices from the Saturated model:

MZ Correlation from Saturated Model
var1A var2A var3A var1B var2B var3B
1A 1.000 0.256 0.284 0.626 0.051 0.343
2A 0.256 1.000 0.448 0.107 0.418 0.383
3A 0.284 0.448 1.000 0.079 0.287 0.721
1B 0.626 0.107 0.079 1.000 0.285 0.326
2B 0.051 0.418 0.287 0.285 1.000 0.543
3B 0.343 0.383 0.721 0.326 0.543 1.000

MZ Correlation from hetcor
var1A var2A var3A var1B var2B var3B
1A 1.000 0.279 0.374 0.627 0.075 0.380
2A 0.279 1.000 0.436 0.101 0.421 0.356
3A 0.374 0.436 1.000 0.143 0.265 0.733
1B 0.627 0.101 0.143 1.000 0.264 0.325
2B 0.075 0.421 0.265 0.264 1.000 0.549
3B 0.380 0.356 0.733 0.325 0.549 1.000

Running with the correlation matrices took a matter of seconds and didn't give an error code. (Estimates comparing the "best" model after a series of 50 start value runs to those from the correlation matrix run are attached. There are differences of >.1 in some estimates...)

Finally, my question: Is it valid to use the correlation matrix like this even though we have the raw data? I'm a bit concerned given the difference in estimates, but I know a lot of people fit structural equation models using correlation or covariance matrices. Any insight would be appreciated.

We've already tried similar with just 4 variables and I couldn't get rid of the code red or get agreement in our -2ll from the start value loop. I see this problem getting much much worse as we add in variables, so it would be lovely if we could spend a fraction of the time running correlation matrices instead.

estimates.txt1.71 KB
mhunter's picture
Joined: 07/31/2009
There are a number of parts

There are a number of parts to your question, so I'll try to address them as best I can.

First, when you have (a) continuous data, (b) no missing values, (c) a large sample size, and (d) all variables have the same variance, then using the covariance matrix, the correlation matrix, or the raw data should not matter at all. If you don't have (a), (b), (c), and (d), then things get interesting.

If the observed variables do not have the same variance (d), then you lose that information by using the correlation matrix instead of the covariance matrix. Using the covariance matrix should be the same as using the raw data, but not the same as using the correlation matrix. This difference could be large is the variables have very different variances.

If the sample size is not large (that is, if (N-1)/N is not close enough to 1.0 that you do not care) (c), then using raw data will not be quite the same as using the covariance or correlation matrices. Going from the estimation equations used with raw data FIML to the cor/cov ML equations assumes that (N-1)/N is 1.0. So these estimates should be quite close in practice situations because say if N=50 then 49/50 is 0.98: usually (N-1)/N really is quite close to 1.0.

If you have missing values (b) and using cov/cor data instead of raw, then you're using pairwise complete data or using listwise deletion. In the latter you don't have some of the data, so you're throwing out information that now cannot be used to help with the estimation. This may lead to different estimates when the deleted observations are valuable. In the former, you generate a cov/cor matrix that has different sample sizes for each entry. Thus the precision of each observed cov/cor estimate is different. When the cov/cor matrix is given to OpenMx it weights all the estimates equally, but they are not equally precise so it should not weight them equally. This can generate quite different estimates from raw data FIML when some pairs of variables have very few observations but other pairs have many.

You can start to see how combinations of the above (a), (b), (c), and (d) can pile up. For now, I'll let others chime in with specific merits of polychoric correlations and raw, ordinal data.

Ryne's picture
Joined: 07/31/2009
I'll follow up Mike's great

I'll follow up Mike's great general response with something more specific to kspoon's problem.

There are a few things that you miss with the pairwise complete polychorics. Obviously, its a poor model for missing data, as you're just ignoring it. Less obviously, you're completely ignoring the thresholds, both in general because they're not in your model of polychorics and specifically because you're not making any constraints across twins.

Looking over your expected MZ correlation matrices, it looks like your model/data are just weird. Look at variables 1 and 3. They are more strongly correlated across twins than within, at least in one direction. 1 with 3 within twin is .284/.374 (sat/hetcor), but 1t1 to 3t2 is .343/.380, while 1t2 to 3t1 is .079/.143. I don't know what to make of that, but that should be an issue regardless of data formatting.

kspoon's picture
Joined: 06/17/2011
More questions, as usual...

Okay, I understand the issues with using correlation matrices, but is it ever acceptable to use polychoric correlation matrices?

Missing Data: A very, very small number of subjects. Not sure it makes sense to waste 3 days fitting models (assuming I'm dealing with ~ 30 minutes of run time and needing to loop over multiple start values in hopes that I actually find the best model) over 10 twin pairs out of 400.

Thresholds: I see the issue here. For these variables in particular, we're looking at 50-60% in each group and they were similar across and between twins and zygosity groups. Would it be more reasonable if we used OpenMX's saturated model to obtain the MZ and DZ correlation matrices? We could test equating thresholds in the saturated models and if it didn't result in a worse fit, we could use the matrices from the model with equated thresholds in our Cholesky...

Could the wonky (technical term) data be to blame for the very slow model fit and the failure to converge? I doubt it since I had no problem fitting models in matter of seconds to the correlation matrices themselves, but I'm curious to know if my experience is common place with multivariate ordinal models in OpenMX or if exacerbated by the data itself.

So, three follow up questions embedded in my post:

1.) Is it ever acceptable to use polychoric correlation matrices in SEM?
2.) Would it be more reasonable if we used OpenMX's saturated model to obtain the MZ and DZ correlation matrices as opposed to hetcor?
3.) Could the wonky (technical term) data be to blame for the very slow model fit and the failure to converge or are such issues par for the course when it comes to multivariate ordinal analyses?

neale's picture
Joined: 07/31/2009
Polychorics need weights

A big problem with polychorics/polyserials is that their precision depends on where the thresholds are. The more extreme the thresholds, the less precise the polychoric, even with exactly the same sample size. Thus is it not ok to use a polychoric (or hetcor()) matrix and just plug in a value for N (even if there are no missing data at all). Even if all variables had the same (say 50:50) response frequencies, pretending that it is a correlation matrix computed from continuous data would over-estimate the precision of the statistics by a lot (maybe 3x). So all comparison of fit information (-2lnL, AIC etc) would be badly biased towards rejecting simpler models in favor of more complex. Standard errors and confidence intervals would be biased towards being too small or narrow. Beginning to get a sense of 'methinks he protests too much' so I'll stop ranting and offer alternatives :).

Better is to use at least a diagonal weight matrix, such as could be calculated using the SE's from hector() or from the polychoricMatrix function I have posted elsewhere in these forums. Then a diagonally weighted least squares function can be used and all model-fitting proceeds rapidly.

Better still (from a statistical point of view) is to use a full weight matrix (essentially this is variances of the correlations on the diagonal and covariances between the correlation estimates on the off-diagonal). This weight matrix could be computed with one long run with FIML (assuming there aren't too many ordinal variables). Then model-fitting can proceed as before by weighted least squares. Note that with WLS it is possible to include information about the thresholds and fit the model to them also (important for growth curve models for example). Sometimes this method encounters difficulty if the weight matrix is not positive definite, which can happen if certain response categories are rare or sample size is small. YMMV.

mspiegel's picture
Joined: 07/31/2009
Not sure it makes sense to

Not sure it makes sense to waste 3 days fitting models.

You may be able to utilize the multiple cores on your machine to run multiple starting values in parallel. The OpenMx user guide contains a section on multicore execution. Kind in mind that if you are running FIML objective functions under linux or multicore-enabled Mac binaries, then you are already utilizing the multiple cores on your machine. It is not covered in the tutorial, but you can use the 'snowfall' package to run independent submodels in parallel on a cluster of workstations. The parallelism in OpenMx is structured so using snowfall to run independent submodels will disable multicore FIML executation, although this can be overridden.

Also turn off standard errors and Hessian calculation to improve performance.

model <- mxOption(model, "Calculate Hessian", "No")
model <- mxOption(model, "Standard Errors", "No")