Fri, 06/17/2011 - 04:33

Hi!

I have a script for ordinal twin data that takes about 5-6 hours to compute in my laptop. I was wondering whether it would be beneficial in this particular case to parallelize the work in order to reduce the computational time. Would it be possible to adapt this script to do the job?

I have been reading the OpenMx notes in the manual to implement the parallelization with the "Snowfall" package, but they seemed a bit odd to me (I have not too much experience in parallelizing with R). Any indication would be very appreciated. I attach here the script if somebody wish to have a look.

Best regards,

-Alfredo

Attachment | Size |
---|---|

Script.R | 6.71 KB |

Currently, OpenMx can execute pieces of model in parallel that are labelled as independent pieces. Independent pieces do not share free parameters, so I don't think any part of your model can be labelled independent. Here are some other techniques for improving performance:

I will address the question of parallelization, but make sure you read the section on checkpointing if you are going to run a script for several hours: http://openmx.psyc.virginia.edu/docs/OpenMx/latest/File_Checkpointing.html.

Michael,

Considering the long computational time using raw data in this job, and the difficulties for the parallelization, does it make sense to you fitting this longitudinal growth model to "correlation matrices" (instead of to raw ordinal data)?

Do you think that there may be issues with accuracy using this alternative approach?

Thanks!

Hi

A few points about fitting models to matrices of polychoric or tetrachoric correlation matrices. First, unlike covariance matrices, the precision of tetrachoric correlations varies as a function of where the thresholds are. For example, the tetrachoric correlation between two binary variables with a 50:50 split of 0/1 responses will be more precise than one where the variables are 90:10 split. So it is necessary to "tell" the model-fitting procedure how accurate the correlations are. It is also necessary to tell it how much the correlation statistics covary with each other. One approach is to use a weight matrix; Michael Browne wrote about this method in a seminal 1982 paper, on "Asymptotically Distribution Free" methods. At this time, however, OpenMx does not have a simple way of estimating a suitable weight matrix. However, it might be possible to fit a single model with every correlation (and every threshold) free, and use the calculated Hessian as a weight matrix. There is a need to make this procedure more efficient (it is possible to estimate the correlations in a pairwise fashion which speeds this method up a lot).

Second, an advantage of FIML is that it provides a natural framework for modeling datasets which contain missing values (and most do; be wary of people who say they have no missing data). Missing data patterns can also create variability in the precision of different correlations, and while FIML handles most types of missing data (MCAR and MAR in Little & Rubin terms) well, other methods often do less well.

Third, you are working with longitudinal growth models, which usually involve making predictions about the means (thresholds) as well as the variances. It would be important to include the thresholds in the weight matrix, so that we end up fitting a model to both the thresholds and the means.

In the light of these issues, if you have the patience - or access to a cluster or grid - then I'd still go with the 5 hour version using FIML, and fit a limited set of models. Alternatively, please feel free to write a weight-matrix calculating function :)

Thank you, Michael! I think I understood the issue. Let me tell you that your kind advice is very much appreciated.

Starting values can have a large impact on estimation time. Running a quick model on the correlation matrix to get good starting values could save you a bit of time in your primary FIML estimation. Similarly, if you care about comparisons to a saturated model, fit that first (overnight, likely), then use the implied covariance matrix to generate starting values just as you would for the correlation matrix approach you proposed.