Mon, 03/22/2010 - 14:10

From other posts on this forum, I mostly understand why fit indexes are not calculated when using FIML. It sounds like the solution is to fit a saturated model and then use this to calculate these indexes. However, when I do this, the saturated model takes an extremely long time to run, and I am wondering (a) whether I am doing something wrong and (b) if there is any way around this.

I had initially tested a saturated model for a dataset with 24 observed variables (three observed variables at each of eight waves) and about 600 observations. After an hour it was still going, so I stopped it to see how long it would take for a smaller model. My model is identical to the bivariate saturated matrix model presented in the OpenMx document (adjusted for the larger number of variables), and it does run. A saturated model with six variables took about six seconds to run. A model with eight variables took 20 seconds to run, and a model with 16 variables took about 10 minutes to run. As I said, I stopped the 24-variable model after an hour.

Is this to be expected, and is there anything I can do to speed things up? My ultimate goal is to test this same model across many different variables from a few different data sets (with up to 25 waves and tens of thousands of observations). OpenMx and R have the advantage of being able to automate all of this, but I think would be close to impossible if the saturated model consistently takes many hours (or longer) to run. So any advice would be appreciated.

Thanks,

Rich

Your comment is right on target. Perhaps you can see why we decided to not include the saturated model as part of the default FIML calculation. That model can be run, but it can take a very long time. At the end of the day, once you have a good working model, you may want to run your saturated model overnight.

With that said, we are currently working on speedups that should improve the speed of FIML calculations. We have not focused on speed at all yet. We are only focusing on functionality up until now. Remember, we only released the first open beta 6 months ago. However, as we approach a 1.0 release in the next 3-4 months, I expect to see optimization times improve substantially.

Remembering that the saturated model only has to be fitted once per set of variables, an hour or two isn't too bad. We would expect optimization times to increase as a function of at least the square of the number of observed variables. However, as Steve says, optimization times should improve dramatically as we tune the code.

Yes, I'm in the somewhat unusual position where the model that I am testing will stay the same, but the set of variables will change (I'm looking at the same basic model for multiwave data with different variables across multiple datasets). So I will need to test many saturated models.

I only asked because I know that Mplus runs the models pretty quickly. So I didn't know if Mplus does something differently or if I was specifying something incorrectly. But it completely makes sense that at this early stage things aren't as fast as they will ultimately be.

And by the way, I am loving OpenMx so far. I have been able to create functions that flexibly specify the correct model regardless of the number of waves or indicators, which wouldn't be possible with other programs. This should more than make up for the delay mentioned in my original post. So I look forward to seeing how development progresses.

In your case, you may also want to look into setting things up so that the many instances you wish to run are distributed over many machines (or cores). As you describe it, each job is independent, so you should be able to run them in parallel.

A starting point would be the BootstrapParallel.R example that uses the snowfall library: http://openmx.psyc.virginia.edu/2010/01/openmx-goes-multicore.

Thanks again for all the help with this. I am starting to investigate doing this in parallel. I read the script above (along with the multicore announcement), but haven't found much other documentation or any other examples. I have had some difficulty figuring out which aspects of the script I need to incorporate to run a simpler model and the corresponding saturated model in parallel.

Are there any other examples in the documentation or forums (I have searched both but have not found any); or would it be possible to show how to modify a very simple example (say the CFA example from the documentation) to run both the model of interest and the saturated model in parallel? This might be broadly useful given how many people have been posting questions about getting fit statistics with FIML. Having a simple example that shows how to run a relatively simple model and the saturated model in parallel may help users build up to more complicated uses of the parallel options. But again, if there are other examples that run different models in parallel, I might be able to figure this out from them.

I'm not sure if there would be a significant speedup from concurrent execution of an unsaturated and a saturated model. If the unsaturated model takes 10 minutes to run, and the saturated model takes 60 minutes, then sequential runtime would be 70 minutes and concurrent runtime would be 60 minutes.

In general, if you have two models that you would like to run concurrently, you will need to do the following (after loading snowfall as in the URL above):

Great, thanks. Good point about the benefits of doing this with a saturated model. However, I think the simple saturated model case will be useful just to make sure I know how to get this working. From there I think that I will be able to generalize to other models that could be run in parallel.

Thanks for the info; that helps. I just wanted to make sure I wasn't doing something wrong.