Fri, 11/23/2012 - 05:49

Hi fellow community members,

I'm working with quite large data sets ~500000 lines of 4-28 variables (number of covariate patterns >500). I'm trying to simplify the input data into OpenMx by aggregating covariate patterns so that I can specify one line per unique pattern, and then multiply the objective function by how many observations there are of the specific pattern. The reason for me to do this is that I want to save computing time.

I currently specify a sub-model for each pattern and simply set the numObs to the number of observations, however this takes longer time than just using the raw data. It seems to me that there should be a number of much simpler ways to do this, but being a not-so-great programmer I don't know how to find out which functions to alter to achieve this. Preferably I would like to be able to input the data as "raw", but only one row per covariate pattern and the supply an array of values to numObs, each value representing the number of observations of the specific pattern in the corresponding row.

My questions are 2:

1. Will I gain any computing time by doing this, or is it redundant since the optimizer already does this, or something similar to it?

2. How to do it, or perhaps some explanation of which functions to alter. (e.g. mxData, mxFIMLObjective and something with the NPSOL optimizer?)

/Ralf Kuja-Halkola

The FIML objective function is performing some of these optimizations. We sort the raw data based on pattern of missingness. We skip over duplicate rows, and then we do not perform redundant calculations for a pattern of missingness. We've tried to convert FIML into a sum of a set of ML objective functions (one per pattern of missingness) but we were unable to get speedup on our test cases. If a large percentage of your data is one pattern of missingness (such as no missingness) then you may be able to use ML for that pattern and FIML for the rest of the data. I'd recommend using checkpointing to track the progress of your optimization. Also recall that more threads should help with FIML optimization.

Great! I'll stop my futile attempts at improving the computational speed, sounds like there are much more brilliant minds at work on solving the issues.

That's OK. Both of us have independently tried the initial approach, which is to convert all of the patterns of missingness into ML objective functions. There is the potential for speedup if we just only convert the top N patterns of missingness into ML objective functions (for some small value of N) combined with FIML for the rest of the data. Nobody has tried this yet.