Mon, 07/26/2010 - 13:59

I have some questions about the correct approach to sort raw data. Imagine the following dataset:

A | B | C |
---|---|---|

1 | 2 | NA |

1 | 3 | NA |

1 | 2 | 2 |

2 | 3 | 3 |

2 | 2 | NA |

2 | 3 | NA |

If neither A, B, nor C are definition variables, then the table represents how OpenMx currently sorts the data. However, we've now created three patterns of missingness, instead of the optimal two patterns of missingness. In this case, I believe the correct solution is to count the # of NA values in each column. Then arrange the priority of the sorting columns according to the # of NA values, with greater NA columns sorted first. When a column is sorted, the default behavior is that all NA values appear at the top of the column.

The confusing part comes along when "A" is a definition variable. Our rules are to first sort according to definition variable columns, and then sort according to non-definition variable columns. That is probably a bad rule. I see two alternative options: (1) first sort according to the number of NA values, and then within the patterns of missingness sort according to definition variables; or (2) first sort according to the definition variables, and then within the definition variables sort according to the patterns of missingness.

Alternative #1

A | B | C |
---|---|---|

1 | 2 | NA |

1 | 3 | NA |

2 | 2 | NA |

2 | 3 | NA |

1 | 2 | 2 |

2 | 3 | 3 |

Alternative #2

A | B | C |
---|---|---|

1 | 2 | NA |

1 | 3 | NA |

1 | 2 | 2 |

2 | 2 | NA |

2 | 3 | NA |

2 | 3 | 3 |

For the non-definition variable situation, I think you're correct about the sorting--the only thing that matters is which columns are NAs, so the ones with identical NA patterns should all be grouped. I would actually recommend sorting based on

`is.na(data)`

so that the values don't matter. That way it doesn't matter what order the columns are sorted in, since identical patterns will clump together. Otherwise, your sort is likely to be thrown off if the missingest column is continuous.For the definition-variable case, performance is likely to be better if we sort by definition-variables first. The reason for this is that the model-implied covariance matrix has to be re-calculated any time a definition variable changes. It can then be reduced several times based on different patterns of missingness. Having the same pattern of missingness but a different definition variable doesn't save any computations--the model-implied covariance matrix has to be recalculated, which means we have re-reduce it as well. There's no significant speedup there. In the case Ryne is talking about with continuous definition variables, I don't know of any techniques that avoid computing a new covariance matrix for every row. (Though if anybody can point me to some, I'm eager to learn about 'em.)

The one situation where I can see a savings for the missingness-first sort is if the definition variable doesn't matter to the covariance matrix computation. That could happen if it's weighted zero for some reason or if the def. var. only matters to the means model and not the covariance model. In that case, the model-implied covariance matrix is unchanged across different rows, and there's speedup to be gained from clustering by the pattern of missingness.

So unless we know for certain that the model-implied covariance matrix doesn't depend on the definition variable, I think we're better off sorting by definition variable first.

I concur that Alternative 1 is best. Definition variables can be continuous; if every row has a different value of the definition variable, the sort becomes useless.

Two side questions, though. First, are there any optimizer based concerns? Is it possible that one or the other will run faster in certain situations, making it likely that users would want to switch the sorting scheme? Second, will one or the other be better for future versions of parallelization?

I believe your non-definition variable solution is correct, sorting on columns with the largest number of NA's, then the next largest, etc.

Per the confusing part, these solutions should be equivalent if the definition variable is being used in the model. However, suppose the definition variable is then dropped from the model (or if we are smart enough to see that the regression weight it is being multiplied by has been set to zero). There would not be a need to re-sort Alternative 1 for missingness, but Alternative 2 would benefit from that operation. Therefore Alternative #1 seems best to me.

Possibly, the efficiency-motivated sorting behavior in the more recent releases have contributed to some decrements in performance of certain scripts. Hermine has one that used to run in 9 hours but now is still spinning its wheels after over 40 hours.