Question about handling covariates in SEM: Say you suspect sex should be controlled in a twin design. How do these three models differ in what they are testing?
1. An ACE model with raw data and sex in the means model (m+(b*data.sex)
2. An ACE model with raw data residualized for sex.
3. An ACE model allowing for sex-limitation but no covariate.

I am very new to the Twin analyses and trying to follow the script provided on openMX website http://openmx.psyc.virginia.edu/repoview/1/trunk/demo/UnivariateTwinAnal... to analyze my dataset. I would like to know what modifications I would need in the existing script to adjust or control the covariate effects in my case age and gender to get the unbiased estimate of A,C,E.

In the case of continuous data, 1 and 2 are (imo) equivalent. With ordinal or binary data, 1 is very straightforward - it proceeds by expressing the thresholds as deviations from the means which are predicted based on the covariate values. However, method 2 is problematic for binary data. Adjusting a 0/1 phenotype defined by a threshold would seem to be only possible probabilistically; I do not know a straightforward approach to this problem.

Item 3 addresses an entirely different question. While 1 and 2 examine whether the covariate sex influence the mean, 3 examines whether it changes variation. In an ACE model, the change in variability can be partitioned into A, C and E components. So it asks whether, for example, genetic factors have a greater influence on individual differences in a trait in men than in women. There are some limitations to this approach with binary data, as discussed in Behav Genet. 2009 Mar;39(2):220-9 by Drs. Medland Neale (Ben), Eaves and myself.

Discussions here noted that while 1 and 2 should be the same, there are the following benefits for giving OpenMx residualised variables

1. lm/glm/nlm/rlm can handle missing variables (Should OpenMx throw an error when it sees NAs in a dat.var?)

2. Removing the need to model covariates can often lead to more stable solutions from the optimizer (reduced noise from covariates in the raw data)

3. Models of course also run a lot faster, which can be as issue.

Regarding categoricals, I wonder whether if data.var covariates are used with binary (or other categorical phenotypes), the correct thing to do would be to express the outcome as a continuous likelihood of category membership?

If the manifest or latent variables linked to the data.var covariates are categorical, should OpenMx throw an errror?

I am not sure that 1 confers any benefits. The action of lm with missing covariates is not entirely clear to me, but the docs say that it depends on the na.action argument. If the dependent variable is missing, then it seems reasonable to have a residual of that variable which is also missing. If the independent variable is missing (this is the case for data.def's missing) then there may be options. By default, lm would omit such cases, and the residual might be set to missing. Mx1 would drop cases where there are missing values for the independent (definition) variable. In the case of OpenMx, we would likely want to na.exclude rows where definition variables are missing; this would parallel the action of Mx1.

Points 2 and 3 are taken, and there are clear computational benefits to this approach. Point 2 works better because the number of parameters being optimized is smaller.

Definition variables are of course much more general. For example they can be used to regress out the effects of covariates on latent variables, or to moderate the relationships between latent variables. But I digress.

If the dependent variables are binary or ordinal, using definition variables is fine. This is another advantage of using definition variables. So my response to your last question "If the manifest or latent variables linked to the data.var covariates are categorical, should OpenMx throw an errror?" is no.

The big issue is when the definition variable is missing. I am not presently in favour of using a probability of group membership as observed data. That is, instead of zero or one binary outcome we were to use .9 or similar, being the residual of the binary variable after regressing out the covariate. It seems to me that distributional assumptions would be violated. Please note, I am ignorant of literature in this area!

However, we could use probabilities in a different way. Assume that the regression is not affected by the missingness mechanism (i.e. data are missing at random) and we have observed a zero on a case for which the definition variable is missing. Suppose that in 3/4 of subjects the definition variable is 0 and in 1/4 it is 1. In this case we could express the likelihood of an observation of a zero as a mixture distribution, with .75 weighting of the likelihood with the covariate set to zero, and .25 weighting of the likelihood with the covariate set to 1. This I intuit would avoid discarding the case because their covariate was missing, but would yield maximum likelihood estimates with their usual nice properties. This approach would work in the ordinal or continuous case. It would be more difficult, however, if there are many binary or ordinal definition variables. All possible combinations of the definition variables would need to be considered, so the number of mixture components would increase exponentially.

In the event that the definition variable is continuous, there is a similar problem; the number of possible definition variable values is infinite so the number of mixture components is infinite, which would take a long time to compute :). One might, however, attack this problem by integrating over the distribution of the definition variable using a finite set of mixture components. This approach is similar to that described in Schmitt et al 2006 article http://www.vipbg.vcu.edu/vipbg/neale-articles.shtml where the integration is over the normally distributed latent common factor. In the case of the definition variable integration, we could use its empirical distribution if we weren't prepared to hazard a guess as to its distribution in the population. The number of mixture components for a single missing definition variable would be determined by the number of abscissae chosen to approximate the distribution.

Sorry for the long involved post. Perhaps it is now clearer why Mx1 dropped cases when the definition variable was missing. And how nice it is that we can potentially do so much better with OpenMx without my having to disappear into a cave to program it.

Hi Dr. Bates,

I am very new to the Twin analyses and trying to follow the script provided on openMX website http://openmx.psyc.virginia.edu/repoview/1/trunk/demo/UnivariateTwinAnal... to analyze my dataset. I would like to know what modifications I would need in the existing script to adjust or control the covariate effects in my case age and gender to get the unbiased estimate of A,C,E.

Any insight would be very helpful.

Thanks.

.

In the case of continuous data, 1 and 2 are (imo) equivalent. With ordinal or binary data, 1 is very straightforward - it proceeds by expressing the thresholds as deviations from the means which are predicted based on the covariate values. However, method 2 is problematic for binary data. Adjusting a 0/1 phenotype defined by a threshold would seem to be only possible probabilistically; I do not know a straightforward approach to this problem.

Item 3 addresses an entirely different question. While 1 and 2 examine whether the covariate sex influence the mean, 3 examines whether it changes variation. In an ACE model, the change in variability can be partitioned into A, C and E components. So it asks whether, for example, genetic factors have a greater influence on individual differences in a trait in men than in women. There are some limitations to this approach with binary data, as discussed in Behav Genet. 2009 Mar;39(2):220-9 by Drs. Medland Neale (Ben), Eaves and myself.

Thanks Mike,

Discussions here noted that while 1 and 2 should be the same, there are the following benefits for giving OpenMx residualised variables

1. lm/glm/nlm/rlm can handle missing variables (Should OpenMx throw an error when it sees NAs in a dat.var?)

2. Removing the need to model covariates can often lead to more stable solutions from the optimizer (reduced noise from covariates in the raw data)

3. Models of course also run a lot faster, which can be as issue.

Regarding categoricals, I wonder whether if data.var covariates are used with binary (or other categorical phenotypes), the correct thing to do would be to express the outcome as a continuous likelihood of category membership?

If the manifest or latent variables linked to the data.var covariates are categorical, should OpenMx throw an errror?

I am not sure that 1 confers any benefits. The action of lm with missing covariates is not entirely clear to me, but the docs say that it depends on the na.action argument. If the dependent variable is missing, then it seems reasonable to have a residual of that variable which is also missing. If the independent variable is missing (this is the case for data.def's missing) then there may be options. By default, lm would omit such cases, and the residual might be set to missing. Mx1 would drop cases where there are missing values for the independent (definition) variable. In the case of OpenMx, we would likely want to na.exclude rows where definition variables are missing; this would parallel the action of Mx1.

Points 2 and 3 are taken, and there are clear computational benefits to this approach. Point 2 works better because the number of parameters being optimized is smaller.

Definition variables are of course much more general. For example they can be used to regress out the effects of covariates on latent variables, or to moderate the relationships between latent variables. But I digress.

If the dependent variables are binary or ordinal, using definition variables is fine. This is another advantage of using definition variables. So my response to your last question "If the manifest or latent variables linked to the data.var covariates are categorical, should OpenMx throw an errror?" is no.

The big issue is when the definition variable is missing. I am not presently in favour of using a probability of group membership as observed data. That is, instead of zero or one binary outcome we were to use .9 or similar, being the residual of the binary variable after regressing out the covariate. It seems to me that distributional assumptions would be violated. Please note, I am ignorant of literature in this area!

However, we could use probabilities in a different way. Assume that the regression is not affected by the missingness mechanism (i.e. data are missing at random) and we have observed a zero on a case for which the definition variable is missing. Suppose that in 3/4 of subjects the definition variable is 0 and in 1/4 it is 1. In this case we could express the likelihood of an observation of a zero as a mixture distribution, with .75 weighting of the likelihood with the covariate set to zero, and .25 weighting of the likelihood with the covariate set to 1. This I intuit would avoid discarding the case because their covariate was missing, but would yield maximum likelihood estimates with their usual nice properties. This approach would work in the ordinal or continuous case. It would be more difficult, however, if there are many binary or ordinal definition variables. All possible combinations of the definition variables would need to be considered, so the number of mixture components would increase exponentially.

In the event that the definition variable is continuous, there is a similar problem; the number of possible definition variable values is infinite so the number of mixture components is infinite, which would take a long time to compute :). One might, however, attack this problem by integrating over the distribution of the definition variable using a finite set of mixture components. This approach is similar to that described in Schmitt et al 2006 article http://www.vipbg.vcu.edu/vipbg/neale-articles.shtml where the integration is over the normally distributed latent common factor. In the case of the definition variable integration, we could use its empirical distribution if we weren't prepared to hazard a guess as to its distribution in the population. The number of mixture components for a single missing definition variable would be determined by the number of abscissae chosen to approximate the distribution.

Sorry for the long involved post. Perhaps it is now clearer why Mx1 dropped cases when the definition variable was missing. And how nice it is that we can potentially do so much better with OpenMx without my having to disappear into a cave to program it.