Thu, 08/06/2009 - 09:46

We will be implementing estimation for ordinal data soon.

We need to lock down an interface for the user to specify what is necessary for categorical estimation.

Issues I can think of right away include:

(1) Specifying data cast. That is, which columns are ordinal? If the observed data is an R dataframe, then this information is already included. Do we insist on dataframes in order to estimate ordinal columns? Or do we add another option somewhere?

(2) Specify high and low thresholds. Highest and lowest thresholds may not occur in every group. We need to give the user a way to supply these.

(3) Is there an additional model/parameter specific component, or is all user-supplied specification attached to a data object?

Thinking ahead, the next logical step following ordered probit model would be generalized linear latent variable models. To accomodate the new class of models, the front-end will need atleast two modifications:

(a) Each variable in the input data, will need to be tagged as being of certain scale type (unordered, ordered, count, interval) etc.

(b) The model will need to specify an error-distribution. There may not be one-to-one correspondance between these two. Each specific distribution will come with its own set-of additional parameters.

For example, for an ordinal variable, a logisitic distribution or even a partial-credit model may be of interest.

Having language constructs to specify both of these will prevent changes in the front-end in the future.

Paras,

Do we need both (a) and (b) for every data item? Are there cases where it matters that a certain DV is count data beyond the fact that (for example) you're interested in using a poisson distribution to model it? If not, is there a benefit to keeping track of the scale type explicitly?

I think this may be beyond the scope of the ordinal data problem. Maybe we should start a new forum topic for it?

You are probably right. (a) The scale may be needed only for count (besides, ordinal/nominal). (b) A more general notion is the error distribution or the link function.

It is beyond the scope of the ordinal data problem, so may not be relevant immediately. Perhaps we should start a separate discussion to see if we can make the model specification general enough without incurring additional costs. The advantage that I see in doing is that the probit threhsold model becomes a specific instance of the more general class of the model! Do you want me to start a thread and present a proposal?

1. Might be a good idea to enforce dataframes: then you can rely on all the type, factor level, column name etc info being present. One less option, and several fewer behind-the-scenes guesses which differ when the data are in a frame or a matrix.

That's not really a hurdle for anyone who has a matrix, a:

a = just data.frame(a);

2. Might makes sense to do that by asking the user to create columns of categorical data user as.factor etc?

Alternatively, when would a user want to specify this in OpenMx code instead of in R data re-coding?

Factor .ne. Ordinal

.............. We need to be very careful about what is ordinal and what is not. As I understand it, factors in R are arbitrary sets of classes of a variable, which do not necessarily have any particular order. FIML ordinal analysis rests on ordered classes where 1<2<3 or a

In general, how will factors be handled?

If someone puts a factor in a dataframe, we need to decide what we are going to do about it.

If it's a binary factor and a predictor it could be used as a dummy code.

If it's a binary factor and a predictor, but also is used as an interaction, it needs to be centered.

If it's a 3 level factor, then what? For now, I suspect we throw an error.

If people play by the R rules and set up their ordered factors with defined numeric equivalents, should we use them as is? I suspect the answer there is yes, if we can determine that they have given us sufficient information.

If people don't want to mess with the jiggery pokery that goes on behind the curtain of R's factors, should we provide them an alternative? Personally, I'd use the alternative if it were available.

Truth in advertising: If you haven't guessed already, I have "luke warm/hate" relationship with R's factors.

Some possible defaults:

If it's an unordered factor (regardless of number of levels), we can treat it as a dummy code or set of dummy codes. So if there's a data column called "says" with three levels, called "Nothing", "Hello", and "Goodbye", we take the first level (saysNothing, in this case) as the base, and make two dummy codes (saysHello and saysGoodbye). This is what lm(), for example, does, and people are likely used to it.

As for centering, many analysis programs center by default, optimize, and then transform back for reporting. I'd recommend this approach or (if folks don't like that one) never centering unless explicitly requested to do so; that way we're at least consistent in our approach.

I'm not quite clear on what you mean by setting up ordered factors with numeric equivalents. As I understand it, R assigns integers (indices into the list of levels()) to each factor as the underlying representation. We ultimately will have to use those as indices into the threshold matrix, but I'm not sure how else we would rely on them.

Generally, I'm inclined to encourage people to lay out their data in the way they want it to be interpreted by OpenMx. I think we should be willing to rely on R's casting if people want to just not worry about it, but for anything that requires more complexity than as.factor(), etc., let's make people do it through R.

Dealing with ordered factors:

There are a couple of issues to be wary of when dealing with ordered factors. First is that in the multiple-group context, not all levels will appear in all groups. Er, I should be saying mxModels, but you know what I mean. So automatic assignation in multiple instances just will not work unless all levels are represented in all the subsets of the data. Mx1 assumes a) the lowest ordinal level is always coded 0; and b) the user can optionally provide a Highest command to indicate the values to be considered as highest for each observed variable.

So, I would agree that it might be best to get the user to code their data in this fashion (previous Mx users will be used to it anyway) and that it would be wise to issue a warning if an ordinal variable a) doesn't have any 0's in it; b) if it exceeds what has been supplied as Highest value (should be an error); and c) if the highest is not represented in the data. Conditions a) and b) may happen by design; condition b) is definitely screwy and should throw an error. Possibly, a) and c) could be subsumed in a single even more general warning:

Due to the mxHighest command, the levels possible for variable X1 are:

0 1 2 3 4 5

but levels 0, 3 and 5 were not observed in the dataset. This spottiness of the data has been ignored, but we wanted to let you know the situation.

I'm still leaning on R here.

If we're going to force the user to recode their data anyway, we might as well make it R-friendly. Folks can do the conversion for any data they're concerned about by using ordered() and specifying a levels= argument (say, 0:5). The difficulty with assuming that it's all in the format (0,1,2,3,4,5) is that if somebody specifies a set with levels (1, 10, 100), things get crazy.

In many cases, folks will start with a single data set that they'll split into the different groups anyway, so they can get around the multiple group problem by loading the dataset as a unit, making it ordered(), and then subsetting by the group element. This would enforce consistency of levels across groups. If they're not in the same data set to begin with, it's a bit trickier, but if you're following Mx1 conventions, you can just use

`ordered(group1$column, levels=0:highest)`

on each of them and get the same result.A helper function to equate levels across groups or enforce the Mx1 leveling method might be worthwhile for the users who are used to it, but I don't know that we want to force everybody to use it.

Factors in R are usually nominal, with no distinction in R between ordinal and ratio: It is left to the user to choose appropriate statistics, as it is with departures from normality, linkage functions etc.

I wonder then if it would be better to handle this explicitly, as it more about the model (a choice about how to treat data) than the data per-se.

If we use ordered, I wonder if that might induce both some extra work in the first place (translating numeric ordinal data in ordered factors) and then errors when the ordered-ness is lost at some point in a data-manipulation?

The user could keep the data as numeric and then tell the model which are ordinal variables. Maybe:

ORDINAL = c("enjoymentOfOlderEpisodesOfTopGear") # list ordinals

For compactness, wildcard and negation would be handy

ORDINAL <- "*" # wildcard - all variables ordinal

ORDINAL <- allVars[-"height"]

I see a few issues with keeping an ordinal list in the model.

Foremost, it's another type of syntax that folks will need to learn. R users have probably seen ordered() before, and should be aware of how it works. If we're specifying a new syntax (wildcards, negations, etc.) just for ordinal variables, that's one more thing to learn.

Second, it seems like this sort of view might make it more difficult to specify levels, which come free with R ordered factors. Either way, I guess we have to build our own procedures for thresholds, but at least we know how many thresholds we need, and can be sure which data columns need 'em when the mxData object is created.

Along the same lines, what about string data or non-sequential ordinal data? Somebody could potentially have a factor with levels "Seriously Uncool", "Uncool", "Cool", and "Sub-zero". (Or, for that matter, 1, 5, 10, and 15). If we're keeping level data in the model, we'll need to implement a matching protocol, which we'd be implementing anew when R already supplies it.

We also run into potential clashes where a submodel might be treating the same data element as a thresholded ordinal variable that the overarching model treats as continuous. And depending on the way that thresholds are specified, it could potentially have, say, more levels, or a different order of levels, for the same variable in a model than in its submodels. While you could technically do this with ordered factors, you'd have to at least treat them as though they were separate data columns.

Also, it bloats the model somewhat. This isn't a terrible thing if the added value is worthwhile, but is worth noting.

It seems to me like the overhead of

`x <- ordered(x, levels=1:max(x))`

is minimal compared to the potential snarls of re-implementing what will effectively be the same thing.I would be concerned about using them if ordered factors frequently lose their ordered-ness, though. I've never seen it happen--in your experience, does it happen often enough that we need to be concerned?

Let's use R's ordered factors.

1) While a standard R factor is not necessarily ordinal, R provides the option for a user to specify an order for their factor. The user can specify the order of the variable levels, and even include levels that don't exist in the data.

This adds one more little thing that the user might forget to do that might provide frustrating errors; I don't think

`read.spss()`

, for example, will read an ordinal column as an ordered factor. Still, if we need to make them specify it somewhere (and I think we do), then we might as well use R's internal functions where we can. This doesn't require that we force folks to use data frames; if all the data columns have the same levels in the same order, they can specify a matrix or vector instead.2) People will need to be able to specify a threshold vector for each ordered factor, as well. In general, this is likely to be just a vector of free parameters, but it's possible that people will want to fix, constrain, or use the threshold values. It seems like this could be handled in the mxData object (perhaps with a threshold matrix or a list of threshold vectors). We could recommend something like:

`ordinalX <- ordered(x, levels=1:numLevels)`

`thresh <- mxMatrix(type=FULL, nrow=1, ncol=numLevels, values=thresholdStartValues, free=T)`

`ordData <- mxData(ordinalX, type="raw", thresholds=thresh)`

I like this a lot. A few points:

1) the numLevels may differ between variables in the dataframe x. So levels would need to be able to contain a list:

ordinalX <- ordered(x, levels=c(1:numLevelsx1,1:numLevelsx2,2)

2) the number of levels of an ordinal variable is one more than the number of thresholds. So perhaps

thresh <- mxMatrix(type=FULL, nrow=1, ncol=numLevels, values=thresholdStartValues, free=T)

should read

thresh <- mxMatrix(type=FULL, nrow=1, ncol=numLevels-1, values=thresholdStartValues, free=T)

3) Mx1 uses a matrix with columns for variables, and rows for thresholds, which is somewhat consistent with modeling means as a row vector, so I would prefer column vectors for thresholds:

thresh <- mxMatrix(type=FULL, ncol=1, nrow=numLevels-1, values=thresholdStartValues, free=T)

I'm not sure whether there would be a need to "glue" the thresholds for different variables together. If so, some filling out of the shorter vectors with NA's up to the maximum threshold vector length would seem appropriate.

4) It will be a common approach to fix the first two thresholds to zero and one, and to estimate mean and variance instead. This innovation, due to Paras, is very useful in the context of models that make explicit predictions about means and variances, rather than just soaking them up with free parameters or error terms.