Many people who wish to contribute code and examples to the OpenMx project are held back by data. Specifically, OpenMx users may be prevented or otherwise unwilling to post their data to a public forum. This guide will walk you through some methods of creating simulated data, with the primary discussion surrounding the
fakeData() function, which creates simulated data from an existing dataset.
While the ensuing sections will deal with creating new datasets meant to resemble an existing dataset, no simulated dataset will contain all of the information present in the original. For the purposes of anonymizing ones data, this is a good thing. However, this also means that the simulated dataset will yield different parameter estimates and fit statistics when fit with a model, and may yield different error messages as well. The only way to retain all of the information in an existing dataset is to use the original data. Selecting the right method which balances accurate representation with their data sharing plan of the data is up to the individual researcher.
A function called
fakeData() exists to assist users who wish to use an existing dataset as a template for creating a new, similar dataset. This function takes an existing dataset, calculates the means and covariances within the data using the
polychor package, and samples data from the multivariate normal distribution implied by those means and covariances using the
mvtnorm package. The existing data may contain any combination of numerical (continuous) variables and ordered factors: the covariances involving ordinal factors are estimated though either biserial or polychoric correlations using the
The options for the
fakeData() function are discussed briefly here. The only required argument is the
dataset argument, which specifies the dataset to be used as a template. This must be either a matrix or data frame, and any categorical variables must be declared as ordered factors (unordered factors will be identified and return a warning). If no other options are specified, then the simulated data will have the same sample size, variable names, level names (for ordered factors), pattern of missingness and frequency counts for each observed category for ordered factors. The
digits argument affects how the randomly generated data is rounded, with a default value of two digits beyond the decimal point.
Several other arguments can be used to make the simulated data differ from the original data, though all are optional. The
n argument allows the user to change the sample size (i.e., the number of rows) in the simulated data. Increasing this value will generally make the means and covariances in the simulated data more closely resemble the input data, while decreasing this value will allow for greater discrepancies between the input and simulated data due to sampling variation. The
use.levels arguments specify whether the existing variable names and ordinal factor level labels will be applied to the data. The
use.miss argument specifies whether the existing missingness in the data should be preserved in the simulated data, or whether no missingness should be included. Additionally, the
het.ML arguments pass options to the
polychor packages, and
het.suppress suppresses warnings from
hetcor function, which can be useful for diagnosing potential problems and cleaning up output.
fakeData() was originally designed to assist OpenMx users in diagnosing errors by allowing them to share data that replicates their error without sharing their actual data. As such, this function favors speed over precision. The means and variances of the generated data are based on the univariate distributions of the input data, and covariances based on bivariate relationships ignoring missing data, essentially assuming data are missing completely at random (MCAR). When data are missing at random (MAR), estimating full covariance matrices in OpenMx will give more accurate answers. When data are missing not at random (MNAR), both methods will give biased answers.
It should be noted that both the
n arguments are all somewhat interdependent. When
n is specified to be a value different than the input dataset, both the distribution of the ordinal factors nor the pattern of missingness in the simulated data are sampled from the input data, and thus won't exactly mirror the input. Setting
FALSE will also change the number of non-missing values for ordered factors. In both of these cases, it is possible that the simulated ordered factors will have fewer categories than the original data. When this occurs, the
use.levels argument will be ignored and a message will be issued. The likelihood of this will increase with low-frequency categories and large reductions in sample size. Likewise, the proportion of missing data will vary slightly when a value of
n other than the observed sample size is used.
Generating data when ordered factors is present depends on the estimation of a heterogeneous correlation matrix, which allows for estimation of correlations between all combinations of numeric variables and factors. As the number of variables and number of categories in the ordered factors increases, this estimation grows more complex and computational time increases. This estimation is responsible for the bulk of the computation in the
fakeData() function when ordinal data is present, and may lead to excessively long processing times. If repeated datasets are desired from the same input dataset, manually executing the individual lines of the function or other user-specified coding will prevent repeated estimation of the heterogeneous covariance matrix.
There are several instances when the
fakeData() function is not appropriate. You should not use this function on your data when any your data contains:
Generating data from the first four conditions requires more complex simulation structures than are provided by the
fakeData() function. The fifth condition can be easily corrected using R's
factor() function. While there are undoubtedly other ways to simulate data, the
fakeData() function provides a relatively easy method.
While there are many ways to simulate data, the general process of simulating data can be thought of in three steps:
Selecting a structure is often the most difficult part of simulating data. When all relationships can be expressed as linear relationships, then a package like
mvtnorm can be used to sample data from an assumed multivariate normal distribution. Model-like structures can be used as well, allowing for a variety of more complex types of data simulation. Any model that can be expressed as a series of equations can be used to simulate data, though recursive models are somewhat easier in this regard.
Generating data from the assumed structure is the next part of data simulation. Packages like
mvtnorm can be used to sample from multivariate distributions, but R also includes random number generation from a wide variety of non-normal distributions. Packages like
sampling can be used for resampling rows from existing datasets, which is more typically used for techniques like bootstrapping. It is important to select distributional forms for your data that fit with theory, model and intended purpose.
Data formatting is the final step in the process. Simulated data generally includes levels of precision far beyond what is commonly found in empirical research. Applying appropriate corrections to the simulated data to curtail inappropriate precision, create ordinal data and other issues is important for creating representative simulated data.
Sharing your data with other researchers is an important part of the scientific process. It is also important to the OpenMx project, as having empirical examples both provides realistic tests of the software and interesting examples for other users to use and learn from.
National Institutes of Health: http://grants.nih.gov/grants/policy/data_sharing/
National Science Foundation: http://www.nsf.gov/pubs/2001/gc101/gc101rev1.pdf
Wikipedia Entry (includes summaries of above NIH and NSF statemnts): http://en.wikipedia.org/wiki/Data_sharing