Published on *OpenMx* (http://openmx.psyc.virginia.edu)

Many people who wish to contribute code and examples to the OpenMx project are held back by data. Specifically, OpenMx users may be prevented or otherwise unwilling to post their data to a public forum. This guide will walk you through some methods of creating simulated data, with the primary discussion surrounding the `fakeData()`

function, which creates simulated data from an existing dataset.

While the ensuing sections will deal with creating new datasets meant to resemble an existing dataset, no simulated dataset will contain all of the information present in the original. For the purposes of anonymizing ones data, this is a good thing. However, this also means that the simulated dataset will yield different parameter estimates and fit statistics when fit with a model, and may yield different error messages as well. The only way to retain all of the information in an existing dataset is to use the original data. Selecting the right method which balances accurate representation with their data sharing plan of the data is up to the individual researcher.

A function called `fakeData()`

exists to assist users who wish to use an existing dataset as a template for creating a new, similar dataset. This function takes an existing dataset, calculates the means and covariances within the data using the `polychor`

package, and samples data from the multivariate normal distribution implied by those means and covariances using the `mvtnorm`

package. The existing data may contain any combination of numerical (continuous) variables and ordered factors: the covariances involving ordinal factors are estimated though either biserial or polychoric correlations using the `polychor`

package.

The options for the `fakeData()`

function are discussed briefly here. The only required argument is the `dataset`

argument, which specifies the dataset to be used as a template. This must be either a matrix or data frame, and any categorical variables must be declared as ordered factors (unordered factors will be identified and return a warning). If no other options are specified, then the simulated data will have the same sample size, variable names, level names (for ordered factors), pattern of missingness and frequency counts for each observed category for ordered factors. The `digits`

argument affects how the randomly generated data is rounded, with a default value of two digits beyond the decimal point.

Several other arguments can be used to make the simulated data differ from the original data, though all are optional. The `n`

argument allows the user to change the sample size (i.e., the number of rows) in the simulated data. Increasing this value will generally make the means and covariances in the simulated data more closely resemble the input data, while decreasing this value will allow for greater discrepancies between the input and simulated data due to sampling variation. The `use.names`

and `use.levels`

arguments specify whether the existing variable names and ordinal factor level labels will be applied to the data. The `use.miss`

argument specifies whether the existing missingness in the data should be preserved in the simulated data, or whether no missingness should be included. Additionally, the `mvt.method`

and `het.ML`

arguments pass options to the `mvtnorm`

and `polychor`

packages, and `het.suppress`

suppresses warnings from `polychor`

's `hetcor`

function, which can be useful for diagnosing potential problems and cleaning up output.

`fakeData()`

was originally designed to assist OpenMx users in diagnosing errors by allowing them to share data that replicates their error without sharing their actual data. As such, this function favors speed over precision. The means and variances of the generated data are based on the univariate distributions of the input data, and covariances based on bivariate relationships ignoring missing data, essentially assuming data are missing completely at random (MCAR). When data are missing at random (MAR), estimating full covariance matrices in OpenMx will give more accurate answers. When data are missing not at random (MNAR), both methods will give biased answers.

It should be noted that both the `use.levels`

, `use.miss`

and `n`

arguments are all somewhat interdependent. When `n`

is specified to be a value different than the input dataset, both the distribution of the ordinal factors nor the pattern of missingness in the simulated data are sampled from the input data, and thus won't exactly mirror the input. Setting `use.miss`

to `FALSE`

will also change the number of non-missing values for ordered factors. In both of these cases, it is possible that the simulated ordered factors will have fewer categories than the original data. When this occurs, the `use.levels`

argument will be ignored and a message will be issued. The likelihood of this will increase with low-frequency categories and large reductions in sample size. Likewise, the proportion of missing data will vary slightly when a value of `n`

other than the observed sample size is used.

Generating data when ordered factors is present depends on the estimation of a heterogeneous correlation matrix, which allows for estimation of correlations between all combinations of numeric variables and factors. As the number of variables and number of categories in the ordered factors increases, this estimation grows more complex and computational time increases. This estimation is responsible for the bulk of the computation in the `fakeData()`

function when ordinal data is present, and may lead to excessively long processing times. If repeated datasets are desired from the same input dataset, manually executing the individual lines of the function or other user-specified coding will prevent repeated estimation of the heterogeneous covariance matrix.

There are several instances when the `fakeData()`

function is not appropriate. You should not use this function on your data when any your data contains:

- Clustered or otherwise non-iid observations, such that the rows of the existing data are not independent.
- Non-linear relationships, specifically those that are crucial to your ensuing model, including moderation and interaction terms.
- Nominal or otherwise non-ordinal categorical data (excluding binary variables declared as ordered factors).
- Missing data assumed to be governed by the MAR or MNAR mechanism, unless accurate recovery of the underlying sample moments is not required (e.g., to replicate errors).
- Categorical data that is not declared as an ordered factor.

Generating data from the first four conditions requires more complex simulation structures than are provided by the `fakeData()`

function. The fifth condition can be easily corrected using R's `factor()`

function. While there are undoubtedly other ways to simulate data, the `fakeData()`

function provides a relatively easy method.

While there are many ways to simulate data, the general process of simulating data can be thought of in three steps:

- Select a structure to underly the data.
- Use random number generation to generate a sample from the assumed structure.
- Format the simulated data in whatever way is appropriate.

Selecting a structure is often the most difficult part of simulating data. When all relationships can be expressed as linear relationships, then a package like `mvtnorm`

can be used to sample data from an assumed multivariate normal distribution. Model-like structures can be used as well, allowing for a variety of more complex types of data simulation. Any model that can be expressed as a series of equations can be used to simulate data, though recursive models are somewhat easier in this regard.

Generating data from the assumed structure is the next part of data simulation. Packages like `mvtnorm`

can be used to sample from multivariate distributions, but R also includes random number generation from a wide variety of non-normal distributions. Packages like `boot`

and `sampling`

can be used for resampling rows from existing datasets, which is more typically used for techniques like bootstrapping. It is important to select distributional forms for your data that fit with theory, model and intended purpose.

Data formatting is the final step in the process. Simulated data generally includes levels of precision far beyond what is commonly found in empirical research. Applying appropriate corrections to the simulated data to curtail inappropriate precision, create ordinal data and other issues is important for creating representative simulated data.

Sharing your data with other researchers is an important part of the scientific process. It is also important to the OpenMx project, as having empirical examples both provides realistic tests of the software and interesting examples for other users to use and learn from.

National Institutes of Health: http://grants.nih.gov/grants/policy/data_sharing/ [1]

National Science Foundation: http://www.nsf.gov/pubs/2001/gc101/gc101rev1.pdf [2]

Wikipedia Entry (includes summaries of above NIH and NSF statemnts): http://en.wikipedia.org/wiki/Data_sharing [3]

Attachment | Size |
---|---|

FakeData.R [4] | 4.9 KB |

**Links:**

[1] http://grants.nih.gov/grants/policy/data_sharing/

[2] http://www.nsf.gov/pubs/2001/gc101/gc101rev1.pdf

[3] http://en.wikipedia.org/wiki/Data_sharing

[4] http://openmx.psyc.virginia.edu/sites/default/files/FakeData_1.R