select sample

3 replies [Last post]
asayin's picture
Offline
Joined: 11/24/2013

I have a real data set consisting of 6148 students. There are six observed variables and two latent variables. I calculate the covariances matrix from data which include 6148. I want to selecet 100 student's data which has the same covariances matrix with the universe. If you help me this regard, I would be very pleasure.

mhunter's picture
Offline
Joined: 07/31/2009
Any randomly selected

Any randomly selected sub-sample will have a covariance matrix that is "pretty close" to the population covariance matrix. They will only differ due to sampling variability.

What do I mean by "pretty close"? I mean that the sampling variability of the covariance matrix is not large. A popular result from undergraduate statistics is that the sampling distribution of the mean has a mean of mu and a variance of v/N where mu is the population mean, v is the population variance, and N is the sample size. Similarly, the sampling distribution of the variance has a mean of v and a variance of 2*v*v/(N-1). If 2*v/(N-1) is less than 1.0 then the variance of the sampling distribution of the variance is smaller than the population variance. There are similar results for multiple variables.

The take-home message is that the sample covariance matrix from a random sample is almost always close enough to the population covariance matrix. You have a population covariance matrix, so take any random sample and the covariance should be sufficiently close to the population covariance.

asayin's picture
Offline
Joined: 11/24/2013
Thanks for your answer.

Thanks for your answer.

neale's picture
Offline
Joined: 07/31/2009
Not any random sample, but why?

I agree with the statistical remarks made by mhunter. However, not all random samples of 100 will have a covariance matrix close to that of the whole sample. Some will be quite different. One could in principle keep randomly sampling and measure the distance from the whole sample covariance matrix, and keep track of the best set. There are many possibilities (some 4.12913262e+220) so this could take some time if systematically attempted.

It would be possible to simulate data consistent with the population covariance matrix, using mvrnorm() and empirical=TRUE.

However, such efforts beg the question of why one would want to do this. At the end of the day, for covariance matrix model fitting, simply changing the sample size would do exactly what is desired.