Standard Error, CLT

Let's turn to statistics and one of the most amazing theorems in the field, namely the Central Limit Theorem. This is a theorem about what happens as things go to infinity, so we can use simulation to test the ideas and aid our understanding.

The Standard Error of the Mean

Consider a very simple distribution, like a six-sided die. The probability of a 1 is 1/6. Suppose we roll it four times and take the mean. What's the probability of getting a 1 as the mean? Much smaller than 1/6. (What is it actually?) If we roll it eight times instead, the chance that the mean will be 1 is even smaller. The more rolls we include in the mean, the smaller the chance of getting a 1.

Using 1 for the sample mean made the idea easier, but we could just as easily ask what the probability of having a sample mean of 2 is. Again, the probability of a sample mean equalling 2 for samples of size four is less than the probability of a 2. And the probability of that sample mean is even less if the sample size is eight.

The general point is that

The distribution of the sample mean is less variable than the distribution you're sampling from. The more values in the sample, the less variable the distribution of the mean.

Sampling Distribution

There's an important concept and related terminology that can be very confusing until you get the hang of it. We are distinguishing:

The latter distribution is called the sampling distribution: the distribution of some statistic computed on samples of size N. Here, we're using the word "statistic" in the sense of a function of some set of data, such as the mean, media, mode. Today, we're pretty much always talking about the mean of the sample, but in principle we could talk about the sampling distribution of the median.

The idea of a sampling distribution is illustrated by the following picture:

the sampling distribution is
a distribution made of statistics on samples from some other distribution

A Quantitative Statement

We observed that the probability of certain values for the mean depends on the As it turns out, we can be more precise. We can quantify how the variance (of the sampling distribution of the mean) decreases with increasing sample size. Letting N be the sample size. The variance of the sampling distribution of the mean of samples of size N is

VAR(mean) = VAR(orig)/N

To find out how the standard deviation changes, just take the square root.

Example

Suppose we measure the heights of 25 Wellesley women, in inches rounded off, and we get the following numbers:

60 60 60 61 61 62 62 62 62 62 62 63 63 63 63 63 64 64 65 65 66 66 67 68 70

The mean of this data is 63.36. We could consider that number, 63.36 as a sample from another distribution: a distribution of means of samples of size 25. How variable is that distribution? If we knew the variance of the real population, we could just divide its variance by 25. Alas, in real life we typically don't know it. (Just between us, I used a standard deviation of 3 in generating this data.)

We do know the variance of the sample, though, and that is an estimate of the variance of the population. The variance of this sample is 6.5 (s.d=2.55). Therefore, we can compute that the variance of the sampling distribution is 6.5/25 or 0.26. The standard deviation is 2.55/5 or 0.51.

The standard deviation of the distribution of the mean goes by a special name: the standard error. It's called that because it lets us know that, if we ran this experiment again, we'd be very likely to get a value for the mean that is close to 63.36. More precisely, the "real" mean is probably 63.36 ± 0.51. (We'll firm this up in a later lecture.)

A Model

We can see the standard error idea at work using the following simulation model:

CLT2

Try the following

The Central Limit Theorem

As we look at sampling distributions of means, we keep seeing the same shape: it usually looks like a Gaussian distribution. This is not an accident. There is a bizarre and wonderful theorem called the Central Limit Theorem that is pivotal in the history of statistics.

The statement of the theorem is pretty simple:

Consider the sum, S, of N independent numbers drawn from any distribution or set of distributions. As N goes to infinity, the distribution of S becomes a Gaussian distribution. Furthermore, the distribution of the mean S/N also becomes a Gaussian.

In even simpler language, the CLT means that distribution of sums and means is roughly Gaussian, as long as the sample is "large enough."

This is amazing, but true, and it helps to explain why so many things in the real world are Gaussian:

The CLT in Action

We can use the same model that we used for the standard error to see the CLT at work. Just change the original distribution to anything you want.

  1. Does the sampling distribution of the mean still look Gaussian?
  2. Try small sample sizes
  3. Try large sample sizes
  4. When does it start to look reasonably Gaussian? How good is the rule of thumb?

Key Caveats

It's easy to get carried away with the CLT. It often applies, but not always. Keep the following in mind:

Independence First of all, the numbers must be independent. The reason for this is that an important part of the theorem has to do with how the different sums can be achieved. Think of our old two-dice example. A sum of 7 is more likely than a sum of 2 because there are more ways to produce a 7. But in order to get a 7, it has to be possible to match up a 1 with a 6, a 2 with a 5, and so forth. If the dice aren't independent (as an extreme example, suppose they're built so that they always have the same value), then those matchings aren't possible.

Additivity You'll notice I emphasize that these different factors must have an additive effect. Some things are not just additive, they are synergistic. For example:

The CLT is based on the factors being additive. Many things are additive, or mostly additive, but not everything.

Continuing on the theme of additivity, the CLT is about sampling distributions of sums and averages of samples drawn from populations, not about populations themselves, even if the populations are sums or averages. Suppose I ask each Wellesley student for the average number of shoes she buys in a year. That's 2200 data points, each of which is an average, but the CLT doesn't apply, because we're not summing or averaging these data, we're just writing them down. If some "clothes horse" buys 50 or 100 pairs of shoes a year, there's no time in which this large number gets summed or averaged with other numbers so that it "averages out" to a more middling value. It's still sticking out there as an outlier. However, if we call up N students, ask each how many shoes she buys in a year, and compute the average of that sample, *then* the CLT applies (as long as N is reasonably big).

Infinity The theorem only says that the limit as we go to infinity is the Gaussian. Anything short of an infinitely large sample (in other words, in real life), is an approximation. If life and death depend on some probability and you are estimating the probability by assuming the CLT applies and that the distribution is Gaussian, keep this in mind. For most purposes, though, an approximation is good enough. (Besides, there's no good alternative. Even the bootstrap, a statistical technique that we will discuss soon, doesn't reveal absolute truth.)

Counterexamples

Here are some distributions that aren't Gaussian.

(FYI, these are all situations where small numbers are common and big numbers are rare. In fact, they fit another distribution called the "Power Law" or "Zipf law" or "Pareto distribution": if you double the number you halve the probability, with a multiplicative constant. For more information, see Zipf, Power-law, Pareto. There's also a wonderful book I read recently about this called Ubiquity by Mark Buchanan.)

Importance of the CLT

Because of the CLT, it means we can use the Z test even if the things we are sampling (SAT scores, queue waiting times, whatever) aren't Gaussian. As long as we take a large enough sample, and we are comparing means, the CLT gives us confidence that the sampling distribution will be roughly Gaussian.

How big is big enough? A rule of thumb quoted in many statistics books is 30.

This work is licensed under a Creative Commons License | Creative Commons License | Viewable With Any
Browser | Valid HTML 4.01! | Valid CSS!