Today, we'll look at how a sample can be used to estimate parameters of a population and how much confidence we can put in that number. We'll be able to understand what CBS News means when they say
47 percent in [battleground] states say they will vote for Bush, and 45 percent for Kerry, well within the poll's margin of error.
What do they mean by the "margin of error"? Isn't 47 bigger than 45? Why not just say that Bush is in the lead?
When we run a simulation or conduct a poll of public opinion, we are sampling. We might be calculating the mean of our sample, or the median, or the percentage. What we do with it is, often, to use it to say something about the population. This is often called a point estimate, since we're estimating a point, such as the mean or the percentage.
Point estimates are, by themselves, useful. If you are thinking of buying a house, and you're wondering how much you'll have to spend on various repairs and so forth. If you're particularly geeky, you might build a model like this:
(There are some interesting new twists in this model that are worth looking at. For example, notice that we can get the sum of all the costs so far without using a holding tank. How?)
Each time you run this model, you get a sample of the mean repair cost over the course of 50 years. Of course, you get a different number each time. So, of course, we would run it many times and take the average. But what do we know about that average?
Intuitively, the more times we run the simulation, the more faith we would be willing to put in the average that we calculate. (Assuming we believe the model, which is a different question.) That's because it seems less likely that there was something unusual about the sample (the runs we collected). Even if a few of them were "fluky," the other runs should wash out the unusual runs.
We can make this a little more precise, which is the purpose of confidence intervals.
Here's an old joke:
A guy notices a bunch of targets scattered over a barn wall, and in the center of each, in the "bulls-eye," is a bullet hole. "Wow," he says to the farmer, "that's pretty good shooting. How'd you do it?" "Oh," says the farmer, "it was easy. I painted the targets after I shot the holes."
Confidence intervals are a little like that. After we make a point estimate (a bullet hole), we are going to draw a target (an interval) around the point and state the probability that the real objective (the real objective was missing in the joke) is in the target area. The wider the target, the greater the probability, as you'd expect. Also, the more accurate the shooting, the greater the probability.
To avoid misleading you, we should re-phrase the joke. Suppose there's only one target painted on the barn, and all of the bullet holes are within it, even though they're not all in the center. The center is defined by the mean of the bullet holes. The size of the target is defined by the variance of the collection of bullet holes. It's still pretty good shooting, and the better the shooting, the smaller the target needs to be to encompass them all. We can then estimate the probability that the real objective is in the target area.
For the reasons we discussed in the last section, the accuracy of the shooting depends on how many samples we took and on how variable they are:
Let's be concrete. Suppose we run the home repair cost model a dozen times and get the following estimates:
55.51, 57.17, 59.19, 58.82, 56.00, 56.90, 57.41, 61.13, 58.88, 55.75, 56.79, 55.94
The summary stats are as follows:
mean = 57.46
s.d. = 1.71
n = 12
The mean of our sample, 57.46, is an estimate of the real monthly cost of home repair in this model. To put this in our statistical language,
the sample average is an estimate of the population average, μ
What's the probability of the population average being within, say, fifty cents of the sample average (x¯)? In general, we can ask
the probability that μ is in the interval [x¯ - s, x¯ + s].
Let's take some time to understand this last statement:
If we remember the CLT, we know that the sampling distribution of the mean is Gaussian with a standard deviation that we can estimate from our sample. What this means is that the mean of the sampling distribution, also known as the expected value of x¯, is μ. That's good. Also, the standard deviation of the sampling distribution is:
σx¯ = σ/sqrt(n)
σx¯ "=" s/sqrt(n)
In our case, the s.d. of the sampling distribution is 1.71/sqrt(12) = 0.49.
Invoking the CLT, that means that:
Pr(μ is in 57.46 ± 0.49) = 0.68
Pr(μ is in [56.97,57.95]) = 0.68
Traditionally, point estimates are given with, say, a 95 percent confidence interval. Note that this confidence interval is a two-tailed situation! So, we need to find the appropriate number of standard deviations so that 2.5 percent is in the upper tail and 2.5 percent is in the lower tail.
Pr( abs(x) < 1.96) = 0.95
Pr(μ is in 57.46 ± 0.49*1.96) = 0.95
Pr(μ is in 57.46 ± 0.96) = 0.95
Pr(μ is in [56.50,58.42]) = 0.95
To demonstrate what this means, we need to know the true mean. Let's go back to our CLT model and use it to get some insight into the meaning of a confidence interval.
This involves some subtlety, so after you've had a chance to think about it for a bit, let's look at a solution together:
Note that we can improve the efficiency by just having the equation block calculated at the end of the run.
Because the CLT only applies to sums and means, when statisticians wanted to put a confidence interval on a median, they were out of luck. That is, they were out of luck until the bootstrap was invented. Let's see how this works.
Suppose you had sampled from a distribution you knew was skewed (like incomes or house prices or something like that) and so you are trying to estimate the median of that population. You sample the data and you take the median of the sample as an estimate of the median of the population. You'd like to put a confidence interval around that estimate, but how?
This is the real reason that statisticians are excited by the bootstrap. They can now give confidence intervals on all kinds of things that they couldn't before, such as
Proportions, such as the polling data we began with, are essentially binomial processes. As "n" gets large for a binomial distribution, it looks increasingly like a Gaussian distribution. (Test it!) More precisely, it looks like a Gaussian distribution with:
μ = np
σ2 = np(1-p)
This means that if we count "x" of "n" things satisfying some property, the following expression has an approximately Gaussian distribution:
z = (x - np)/sqrt[np(1-p)]
Thus, a confidence interval for a proportion is:
x/n ± zα/2 * sqrt[p(1-p)/n]
We can use these values to estimate the error of the results of a poll. For example, suppose CBS polled 1100 people and 517 said they supported Bush and 495 said they supported Gore. With 95% confidence, we can say the proportion supporting Bush is in the following confidence interval:
517/1100 ± 1.96 * sqrt[p(1-p)/n]
We estimate p using the sample proportion, x/n, namely 517/1100 = 0.47.
0.47 ± 1.96 * sqrt[0.47*0.53/1100]
0.47 ± 1.96 * 0.015
0.47 ± 0.0294
0.47 ± 0.03
Which is why they almost always say that the margin for error is about 3 percentage points and why they almost always poll 1100 people.
So, when you read that a poll says the margin of error on a percentage is 3 percentage points, you know that it's some kind of confidence interval on that estimate.
This work is licensed under a Creative Commons License | | | |