Today, we'll look at how a sample can be used to estimate parameters of a population and how much confidence we can put in that number. We'll be able to understand what CBS News means when they say

47 percent in [battleground] states say they will vote for Bush, and 45 percent for Kerry, well within the poll's margin of error.

What do they mean by the "margin of error"? Isn't 47 bigger than 45? Why not just say that Bush is in the lead?

When we run a simulation or conduct a poll of public opinion, we are
sampling. We might be calculating the mean of our sample, or the median,
or the percentage. What we do with it is, often, to use it to say
something about the *population*. This is often called a **point
estimate**, since we're estimating a point, such as the mean or the
percentage.

Point estimates are, by themselves, useful. If you are thinking of buying a house, and you're wondering how much you'll have to spend on various repairs and so forth. If you're particularly geeky, you might build a model like this:

(There are some interesting new twists in this model that are worth looking at. For example, notice that we can get the sum of all the costs so far without using a holding tank. How?)

Each time you run this model, you get a sample of the mean repair cost over the course of 50 years. Of course, you get a different number each time. So, of course, we would run it many times and take the average. But what do we know about that average?

Intuitively, the more times we run the simulation, the more faith we would be willing to put in the average that we calculate. (Assuming we believe the model, which is a different question.) That's because it seems less likely that there was something unusual about the sample (the runs we collected). Even if a few of them were "fluky," the other runs should wash out the unusual runs.

We can make this a little more precise, which is the purpose of confidence intervals.

Here's an old joke:

A guy notices a bunch of targets scattered over a barn wall, and in the center of each, in the "bulls-eye," is a bullet hole. "Wow," he says to the farmer, "that's pretty good shooting. How'd you do it?" "Oh," says the farmer, "it was easy. I painted the targets after I shot the holes."

**Confidence intervals** are a little like that. After we make a
point estimate (a bullet hole), we are going to draw a target (an
interval) around the point and state the probability that the real
objective (the real objective was missing in the joke) is in the target
area. The wider the target, the greater the probability, as you'd expect.
Also, the more accurate the shooting, the greater the probability.

To avoid misleading you, we should re-phrase the joke. Suppose there's
only one target painted on the barn, and all of the bullet holes are
within it, even though they're not all in the center. The center is
defined by the *mean* of the bullet holes. The size of the target
is defined by the *variance* of the collection of bullet holes.
It's still pretty good shooting, and the better the shooting, the smaller
the target needs to be to encompass them all. We can then estimate the
probability that the *real objective* is in the target area.

For the reasons we discussed in the last section, the accuracy of the
shooting depends on how many samples we took and on how *variable*
they are:

- If they're all over the place, there's no reason to be very confident in their mean.
- If they're concentrated in one area, there's good reason to think that the real objective is in that area.

Let's be concrete. Suppose we run the home repair cost model a dozen times and get the following estimates:

55.51, 57.17, 59.19, 58.82, 56.00, 56.90, 57.41, 61.13, 58.88, 55.75, 56.79, 55.94

The summary stats are as follows:

mean = 57.46

s.d. = 1.71

n = 12

The mean of our sample, 57.46, is an **estimate** of the real
monthly cost of home repair in this model. To put this in our statistical
language,

the

sample averageis an estimate of thepopulation average, μ

What's the probability of the population average being within, say, fifty cents of the sample average (x¯)? In general, we can ask

the probability that μ is in the interval [x¯ - s, x¯ + s].

Let's take some time to understand this last statement:

- In what sense is this a probability? Either μ is in that range or it's not!
- What does the statement really mean? Let's draw some pictures to help us understand.

If we remember the CLT, we know that the sampling distribution of the mean is Gaussian with a standard deviation that we can estimate from our sample. What this means is that the mean of the sampling distribution, also known as the expected value of x¯, is μ. That's good. Also, the standard deviation of the sampling distribution is:

σ

_{x¯}= σ/sqrt(n)

σ_{x¯}"=" s/sqrt(n)

In our case, the s.d. of the sampling distribution is 1.71/sqrt(12) = 0.49.

Invoking the CLT, that means that:

Pr(μ is in 57.46 ± 0.49) = 0.68

Pr(μ is in [56.97,57.95]) = 0.68

Traditionally, point estimates are given with, say, a 95 percent
confidence interval. Note that this confidence interval is a
**two-tailed** situation! So, we need to find the appropriate number
of standard deviations so that 2.5 percent is in the upper tail and 2.5
percent is in the lower tail.

Pr( abs(x) < 1.96) = 0.95

Therefore,

Pr(μ is in 57.46 ± 0.49*1.96) = 0.95

Pr(μ is in 57.46 ± 0.96) = 0.95

Pr(μ is in [56.50,58.42]) = 0.95

To demonstrate what this means, we need to know the true mean. Let's go back to our CLT model and use it to get some insight into the meaning of a confidence interval.

- Add blocks to the simulation to calculate a 90 percent confidence (use z=1.645, but why?) interval and determine whether the mean is within that confidence interval.
- Run the simulation 1000 times. What percentage of the runs have the mean within the confidence interval?

This involves some subtlety, so after you've had a chance to think about it for a bit, let's look at a solution together:

Note that we can improve the efficiency by just having the equation block calculated at the end of the run.

Because the CLT only applies to sums and means, when statisticians wanted to put a confidence interval on a median, they were out of luck. That is, they were out of luck until the bootstrap was invented. Let's see how this works.

Suppose you had sampled from a distribution you knew was skewed (like
incomes or house prices or something like that) and so you are trying to
estimate the **median** of that population. You sample the data and you
take the median of the sample as an estimate of the median of the
population. You'd like to put a confidence interval around that estimate,
but how?

- Take your data as the "estimate" of the population.
- Draw thousands of resamples of equal size from the data.
- Compute the median of each.
- The distributions of medians is an estimate of the sampling distribution of the median on your population.
- Use the appropriate interval from your distribution. For example, for a 90 percent confidence interval, use the 5th and 95 percentiles from your distribution.

This is the real reason that statisticians are excited by the bootstrap. They can now give confidence intervals on all kinds of things that they couldn't before, such as

- medians
- modes
- isobar charts (such as weather charts)
- robust correlations

Proportions, such as the polling data we began with, are essentially binomial processes. As "n" gets large for a binomial distribution, it looks increasingly like a Gaussian distribution. (Test it!) More precisely, it looks like a Gaussian distribution with:

μ = np

σ^{2}= np(1-p)

This means that if we count "x" of "n" things satisfying some property, the following expression has an approximately Gaussian distribution:

z = (x - np)/sqrt[np(1-p)]

Thus, a confidence interval for a proportion is:

x/n ± z

_{α/2}* sqrt[p(1-p)/n]

We can use these values to estimate the error of the results of a poll. For example, suppose CBS polled 1100 people and 517 said they supported Bush and 495 said they supported Gore. With 95% confidence, we can say the proportion supporting Bush is in the following confidence interval:

517/1100 ± 1.96 * sqrt[p(1-p)/n]

We estimate p using the sample proportion, x/n, namely 517/1100 = 0.47.

0.47 ± 1.96 * sqrt[0.47*0.53/1100]

0.47 ± 1.96 * 0.015

0.47 ± 0.0294

0.47 ± 0.03

Which is why they almost always say that the margin for error is about 3 percentage points and why they almost always poll 1100 people.

So, when you read that a poll says the margin of error on a percentage is 3 percentage points, you know that it's some kind of confidence interval on that estimate.

This work is licensed under a Creative Commons License | | | |