Two Sample Tests

Real differences versus Chance Differences

Suppose you toss a coin 10 times and you get 6 heads. Is it a fair coin? Sure, probably. The chance of getting 6 heads is high enough that you're not willing to reject the assumption that it's a fair coin.

But if you got 10 heads, you might. Your reasoning would be as follows:

For many people, this is a strange kind of reasoning. But this is the central reasoning of all statistical tests. Here's the standard outline:

  1. Determine the null hypothesis (usually status quo or no difference and denoted H0 or H0) and the alternative hypothesis (often denoted H1 or H1).
  2. Run your experiment and obtain a measurement or statistic. Let's call it M.
  3. Determine the probability, p of getting a result as extreme as M or more, under the assumption that the null hypothesis is true
  4. If that probability is sufficiently small, you reject the null hypothesis in favor of the alternative hypothesis.

In your scientific paper, you then say something like:

Our experiment yielded a value of M, which is statistically significant (p < 0.05).

Type I and Type II Errors

But wait, you say, what if we're wrong? In fact, if you think about it for a while, you can see that there are two ways to be wrong:

  1. We could reject H0 even though H0 is correct. (Type I error)
  2. We could accept H0 even though H1 is correct. (Type II error)

It can be helpful to see these via the following table:

What we choose
H0H1
TruthH0correcttype I error
H1type II errorcorrect

We'll talk more about these two errors in a later lecture, but it's important to keep them in the back of our minds. For now, know that:

The "p" value of a statistical test is the probability of a type I error.

Two Sample tests

We now turn to a statistical test that uses this kind of reasoning. I think the simplest practical statistical test is a two-sample test. Most textbooks, including ours, start with one-sample tests, which are like the coin example at the top of this page. I'm skipping those because they don't arise often in real life.

Let's first give some real-world examples and then a simpler artificial example.

Test Differences

Suppose you give the SAT math test to 40 students. The boys scores are as follows:

275 309 355 378 396 461 464 468 472 474 475 480 490 523 555 582 632 683 741 766

(summary stats: Q1=432 Q2=474.5 Q3=586 mean = 498, sd = 132)

The girls scores are as follows:

298 319 409 423 436 447 447 457 454 459 488 502 519 532 533 547 551 557 569 588

(summary stats: Q1=441 Q2=473.5 Q3=540 mean = 476, sd = 78)

Question: Is there any statistically significant difference between boys and girls? In other words, does this data give any evidence that one sex is better than the other, or can we attribute the differences to chance?

Treatment Differences

Suppose we want to test an experimental pain reliever. We induce pain in each of the test subjects. (This happened to me!) We then give the experimental pain reliever to one group of people and ibuprofen to the other (the control group). After one hour, we ask each person to report their pain level (on a 1-10 scale).

Test: 4 3 4 5 7 3 4 4 2 4
Control: 4 4 5 3 3 4 2 5 6 4

Question: Can we compare the two data to see if the pain reliever is better than ibuprofen? Is the experimental one better (or worse)? Or must we attribute this difference to chance?

Balls from an Urn

Suppose we draw 10 balls from an urn of red and white balls and we get

8 red and 2 white

We leave the room and come back some time later. We again draw 10 balls and we get:

5 red and 5 white

Question: Has someone tampered with the urn and changed the ratio of red to white balls? Or must we attribute this to chance?

Testing using the Bootstrap

Let's start with the urns. What are our two hypotheses?

What is our statistic? We could use lots of different statistics. For example, we could use the absolute value of the difference in the two proportions. In this case, the statistic is |8/2-5/5|=3. But a statistic that is easier to calculate is just the difference in the number of reds, which is |8-5|, which also happens to be 3. (Just by coincidence!)

Under the null hypothesis, what is the probability that we would get a difference as big as 3 by chance? Ummm.

Let's make the computer do it. Let's draw thousands of samples from a virtual urn and see how unusual a 3 is. How do we make a virtual urn and, more importantly, what's in it?

In this case, under the null hypothesis, we could assume that real proportion is 13:7, because we've gotten 13 red balls and 7 white balls in 20 draws.

We can construct a spreadsheet that does this. Here's one:

bootstrap.xls

The major elements of this spreadsheet are:

Let's also build an Extend model of this. What is the probability of a type I error?

The Logic of the Bootstrap

The logic of the bootstrap is as follows:

  1. Draw the two real samples.
  2. Calculate the test statistic.
  3. Under the null hypothesis, the two samples are in fact drawn from the same population.
  4. Pool the two samples
  5. Use the pooled sample as an estimate of the population we are sampling from.
  6. Draw zillions of samples from the estimated population (the pooled sample) and calculate the statistic on each sample. Throw each statistic into a bucket.
  7. The distribution of values in the bucket is called the sampling distribution of the statistic
  8. See how unusual the test statistic is in the sampling distribution.
  9. Reject the null hypothesis if the test statistic is sufficiently extreme.
  10. The p value is the part of the sampling distribution that is more extreme than our test statistic.

Testing using the z-test

Let's turn to another statistical test; this time a classic called the z-test. Let's use it to test our SAT data. First, we assume that the scores are normally distributed. That is, the population they are drawn from is Gaussian.

What will our test statistic be? Here, the choice is not quite so arbitrary, for reasons that we'll discuss later, but among the first things you'd think of is the difference of the means, and that's just fine.

test statistic: difference of means or 498-476 = 22

Under the null hypothesis, the scores are from the same Gaussian distribution. Our statistic is the difference between the means of two samples.

What is the sampling distribution of the difference in means of two samples from a Gaussian? Call our two samples A and B. It turns out that the sampling distribution is also Gaussian with parameters:

mean(A-B): 0
var(A-B): var(A)/N(A)+var(B)/N(B)

What the heck does the variance formula mean? The formula means the variance of the first sample divided by the size of the first sample, plus the variance of the second sample divided by its size.

In our case, that means:

VAR(A) = 782 = 6084
N(A) = 20
VAR(A) = 1322 = 17424
N(A) = 20
VAR(A-B) = 6084/20+17424/20 = 1175.4

So, if we standardize our statistic, we can use the standard normal as our sampling distribution and look up the probability in a table.

z = (22-0)/sqrt(1175.4) = 22/34.3 = 0.64

What's the probability of a value greater than 0.64? It's not even a whole standard deviation, so it's clearly pretty large. We can find the exact value using Excel if we want. However, we will definitely not be rejecting the null hypothesis.

The Logic of the z-test

The logic of the z-test is similar to the bootstrap, but there are some important differences, which I marked with asterisks.

  1. Draw the two samples.
  2. *Choose an appropriate statistic, such as the mean.
  3. Calculate the test statistic.
  4. Under the null hypothesis, the two samples are in fact drawn from the same population.
  5. Pool the two samples
  6. *Assume that the two samples are drawn from a Gaussian distribution.
  7. *Use the pooled sample as an estimate of the parameters of the sampling distribution. (The sampling distribution may also be Gaussian, but not the same Gaussian.)
  8. See how unusual the test statistic is in the sampling distribution.
  9. Reject the null hypothesis if the test statistic is sufficiently extreme.
  10. The p value is the part of the sampling distribution that is more extreme than our test statistic.

Differences Between the Bootstrap and the Z-test

There's a really important step in the z-test that we didn't dwell on earlier, namely that the distribution of differences in means of samples from Gaussians is also Gaussian.

Let's say that again. You have a population P that is Gaussian. A procedure draws two samples from it, A and B, one of size N(A) and the other of size N(B) and calculates D, the difference in the means of the two samples. The distribution of D is also Gaussian.

Moreover, the mathematical statisticians of the last century were able to calculate the parameters of D as a function of the variance of P and the sizes of the two samples. Amazing, complicated stuff, but outside the scope of this class.

We use the sampling distribution to determine whether to reject the null hypothesis.

So, who cares?

But, if you can't meet the two key requirements, you're sunk: unless you use the bootstrap. The bootstrap is a major innovation in statistics that has come about thanks to the advent of cheap computing. It lets us:

Comparing Grocery Models

Let's compare the two grocery models. Here are two variants using timers for customers:

Inspect the differences, and then run them. Let's at least eyeball the statistics to see whether a Z test is likely to reject H0. Let's see if we can explain the result.

  1. This work is licensed under a Creative Commons License
  2. Creative Commons License
  3. Viewable With Any
Browser
  4. Valid HTML 4.01!
  5. Valid CSS!