Suppose you toss a coin 10 times and you get 6 heads. Is it a fair coin? Sure, probably. The chance of getting 6 heads is high enough that you're not willing to reject the assumption that it's a fair coin.
But if you got 10 heads, you might. Your reasoning would be as follows:
For many people, this is a strange kind of reasoning. But this is the central reasoning of all statistical tests. Here's the standard outline:
In your scientific paper, you then say something like:
Our experiment yielded a value of M, which is statistically significant (p < 0.05).
But wait, you say, what if we're wrong? In fact, if you think about it for a while, you can see that there are two ways to be wrong:
It can be helpful to see these via the following table:
What we choose H0 H1 Truth H0 correct type I error H1 type II error correct
We'll talk more about these two errors in a later lecture, but it's important to keep them in the back of our minds. For now, know that:
The "p" value of a statistical test is the probability of a type I error.
We now turn to a statistical test that uses this kind of reasoning. I think the simplest practical statistical test is a two-sample test. Most textbooks, including ours, start with one-sample tests, which are like the coin example at the top of this page. I'm skipping those because they don't arise often in real life.
Let's first give some real-world examples and then a simpler artificial example.
Suppose you give the SAT math test to 40 students. The boys scores are as follows:
275 309 355 378 396 461 464 468 472 474 475 480 490 523 555 582 632 683 741 766
(summary stats: Q1=432 Q2=474.5 Q3=586 mean = 498, sd = 132)
The girls scores are as follows:
298 319 409 423 436 447 447 457 454 459 488 502 519 532 533 547 551 557 569 588
(summary stats: Q1=441 Q2=473.5 Q3=540 mean = 476, sd = 78)
Question: Is there any statistically significant difference between boys and girls? In other words, does this data give any evidence that one sex is better than the other, or can we attribute the differences to chance?
Suppose we want to test an experimental pain reliever. We induce pain in each of the test subjects. (This happened to me!) We then give the experimental pain reliever to one group of people and ibuprofen to the other (the control group). After one hour, we ask each person to report their pain level (on a 1-10 scale).
Test: 4 3 4 5 7 3 4 4 2 4
Control: 4 4 5 3 3 4 2 5 6 4
Question: Can we compare the two data to see if the pain reliever is better than ibuprofen? Is the experimental one better (or worse)? Or must we attribute this difference to chance?
Suppose we draw 10 balls from an urn of red and white balls and we get
8 red and 2 white
We leave the room and come back some time later. We again draw 10 balls and we get:
5 red and 5 white
Question: Has someone tampered with the urn and changed the ratio of red to white balls? Or must we attribute this to chance?
Let's start with the urns. What are our two hypotheses?
What is our statistic? We could use lots of different statistics. For example, we could use the absolute value of the difference in the two proportions. In this case, the statistic is |8/2-5/5|=3. But a statistic that is easier to calculate is just the difference in the number of reds, which is |8-5|, which also happens to be 3. (Just by coincidence!)
Under the null hypothesis, what is the probability that we would get a difference as big as 3 by chance? Ummm.
Let's make the computer do it. Let's draw thousands of samples from a virtual urn and see how unusual a 3 is. How do we make a virtual urn and, more importantly, what's in it?
In this case, under the null hypothesis, we could assume that real proportion is 13:7, because we've gotten 13 red balls and 7 white balls in 20 draws.
We can construct a spreadsheet that does this. Here's one:
The major elements of this spreadsheet are:
Let's also build an Extend model of this. What is the probability of a type I error?
The logic of the bootstrap is as follows:
Let's turn to another statistical test; this time a classic called the z-test. Let's use it to test our SAT data. First, we assume that the scores are normally distributed. That is, the population they are drawn from is Gaussian.
What will our test statistic be? Here, the choice is not quite so arbitrary, for reasons that we'll discuss later, but among the first things you'd think of is the difference of the means, and that's just fine.
test statistic: difference of means or 498-476 = 22
Under the null hypothesis, the scores are from the same Gaussian distribution. Our statistic is the difference between the means of two samples.
What is the sampling distribution of the difference in means of two samples from a Gaussian? Call our two samples A and B. It turns out that the sampling distribution is also Gaussian with parameters:
mean(A-B): 0
var(A-B): var(A)/N(A)+var(B)/N(B)
What the heck does the variance formula mean? The formula means the variance of the first sample divided by the size of the first sample, plus the variance of the second sample divided by its size.
In our case, that means:
VAR(A) = 782 = 6084
N(A) = 20
VAR(A) = 1322 = 17424
N(A) = 20
VAR(A-B) = 6084/20+17424/20 = 1175.4
So, if we standardize our statistic, we can use the standard normal as our sampling distribution and look up the probability in a table.
z = (22-0)/sqrt(1175.4) = 22/34.3 = 0.64
What's the probability of a value greater than 0.64? It's not even a whole standard deviation, so it's clearly pretty large. We can find the exact value using Excel if we want. However, we will definitely not be rejecting the null hypothesis.
The logic of the z-test is similar to the bootstrap, but there are some important differences, which I marked with asterisks.
There's a really important step in the z-test that we didn't dwell on earlier, namely that the distribution of differences in means of samples from Gaussians is also Gaussian.
Let's say that again. You have a population P that is Gaussian. A procedure draws two samples from it, A and B, one of size N(A) and the other of size N(B) and calculates D, the difference in the means of the two samples. The distribution of D is also Gaussian.
Moreover, the mathematical statisticians of the last century were able to calculate the parameters of D as a function of the variance of P and the sizes of the two samples. Amazing, complicated stuff, but outside the scope of this class.
We use the sampling distribution to determine whether to reject the null hypothesis.
So, who cares?
But, if you can't meet the two key requirements, you're sunk: unless you use the bootstrap. The bootstrap is a major innovation in statistics that has come about thanks to the advent of cheap computing. It lets us:
Let's compare the two grocery models. Here are two variants using timers for customers:
Inspect the differences, and then run them. Let's at least eyeball the statistics to see whether a Z test is likely to reject H0. Let's see if we can explain the result.