In 1908, a statistician named William S. Gosset at Guinness Brewing derived the sampling distribution of the mean when the sample is small (so the CLT doesn't apply). To do so, he had to assume that the samples were from a Gaussian distribution.
He used the following statistic (compare this with the z statistic--it's really the same)
t = (x-bar - μ)/(S*sqrt(1/n+1/m))
and calculated its distribution, which is now known as the "t" distribution.
The t distribution is just like the Gaussian (AKA the z distribution) except that it has slightly heavier tails. In fact the heaviness of the tails depends on how big the sample is. The smaller the sample, the heavier the tails, because it's more possible to get an unusual sample. Here's a nice visualization of the t distribution.
Therefore, there is a family of t distributions, indexed by something called the degrees of freedom. Note that, as the number of degrees of freedom increases, the t distribution becomes the z distribution.
The degrees of freedom (dof) is the number of independent numbers contributing to the calculation of t.
In real life, the number of degrees of freedom is equal to one less than the number of samples.
Why one less? Because we would use the sample to estimate the variance of the population. If an angel comes down from heaven and tells us the population variance, σ, so that we don't have to estimate it from the sample, we can compute:
t = (x-bar - μ)/(σ/sqrt(n))
and we can use "n" as the degrees of freedom instead of "n-1"
This, by the way, is why we use n-1 in the computing the variance of the sample instead of n.
This statistician published his work anonymously, signing his paper "A Student" and, for that reason, this is often called "Student's t" in his honor.
To use the t test, you first calculate something that is distributed like t under the null hypothesis. There's a different t statistic for each usage of the t distribution, but they're all pretty similar. The one that we will use for the difference between two samples is nearly the same as the z statistic. The difference is in computing the standard deviation of the sampling distribution (the t distribution).
Here's the awful truth (and yes, it is awful):
spooled2 = ((N1-1)s12+ (N2-1)s22)/ (N1+N2-2)
The variance in the denominator is:
spooled2 * (1/N1+1/N2)
Why such a horrendous calculation? We'll discuss this as a class, but it rests on the following:
Beyond that, it's just some algebra.
We have talked about alpha, which is the probability of a type I error: the probability that we reject H0 even though it's true. We determine alpha when we do our statistical test. We look at how extreme the value is within the sampling distribution, and if it the probability (alpha) is low, we reject H0. So, we know immediately what alpha is.
What about the probability of a type II error: the probability that we accept H0 even though H1 is true. That probability is unsurprisingly called beta.
Usually, beta is harder to come by. First of all, there are a great many alternative to H0. For example, suppose H0 is that μ=5; that is, that the mean of some population is 5. One alternative might be that μ=6. Another might be that μ=5.1. Depending on the standard deviation of the population, it might be very hard to reject H0 in favor of either of these alternative hypotheses.
To gain some intuition about this, I've built the following model. Download it and spend a little time looking at it.
The power of a test is important because, if you have a choice between tests, you should use the one with the greatest power, since you'll have the best chance of rejecting the null hypothesis, and thereby having a significant result.
Generally speaking, the more assumptions that a test makes, the greater its power, because, in a sense, it "knows" more about the population than tests that make fewer assumptions.
One reason that statisticians like the bootstrap is that it makes the fewest assumptions. Therefore, it is very general and applies in all sorts of situations. However, that also means it typically has the least power. That is, for a given p-value, the t-test is most likely to reject the null hypothesis, then the z-test, and finally the bootstrap.