These notes go into several topics in more detail that we glossed over too much in class.
On of the fundamental purposes of statistics is to make use of information (data) that is intrinsically variable. One important aspect of that information processing is knowing how much information you actually have: the more information the better, but knowing how much you have is critical. Thus, we count the number of independent samples.
Now, consider estimating a particular value, namely the variance of the population. The obvious way to do that is to use the average squared distance from the population mean:
(1) σ2 ≈ (1/N) SUM (xi-μ)2
However, we rarely know the population mean in real life. So, we estimate that, too, using our sample. Letting M be the sample mean:
(2) σ2 ≈ (1/N) SUM (xi-M)2
However, that results in a biased estimation of the variance, because the latter calculation is too small, because the data will be "too close" to M. That is, they are closer to M than they are to μ. (Indeed, you can prove that M mimimizes the variance.)
One way to think about this is that equation (1) uses 2 data values and 1 known parameter, while equation (2) uses 2 data values alone. The result is that equation (2) is equivalent to using equation (1) with only one independent sample.
Why? Because if I tell you either of the two data values and M, you can tell me the other data value. The other data value has no new information in it.
Statisticians say that the second case only has one degree of freedom. They count each independent data item (each new piece of information) as a degree of freedom, and each time something is estimated from the sample, they deduct one degree of freedom.
Because the sample variance is estimated from the data values and M, it's being estimated from one less degree of freedom, and therefore, we divide by N-1 instead of by N. Therefore, we estimate the population variance as:
(3) σ2 ≈ (1/(N-1)) SUM (xi-M)2
As a demonstration of the three equations we have here, I implemented the following Extend model. In this model, we are just taking two data values, which is the smallest possible sample. Not only does that simplify the Extend model, but it maximizes the difference between equation (2) and equation (3).
degrees-of-freedom.mox
First of all, dividing by N-1 instead of N only matters if we are trying to estimate the variance of the population. If we just want a measure of the variance of the sample, either one is fine, and there are functions in Excel to calculate it either way. Since the N-1 form is the default, you should explain that you're dividing by N should you choose to do so. Personally, I suggest you stick with what's standard, because it doesn't really matter.
Why doesn't it matter? If the extreme case of having only 2 samples, it clearly matters a lot whether you divide by 1 or 2, but you would never do any scientific or other important work based on just 2 samples. Cost may limit the amount of data you can get, but if you couldn't afford more than two samples, you would probably abandon the effort.
Furthermore, suppose it does matter: suppose that dividing by N just barely gets you a significant result (p<0.05) but dividing by N-1 doesn't quite (p≥0.05). That means, in a sense, that your result depends on one data item (one degree of freedom). The next researcher to try to replicate your result could well end up on the other side and fail to confirm your result.
In general, while a field typically chooses a conventional standard (such as p<0.05), you'd feel better if there were some distance between your result and the critical value.
The one sample t test rarely arises in real life, but it's a nice warmup for the later, more typical cases. Imagine that we have a hypothesis, perhaps from earlier work in the field, that some measure has a particular value. That is, H0 will be a statement about μ for some population. You'll collect some data, calculate a statistic (a t value), and calculate the probability of that statistic given H0. You'll reject H0 if the statistic is sufficiently unlikely.
For concreteness, suppose that your hypotheses are about the white blood cell count (WBC, which is a rough measure of infection or the health of the immune system) for healthy rats. In particular, your hypotheses are
H0: WBC for a healthy rat is 20,000 per microliter
H1: WBC for a healthy rat is not 20,000 per microliter
Thus, you'll have a two-tailed test.
You have 10 lab rats, and you draw a small sample of blood from each, run a WBC test, and make the following measurements:
19,500
18,800
21,600
22,300
19,800
22,900
20,700
23,200
19,100
18,500
n = 10
x-bar = 20,640
s = 1754
stderr = s/sqrt(n) = 1754/sqrt(20) = 392
(4) t = (x-bar - μ)/stderr = (20,640-20,000)/392 = 640/392 = 1.63
df t0.050 t0.025 t0.010 t0.005 19 1.729 2.093 2.539 2.861
Since t is less than 1.729, we know that the probability that a value greater than this would occur by chance more than 5 percent of the time. And that's just the upper tail. Since we decided this was a two-tailed test, we double this to know that the probability that we would see a difference as great as this is more than 10 percent.
If we want the exact p-value (there's no reason we should, but just because we can), we can use the following Excel formula:
=tdist(tvalue,degreesOfFreedom,tails)
=tdist(1.63,19,2)
which evaluates to 0.12. Thus, there's a 12 percent probability that something this large would happen by chance.
Suppose we have a hypothesis that glucocorticoids (a kind of steroid that is released when an organism is under stress) will reduce immune function, and therefore WBC count. We decide to treat all the rats to a round of glucocorticoids and measure their WBC count after the treatment.
Our hypotheses end up being about two means: the mean of the treatment group and the mean of the control group, or, equivalently, the mean after treatment and the mean before. Let's denote these:
μtreatment or μ1, and
μcontrol or μ0
Thus, our hypotheses are:
H0: the treatment has no effect, so μ1=μ0.
H1: the treatment has an effect, so μ1<μ0
Equivalently, our hypotheses are:
H0: the treatment has no effect, so μ1-μ0=0
H1: the treatment has an effect, so μ1-μ0<0
Note that this is a one-tailed test. Note also that our test is about the differences of means. This is crucial.
Here is the data we get after treating the rats. Unfortunately, one of the rats escaped, so we only have 9.
17,900
19,000
21,500
19,200
21,900
18,800
20,100
19,100
17,500
We first compute the summary statistics. We've repeated the summary stats for the original value from the earlier measurement.
| Control Group | Treatment Group | ||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
|
|
Now we have to compute the standard error, which is the standard deviation of the sampling distribution. Here, we have to think a bit. What is the sampling distribution? It is a distribution of differences of means. Can we determine its standard deviation (that is, the standard error of our test statistic)? Well, the variance of a difference is the sum of the variances, so it turns out that we can determine this. Mathematical statisticians have worked through the math and the formula for this standard error is:
dof = (n1-1)+(n2-1) = n1+n2-2
s2pooled = [(n1-1)s12+ (n2-1)s22]/dof
stderr2=s2pooled(1/n1+1/n2)
The pooled variance is computed by the variance of one group around its mean plus the variance of the other group around its mean. What it really is is the average squared distance of each data value from the mean of its group. This is because one of the assumptions of the two-sample t test is that the two populations have the same variance, except that one may be shifted relative to the other. Thus, we want to pool all our data to best estimate that common variance:
If you let n1=n2 and algebraically simplify the expression, you'll see that this turns out to be very similar to our usual standard error calculation.
For our example, the calculations are:
dof = (10-1)+(9-1) = 10+9-2 = 17
s2pooled = [9(17542)+8(14852)]/dof
spooled = 1632
stderr2=16322(1/10+1/9)=16322(0.2111)
stderr=1632*sqrt(0.2111)=750
Notice that the pooled variance is between the variance for each sample: it's really a kind of weighted average. Notice that the standard error is smaller than the pooled variance, because the variance of a mean is less than the variation of the underlying population.
You saw that the degrees of freedom is two less than the number of data. That's because we've estimated two parameters from our data: the mean of the test group and the mean of the control group. Thus, we have to reduce our degrees of freedom by two.
Now that we've grappled with the standard error, let's compute our t statistic and p value. The t statistic is pretty much what you'd think, taking our previous formula, namely equation (4), and substituting the differences of the sample means for the sample mean, the difference of the population means for the population mean, and the new standard error calculation:
(5) t = [(xbar1-xbar2) - (μ1-μ2)]/stderr
(6) t = (xbar1-xbar2)/stderr
The more general formula is (5), where we allow for null hypotheses like
The mean of the control population is 20,000 and the mean of the treatment population is 19,000. Therefore, we expect that the difference of the sample means will equal 1000.
In most cases, the null hypothesis is that the means, whatever they are, are equal, so that the null hypothesis is that the difference of the sample means is zero. In that case, we get equation (6), which is pretty straighforward.
Using formula (6) on our data, we get:
t = (20,650-19,444)/750 = 1195/750 = 1.59
Hmm. Not a whopping big value of t. Consulting our table for 17 degrees of freedom, we see:
df t0.050 t0.025 t0.010 t0.005 17 1.740 2.110 2.567 2.898
So, it doesn't quite make the 0.05 significance level for this one-tailed test. To find the exact p value, we can use Excel:
=tdist(tvalue,degreesOfFreedom,tails)
=tdist(1.59,17,1)
which evaluates to 0.06. Thus, there's a 6 percent probability that something this large would happen by chance.
Because our data were so close, we consult a statistician friend of ours. After explaining that we should have consulted her first, when we were designing the experiment, she agrees to help.
She points out that there are a few problems with our experiment. First of all, if we'd have gotten a positive result, we would be susceptible to the following criticism:
The experiment itself, all that poking with needles for days on end, stressed out the poor little rodents all by itself. In other words, it might not have been the glucocorticoids, but the procedure that suppressed their immune systems.
Because of this, it would have been better to take our 10 rats, randomly assign half of them to the test group and the other half to the control group, and then injecting the control group with saline (or other placebo) while injecting the test group with glucocorticoids.
Okay, we concede, that would be better, but we give up a lot of degrees of freedom! We'd probably end up having to buy more rats, so that there would be 10 in each group, and that would be twice as many injections every day, and so forth. It's too expensive. Besides, we really think (based on our experience with testing rats), that the placebo is a waste of time in this case.
Fine, she says; there's a better way. Notice that some of the rats had high WBC and some had low, just as you'd expect. You'd expect that the steroids might bring the high WBC down to medium and the low or medium WBC to low. In other words, let's look at the effect on each rat as an individual. This is called a matched pairs design. The idea is to pair up the experimental subjects and look only at the differences between the pairmates. In our case, our pairs are the rat before the treatment and the rat after the treatment.
So, we go back to the lab, let the poor rats recover, and paint numbers on their backs. Then we run and collect the data:
Rat WBC Before WBC After Diff 1 19500 17900 -1600 2 18800 19000 200 3 21600 21500 -100 4 22300 19200 -3100 5 19800 21900 2100 6 22900 18800 -4100 7 20700 20100 -600 8 23200 19100 -4100 9 19100 17500 -1600 10 18500 18300 -200
Notice that most of the numbers are the same as before.
We should state our hypotheses up front, before collecting data, but it's clearer to state them now. Essentially, we have hypotheses about this column of differences between the pairmates.
H0: the mean difference is zero: μ=0
H1: the mean difference is less than: μ<0
That is, with paired data, we're back to a one-sample t test!
The calculations are relatively straightforward:
With a t value of -2.07, we feel much better, even though the degrees of freedom is so much less. Indeed, here is the relevant line from Table III:
df t0.050 t0.025 t0.010 t0.005 9 1.833 2.262 2.821 3.250
So, the data are significant at the level of 0.05. To find the exact p value, we use Excel:
=tdist(tvalue,degreesOfFreedom,tails)
=tdist(abs(-2.07),9,1)
which evaluates to 0.03. Thus, there's only a 3 percent probability that something this large would happen by chance.
Therefore, we reject H0, and we conclude that the glucocorticoids do knock out the immune system as evidenced by WBC. We rush to publish!
Generally, a matched-pairs experiment is more sensitive (more powerful) than a two sample experiment design, but it's not always possible.
The following Excel document shows not only the calculations we did above, but how to use the built-in Excel TTEST function, that will do this very easily, as well as introducing heteroscedasticity, which is an utterly terrifying word.
The ttest(array1,array2,tails,type) function in Excel will do several kinds of t test:
Really, a very convenient function, though maybe too powerful for its own good.
For readability, I've glossed over some important criteria for using the t test: