More on the t Test

These notes go into several topics in more detail that we glossed over too much in class.

Degrees of Freedom

On of the fundamental purposes of statistics is to make use of information (data) that is intrinsically variable. One important aspect of that information processing is knowing how much information you actually have: the more information the better, but knowing how much you have is critical. Thus, we count the number of independent samples.

Now, consider estimating a particular value, namely the variance of the population. The obvious way to do that is to use the average squared distance from the population mean:

(1) σ2 ≈ (1/N) SUM (xi-μ)2

However, we rarely know the population mean in real life. So, we estimate that, too, using our sample. Letting M be the sample mean:

(2) σ2 ≈ (1/N) SUM (xi-M)2

However, that results in a biased estimation of the variance, because the latter calculation is too small, because the data will be "too close" to M. That is, they are closer to M than they are to μ. (Indeed, you can prove that M mimimizes the variance.)

One way to think about this is that equation (1) uses 2 data values and 1 known parameter, while equation (2) uses 2 data values alone. The result is that equation (2) is equivalent to using equation (1) with only one independent sample.

Why? Because if I tell you either of the two data values and M, you can tell me the other data value. The other data value has no new information in it.

Statisticians say that the second case only has one degree of freedom. They count each independent data item (each new piece of information) as a degree of freedom, and each time something is estimated from the sample, they deduct one degree of freedom.

Because the sample variance is estimated from the data values and M, it's being estimated from one less degree of freedom, and therefore, we divide by N-1 instead of by N. Therefore, we estimate the population variance as:

(3) σ2 ≈ (1/(N-1)) SUM (xi-M)2

Demonstration

As a demonstration of the three equations we have here, I implemented the following Extend model. In this model, we are just taking two data values, which is the smallest possible sample. Not only does that simplify the Extend model, but it maximizes the difference between equation (2) and equation (3).

degrees-of-freedom.mox

Does It Matter?

First of all, dividing by N-1 instead of N only matters if we are trying to estimate the variance of the population. If we just want a measure of the variance of the sample, either one is fine, and there are functions in Excel to calculate it either way. Since the N-1 form is the default, you should explain that you're dividing by N should you choose to do so. Personally, I suggest you stick with what's standard, because it doesn't really matter.

Why doesn't it matter? If the extreme case of having only 2 samples, it clearly matters a lot whether you divide by 1 or 2, but you would never do any scientific or other important work based on just 2 samples. Cost may limit the amount of data you can get, but if you couldn't afford more than two samples, you would probably abandon the effort.

Furthermore, suppose it does matter: suppose that dividing by N just barely gets you a significant result (p<0.05) but dividing by N-1 doesn't quite (p≥0.05). That means, in a sense, that your result depends on one data item (one degree of freedom). The next researcher to try to replicate your result could well end up on the other side and fail to confirm your result.

In general, while a field typically chooses a conventional standard (such as p<0.05), you'd feel better if there were some distance between your result and the critical value.

The One Sample t Test

The one sample t test rarely arises in real life, but it's a nice warmup for the later, more typical cases. Imagine that we have a hypothesis, perhaps from earlier work in the field, that some measure has a particular value. That is, H0 will be a statement about μ for some population. You'll collect some data, calculate a statistic (a t value), and calculate the probability of that statistic given H0. You'll reject H0 if the statistic is sufficiently unlikely.

Example

For concreteness, suppose that your hypotheses are about the white blood cell count (WBC, which is a rough measure of infection or the health of the immune system) for healthy rats. In particular, your hypotheses are

H0: WBC for a healthy rat is 20,000 per microliter

H1: WBC for a healthy rat is not 20,000 per microliter

Thus, you'll have a two-tailed test.

You have 10 lab rats, and you draw a small sample of blood from each, run a WBC test, and make the following measurements:

19,500
18,800
21,600
22,300
19,800
22,900
20,700
23,200
19,100
18,500

Calculations

  1. We first compute the summary statistics:

    n = 10

    x-bar = 20,640

    s = 1754

  2. Compute standard error, which is the standard deviation of the sampling distribution:

    stderr = s/sqrt(n) = 1754/sqrt(20) = 392

  3. We compute the value of t as a standardized value within the sampling distribution.

    (4) t = (x-bar - μ)/stderr = (20,640-20,000)/392 = 640/392 = 1.63

  4. We can eyeball this and know that H0 will not be rejected, but let's go on an compute the significance (the p value). From table III inside the back cover of your Freund book, in the row for 19 degrees of freedom, we see the following critical values:
    df t0.050 t0.025 t0.010 t0.005
    19 1.729 2.093 2.539 2.861

    Since t is less than 1.729, we know that the probability that a value greater than this would occur by chance more than 5 percent of the time. And that's just the upper tail. Since we decided this was a two-tailed test, we double this to know that the probability that we would see a difference as great as this is more than 10 percent.

    If we want the exact p-value (there's no reason we should, but just because we can), we can use the following Excel formula:

    =tdist(tvalue,degreesOfFreedom,tails)

    =tdist(1.63,19,2)

    which evaluates to 0.12. Thus, there's a 12 percent probability that something this large would happen by chance.

  5. Therefore, we accept H0.

Two Sample Tests

Suppose we have a hypothesis that glucocorticoids (a kind of steroid that is released when an organism is under stress) will reduce immune function, and therefore WBC count. We decide to treat all the rats to a round of glucocorticoids and measure their WBC count after the treatment.

Our hypotheses end up being about two means: the mean of the treatment group and the mean of the control group, or, equivalently, the mean after treatment and the mean before. Let's denote these:

μtreatment or μ1, and

μcontrol or μ0

Thus, our hypotheses are:

H0: the treatment has no effect, so μ10.

H1: the treatment has an effect, so μ10

Equivalently, our hypotheses are:

H0: the treatment has no effect, so μ10=0

H1: the treatment has an effect, so μ10<0

Note that this is a one-tailed test. Note also that our test is about the differences of means. This is crucial.

Data

Here is the data we get after treating the rats. Unfortunately, one of the rats escaped, so we only have 9.

17,900
19,000
21,500
19,200
21,900
18,800
20,100
19,100
17,500

Summary Statistics

We first compute the summary statistics. We've repeated the summary stats for the original value from the earlier measurement.

Control GroupTreatment Group
n0 = 10
x-bar0 = 20,640
s0 = 1754
n1 = 9
x-bar1 = 19,444
s1 = 1485

Standard Error

Now we have to compute the standard error, which is the standard deviation of the sampling distribution. Here, we have to think a bit. What is the sampling distribution? It is a distribution of differences of means. Can we determine its standard deviation (that is, the standard error of our test statistic)? Well, the variance of a difference is the sum of the variances, so it turns out that we can determine this. Mathematical statisticians have worked through the math and the formula for this standard error is:

dof = (n1-1)+(n2-1) = n1+n2-2

s2pooled = [(n1-1)s12+ (n2-1)s22]/dof

stderr2=s2pooled(1/n1+1/n2)

The pooled variance is computed by the variance of one group around its mean plus the variance of the other group around its mean. What it really is is the average squared distance of each data value from the mean of its group. This is because one of the assumptions of the two-sample t test is that the two populations have the same variance, except that one may be shifted relative to the other. Thus, we want to pool all our data to best estimate that common variance:

the pooled variance estimates the
common variance of two distributions

If you let n1=n2 and algebraically simplify the expression, you'll see that this turns out to be very similar to our usual standard error calculation.

For our example, the calculations are:

dof = (10-1)+(9-1) = 10+9-2 = 17
s2pooled = [9(17542)+8(14852)]/dof
spooled = 1632

stderr2=16322(1/10+1/9)=16322(0.2111)
stderr=1632*sqrt(0.2111)=750

Notice that the pooled variance is between the variance for each sample: it's really a kind of weighted average. Notice that the standard error is smaller than the pooled variance, because the variance of a mean is less than the variation of the underlying population.

Degrees of Freedom

You saw that the degrees of freedom is two less than the number of data. That's because we've estimated two parameters from our data: the mean of the test group and the mean of the control group. Thus, we have to reduce our degrees of freedom by two.

Calculations

Now that we've grappled with the standard error, let's compute our t statistic and p value. The t statistic is pretty much what you'd think, taking our previous formula, namely equation (4), and substituting the differences of the sample means for the sample mean, the difference of the population means for the population mean, and the new standard error calculation:

(5) t = [(xbar1-xbar2) - (μ12)]/stderr

(6) t = (xbar1-xbar2)/stderr

The more general formula is (5), where we allow for null hypotheses like

The mean of the control population is 20,000 and the mean of the treatment population is 19,000. Therefore, we expect that the difference of the sample means will equal 1000.

In most cases, the null hypothesis is that the means, whatever they are, are equal, so that the null hypothesis is that the difference of the sample means is zero. In that case, we get equation (6), which is pretty straighforward.

Using formula (6) on our data, we get:

t = (20,650-19,444)/750 = 1195/750 = 1.59

Hmm. Not a whopping big value of t. Consulting our table for 17 degrees of freedom, we see:

df t0.050 t0.025 t0.010 t0.005
17 1.740 2.110 2.567 2.898

So, it doesn't quite make the 0.05 significance level for this one-tailed test. To find the exact p value, we can use Excel:

=tdist(tvalue,degreesOfFreedom,tails)

=tdist(1.59,17,1)

which evaluates to 0.06. Thus, there's a 6 percent probability that something this large would happen by chance.

  • Therefore, we accept H0, and we conclude that the glucocorticoids don't knock out the immune system as evidenced by WBC.

    Paired Sample Tests

    Because our data were so close, we consult a statistician friend of ours. After explaining that we should have consulted her first, when we were designing the experiment, she agrees to help.

    She points out that there are a few problems with our experiment. First of all, if we'd have gotten a positive result, we would be susceptible to the following criticism:

    The experiment itself, all that poking with needles for days on end, stressed out the poor little rodents all by itself. In other words, it might not have been the glucocorticoids, but the procedure that suppressed their immune systems.

    Because of this, it would have been better to take our 10 rats, randomly assign half of them to the test group and the other half to the control group, and then injecting the control group with saline (or other placebo) while injecting the test group with glucocorticoids.

    Okay, we concede, that would be better, but we give up a lot of degrees of freedom! We'd probably end up having to buy more rats, so that there would be 10 in each group, and that would be twice as many injections every day, and so forth. It's too expensive. Besides, we really think (based on our experience with testing rats), that the placebo is a waste of time in this case.

    Fine, she says; there's a better way. Notice that some of the rats had high WBC and some had low, just as you'd expect. You'd expect that the steroids might bring the high WBC down to medium and the low or medium WBC to low. In other words, let's look at the effect on each rat as an individual. This is called a matched pairs design. The idea is to pair up the experimental subjects and look only at the differences between the pairmates. In our case, our pairs are the rat before the treatment and the rat after the treatment.

    Data

    So, we go back to the lab, let the poor rats recover, and paint numbers on their backs. Then we run and collect the data:

    RatWBC BeforeWBC AfterDiff
    119500 17900 -1600
    218800 19000 200
    321600 21500 -100
    422300 19200 -3100
    519800 21900 2100
    622900 18800 -4100
    720700 20100 -600
    823200 19100 -4100
    919100 17500 -1600
    1018500 18300 -200

    Notice that most of the numbers are the same as before.

    Hypotheses

    We should state our hypotheses up front, before collecting data, but it's clearer to state them now. Essentially, we have hypotheses about this column of differences between the pairmates.

    H0: the mean difference is zero: μ=0
    H1: the mean difference is less than: μ<0

    That is, with paired data, we're back to a one-sample t test!

    Calculations

    The calculations are relatively straightforward:

    1. n = 10
    2. mean difference = -1310
    3. standard deviation of the column of differences: 2001
    4. standard error = 2001/sqrt(10) = 633
    5. t = -1310/633 = -2.07
    6. degrees of freedom = 9

    With a t value of -2.07, we feel much better, even though the degrees of freedom is so much less. Indeed, here is the relevant line from Table III:

    df t0.050 t0.025 t0.010 t0.005
    9 1.833 2.262 2.821 3.250

    So, the data are significant at the level of 0.05. To find the exact p value, we use Excel:

    =tdist(tvalue,degreesOfFreedom,tails)

    =tdist(abs(-2.07),9,1)

    which evaluates to 0.03. Thus, there's only a 3 percent probability that something this large would happen by chance.

    Therefore, we reject H0, and we conclude that the glucocorticoids do knock out the immune system as evidenced by WBC. We rush to publish!

    Generally, a matched-pairs experiment is more sensitive (more powerful) than a two sample experiment design, but it's not always possible.

    TTEST in Excel

    The following Excel document shows not only the calculations we did above, but how to use the built-in Excel TTEST function, that will do this very easily, as well as introducing heteroscedasticity, which is an utterly terrifying word.

    ttest.xls

    The ttest(array1,array2,tails,type) function in Excel will do several kinds of t test:

    Really, a very convenient function, though maybe too powerful for its own good.

    Caveats and Conditions

    For readability, I've glossed over some important criteria for using the t test:

    1. This work is licensed under a Creative Commons License
    2. Creative Commons License
    3. Viewable With Any
Browser
    4. Valid HTML 4.01!
    5. Valid CSS!