Correlation and Regression
Many lay people understand the word "correlation." We hear it in the
popular press all the time:
- There's a correlation between smoking and lung cancer
- There's a negative correlation between eduation and abortion
- There's a correlation between SAT scores and grades
- There's a negative correlation between SAT scores and being admitted
Early Decision
- There's a positive correlation between drinking diet soda and being
overweight
- There's a positive correlation between drinking sugary soft drinks and being
obese.
- There's a positive correlation between stretching and injuries in running.
- zillions of others
Today, we'll get a more detailed understanding of what is meant by
correlation and the closely related statistical technique of
regression. (Specifically, linear regression; there are
many other kinds of regression that we won't be looking at.)
Curve Fitting
The idea underlying this terminology is using some data to guess a
"curve" that "fits" the data.
Once we have the curve, we can use the curve to predict values
of y for other values of x. The equation of the curve can also be very
helpful for understanding phenomena: if water flow is increased by x
percent, how much will phosphate decrease?
You'll notice that, in general, the curve doesn't go through the data
points. In may not even go through any of them. Instead, we just
want something that is "closest" to them, in a way we'll define soon.
Before we fit the curve, we have to first pick the kind of curve we
want to fit. Today, we'll only talk about fitting a line: a
function of the form:
y = a+bx
So, what we're really doing in practice is taking a bunch of (x,y)
points and finding a and b.
The Method of Least Squares
We're trying to find the curve that best fits the data, but what does
"best" mean? There are lots of ways that you can define it, but the one
that is most commonly used is:
The best curve minimizes the sum of the squares of the vertical distance
from a point to the curve. The vertical distances are usually called
"error" (because the curve differs from the data by that amount) and so
this method minimizes the "squared error."
Here's a simplification that helps me: Imagine there's no y value (or
no x value), so you just have a set of numbers.
- The mean minimizes the squared error.
- The median minimizes the absolute error.
So, the method of least squares is the 2D equivalent of the mean!
Amazing!
Mathematically, the error is the difference between the data point
(x,y) and what the curve predicts, y=f(x). Usually, the predicted y is
called y-hat, and is notated with that decoration, but I can't do that
in HTML, so I'll use capital letters for the data points: (X,Y)
Σ (y-Yi)2
Σ (f(Xi)-Yi)2
Σ ([a+bXi]-Yi)2
Finding the values of a and b that minimize that sum
isn't what this course is about. (You can do it with some relatively
straightforward calculus; try it on the mean, first. Incidentally,
minimizing this is why the mathematical statisticians used the squared
error, because it's much easier to take the deriviative of a squared term
than an absolute value.) Instead, we'll let Excel do this. If you have
to do it by hand, though, the equations are:
Let, SS stand for "sum of squares" and compute:
SS(xy)= Σ (Xi-Xbar)(Yi-Ybar)
SS(xx) = Σ (Xi-Xbar)2
Then, from these, we can compute:
b = SS(xy)/SS(xx)
a = Ybar - b*Xbar
Let's take a moment to think about what these things mean:
- SS(xx) is essentially the variance. The only thing missing is the
division by n to make it an average. Instead, it's the total
variation in the Xs.
- SS(xy) is the 2D equivalent of SS(xx). (It is similarly related to
the concept called covariance, literally "vary together," which you
can compute in Excel using the covar(range,range) function.)
SS(xy) is the total covariation of the points
(Xi,Yi) around the middle point (Xbar,Ybar).
- You get a positive term when the point is above and to the right
of the middle.
- You get a positive term when the point is below and to the left of the
middle.
- You get a negative term when the point is below and to the right of
the middle.
- You get a negative term when the point is above and to the left of
the middle.
So, this means:
- If the points are scattered randomly, SS(xy) will have lots of
positive and negative terms and will mostly cancel out, so the total will
be near zero. This indicates zero correlation
- If the points are lower left and upper right, the result will be a
fairly large positive number, indicating positive correlation.
- If the points are upper left and lower right, the result will be a
fairly large negative number, indicating negative correlation.
- b is the slope of the correlation line. For most
calculations, the slope is the important quantity. It indicates how much
the average value of y changes for a given change in the value of
x.
- a is the intercept. It indicates where the line
crosses the y axis.
Regression in Excel
Excel can do regressions. Download the following spreadsheet so we can
play with it:
regression.xls
- Some explanatory stuff:
- The mean of the X and Y are computed
- There's a column of the XY deviations. Check that it makes sense
- There's a column of just X deviations. Check that it makes sense
- There's a column of just Y deviations. Check that it makes sense
- There're the sums of the XY, X and Y deviations.
- The slope and intercept of the regression line are computed using the
formulas above.
- There are two points computed for the regression line. Note that I
used a scatter plot, but then told Excel to use a line for this data:
right click (or control-click on a Mac) on the data, and choose "Format
Data Series."
- There's a chart of the data, the mean point, and the regression
points. Note that the regression line always goes through the mean point.
- If you go to Tools / Data Analysis, and choose "Regression," Excel
will do the regression for you. (If you want less output, you can choose
"Correlation," but we'll look at regression.) Just select the X range and
Y range. I like to click on the "Line fit plots." My results for this
are on the "regression" sheet, but feel free to make your own.
Let's talk about some of the things on the regression sheet.
- The first thing to look at is the "Significance F" in the ANOVA
table. That's the "p-value" for the whole model. If it's not small, you
can probably ignore everything.
- The second things to look at are near the top: "R Square" and
"Adjusted R Square." These are interpretable as the percentage of
variation that the model accounts for. More is better. If your model
doesn't explain much, it's no good.
The adjustment is to discount "R Square" by the number of predictor
variables. If you add more predictor variables, the "R Square"
always goes up, but the "Adjusted R Square" may go down if the
predictor isn't very good.
- The third thing to look at is the "Standard Error." This is the
standard deviation of our residuals, which helps to tell how
predictive our model can be. This quantity is also called the
"Root Mean Squared Error" or "RMS Error." The mathematical model we are
creating is:
Y = a X + b + ε
where all the epsilons are from a single Gaussian distribution with
zero mean. The standard deviation of the Gaussian is given by the
"Standard Error" result.
- The third table gives the coefficients for the regression line. Since
these amount to point estimates, the other things on that row give:
- The standard error: the variation around that point estimate
- The t statistic: essentially, the regression includes a t-test as to
whether the coefficient is significantly different from zero. If the
coefficient of X is not different from zero, there is no correlation
between Y and that X. (In general, there may be multiple Xs: consider a
regression of SAT scores on high-school grades, family income, and hours
studied. There are three different X's used to predict each Y.)
- The p-value that goes with that t-statistic. Here, we can say that X
matters (p way less than 0.01), but that the y-intercept probably doesn't.
- The rest of that line is the 95 percent confidence interval around
each point estimate. Notice how the interval around the intercept also
includes zero.
- The fourth table gives residuals: the errors that the
regression minimizes. Statisticians say that this is the stuff that is
not explained by the regression line. They are the epsilons of our
mathematical model: if they have some sort of pattern, they aren't random
samples from a Gaussian.
- In fact, statisticians say that the total "Sum of Squares" (squared
error) breaks down into the explained part and the
unexplained part. Look at table 2. Notice that the SS column
has 582.1 for the SS explained by the regression, 78.5 for the part
attributed to the residuals, for a total of 660.6. (Notice that the 660.6
is the same as the "sum Y-dev" that we computed on the first worksheet.)
That means that 582.1/660.6 or 88.1 percent of the variance is explained
by the regression model. (This is very good.)
- See picture at end, and we'll discuss this idea of breaking down the
sum of squares.
- In the first table, we see "R square" which is 0.881, which is the
percent of variation explained by the model.
- Again, in the first table, the first entry is "Multiple R," which is
the correlation coefficient. In this case, it is 0.9387. Often, a
scientist will say "Y is correlated with X with a correlation coefficient
of R." This is the R for our data. As it happens, R2 is
0.881, the percent of variation in Y explained by X.
- What is the equation?
Correlation Coefficient
Sometimes, we don't need all the rigamarole of a regression, but we do
want the correlation coefficient (the "r" value). Why?
- r is always between -1 and 1
- r = 0 means no correlation
- r = 1 is perfect positive correlation
- r = -1 is perfect negative positive correlation
- r2 is the percent variation explained by the linear correlation
The Excel regression gives us the correlation coefficient as the very
first number. If we don't have Excel handy, we can compute r as follows:
r = SS(xy)/sqrt(SS(xx)*SS(yy))
In other words, the covariance divided by the square root of the
product of two variances.
This calculation appears on the Excel spreadsheet as well.
The following web site has a nice "quiz" in which you can guess
correlation coefficients. Can you get 100%?
Fundamental Assumptions of Regression
- Linearity: the real function is linear
- Equal Variance: (Homoscedasticity) That the variation around the
line is always the same
- Normality of Errors: That the distribution of errors is Gaussian
- Good Sample: We always have to assume that the data are not biased in
some way.
Problems with Regression
-
This work is licensed under a Creative Commons
License
-