Correlation and Regression

Many lay people understand the word "correlation." We hear it in the popular press all the time:

Today, we'll get a more detailed understanding of what is meant by correlation and the closely related statistical technique of regression. (Specifically, linear regression; there are many other kinds of regression that we won't be looking at.)

Curve Fitting

The idea underlying this terminology is using some data to guess a "curve" that "fits" the data.

some data points and a line that fits some data points and a curve that fits

Once we have the curve, we can use the curve to predict values of y for other values of x. The equation of the curve can also be very helpful for understanding phenomena: if water flow is increased by x percent, how much will phosphate decrease?

You'll notice that, in general, the curve doesn't go through the data points. In may not even go through any of them. Instead, we just want something that is "closest" to them, in a way we'll define soon.

Before we fit the curve, we have to first pick the kind of curve we want to fit. Today, we'll only talk about fitting a line: a function of the form:

y = a+bx

So, what we're really doing in practice is taking a bunch of (x,y) points and finding a and b.

The Method of Least Squares

We're trying to find the curve that best fits the data, but what does "best" mean? There are lots of ways that you can define it, but the one that is most commonly used is:

The best curve minimizes the sum of the squares of the vertical distance from a point to the curve. The vertical distances are usually called "error" (because the curve differs from the data by that amount) and so this method minimizes the "squared error."

some data points, a line, and the errors

Here's a simplification that helps me: Imagine there's no y value (or no x value), so you just have a set of numbers.

So, the method of least squares is the 2D equivalent of the mean! Amazing!

Mathematically, the error is the difference between the data point (x,y) and what the curve predicts, y=f(x). Usually, the predicted y is called y-hat, and is notated with that decoration, but I can't do that in HTML, so I'll use capital letters for the data points: (X,Y)

Σ (y-Yi)2
Σ (f(Xi)-Yi)2
Σ ([a+bXi]-Yi)2

Finding the values of a and b that minimize that sum isn't what this course is about. (You can do it with some relatively straightforward calculus; try it on the mean, first. Incidentally, minimizing this is why the mathematical statisticians used the squared error, because it's much easier to take the deriviative of a squared term than an absolute value.) Instead, we'll let Excel do this. If you have to do it by hand, though, the equations are:

Let, SS stand for "sum of squares" and compute:
SS(xy)= Σ (Xi-Xbar)(Yi-Ybar)
SS(xx) = Σ (Xi-Xbar)2

Then, from these, we can compute:
b = SS(xy)/SS(xx)
a = Ybar - b*Xbar

Let's take a moment to think about what these things mean:

Regression in Excel

Excel can do regressions. Download the following spreadsheet so we can play with it:

regression.xls

Let's talk about some of the things on the regression sheet.

Correlation Coefficient

Sometimes, we don't need all the rigamarole of a regression, but we do want the correlation coefficient (the "r" value). Why?

The Excel regression gives us the correlation coefficient as the very first number. If we don't have Excel handy, we can compute r as follows:

r = SS(xy)/sqrt(SS(xx)*SS(yy))

In other words, the covariance divided by the square root of the product of two variances.

This calculation appears on the Excel spreadsheet as well.

line fit with SS regression

The following web site has a nice "quiz" in which you can guess correlation coefficients. Can you get 100%?

Fundamental Assumptions of Regression

  1. Linearity: the real function is linear
  2. Equal Variance: (Homoscedasticity) That the variation around the line is always the same
  3. Normality of Errors: That the distribution of errors is Gaussian
  4. Good Sample: We always have to assume that the data are not biased in some way.

Problems with Regression

  1. This work is licensed under a Creative Commons License
  2. Creative Commons License
  3. Viewable With Any
Browser
  4. Valid HTML 4.01!
  5. Valid CSS!