CS 234: Linear Regression, Multiple Linear Regression

Eni Mustafaraj
November 28, 2017

This notebook shows how to perform linear and multiple linear regression with two different Python packages: statsmodels and scikit-learn. The two packages have different properties:

  • statsmodels has a better coverage of statistics methods and knowledge
  • scikit-learn has a better coverage of machine learning algorithms

Because linear regression is used more by statisticians than computer scientists, statsmodels is more suitable to perform and understand regression. However, if you are interested in doing cross-validation, scikit-learn has better built-in support for it.

Table of Contents

  1. CS 234 data: Height vs. Weight - model with linear regression
  2. The Advertising dataset
  3. Linear regression: sales dependent on TV
  4. Exercise: Regress sales on radio and sales on newspaper
  5. Multiple Linear Regression
  6. Regression using scikit-learn

Credit: Some of this material is based on the excellent book "An introduction to Statistical Learning with applications in R". While the book is for graduate level courses, the ones of you who have taking or are taking MATH/STAT 220 should find the book accessible.

1. CS 234 data: Height vs. Weight - model with linear regression

Early in the semester we collected data from every student, among others height and weight. The first thing to do before building a linear regression model is to make sure that the two variables do have some degree of correlation if we look at the scatterplots.

Below I'm showing a pairplot of the data, which contains histograms and scatterplots. I'm using the Python library seaborn to do this. If you have never installed it, do that below:

In [ ]:
!pip install seaborn
In [1]:
import pandas as pd
import numpy as np
%matplotlib inline
import matplotlib.pyplot as plt
import seaborn as sns

We'll create a dataframe because it's easier to work with it when doing plots and linear regression:

In [2]:
heights = [155, 166, 162, 164, 164, 163, 152, 165, 170, 167, 162, 180, 
           160, 162, 157, 170, 168, 164, 171, 165, 168, 168, 155, 168, 156, 170]

weights = [50.7, 52.6, 46.0, 79.0, 50.4, 56.7, 49.9, 63.5, 52.2, 58.0, 
           59.0, 75.0, 56.7, 53.5, 50.0, 68.0, 56.0, 62.5, 61.2, 57.0, 
           54.4, 61.2, 52.1, 84.8, 55.3, 61.0]

df = pd.DataFrame({'height': heights, 'weight': weights})

Pairplot with seaborn:

In [3]:
sns.pairplot(df, size=4, plot_kws={"s": 50})
Out[3]:
<seaborn.axisgrid.PairGrid at 0x104586d50>

As we had established in class a while ago, a few students made some calculation errors in converting from pounds to kg, and we have a few outlier points in the data.

Nevertheless, one can see the tendency of height and weight to grow together.

We can also calculate their correlation:

In [4]:
df.corr()
Out[4]:
height weight
height 1.000000 0.531631
weight 0.531631 1.000000

We get r = 0.531 which is a good correlation coefficient. If we are also interested in the p-value for the coefficient, we can use the library scipy instead:

In [5]:
import scipy
scipy.stats.pearsonr(df.height, df.weight)
Out[5]:
(0.53163135576628351, 0.0051893083659324503)

The second value in the tuple is the p-value, as we see it is pretty small: 0.005, thus the correlation is statistically significant.

Linear Regression with statsmodels

We'll use the ols function from statsmodels to learn the linear regression model. This function uses an R-like syntax to express the relationship between dependent and indipendent variables:

In [6]:
import statsmodels.formula.api as smf
lm = smf.ols(formula='weight ~ height', data=df).fit()
lm.params
Out[6]:
Intercept   -73.465688
height        0.804496
dtype: float64

Based on these two parameters, we can write the linear equation for the relationshop between weight and height:

weight = -73.4 + 0.8$*$height

We can get all details about the model by calling the summary function:

In [7]:
lm.summary()
Out[7]:
OLS Regression Results
Dep. Variable: weight R-squared: 0.283
Model: OLS Adj. R-squared: 0.253
Method: Least Squares F-statistic: 9.456
Date: Tue, 28 Nov 2017 Prob (F-statistic): 0.00519
Time: 09:05:00 Log-Likelihood: -89.952
No. Observations: 26 AIC: 183.9
Df Residuals: 24 BIC: 186.4
Df Model: 1
Covariance Type: nonrobust
coef std err t P>|t| [0.025 0.975]
Intercept -73.4657 43.016 -1.708 0.101 -162.246 15.314
height 0.8045 0.262 3.075 0.005 0.265 1.344
Omnibus: 15.044 Durbin-Watson: 2.328
Prob(Omnibus): 0.001 Jarque-Bera (JB): 15.914
Skew: 1.437 Prob(JB): 0.000350
Kurtosis: 5.535 Cond. No. 4.50e+03

Plotting linear regression with seaborn

In [8]:
ax = sns.lmplot('height', 'weight', data=df, fit_reg=True, 
                size=6, scatter_kws={'s': 50})

# set labels that are different from the variable names in the dataframe
ax.set(xlabel='height [cm]', ylabel='weight [kg]')
Out[8]:
<seaborn.axisgrid.FacetGrid at 0x10ac1ef50>

The plot also shows a shadowed area containing the 95% confidence interval for the regression estimate. A narrow area shows that there is less uncertainty about the estimate. This usually correspond to having more data for the estimation.

2. The Advertising dataset

To learn about multiple linear regression we will use a dataset known as Advertising (from the ISLR book).

The Advertising data set consists of the sales of a product in 200 different markets, along with advertising budgets for the product in each of those markets for three different media: TV, radio, and newspaper.

TV, radio, and newspaper are the independent variables, while sales is the dependent variable. All values are in 1000s, but to make calculations easier big numbers are not used.

In [9]:
df = pd.read_csv('Advertising.csv', index_col=0)
df.head(10)
Out[9]:
TV radio newspaper sales
1 230.1 37.8 69.2 22.1
2 44.5 39.3 45.1 10.4
3 17.2 45.9 69.3 9.3
4 151.5 41.3 58.5 18.5
5 180.8 10.8 58.4 12.9
6 8.7 48.9 75.0 7.2
7 57.5 32.8 23.5 11.8
8 120.2 19.6 11.6 13.2
9 8.6 2.1 1.0 4.8
10 199.8 2.6 21.2 10.6

Concretely, the first market spent a budget of \$230100 in TV ads, \$37800 in radio ads, and \$69200 in newspaper ads and achieved a total of sales of 22100 units.

We can use again seaborn to see the relations between independent and dependent variables. We can directly show the linear regression line with the parameter kind:

In [10]:
sns.pairplot(df, x_vars=["TV", "radio", "newspaper"], y_vars=["sales"],
             size=5, aspect=.8, kind="reg");

We can also calculate the correlation coefficients among all pairs of variables:

In [11]:
df.corr()
Out[11]:
TV radio newspaper sales
TV 1.000000 0.054809 0.056648 0.782224
radio 0.054809 1.000000 0.354104 0.576223
newspaper 0.056648 0.354104 1.000000 0.228299
sales 0.782224 0.576223 0.228299 1.000000

Notice the high correlation between TV ads and sales volume, which was evident on the scatter plots too.

3. Linear regression: sales dependent on TV

This is going to be very similar to the example we saw with weight dependent on height:

In [12]:
lm = smf.ols(formula='sales ~ TV', data=df).fit()
lm.params
Out[12]:
Intercept    7.032594
TV           0.047537
dtype: float64

We see that the intercept is 7 and the slope is 0.047. Because all values are in their 1000s, this is what these two numbers mean:

  • if there is no advertising at all, we would expect 7032 units to be sold (the intercept)
  • for every 1000 dollars in TV ads, we can expect an addition of 47 more units sold (0.047*1000)
In [13]:
lm.summary()
Out[13]:
OLS Regression Results
Dep. Variable: sales R-squared: 0.612
Model: OLS Adj. R-squared: 0.610
Method: Least Squares F-statistic: 312.1
Date: Tue, 28 Nov 2017 Prob (F-statistic): 1.47e-42
Time: 09:05:27 Log-Likelihood: -519.05
No. Observations: 200 AIC: 1042.
Df Residuals: 198 BIC: 1049.
Df Model: 1
Covariance Type: nonrobust
coef std err t P>|t| [0.025 0.975]
Intercept 7.0326 0.458 15.360 0.000 6.130 7.935
TV 0.0475 0.003 17.668 0.000 0.042 0.053
Omnibus: 0.531 Durbin-Watson: 1.935
Prob(Omnibus): 0.767 Jarque-Bera (JB): 0.669
Skew: -0.089 Prob(JB): 0.716
Kurtosis: 2.779 Cond. No. 338.

There is too much information in this tabble, we only need to focus on:

  • R-squared (the goodnees of fit: how well does the regression line approximates the data points)
  • Adjusted R-squared (an adjustment of R-squared to take into account the complexity of the model)
  • The second table on the coefficients: their values, standard error, the t-test statistic, the P-value for the null hypothesis (which states that the coefficients are 0), the confidence interval: the interval that contains the true parameter values.

4. Exercise: Regress sales on radio; sales on newspapers

The linear regression model based on TV ads looked pretty good, what about the ones that will take into account only radio ads, and only newspaper ads. Build those two models below and look at their statistics:

In [ ]:
# Fit the model for "sales ~ radio"
In [ ]:
# print the summary statistics

Repeat the same steps, but now for sales on newspapers:

In [ ]:
# Fit the model for "sales ~ newspaper"
In [ ]:
# print the summary statistics

Compare all three models: Look up the R-squared values for the three models that we created.
Which one has the highest R-square?

YOUR ANSWER:

5. Multiple Linear Regression

Multiple linear regression is an extension of linear regression to include more than one independent variable:

$y = \beta_0 + \beta_1*x_1 + \beta_2*x_2 + \beta_3*x_3$

Since we have three independent variables for the Advertising dataset, we can create a relationship like this:

$sales = \beta_0 + \beta_1*TV + \beta_2*radio + \beta_3*newspaper$

In [14]:
# build and fit the model with three independent variables
lm = smf.ols(formula='sales ~ TV + radio + newspaper', data=df).fit()
lm.params
Out[14]:
Intercept    2.938889
TV           0.045765
radio        0.188530
newspaper   -0.001037
dtype: float64
In [15]:
lm.summary()
Out[15]:
OLS Regression Results
Dep. Variable: sales R-squared: 0.897
Model: OLS Adj. R-squared: 0.896
Method: Least Squares F-statistic: 570.3
Date: Tue, 28 Nov 2017 Prob (F-statistic): 1.58e-96
Time: 09:05:36 Log-Likelihood: -386.18
No. Observations: 200 AIC: 780.4
Df Residuals: 196 BIC: 793.6
Df Model: 3
Covariance Type: nonrobust
coef std err t P>|t| [0.025 0.975]
Intercept 2.9389 0.312 9.422 0.000 2.324 3.554
TV 0.0458 0.001 32.809 0.000 0.043 0.049
radio 0.1885 0.009 21.893 0.000 0.172 0.206
newspaper -0.0010 0.006 -0.177 0.860 -0.013 0.011
Omnibus: 60.414 Durbin-Watson: 2.084
Prob(Omnibus): 0.000 Jarque-Bera (JB): 151.241
Skew: -1.327 Prob(JB): 1.44e-33
Kurtosis: 6.332 Cond. No. 454.

There are a few things to notice in the statistics this time:

  • R-squared is higher than for the models with only one variable, which shows why multiple linear regression is useful
  • The estimated coefficients for intercept, TV and radio are statistically significant, but newspaper is not. That means that TV and radio are associated with sales (we rejected that null hypothesis that they are not), but, we cannot reject the null hypothesis for "newspaper".
  • TV and radio are positively correlated with sales (see the positive values of the coefficients)

If we want to see exact p-values for all coefficients, we can write:

In [16]:
lm.pvalues
Out[16]:
Intercept    1.267295e-17
TV           1.509960e-81
radio        1.505339e-54
newspaper    8.599151e-01
dtype: float64

6. Regression with scikit-learn

Python offers a library that contains implementation of all machine learning algorithms and the setup for using them to fit data and validate results. If scikit-learn is not yet installed, do so below:

In [ ]:
!pip install scikit-learn

scikit-learn expects the data as vectors or matrices, so, we have to convert our dataframe first:

In [17]:
# create X and y
variables = ['TV', 'radio', 'newspaper']
X = df[variables]
y = df.sales
print X.shape
print y.shape
(200, 3)
(200,)

Now we can build and fit the model similarly to statsmodels:

In [19]:
from sklearn.linear_model import LinearRegression
lm = LinearRegression()
lm.fit(X, y)

# print intercept and coefficients
print lm.intercept_
print lm.coef_
2.93888936946
[ 0.04576465  0.18853002 -0.00103749]

There is no nice printout summary for the model in scikit-learn, but we can access what we need through various features:

In [20]:
zip(variables, lm.coef_)
Out[20]:
[('TV', 0.045764645455397601),
 ('radio', 0.18853001691820448),
 ('newspaper', -0.0010374930424762578)]
In [21]:
lm.score(X, y)
Out[21]:
0.89721063817895208

We can easily predict y values based on x values:

In [22]:
# provide an input as a list of three values for TV, radio, newspaper
# however, we also need to convert it into the expected "row" of a matrix
observation = np.array([100, 25, 25]).reshape(1, -1)
print "observation:", observation
print "prediction:", lm.predict(observation)
observation: [[100  25  25]]
prediction: [ 12.20266701]