CS 234: Linear Regression, Multiple Linear Regression¶

Eni Mustafaraj
November 28, 2017

This notebook shows how to perform linear and multiple linear regression with two different Python packages: statsmodels and scikit-learn. The two packages have different properties:

statsmodels has a better coverage of statistics methods and knowledge
scikit-learn has a better coverage of machine learning algorithms

Because linear regression is used more by statisticians than computer scientists, statsmodels is more suitable to perform and understand regression. However, if you are interested in doing cross-validation, scikit-learn has better built-in support for it.

Table of Contents

CS 234 data: Height vs. Weight - model with linear regression
The Advertising dataset
Linear regression: sales dependent on TV
Exercise: Regress sales on radio and sales on newspaper
Multiple Linear Regression
Regression using scikit-learn

Credit: Some of this material is based on the excellent book "An introduction to Statistical Learning with applications in R". While the book is for graduate level courses, the ones of you who have taking or are taking MATH/STAT 220 should find the book accessible.

1. CS 234 data: Height vs. Weight - model with linear regression¶

Early in the semester we collected data from every student, among others height and weight. The first thing to do before building a linear regression model is to make sure that the two variables do have some degree of correlation if we look at the scatterplots.

Below I'm showing a pairplot of the data, which contains histograms and scatterplots. I'm using the Python library seaborn to do this. If you have never installed it, do that below:

!pip install seaborn

import pandas as pd
import numpy as np
%matplotlib inline
import matplotlib.pyplot as plt
import seaborn as sns

We'll create a dataframe because it's easier to work with it when doing plots and linear regression:

heights = [155, 166, 162, 164, 164, 163, 152, 165, 170, 167, 162, 180, 
           160, 162, 157, 170, 168, 164, 171, 165, 168, 168, 155, 168, 156, 170]

weights = [50.7, 52.6, 46.0, 79.0, 50.4, 56.7, 49.9, 63.5, 52.2, 58.0, 
           59.0, 75.0, 56.7, 53.5, 50.0, 68.0, 56.0, 62.5, 61.2, 57.0, 
           54.4, 61.2, 52.1, 84.8, 55.3, 61.0]

df = pd.DataFrame({'height': heights, 'weight': weights})

Pairplot with seaborn:

sns.pairplot(df, size=4, plot_kws={"s": 50})

<seaborn.axisgrid.PairGrid at 0x104586d50>

As we had established in class a while ago, a few students made some calculation errors in converting from pounds to kg, and we have a few outlier points in the data.

Nevertheless, one can see the tendency of height and weight to grow together.

We can also calculate their correlation:

df.corr()

We get r = 0.531 which is a good correlation coefficient. If we are also interested in the p-value for the coefficient, we can use the library scipy instead:

import scipy
scipy.stats.pearsonr(df.height, df.weight)

(0.53163135576628351, 0.0051893083659324503)

The second value in the tuple is the p-value, as we see it is pretty small: 0.005, thus the correlation is statistically significant.

Linear Regression with `statsmodels`¶

We'll use the ols function from statsmodels to learn the linear regression model. This function uses an R-like syntax to express the relationship between dependent and indipendent variables:

import statsmodels.formula.api as smf
lm = smf.ols(formula='weight ~ height', data=df).fit()
lm.params

Intercept   -73.465688
height        0.804496
dtype: float64

Based on these two parameters, we can write the linear equation for the relationshop between weight and height:

weight = -73.4 + 0.8$*$height

We can get all details about the model by calling the summary function:

lm.summary()

Plotting linear regression with `seaborn`¶

ax = sns.lmplot('height', 'weight', data=df, fit_reg=True, 
                size=6, scatter_kws={'s': 50})

# set labels that are different from the variable names in the dataframe
ax.set(xlabel='height [cm]', ylabel='weight [kg]')

<seaborn.axisgrid.FacetGrid at 0x10ac1ef50>

The plot also shows a shadowed area containing the 95% confidence interval for the regression estimate. A narrow area shows that there is less uncertainty about the estimate. This usually correspond to having more data for the estimation.

2. The Advertising dataset¶

To learn about multiple linear regression we will use a dataset known as Advertising (from the ISLR book).

The Advertising data set consists of the sales of a product in 200 different markets, along with advertising budgets for the product in each of those markets for three different media: TV, radio, and newspaper.

TV, radio, and newspaper are the independent variables, while sales is the dependent variable. All values are in 1000s, but to make calculations easier big numbers are not used.

df = pd.read_csv('Advertising.csv', index_col=0)
df.head(10)

Concretely, the first market spent a budget of \$230100 in TV ads, \$37800 in radio ads, and \$69200 in newspaper ads and achieved a total of sales of 22100 units.

We can use again seaborn to see the relations between independent and dependent variables. We can directly show the linear regression line with the parameter kind:

sns.pairplot(df, x_vars=["TV", "radio", "newspaper"], y_vars=["sales"],
             size=5, aspect=.8, kind="reg");

We can also calculate the correlation coefficients among all pairs of variables:

df.corr()

Notice the high correlation between TV ads and sales volume, which was evident on the scatter plots too.

3. Linear regression: sales dependent on TV¶

This is going to be very similar to the example we saw with weight dependent on height:

lm = smf.ols(formula='sales ~ TV', data=df).fit()
lm.params

Intercept    7.032594
TV           0.047537
dtype: float64

We see that the intercept is 7 and the slope is 0.047. Because all values are in their 1000s, this is what these two numbers mean:

if there is no advertising at all, we would expect 7032 units to be sold (the intercept)
for every 1000 dollars in TV ads, we can expect an addition of 47 more units sold (0.047*1000)

lm.summary()

There is too much information in this tabble, we only need to focus on:

R-squared (the goodnees of fit: how well does the regression line approximates the data points)
Adjusted R-squared (an adjustment of R-squared to take into account the complexity of the model)
The second table on the coefficients: their values, standard error, the t-test statistic, the P-value for the null hypothesis (which states that the coefficients are 0), the confidence interval: the interval that contains the true parameter values.

4. Exercise: Regress sales on radio; sales on newspapers¶

The linear regression model based on TV ads looked pretty good, what about the ones that will take into account only radio ads, and only newspaper ads. Build those two models below and look at their statistics:

# Fit the model for "sales ~ radio"

# print the summary statistics

Repeat the same steps, but now for sales on newspapers:

# Fit the model for "sales ~ newspaper"

# print the summary statistics

Compare all three models: Look up the R-squared values for the three models that we created.
Which one has the highest R-square?

YOUR ANSWER:

5. Multiple Linear Regression¶

Multiple linear regression is an extension of linear regression to include more than one independent variable:

$y = \beta_0 + \beta_1*x_1 + \beta_2*x_2 + \beta_3*x_3$

Since we have three independent variables for the Advertising dataset, we can create a relationship like this:

$sales = \beta_0 + \beta_1*TV + \beta_2*radio + \beta_3*newspaper$

# build and fit the model with three independent variables
lm = smf.ols(formula='sales ~ TV + radio + newspaper', data=df).fit()
lm.params

Intercept    2.938889
TV           0.045765
radio        0.188530
newspaper   -0.001037
dtype: float64

lm.summary()

There are a few things to notice in the statistics this time:

R-squared is higher than for the models with only one variable, which shows why multiple linear regression is useful
The estimated coefficients for intercept, TV and radio are statistically significant, but newspaper is not. That means that TV and radio are associated with sales (we rejected that null hypothesis that they are not), but, we cannot reject the null hypothesis for "newspaper".
TV and radio are positively correlated with sales (see the positive values of the coefficients)

If we want to see exact p-values for all coefficients, we can write:

lm.pvalues

Intercept    1.267295e-17
TV           1.509960e-81
radio        1.505339e-54
newspaper    8.599151e-01
dtype: float64

6. Regression with `scikit-learn`¶

Python offers a library that contains implementation of all machine learning algorithms and the setup for using them to fit data and validate results. If scikit-learn is not yet installed, do so below:

!pip install scikit-learn

scikit-learn expects the data as vectors or matrices, so, we have to convert our dataframe first:

# create X and y
variables = ['TV', 'radio', 'newspaper']
X = df[variables]
y = df.sales
print X.shape
print y.shape

(200, 3)
(200,)

Now we can build and fit the model similarly to statsmodels:

from sklearn.linear_model import LinearRegression
lm = LinearRegression()
lm.fit(X, y)

# print intercept and coefficients
print lm.intercept_
print lm.coef_

2.93888936946
[ 0.04576465  0.18853002 -0.00103749]

There is no nice printout summary for the model in scikit-learn, but we can access what we need through various features:

zip(variables, lm.coef_)

[('TV', 0.045764645455397601),
 ('radio', 0.18853001691820448),
 ('newspaper', -0.0010374930424762578)]

lm.score(X, y)

0.89721063817895208

We can easily predict y values based on x values:

# provide an input as a list of three values for TV, radio, newspaper
# however, we also need to convert it into the expected "row" of a matrix
observation = np.array([100, 25, 25]).reshape(1, -1)
print "observation:", observation
print "prediction:", lm.predict(observation)

observation: [[100  25  25]]
prediction: [ 12.20266701]

Dep. Variable:	weight	R-squared:	0.283
Model:	OLS	Adj. R-squared:	0.253
Method:	Least Squares	F-statistic:	9.456
Date:	Tue, 28 Nov 2017	Prob (F-statistic):	0.00519
Time:	09:05:00	Log-Likelihood:	-89.952
No. Observations:	26	AIC:	183.9
Df Residuals:	24	BIC:	186.4
Df Model:	1
Covariance Type:	nonrobust

	coef	std err	t	P>\|t\|	[0.025	0.975]
Intercept	-73.4657	43.016	-1.708	0.101	-162.246	15.314
height	0.8045	0.262	3.075	0.005	0.265	1.344

Omnibus:	15.044	Durbin-Watson:	2.328
Prob(Omnibus):	0.001	Jarque-Bera (JB):	15.914
Skew:	1.437	Prob(JB):	0.000350
Kurtosis:	5.535	Cond. No.	4.50e+03

	TV	radio	newspaper	sales
1	230.1	37.8	69.2	22.1
2	44.5	39.3	45.1	10.4
3	17.2	45.9	69.3	9.3
4	151.5	41.3	58.5	18.5
5	180.8	10.8	58.4	12.9
6	8.7	48.9	75.0	7.2
7	57.5	32.8	23.5	11.8
8	120.2	19.6	11.6	13.2
9	8.6	2.1	1.0	4.8
10	199.8	2.6	21.2	10.6

	TV	radio	newspaper	sales
TV	1.000000	0.054809	0.056648	0.782224
radio	0.054809	1.000000	0.354104	0.576223
newspaper	0.056648	0.354104	1.000000	0.228299
sales	0.782224	0.576223	0.228299	1.000000

Omnibus:	0.531	Durbin-Watson:	1.935
Prob(Omnibus):	0.767	Jarque-Bera (JB):	0.669
Skew:	-0.089	Prob(JB):	0.716
Kurtosis:	2.779	Cond. No.	338.

Omnibus:	60.414	Durbin-Watson:	2.084
Prob(Omnibus):	0.000	Jarque-Bera (JB):	151.241
Skew:	-1.327	Prob(JB):	1.44e-33
Kurtosis:	6.332	Cond. No.	454.

	coef	std err	t	P>\|t\|	[0.025	0.975]
Intercept	7.0326	0.458	15.360	0.000	6.130	7.935
TV	0.0475	0.003	17.668	0.000	0.042	0.053

	coef	std err	t	P>\|t\|	[0.025	0.975]
Intercept	2.9389	0.312	9.422	0.000	2.324	3.554
TV	0.0458	0.001	32.809	0.000	0.043	0.049
radio	0.1885	0.009	21.893	0.000	0.172	0.206
newspaper	-0.0010	0.006	-0.177	0.860	-0.013	0.011