Eni Mustafaraj
November 28, 2017
This notebook shows how to perform linear and multiple linear regression with two different Python packages: statsmodels
and scikit-learn
. The two packages have different properties:
statsmodels
has a better coverage of statistics methods and knowledgescikit-learn
has a better coverage of machine learning algorithmsBecause linear regression is used more by statisticians than computer scientists, statsmodels
is more suitable to perform and understand regression. However, if you are interested in doing cross-validation, scikit-learn
has better built-in support for it.
Table of Contents
Credit: Some of this material is based on the excellent book "An introduction to Statistical Learning with applications in R". While the book is for graduate level courses, the ones of you who have taking or are taking MATH/STAT 220 should find the book accessible.
Early in the semester we collected data from every student, among others height and weight. The first thing to do before building a linear regression model is to make sure that the two variables do have some degree of correlation if we look at the scatterplots.
Below I'm showing a pairplot of the data, which contains histograms and scatterplots. I'm using the Python library seaborn
to do this. If you have never installed it, do that below:
!pip install seaborn
import pandas as pd
import numpy as np
%matplotlib inline
import matplotlib.pyplot as plt
import seaborn as sns
We'll create a dataframe because it's easier to work with it when doing plots and linear regression:
heights = [155, 166, 162, 164, 164, 163, 152, 165, 170, 167, 162, 180,
160, 162, 157, 170, 168, 164, 171, 165, 168, 168, 155, 168, 156, 170]
weights = [50.7, 52.6, 46.0, 79.0, 50.4, 56.7, 49.9, 63.5, 52.2, 58.0,
59.0, 75.0, 56.7, 53.5, 50.0, 68.0, 56.0, 62.5, 61.2, 57.0,
54.4, 61.2, 52.1, 84.8, 55.3, 61.0]
df = pd.DataFrame({'height': heights, 'weight': weights})
Pairplot with seaborn
:
sns.pairplot(df, size=4, plot_kws={"s": 50})
As we had established in class a while ago, a few students made some calculation errors in converting from pounds to kg, and we have a few outlier points in the data.
Nevertheless, one can see the tendency of height
and weight
to grow together.
We can also calculate their correlation:
df.corr()
We get r = 0.531
which is a good correlation coefficient. If we are also interested in the p-value for the coefficient, we can use the library scipy
instead:
import scipy
scipy.stats.pearsonr(df.height, df.weight)
The second value in the tuple is the p-value, as we see it is pretty small: 0.005, thus the correlation is statistically significant.
statsmodels
¶We'll use the ols
function from statsmodels
to learn the linear regression model. This function uses an R-like syntax to express the relationship between dependent and indipendent variables:
import statsmodels.formula.api as smf
lm = smf.ols(formula='weight ~ height', data=df).fit()
lm.params
Based on these two parameters, we can write the linear equation for the relationshop between weight and height:
weight = -73.4 + 0.8$*$height
We can get all details about the model by calling the summary
function:
lm.summary()
seaborn
¶ax = sns.lmplot('height', 'weight', data=df, fit_reg=True,
size=6, scatter_kws={'s': 50})
# set labels that are different from the variable names in the dataframe
ax.set(xlabel='height [cm]', ylabel='weight [kg]')
The plot also shows a shadowed area containing the 95% confidence interval for the regression estimate. A narrow area shows that there is less uncertainty about the estimate. This usually correspond to having more data for the estimation.
To learn about multiple linear regression we will use a dataset known as Advertising (from the ISLR book).
The Advertising data set consists of the sales of a product in 200 different markets, along with advertising budgets for the product in each of those markets for three different media: TV, radio, and newspaper.
TV, radio, and newspaper are the independent variables, while sales is the dependent variable. All values are in 1000s, but to make calculations easier big numbers are not used.
df = pd.read_csv('Advertising.csv', index_col=0)
df.head(10)
Concretely, the first market spent a budget of \$230100 in TV ads, \$37800 in radio ads, and \$69200 in newspaper ads and achieved a total of sales of 22100 units.
We can use again seaborn
to see the relations between independent and dependent variables. We can directly show the linear regression line with the parameter kind
:
sns.pairplot(df, x_vars=["TV", "radio", "newspaper"], y_vars=["sales"],
size=5, aspect=.8, kind="reg");
We can also calculate the correlation coefficients among all pairs of variables:
df.corr()
Notice the high correlation between TV ads and sales volume, which was evident on the scatter plots too.
This is going to be very similar to the example we saw with weight dependent on height:
lm = smf.ols(formula='sales ~ TV', data=df).fit()
lm.params
We see that the intercept is 7 and the slope is 0.047. Because all values are in their 1000s, this is what these two numbers mean:
lm.summary()
There is too much information in this tabble, we only need to focus on:
The linear regression model based on TV ads looked pretty good, what about the ones that will take into account only radio ads, and only newspaper ads. Build those two models below and look at their statistics:
# Fit the model for "sales ~ radio"
# print the summary statistics
Repeat the same steps, but now for sales on newspapers:
# Fit the model for "sales ~ newspaper"
# print the summary statistics
Compare all three models: Look up the R-squared values for the three models that we created.
Which one has the highest R-square?
YOUR ANSWER:
Multiple linear regression is an extension of linear regression to include more than one independent variable:
$y = \beta_0 + \beta_1*x_1 + \beta_2*x_2 + \beta_3*x_3$
Since we have three independent variables for the Advertising dataset, we can create a relationship like this:
$sales = \beta_0 + \beta_1*TV + \beta_2*radio + \beta_3*newspaper$
# build and fit the model with three independent variables
lm = smf.ols(formula='sales ~ TV + radio + newspaper', data=df).fit()
lm.params
lm.summary()
There are a few things to notice in the statistics this time:
If we want to see exact p-values for all coefficients, we can write:
lm.pvalues
scikit-learn
¶Python offers a library that contains implementation of all machine learning algorithms and the setup for using them to fit data and validate results. If scikit-learn
is not yet installed, do so below:
!pip install scikit-learn
scikit-learn
expects the data as vectors or matrices, so, we have to convert our dataframe first:
# create X and y
variables = ['TV', 'radio', 'newspaper']
X = df[variables]
y = df.sales
print X.shape
print y.shape
Now we can build and fit the model similarly to statsmodels
:
from sklearn.linear_model import LinearRegression
lm = LinearRegression()
lm.fit(X, y)
# print intercept and coefficients
print lm.intercept_
print lm.coef_
There is no nice printout summary for the model in scikit-learn
, but we can access what we need through various features:
zip(variables, lm.coef_)
lm.score(X, y)
We can easily predict y values based on x values:
# provide an input as a list of three values for TV, radio, newspaper
# however, we also need to convert it into the expected "row" of a matrix
observation = np.array([100, 25, 25]).reshape(1, -1)
print "observation:", observation
print "prediction:", lm.predict(observation)