Week 14: Review Hypothesis Testing

This notebook reviews concepts of hypothesis testing in the context of a problem.
It is looking at the question: what can we say about the ethnic diversity of a group
of people based on their last names, concretely, the length of the last names.

To do this analysis we need to find data sources that contain last names from
different countries, so that we can compare the average lengths. We then will
get a class of Wellesley students and test the hypothesis that Wellesley students
are ethnically more diverse than the American population.

Table of Contents

  1. Get lastname data from the US Census
  2. Explore the data
  3. Sidenote: The sampling distribution
  4. Hypothesis testing: test for the mean
  5. Visualization with Plotly
  6. Experimenting with scatter plots

NOTE: You can see this notebook as an example of a project for grade C, although
it is not about "digital natives" (in purpose), so that it doesn't influence your work.

1. Get lastname data from US Census

While this data might be in a CSV format somewhere, to show that we need usually to write code to get data, I'm looking at data from a website: Top surnames in the United States. As we can see, the data is in a table in the HTML file, thus, we need to extract it with BeautifulSoup.

Steps:

  1. Send request to get the HTML page
  2. Use BeautifulSoup to create a searchable tree
  3. Inspect HTML source code to find the elements or class names that contain the values
  4. Write code to extract the data
  5. Store the data with pandas

Step 1: Get the HTLM page

In [1]:
# Step 1: get the HTML page
import requests
response = requests.get("https://surnames.behindthename.com/top/lists/united-states/1990")
if response.status_code == 200:
    htmlpage = response.content
    print htmlpage[:300]
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" "http://www.w3.org/TR/html4/loose.dtd">

<html>
<head>
<meta name="author" content="Mike Campbell">
<meta name="description" content="Common Surnames in the United States 1990 (top 1000)">
<meta name="keywords" content="meaning, baby nam

The HTML page is here, contains the top 1000 last names from the Census in 1990.

Step 2: Use BeautifulSoup to create a searchable tree

In [2]:
# Step 2: Use BeautifulSoup
from bs4 import BeautifulSoup
htmltree = BeautifulSoup(htmlpage, 'lxml')

htmltree.find('h1')
Out[2]:
<h1>Common Surnames in the United States</h1>

The tree is created, we can now search it for the elements we need.

Step 3: Inspect HTML

At this point, it's better to use the "View Source" on the browser to look at the different elements that might help us to extract the names.

It turns out this is very old-style HTML, it has a class for each odd/even row to alternate colors.
The two classes are r0 and r1.

Step 4: Write code to extract data

We will extract the values from both kinds of rows.

In [3]:
# Step 4: Extract data 
odd = htmltree.find_all(attrs={'class': 'r1'})
even = htmltree.find_all(attrs={'class': 'r0'})
allRows = odd + even
len(allRows)
Out[3]:
1000

We got 1000 rows, as expected. Let's see what each row contains:

In [4]:
allRows[0]
Out[4]:
<tr class="r1"><td>1.</td><td nowrap=""><div style="float:right">\xa0\xa0<a href="/name/smith/top/united-states/"><img src="/images/chart.gif"/></a></div><a href="/name/smith">Smith</a></td>\n<td align="right">1.01%</td>\n</tr>

It's a lot of HTML "junk" that is not useful to us. Luckily, BeautifulSoup has a method getText that extracts the text element from each HTML element.

In [5]:
allRows[0].get_text()
Out[5]:
u'1.\xa0\xa0Smith\n1.01%\n'

This still has some junk characters, let's split at the spaces:

In [6]:
allRows[0].get_text().split()
Out[6]:
[u'1.', u'Smith', u'1.01%']

Much better. Finally, let's create two separate lists one for the names and one for the percentages.

The percentage value is a string, we will need to remove the '%' character, and divide it by 100, because that converts it to a probability value.

Also, given that we cannot be sure that the HTML doesn't contain errors, it's better to use try ... except to extract the values.

In [7]:
names = [] # to store 
probs = [] # to store the probability values

for row in allRows:
    try:
        index, name, percent = row.get_text().split()
        names.append(str(name))
        probs.append(float(percent.replace('%', ''))/100)
    except:
        print row

We didn't get any errors. Now we can put this data into a pandas dataframe.

Step 5: Store the data for future analysis

In order to not have to repeat the above steps again, we will store the data into a dataframe and also store
it as a CSV copy. In that way, we can start an entire new notebook only for the exploration of the data.

In [8]:
import pandas as pd
dfnames = pd.DataFrame({'names': names, 'probs': probs})
dfnames.head()
Out[8]:
names probs
0 Smith 0.0101
1 Williams 0.0070
2 Brown 0.0062
3 Miller 0.0042
4 Moore 0.0031

Now let's store it as a CSV for the future.

In [9]:
dfnames.to_csv("top1000names.csv", encode='utf8')

2. Explore the data

Since the data is now stored as a CSV, we can load it back to a pandas dataframe.
If the process of getting the data from one or more sources and combining it into a dataframe
is laborious, create separate notebooks for different steps of the data science cycle.

In [10]:
dfnames = pd.read_csv("top1000names.csv")
dfnames.head()
Out[10]:
Unnamed: 0 names probs
0 0 Smith 0.0101
1 1 Williams 0.0070
2 2 Brown 0.0062
3 3 Miller 0.0042
4 4 Moore 0.0031

I don't like the Unnamed column, so I'll get rid of it when I read the file, by using the named parameter usecols which specifies which columns we want to keep in the dataframe:

In [11]:
dfnames = pd.read_csv("top1000names.csv", usecols=[1,2])
dfnames.head()
Out[11]:
names probs
0 Smith 0.0101
1 Williams 0.0070
2 Brown 0.0062
3 Miller 0.0042
4 Moore 0.0031

Let's create a new column for the length of the names:

In [12]:
dfnames['namelengths'] = dfnames.names.apply(len)
dfnames.head()
Out[12]:
names probs namelengths
0 Smith 0.0101 5
1 Williams 0.0070 8
2 Brown 0.0062 5
3 Miller 0.0042 6
4 Moore 0.0031 5

Let's look at some descriptive statistics:

In [13]:
dfnames.namelengths.describe()
Out[13]:
count    1000.000000
mean        6.125000
std         1.502542
min         2.000000
25%         5.000000
50%         6.000000
75%         7.000000
max        11.000000
Name: namelengths, dtype: float64

We notice that the median and the mean are really close to one-another, 6.12 and 6, which is usually a good indication for a normal-like distribution.

Let's create a histogram. We need first to import matplotlib:

In [14]:
%matplotlib inline
import matplotlib.pyplot as plt
import seaborn as sns

pandas has its own version of plotting that can be called as a method on every column:

In [15]:
dfnames.namelengths.plot(kind='hist')
Out[15]:
<matplotlib.axes._subplots.AxesSubplot at 0x10b91e810>

We can also try the visualization with Seaborn, which also shows the line form:

In [16]:
sns.distplot(dfnames.namelengths)
Out[16]:
<matplotlib.axes._subplots.AxesSubplot at 0x10bb42110>

Oh, this looks ugly. This is because there are fewer data points than bins, let's ask for a specific number of bins.

In [17]:
sns.distplot(dfnames.namelengths, bins=8)
Out[17]:
<matplotlib.axes._subplots.AxesSubplot at 0x10baeac50>

Much better. Now, let's find the mean of the population of name lengths.

3. The sampling distribution and the Central Theorem

Hypothesis testing is closely related to the sampling distribution: the distribution of a certain statistic (mean, proportion, standard deviation) for all samples of a population. As we will see below the sampling distribution of the means is a normal distribution.

We will get several samples of different sizes to see what happens with the sampling distribution.

10 samples

Let's find the means of 10 samples of 100 items each.

In [18]:
import random
means = []
for i in range(10):
    vals = random.sample(dfnames.namelengths, 100)
    means.append(pd.Series(vals).mean())

plt.hist(means)
Out[18]:
(array([ 2.,  0.,  1.,  1.,  2.,  1.,  1.,  1.,  0.,  1.]),
 array([ 5.85,  5.9 ,  5.95,  6.  ,  6.05,  6.1 ,  6.15,  6.2 ,  6.25,
         6.3 ,  6.35]),
 <a list of 10 Patch objects>)

This shows no pattern at the moment, we get more of a randomly uniform distribution.

100 samples

Let's create 100 samples of 100 names each:

In [19]:
means = []
for i in range(100):
    vals = random.sample(dfnames.namelengths, 100)
    means.append(pd.Series(vals).mean())

plt.hist(means)
Out[19]:
(array([  6.,   7.,  14.,  17.,  18.,  14.,  13.,   6.,   2.,   3.]),
 array([ 5.87,  5.93,  5.99,  6.05,  6.11,  6.17,  6.23,  6.29,  6.35,
         6.41,  6.47]),
 <a list of 10 Patch objects>)

Somewhat better, but not yet close to a normal distribution:

1000 samples

In [20]:
means = []
for i in range(1000):
    vals = random.sample(dfnames.namelengths, 100)
    means.append(pd.Series(vals).mean())

plt.hist(means)
Out[20]:
(array([   6.,   29.,   71.,  154.,  191.,  247.,  184.,   76.,   32.,   10.]),
 array([ 5.71 ,  5.792,  5.874,  5.956,  6.038,  6.12 ,  6.202,  6.284,
         6.366,  6.448,  6.53 ]),
 <a list of 10 Patch objects>)

This is the closest result to the normal distribution we have encountered so far.

Does the result depend on the sample size?

We'll look at 3000 samples of size 50:

In [21]:
means = []
for i in range(3000):
    vals = random.sample(dfnames.namelengths, 50)
    means.append(pd.Series(vals).mean())

plt.hist(means)
Out[21]:
(array([  14.,   43.,  176.,  479.,  720.,  847.,  460.,  190.,   57.,   14.]),
 array([ 5.4  ,  5.544,  5.688,  5.832,  5.976,  6.12 ,  6.264,  6.408,
         6.552,  6.696,  6.84 ]),
 <a list of 10 Patch objects>)

The Central limit theorem: The distribution of the sample means is normal.

In both cases where we created a large number of samples, the distribution looked close to the normal distribution, but the interval of the values for the means changed when we the sample sizes were smaller. That is, for samples that are small, we can expect more variability then for samples that are bigger.

Thus, let's try for the last time two experiments: one with a big sample size and one with a small sample size:

In [22]:
random.seed(0)
means = []
for i in range(5000):
    vals = random.sample(dfnames.namelengths, 300)
    means.append(pd.Series(vals).mean())

plt.hist(means)
Out[22]:
(array([    4.,    32.,   181.,   494.,  1097.,  1442.,  1047.,   520.,
          161.,    22.]),
 array([ 5.84      ,  5.89233333,  5.94466667,  5.997     ,  6.04933333,
         6.10166667,  6.154     ,  6.20633333,  6.25866667,  6.311     ,
         6.36333333]),
 <a list of 10 Patch objects>)
In [23]:
random.seed(0)
means = []
for i in range(5000):
    vals = random.sample(dfnames.namelengths, 40)
    means.append(pd.Series(vals).mean())

plt.hist(means)
Out[23]:
(array([   23.,   151.,   618.,  1025.,  1448.,  1062.,   459.,   163.,
           44.,     7.]),
 array([ 5.375 ,  5.5425,  5.71  ,  5.8775,  6.045 ,  6.2125,  6.38  ,
         6.5475,  6.715 ,  6.8825,  7.05  ]),
 <a list of 10 Patch objects>)

These two charts show clearly that when the sample size was large (n=300), the interval of possible values for the mean was 5.9-6.4. However, for the small sample size (n=40), the interval of possible values increased from 5.5 to 7.0. This makes clear how just by chance, if the sample size is small, we can get many mean values that seem far away from the mean of the population, but that these divergences are purely random and one shouldn't attach meaning to them.

4. Hypothesis Testing

We hypothesized that Wellesley students are "ethnically" diverse from the American population. The way we decided to test that is by taking the average length of their last names and compare it to the average of the American population. The null hypothesis is that the class of Wellesley students have the same average length of last names as the US population (they don't differ). The alternative hypothesis (in which we are interested) is that the mean length is significantly different from that of the American population.

Here is the process we will follow:

  1. Get the Wellesley data
  2. Generate the list of lengths of all names (this is our sample)
  3. Apply one-sample hypothesis, where we compare one sample's mean with the known mean of the population (the names of American people)

Step 1: Load data

In [24]:
# Step 1: Load the Wellesley data

with open('lastnames.txt', 'r') as inFile:
    lastnames = [name.strip('\n') for name in inFile]
    
print len(lastnames)
95

Let's look at some of the names:

In [25]:
lastnames[:5]
Out[25]:
['Albach', 'Alius', 'Bryant', 'Chan', 'Chavez']

Step 2: Generate sample

In [26]:
lengthNames = [len(name) for name in lastnames]

Let's explore this data:

In [27]:
lengthsSeries = pd.Series(lengthNames)
lengthsSeries.describe()
Out[27]:
count    95.000000
mean      5.642105
std       2.701126
min       2.000000
25%       4.000000
50%       5.000000
75%       7.000000
max      14.000000
dtype: float64

The mean for this sample is 5.64 which is smaller than the mean 6.125 of the U.S. population.

Is this difference statistically significant?

Let us also see the distribution of the lengths:

In [28]:
sns.distplot(lengthsSeries)
Out[28]:
<matplotlib.axes._subplots.AxesSubplot at 0x10cc57710>

NOTE: This is clearly a skewed distribution, where most values are of small length. But, any sample we can get
from the population might be similarly skewed (in one direction or other).

We can test it below, let's get a sample of 95, the same size of our group of Wellesley students:

In [29]:
random.seed(0)
onesample = random.sample(dfnames.namelengths, 95)
sns.distplot(onesample)
Out[29]:
<matplotlib.axes._subplots.AxesSubplot at 0x10ccae990>

As we can notice, this distribution doesn't look as normal as the population either.

Step 3: One-sample hypothesis testing Because we have one sample and we want to compare it against the US Population, we will use the One Sample T-test with the known mean, which we calculated above, 6.125.

In [30]:
import scipy
onesampleRes = scipy.stats.ttest_1samp(lengthsSeries, 6.125)
print onesampleRes 
Ttest_1sampResult(statistic=-1.7424866614482968, pvalue=0.08469382105428197)

Because the p-value is 0.08 which is greater than 0.05, we cannot reject the null hypothesis. The Wellesley lengths of lastnames indicate the same "diversity" as the American population.

5. Visualization with Plotly

It is possible to create OFFLINE visualizations with Plotly, which don't require them to be
posted online (because that has a limit on how many charts you can create).

Notice below that we're using plotly.offline module.

In [31]:
import plotly.plotly as py
import plotly.graph_objs as go
import plotly.figure_factory as FF
from plotly.offline import download_plotlyjs, init_notebook_mode, plot, iplot
init_notebook_mode()

Plotly can be used to draw tables:

In [32]:
matrix_onesample = [
    ['', 'Test Statistic', 'p-value'],
    ['Sample Data', onesampleRes[0], onesampleRes[1]]
]

onesample_table = FF.create_table(matrix_onesample, index=True)
iplot(onesample_table)

6. Experimenting with scatter plots

There are many ways to representing our data as a scatterplot. We'll go over a few of them here:

6.1 All students are a dot in the plot

We can look at all students names as single measurement and represent the x-value as their index in the list and the y-value as their length of last name.

In [33]:
trace = go.Scatter(
    x=len(lengthsSeries),
    y=lengthsSeries,
    mode='markers'
)

data = [trace]
iplot(data)

Although we can see some patterns here, for example: lots of students with the same length of 3 or 4 which are also alphabetically close (because the names are sorted alphabetically), it's hard to say anything meanginful about the data.

6.2 Bins of first letters

Another plot we can create is to see relations between first letter and number of lastnames starting with that letter. Because later we might be interested also in the length, we'll create a data structure that can transform the data in a certain way:

In [34]:
from collections import defaultdict

firstLetterDict = defaultdict(list)

for name in lastnames:
    firstLetterDict[name[0]].append(name)
    
print firstLetterDict.keys()
['A', 'C', 'B', 'D', 'G', 'F', 'I', 'H', 'K', 'J', 'M', 'L', 'O', 'Q', 'P', 'S', 'R', 'T', 'W', 'V', 'Y', 'X', 'Z']
In [35]:
for key in firstLetterDict.keys()[:5]:
    print key, len(firstLetterDict[key])
A 8
C 15
B 2
D 5
G 3

Because the data might miss some letters, to create a good chart, we need to go over the letters of the alphabet:

In [36]:
from string import uppercase
print uppercase
ABCDEFGHIJKLMNOPQRSTUVWXYZ

The following dict will accumulate for each letter the number of last names that start with that letter:

In [37]:
countsDict = {letter: len(firstLetterDict.get(letter, [])) for letter in uppercase}
countsDict
Out[37]:
{'A': 8,
 'B': 2,
 'C': 15,
 'D': 5,
 'E': 0,
 'F': 1,
 'G': 3,
 'H': 7,
 'I': 2,
 'J': 1,
 'K': 5,
 'L': 10,
 'M': 8,
 'N': 0,
 'O': 1,
 'P': 3,
 'Q': 1,
 'R': 2,
 'S': 6,
 'T': 1,
 'U': 0,
 'V': 2,
 'W': 1,
 'X': 5,
 'Y': 2,
 'Z': 4}
In [38]:
trace = go.Scatter(
    x=len(uppercase),
    y=[countsDict[letter] for letter in uppercase],
    mode='markers'
)

layout = go.Layout(
    title="Distribution of lastnames by first letter",
    xaxis=dict(tickvals=range(len(uppercase)),
               ticktext=list(uppercase),
               title="First letter of last names"),
    yaxis=dict(title="Number of firstnames")
    )

data = [trace]
fig = go.Figure(data=data, layout=layout)
iplot(fig)

This plot shows that there are a few letters like C, L and A and M that have more names than other letters.
However, this visualization doesn't do much about showing us something about the lengths of the last names.

6.3 Scatterplot with different marker sizes

In this plot we will change the size of each marker to represent the number of students with that a certain length.
To do this, we need to create a different data structure which collects students based on the length of last name

In [39]:
from collections import Counter
lengthsCount = Counter()

for name in lastnames:
    lengthsCount[len(name)] += 1
    
print lengthsCount
Counter({4: 19, 3: 15, 6: 13, 5: 11, 7: 8, 8: 8, 2: 7, 9: 5, 11: 3, 12: 3, 10: 2, 14: 1})

Let's sort this dict:

In [40]:
sortedPairs = sorted(lengthsCount.items())
lengths, sizes = zip(*sortedPairs)
lengths
Out[40]:
(2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 14)

Let's plot this data:

In [41]:
trace = go.Scatter(
    x=lengths,
    y=sizes,
    mode='markers',
    marker=dict(size=sizes)
)

layout = go.Layout(
    title="Distribution of lastnames by length",
    xaxis=dict(title="Length of lastnames"),
    yaxis=dict(title="Count of lastnames")                
    )

data = [trace]
fig = go.Figure(data=data, layout=layout)
iplot(fig)

The points are too small for certain values, we'll add a minimum size to each bubble:

In [42]:
trace = go.Scatter(
    x=lengths,
    y=sizes,
    mode='markers',
    marker=dict(size=[el+10 for el in sizes]) # add 10 to each size
)

layout = go.Layout(
    title="Distribution of lastnames by length",
    xaxis=dict(title="Length of lastnames"),
    yaxis=dict(title="Count of lastnames")                
    )

data = [trace]
fig = go.Figure(data=data, layout=layout)
iplot(fig)

We can use a color scale to have the intensity of color correspond to lengths with greater count:

In [43]:
trace = go.Scatter(
    x=lengths,
    y=sizes,
    mode='markers',
    marker=dict(size=[el+10 for el in sizes], # add 10 to each size
                color=[100+el*10 for el in sizes],
               ) 
)

layout = go.Layout(
    title="Distribution of lastnames by length",
    xaxis=dict(title="Length of lastnames"),
    yaxis=dict(title="Count of lastnames"),
    )

data = [trace]
fig = go.Figure(data=data, layout=layout)
iplot(fig)

What we can see from this graph is that most lastnames have a length of 4 and 3 and then 6 and 5.
Names of longer lenght are fewer.