This notebook reviews concepts of hypothesis testing in the context of a problem.
It is looking at the question: what can we say about the ethnic diversity of a group
of people based on their last names, concretely, the length of the last names.
To do this analysis we need to find data sources that contain last names from
different countries, so that we can compare the average lengths. We then will
get a class of Wellesley students and test the hypothesis that Wellesley students
are ethnically more diverse than the American population.
Table of Contents
NOTE: You can see this notebook as an example of a project for grade C, although
it is not about "digital natives" (in purpose), so that it doesn't influence your work.
While this data might be in a CSV format somewhere, to show that we need usually to write code to get data, I'm looking at data from a website: Top surnames in the United States. As we can see, the data is in a table in the HTML file, thus, we need to extract it with BeautifulSoup.
Steps:
Step 1: Get the HTLM page
# Step 1: get the HTML page
import requests
response = requests.get("https://surnames.behindthename.com/top/lists/united-states/1990")
if response.status_code == 200:
htmlpage = response.content
print htmlpage[:300]
The HTML page is here, contains the top 1000 last names from the Census in 1990.
Step 2: Use BeautifulSoup to create a searchable tree
# Step 2: Use BeautifulSoup
from bs4 import BeautifulSoup
htmltree = BeautifulSoup(htmlpage, 'lxml')
htmltree.find('h1')
The tree is created, we can now search it for the elements we need.
Step 3: Inspect HTML
At this point, it's better to use the "View Source" on the browser to look at the different elements that might help us to extract the names.
It turns out this is very old-style HTML, it has a class for each odd/even row to alternate colors.
The two classes are r0 and r1.
Step 4: Write code to extract data
We will extract the values from both kinds of rows.
# Step 4: Extract data
odd = htmltree.find_all(attrs={'class': 'r1'})
even = htmltree.find_all(attrs={'class': 'r0'})
allRows = odd + even
len(allRows)
We got 1000 rows, as expected. Let's see what each row contains:
allRows[0]
It's a lot of HTML "junk" that is not useful to us. Luckily, BeautifulSoup has a method getText
that extracts the text element from each HTML element.
allRows[0].get_text()
This still has some junk characters, let's split at the spaces:
allRows[0].get_text().split()
Much better. Finally, let's create two separate lists one for the names and one for the percentages.
The percentage value is a string, we will need to remove the '%' character, and divide it by 100, because that converts it to a probability value.
Also, given that we cannot be sure that the HTML doesn't contain errors, it's better to use try ... except
to extract the values.
names = [] # to store
probs = [] # to store the probability values
for row in allRows:
try:
index, name, percent = row.get_text().split()
names.append(str(name))
probs.append(float(percent.replace('%', ''))/100)
except:
print row
We didn't get any errors. Now we can put this data into a pandas dataframe.
Step 5: Store the data for future analysis
In order to not have to repeat the above steps again, we will store the data into a dataframe and also store
it as a CSV copy. In that way, we can start an entire new notebook only for the exploration of the data.
import pandas as pd
dfnames = pd.DataFrame({'names': names, 'probs': probs})
dfnames.head()
Now let's store it as a CSV for the future.
dfnames.to_csv("top1000names.csv", encode='utf8')
Since the data is now stored as a CSV, we can load it back to a pandas dataframe.
If the process of getting the data from one or more sources and combining it into a dataframe
is laborious, create separate notebooks for different steps of the data science cycle.
dfnames = pd.read_csv("top1000names.csv")
dfnames.head()
I don't like the Unnamed column, so I'll get rid of it when I read the file, by using the named parameter usecols
which specifies which columns we want to keep in the dataframe:
dfnames = pd.read_csv("top1000names.csv", usecols=[1,2])
dfnames.head()
Let's create a new column for the length of the names:
dfnames['namelengths'] = dfnames.names.apply(len)
dfnames.head()
Let's look at some descriptive statistics:
dfnames.namelengths.describe()
We notice that the median and the mean are really close to one-another, 6.12 and 6, which is usually a good indication for a normal-like distribution.
Let's create a histogram. We need first to import matplotlib:
%matplotlib inline
import matplotlib.pyplot as plt
import seaborn as sns
pandas
has its own version of plotting that can be called as a method on every column:
dfnames.namelengths.plot(kind='hist')
We can also try the visualization with Seaborn, which also shows the line form:
sns.distplot(dfnames.namelengths)
Oh, this looks ugly. This is because there are fewer data points than bins, let's ask for a specific number of bins.
sns.distplot(dfnames.namelengths, bins=8)
Much better. Now, let's find the mean of the population of name lengths.
Hypothesis testing is closely related to the sampling distribution: the distribution of a certain statistic (mean, proportion, standard deviation) for all samples of a population. As we will see below the sampling distribution of the means is a normal distribution.
We will get several samples of different sizes to see what happens with the sampling distribution.
10 samples
Let's find the means of 10 samples of 100 items each.
import random
means = []
for i in range(10):
vals = random.sample(dfnames.namelengths, 100)
means.append(pd.Series(vals).mean())
plt.hist(means)
This shows no pattern at the moment, we get more of a randomly uniform distribution.
100 samples
Let's create 100 samples of 100 names each:
means = []
for i in range(100):
vals = random.sample(dfnames.namelengths, 100)
means.append(pd.Series(vals).mean())
plt.hist(means)
Somewhat better, but not yet close to a normal distribution:
1000 samples
means = []
for i in range(1000):
vals = random.sample(dfnames.namelengths, 100)
means.append(pd.Series(vals).mean())
plt.hist(means)
This is the closest result to the normal distribution we have encountered so far.
Does the result depend on the sample size?
We'll look at 3000 samples of size 50:
means = []
for i in range(3000):
vals = random.sample(dfnames.namelengths, 50)
means.append(pd.Series(vals).mean())
plt.hist(means)
The Central limit theorem: The distribution of the sample means is normal.
In both cases where we created a large number of samples, the distribution looked close to the normal distribution, but the interval of the values for the means changed when we the sample sizes were smaller. That is, for samples that are small, we can expect more variability then for samples that are bigger.
Thus, let's try for the last time two experiments: one with a big sample size and one with a small sample size:
random.seed(0)
means = []
for i in range(5000):
vals = random.sample(dfnames.namelengths, 300)
means.append(pd.Series(vals).mean())
plt.hist(means)
random.seed(0)
means = []
for i in range(5000):
vals = random.sample(dfnames.namelengths, 40)
means.append(pd.Series(vals).mean())
plt.hist(means)
These two charts show clearly that when the sample size was large (n=300), the interval of possible values for the mean was 5.9-6.4. However, for the small sample size (n=40), the interval of possible values increased from 5.5 to 7.0. This makes clear how just by chance, if the sample size is small, we can get many mean values that seem far away from the mean of the population, but that these divergences are purely random and one shouldn't attach meaning to them.
We hypothesized that Wellesley students are "ethnically" diverse from the American population. The way we decided to test that is by taking the average length of their last names and compare it to the average of the American population. The null hypothesis is that the class of Wellesley students have the same average length of last names as the US population (they don't differ). The alternative hypothesis (in which we are interested) is that the mean length is significantly different from that of the American population.
Here is the process we will follow:
Step 1: Load data
# Step 1: Load the Wellesley data
with open('lastnames.txt', 'r') as inFile:
lastnames = [name.strip('\n') for name in inFile]
print len(lastnames)
Let's look at some of the names:
lastnames[:5]
Step 2: Generate sample
lengthNames = [len(name) for name in lastnames]
Let's explore this data:
lengthsSeries = pd.Series(lengthNames)
lengthsSeries.describe()
The mean for this sample is 5.64 which is smaller than the mean 6.125 of the U.S. population.
Is this difference statistically significant?
Let us also see the distribution of the lengths:
sns.distplot(lengthsSeries)
NOTE: This is clearly a skewed distribution, where most values are of small length. But, any sample we can get
from the population might be similarly skewed (in one direction or other).
We can test it below, let's get a sample of 95, the same size of our group of Wellesley students:
random.seed(0)
onesample = random.sample(dfnames.namelengths, 95)
sns.distplot(onesample)
As we can notice, this distribution doesn't look as normal as the population either.
Step 3: One-sample hypothesis testing Because we have one sample and we want to compare it against the US Population, we will use the One Sample T-test with the known mean, which we calculated above, 6.125.
import scipy
onesampleRes = scipy.stats.ttest_1samp(lengthsSeries, 6.125)
print onesampleRes
Because the p-value is 0.08 which is greater than 0.05, we cannot reject the null hypothesis. The Wellesley lengths of lastnames indicate the same "diversity" as the American population.
It is possible to create OFFLINE visualizations with Plotly, which don't require them to be
posted online (because that has a limit on how many charts you can create).
Notice below that we're using plotly.offline
module.
import plotly.plotly as py
import plotly.graph_objs as go
import plotly.figure_factory as FF
from plotly.offline import download_plotlyjs, init_notebook_mode, plot, iplot
init_notebook_mode()
Plotly can be used to draw tables:
matrix_onesample = [
['', 'Test Statistic', 'p-value'],
['Sample Data', onesampleRes[0], onesampleRes[1]]
]
onesample_table = FF.create_table(matrix_onesample, index=True)
iplot(onesample_table)
There are many ways to representing our data as a scatterplot. We'll go over a few of them here:
We can look at all students names as single measurement and represent the x-value as their index in the list and the y-value as their length of last name.
trace = go.Scatter(
x=len(lengthsSeries),
y=lengthsSeries,
mode='markers'
)
data = [trace]
iplot(data)
Although we can see some patterns here, for example: lots of students with the same length of 3 or 4 which are also alphabetically close (because the names are sorted alphabetically), it's hard to say anything meanginful about the data.
Another plot we can create is to see relations between first letter and number of lastnames starting with that letter. Because later we might be interested also in the length, we'll create a data structure that can transform the data in a certain way:
from collections import defaultdict
firstLetterDict = defaultdict(list)
for name in lastnames:
firstLetterDict[name[0]].append(name)
print firstLetterDict.keys()
for key in firstLetterDict.keys()[:5]:
print key, len(firstLetterDict[key])
Because the data might miss some letters, to create a good chart, we need to go over the letters of the alphabet:
from string import uppercase
print uppercase
The following dict will accumulate for each letter the number of last names that start with that letter:
countsDict = {letter: len(firstLetterDict.get(letter, [])) for letter in uppercase}
countsDict
trace = go.Scatter(
x=len(uppercase),
y=[countsDict[letter] for letter in uppercase],
mode='markers'
)
layout = go.Layout(
title="Distribution of lastnames by first letter",
xaxis=dict(tickvals=range(len(uppercase)),
ticktext=list(uppercase),
title="First letter of last names"),
yaxis=dict(title="Number of firstnames")
)
data = [trace]
fig = go.Figure(data=data, layout=layout)
iplot(fig)
This plot shows that there are a few letters like C, L and A and M that have more names than other letters.
However, this visualization doesn't do much about showing us something about the lengths of the last names.
In this plot we will change the size of each marker to represent the number of students with that a certain length.
To do this, we need to create a different data structure which collects students based on the length of last name
from collections import Counter
lengthsCount = Counter()
for name in lastnames:
lengthsCount[len(name)] += 1
print lengthsCount
Let's sort this dict:
sortedPairs = sorted(lengthsCount.items())
lengths, sizes = zip(*sortedPairs)
lengths
Let's plot this data:
trace = go.Scatter(
x=lengths,
y=sizes,
mode='markers',
marker=dict(size=sizes)
)
layout = go.Layout(
title="Distribution of lastnames by length",
xaxis=dict(title="Length of lastnames"),
yaxis=dict(title="Count of lastnames")
)
data = [trace]
fig = go.Figure(data=data, layout=layout)
iplot(fig)
The points are too small for certain values, we'll add a minimum size to each bubble:
trace = go.Scatter(
x=lengths,
y=sizes,
mode='markers',
marker=dict(size=[el+10 for el in sizes]) # add 10 to each size
)
layout = go.Layout(
title="Distribution of lastnames by length",
xaxis=dict(title="Length of lastnames"),
yaxis=dict(title="Count of lastnames")
)
data = [trace]
fig = go.Figure(data=data, layout=layout)
iplot(fig)
We can use a color scale to have the intensity of color correspond to lengths with greater count:
trace = go.Scatter(
x=lengths,
y=sizes,
mode='markers',
marker=dict(size=[el+10 for el in sizes], # add 10 to each size
color=[100+el*10 for el in sizes],
)
)
layout = go.Layout(
title="Distribution of lastnames by length",
xaxis=dict(title="Length of lastnames"),
yaxis=dict(title="Count of lastnames"),
)
data = [trace]
fig = go.Figure(data=data, layout=layout)
iplot(fig)
What we can see from this graph is that most lastnames have a length of 4 and 3 and then 6 and 5.
Names of longer lenght are fewer.