Author: Eni Mustafaraj
Notes: Originally created for CS 315 Data and Text Mining for the Web (Spring 2017)
Table of Contents
In this notebook, we'll see how to use the Python libraries sklearn
and scipy
to perform the k-means and hierarchical clustering that we discussed in lecture.
Before starting, make sure that you have these packages installed. If not, install them and then continue.
To check:
import sklearn
import scipy
The examples below also use the visualization package seaborn
.
# check if packages are installed
import sklearn
import scipy
import seaborn
We will generate a synthetic data set that has natural clusters and then will use clustering functions to see how well they work.
%matplotlib inline
import matplotlib.pyplot as plt
import seaborn
Here is the function that will create the synthetic data:
from sklearn.datasets.samples_generator import make_blobs
help(make_blobs)
Now that we know how to call make_blob
, we can generate a dataset.
X, y_true = make_blobs(n_samples=120, centers=4,
cluster_std=0.40, random_state=0)
print X.shape
print y_true.shape
We just created the matrix X
that has 120 rows (one for each sample) with two columns each (meaning two features), and the vector y_true
, that has 120 values, indicating which group a data sample belongs to.
y_true
X is a two-dimensional array, every points has two feature values (so that we can plot it on a 2D plane):
X[:10]
We can plot the data samples in X to see the clusters.
plt.figure(figsize=(9,6))
plt.scatter(X[:, 0], X[:, 1], s=50); # s is the size of the dots in the plot
K-means clustering will divide data in k clusters, each with its own center. The algorithm works iteratively
to group together data points that are spatially closer to one-another.
We will use the Kmeans algorithm that is implemented within the sklearn
package.
All sklearn
algorithms have the same API for being used:
from sklearn.cluster import KMeans
kmeans = KMeans(n_clusters=4)
kmeans.fit(X)
y_kmeans = kmeans.predict(X)
Let's plot the results in order to see whether the clustering worked:
plt.scatter(X[:, 0], X[:, 1], c=y_kmeans, s=50, cmap='viridis')
centers = kmeans.cluster_centers_
plt.scatter(centers[:, 0], centers[:, 1], c='red', s=200, alpha=0.5)
# The variable center stores the point that are the centroids of the clusters
centers
We can play with the parameters of our data generation process to create different-looking clusters; for example clusters that are not well separated:
X, y_true = make_blobs(n_samples=300, centers=5,
cluster_std=0.8, random_state=0)
plt.scatter(X[:, 0], X[:, 1], s=50); # s is the size of the dots in the plot
Let's see how good the clustering will look this time:
kmeans = KMeans(n_clusters=5)
kmeans.fit(X)
y_kmeans = kmeans.predict(X)
plt.scatter(X[:, 0], X[:, 1], c=y_kmeans, s=50, cmap='viridis')
centers = kmeans.cluster_centers_
plt.scatter(centers[:, 0], centers[:, 1], c='red', s=200, alpha=0.5)
It worked pretty well.
Let's see what happens when we ask it to find 2 clusters, instead of the 5 natural clusters that are in the data:
kmeans = KMeans(n_clusters=2)
kmeans.fit(X)
y_kmeans = kmeans.predict(X)
plt.scatter(X[:, 0], X[:, 1], c=y_kmeans, s=50, cmap='viridis')
centers = kmeans.cluster_centers_
plt.scatter(centers[:, 0], centers[:, 1], c='red', s=200, alpha=0.5)
Let's ask about 3 clusters:
kmeans = KMeans(n_clusters=3)
kmeans.fit(X)
y_kmeans = kmeans.predict(X)
plt.scatter(X[:, 0], X[:, 1], c=y_kmeans, s=50, cmap='viridis')
centers = kmeans.cluster_centers_
plt.scatter(centers[:, 0], centers[:, 1], c='red', s=200, alpha=0.5)
The agorithm does the right thing even when we ask to find less culsters, thus, it's up to us to decide how many we want.
In Kmeans clustering, we provide the number of clusters and then the algorithm partitions the data. In agglomerative clustering, the data is grouped together based on the distance, and we can decide how many clusters we want, once we see how the data are grouped together.
from scipy.cluster.hierarchy import linkage, dendrogram
# linkage is the the function that performs the clustering
Z = linkage(X, method='single', metric='euclidean')
# create string labels for the data samples, to show in the dendrogram
ticks = ['el_{}_group={}'.format(i, el) for i, el in enumerate(y_true)]
Draw the dendrogram:
# calculate full dendrogram
plt.figure(figsize=(25, 10))
plt.title('Hierarchical Clustering')
plt.xlabel('samples')
plt.ylabel('distance')
dendrogram(
Z,
leaf_rotation=90., # rotates the x axis labels
leaf_font_size=8., # font size for the x axis labels
labels = ticks
)
plt.show()
Something this makes clear is that visualizing the dengrogram is useful, but it's not for big dataset. The datset in the chart has 300 points, and it's corrently hard to see the labels of the points.
Let's create a smaller dataset that also is a bit more separated.
X, y_true = make_blobs(n_samples=80, centers=3,
cluster_std=0.4, random_state=0)
plt.scatter(X[:, 0], X[:, 1], s=50);
Z = linkage(X, method='single', metric='euclidean')
ticks = ['el_{}_group={}'.format(i, el) for i, el in enumerate(y_true)]
# calculate full dendrogram
plt.figure(figsize=(25, 10))
plt.title('Hierarchical Clustering')
plt.xlabel('samples')
plt.ylabel('distance')
dendrogram(
Z,
leaf_rotation=90., # rotates the x axis labels
leaf_font_size=8., # font size for the x axis labels
labels = ticks
)
plt.show()
For this dataset, even the labels are visible. In fact, we can see how the original groups (notice the second part of name) are clustered together.
In scipy we can calculate the pairwise distance of every two data samples with the function pdist
:
from scipy.spatial.distance import pdist
xdist = pdist(X, 'euclidean')
xdist.shape
Question: Why is the shape of the distance vector 3160? The data set X has 80 points. Why does xdist contain?
Your answer:
# lookup the first 10 values
xdist[:10]
The first value in the xdist is the distance between points X[0] and X[1], that distance is small, because, as you can see below the values for X[0] and X[1], they are close.
X[0], X[1]
However, the second value, 3.97438162, means that X[0] and X[2] are further apart, see values below:
X[0], X[2]
NOTE: Notice that the function pdist
used the argument euclidean
, calculating the Euclidean distance between two points.
The k-NN (k-Nearest Neighbors) algorithm finds for each point its neighbors (points with smallest distance) and get the label that is the majority of labels for the neighbors.
We'll try clustering with a famous text dataset called 20newsgroups. Read a bit about its content on this page.
Because this is a famous dataset, sklearn
already knows how to read its content from files. If you are curious about the files, download the ZIP archive from the website. Careful, it's a big file.
First, sklearn will get the dataset:
import sklearn.datasets
all_data = sklearn.datasets.fetch_20newsgroups(subset='all')
print(len(all_data.filenames))
It contains 18846 files in total, so, it's a big dataset.
Documents are grouped in classes (a class is known as a target). Let's see these classes:
print(all_data.target_names)
Because this dataset is usually used for text classification, it is divided in two parts: "train" data and "test" data. Both of the groups have examples with labels (the category where a piece of news belongs), but the algorithm will learn its model from the train data and then it is tested for accuracy (how well is doing) on the test data.
train_data = sklearn.datasets.fetch_20newsgroups(subset='train')
print(len(train_data.filenames))
test_data = sklearn.datasets.fetch_20newsgroups(subset='test')
print(len(test_data.filenames))
In this dataset, 60% of files are assigned to training and 40% are assigned to testing.
One doesn't need to work with all 20 groups at once, we can choose to focus on a subset:
groups = ['comp.graphics', 'comp.os.ms-windows.misc',
'comp.sys.ibm.pc.hardware', 'comp.sys.mac.hardware',
'comp.windows.x', 'sci.space']
train_data = sklearn.datasets.fetch_20newsgroups(subset='train', categories=groups)
print(len(train_data.filenames))
Clustering works with vectors of numbers, thus, all documents need to be converted into such vectors.
sklearn
already contains all classes that perform such transformation, here is an example of creating one that does stemming of words (reducing a word to its stem, for example: "working"--> "work", etc.), and then calculates tf*idf (the product of term frequency and the inverse document frequency).
import nltk.stem
english_stemmer = nltk.stem.SnowballStemmer('english')
from sklearn.feature_extraction.text import TfidfVectorizer
class StemmedTfidfVectorizer(TfidfVectorizer):
def build_analyzer(self):
analyzer = super(TfidfVectorizer, self).build_analyzer()
return lambda doc: (english_stemmer.stem(w) for w in analyzer(doc))
Choices
In addition to removing stop-words, getting stems, ignoring unicode characters that cannot be decoded, we are also dropping all words that appear less than 10 times (this will remove spelling mistakes or unusual acronyms), as well as words that appear in 50% of the documents. All these efforts are for the goal of reducing the dimensions of the problem (by having a smaller number of features).
vectorizer = StemmedTfidfVectorizer(min_df=10, max_df=0.5, stop_words='english', decode_error='ignore')
vectorized = vectorizer.fit_transform(train_data.data)
vectorized
from sklearn.cluster import KMeans
num_clusters = 50
km = KMeans(n_clusters=num_clusters, init='random', n_init=1, verbose=1, random_state=3)
km
# fit the data to this "model"
km.fit(vectorized)
The variable labels_
will indicate for every document the number of cluster to which the document has been assigned.
km.labels_
There is a label assigned to every document:
len(km.labels_)
We can look at all centers, there are large vectors:
km.cluster_centers_
km.cluster_centers_.shape
That is, we have 50 centers, each a vector of 4712 dimensions.
Let's print out one centroid, it should be a long array of numberds (tf-idf values) for different words that are relevant to the cluster:
print list(km.cluster_centers_[0])[:200]
Let's see how many documents are in cluster 42:
indices = [i for i in range(len(km.labels_)) if km.labels_[i] == 42]
len(indices)
print indices
Indices in this case is a list of document ID-s, each document is assigned a number and we can access the data through that index:
train_data.data[indices[0]]
TASK FOR YOU: Find the smallest cluster, then get the text of some documents from it. Does it make sense that those documents are grouped together?