Week 13: Clustering algorithms¶

Author: Eni Mustafaraj
Notes: Originally created for CS 315 Data and Text Mining for the Web (Spring 2017)

Table of Contents

Installation
Preparing Synthetic Data
K-Means
Agglomerative Clustering
Distance
Clustering documents with k-means

1. Installation¶

In this notebook, we'll see how to use the Python libraries sklearn and scipy to perform the k-means and hierarchical clustering that we discussed in lecture.

Before starting, make sure that you have these packages installed. If not, install them and then continue.

To check:

import sklearn
import scipy

The examples below also use the visualization package seaborn.

# check if packages are installed

import sklearn
import scipy
import seaborn

2. Preparing synthetic data¶

We will generate a synthetic data set that has natural clusters and then will use clustering functions to see how well they work.

%matplotlib inline
import matplotlib.pyplot as plt
import seaborn

Here is the function that will create the synthetic data:

from sklearn.datasets.samples_generator import make_blobs

help(make_blobs)

Help on function make_blobs in module sklearn.datasets.samples_generator:

make_blobs(n_samples=100, n_features=2, centers=3, cluster_std=1.0, center_box=(-10.0, 10.0), shuffle=True, random_state=None)
    Generate isotropic Gaussian blobs for clustering.
    
    Read more in the :ref:`User Guide <sample_generators>`.
    
    Parameters
    ----------
    n_samples : int, optional (default=100)
        The total number of points equally divided among clusters.
    
    n_features : int, optional (default=2)
        The number of features for each sample.
    
    centers : int or array of shape [n_centers, n_features], optional
        (default=3)
        The number of centers to generate, or the fixed center locations.
    
    cluster_std: float or sequence of floats, optional (default=1.0)
        The standard deviation of the clusters.
    
    center_box: pair of floats (min, max), optional (default=(-10.0, 10.0))
        The bounding box for each cluster center when centers are
        generated at random.
    
    shuffle : boolean, optional (default=True)
        Shuffle the samples.
    
    random_state : int, RandomState instance or None, optional (default=None)
        If int, random_state is the seed used by the random number generator;
        If RandomState instance, random_state is the random number generator;
        If None, the random number generator is the RandomState instance used
        by `np.random`.
    
    Returns
    -------
    X : array of shape [n_samples, n_features]
        The generated samples.
    
    y : array of shape [n_samples]
        The integer labels for cluster membership of each sample.
    
    Examples
    --------
    >>> from sklearn.datasets.samples_generator import make_blobs
    >>> X, y = make_blobs(n_samples=10, centers=3, n_features=2,
    ...                   random_state=0)
    >>> print(X.shape)
    (10, 2)
    >>> y
    array([0, 0, 1, 0, 2, 2, 2, 1, 1, 0])
    
    See also
    --------
    make_classification: a more intricate variant

Now that we know how to call make_blob, we can generate a dataset.

X, y_true = make_blobs(n_samples=120, centers=4,
                       cluster_std=0.40, random_state=0)

print X.shape
print y_true.shape

(120, 2)
(120,)

We just created the matrix X that has 120 rows (one for each sample) with two columns each (meaning two features), and the vector y_true, that has 120 values, indicating which group a data sample belongs to.

y_true

array([2, 3, 1, 0, 0, 1, 2, 2, 1, 1, 3, 0, 3, 3, 3, 0, 1, 2, 2, 0, 3, 3, 2,
       0, 3, 2, 1, 2, 2, 3, 1, 0, 1, 0, 0, 1, 0, 0, 0, 1, 0, 1, 1, 2, 2, 0,
       3, 3, 0, 0, 0, 3, 1, 1, 0, 3, 3, 3, 2, 1, 0, 0, 0, 1, 3, 1, 2, 3, 2,
       2, 1, 2, 2, 0, 2, 1, 0, 2, 0, 1, 1, 1, 1, 1, 2, 1, 2, 0, 3, 2, 2, 2,
       1, 1, 3, 1, 1, 0, 3, 0, 3, 3, 2, 1, 3, 0, 0, 3, 2, 2, 3, 3, 3, 3, 3,
       2, 2, 0, 2, 3])

X is a two-dimensional array, every points has two feature values (so that we can plot it on a 2D plane):

X[:10]

array([[-1.72611699,  3.68969508],
       [-1.39592851,  7.73970834],
       [ 2.21620418,  0.62373962],
       [ 0.72254124,  4.15869086],
       [ 0.96499719,  4.47512008],
       [ 2.52677935,  0.82569373],
       [-0.5736461 ,  3.29567406],
       [-1.82480594,  2.58730685],
       [ 1.58601816,  1.67511213],
       [ 1.88981993,  0.59868174]])

We can plot the data samples in X to see the clusters.

plt.figure(figsize=(9,6))
plt.scatter(X[:, 0], X[:, 1], s=50); # s is the size of the dots in the plot

3. K-Means¶

K-means clustering will divide data in k clusters, each with its own center. The algorithm works iteratively
to group together data points that are spatially closer to one-another.

We will use the Kmeans algorithm that is implemented within the sklearn package.

All sklearn algorithms have the same API for being used:

initialize the algorithm
fit the model
predict the outcome for the data

from sklearn.cluster import KMeans 

kmeans = KMeans(n_clusters=4) 
kmeans.fit(X)
y_kmeans = kmeans.predict(X)

Let's plot the results in order to see whether the clustering worked:

plt.scatter(X[:, 0], X[:, 1], c=y_kmeans, s=50, cmap='viridis')
centers = kmeans.cluster_centers_
plt.scatter(centers[:, 0], centers[:, 1], c='red', s=200, alpha=0.5)

<matplotlib.collections.PathCollection at 0x109e2d5d0>

# The variable center stores the point that are the centroids of the clusters
centers

array([[-1.59511929,  2.88983186],
       [-1.22856712,  7.81657033],
       [ 2.11087181,  1.07870126],
       [ 0.89497699,  4.28593338]])

Tighter data¶

We can play with the parameters of our data generation process to create different-looking clusters; for example clusters that are not well separated:

X, y_true = make_blobs(n_samples=300, centers=5,
                       cluster_std=0.8, random_state=0)

plt.scatter(X[:, 0], X[:, 1], s=50); # s is the size of the dots in the plot

Let's see how good the clustering will look this time:

kmeans = KMeans(n_clusters=5) 
kmeans.fit(X)
y_kmeans = kmeans.predict(X)

plt.scatter(X[:, 0], X[:, 1], c=y_kmeans, s=50, cmap='viridis')
centers = kmeans.cluster_centers_
plt.scatter(centers[:, 0], centers[:, 1], c='red', s=200, alpha=0.5)

<matplotlib.collections.PathCollection at 0x10bc9af90>

It worked pretty well.

Changing the number of clusters¶

Let's see what happens when we ask it to find 2 clusters, instead of the 5 natural clusters that are in the data:

kmeans = KMeans(n_clusters=2) 
kmeans.fit(X)
y_kmeans = kmeans.predict(X)

plt.scatter(X[:, 0], X[:, 1], c=y_kmeans, s=50, cmap='viridis')
centers = kmeans.cluster_centers_
plt.scatter(centers[:, 0], centers[:, 1], c='red', s=200, alpha=0.5)

<matplotlib.collections.PathCollection at 0x10b9368d0>

Let's ask about 3 clusters:

kmeans = KMeans(n_clusters=3) 
kmeans.fit(X)
y_kmeans = kmeans.predict(X)

plt.scatter(X[:, 0], X[:, 1], c=y_kmeans, s=50, cmap='viridis')
centers = kmeans.cluster_centers_
plt.scatter(centers[:, 0], centers[:, 1], c='red', s=200, alpha=0.5)

<matplotlib.collections.PathCollection at 0x10ba55b10>

The agorithm does the right thing even when we ask to find less culsters, thus, it's up to us to decide how many we want.

4. Agglomerative Clustering¶

In Kmeans clustering, we provide the number of clusters and then the algorithm partitions the data. In agglomerative clustering, the data is grouped together based on the distance, and we can decide how many clusters we want, once we see how the data are grouped together.

from scipy.cluster.hierarchy import linkage, dendrogram

# linkage is the the function that performs the clustering
Z = linkage(X, method='single', metric='euclidean')

# create string labels for the data samples, to show in the dendrogram
ticks = ['el_{}_group={}'.format(i, el) for i, el in enumerate(y_true)]

Draw the dendrogram:

# calculate full dendrogram
plt.figure(figsize=(25, 10))
plt.title('Hierarchical Clustering')
plt.xlabel('samples')
plt.ylabel('distance')
dendrogram(
    Z,
    leaf_rotation=90.,  # rotates the x axis labels
    leaf_font_size=8.,  # font size for the x axis labels
    labels = ticks
)

plt.show()

Something this makes clear is that visualizing the dengrogram is useful, but it's not for big dataset. The datset in the chart has 300 points, and it's corrently hard to see the labels of the points.

Create a smaller dataset¶

Let's create a smaller dataset that also is a bit more separated.

X, y_true = make_blobs(n_samples=80, centers=3,
                       cluster_std=0.4, random_state=0)

plt.scatter(X[:, 0], X[:, 1], s=50);

Z = linkage(X, method='single', metric='euclidean')

ticks = ['el_{}_group={}'.format(i, el) for i, el in enumerate(y_true)]
# calculate full dendrogram
plt.figure(figsize=(25, 10))
plt.title('Hierarchical Clustering')
plt.xlabel('samples')
plt.ylabel('distance')
dendrogram(
    Z,
    leaf_rotation=90.,  # rotates the x axis labels
    leaf_font_size=8.,  # font size for the x axis labels
    labels = ticks
)

plt.show()

For this dataset, even the labels are visible. In fact, we can see how the original groups (notice the second part of name) are clustered together.

5. Distances¶

In scipy we can calculate the pairwise distance of every two data samples with the function pdist:

from scipy.spatial.distance import pdist

xdist = pdist(X, 'euclidean')
xdist.shape

(3160,)

Question: Why is the shape of the distance vector 3160? The data set X has 80 points. Why does xdist contain?

Your answer:

# lookup the first 10 values
xdist[:10]

array([ 0.32214242,  3.97438162,  0.50807807,  0.1525362 ,  4.64852341,
        0.76986202,  3.8379928 ,  4.00335503,  0.55550752,  3.75433512])

The first value in the xdist is the distance between points X[0] and X[1], that distance is small, because, as you can see below the values for X[0] and X[1], they are close.

X[0], X[1]

(array([ 2.10603236,  1.05845941]), array([ 2.12623798,  0.73695129]))

However, the second value, 3.97438162, means that X[0] and X[2] are further apart, see values below:

X[0], X[2]

(array([ 2.10603236,  1.05845941]), array([ 1.03388751,  4.88549673]))

NOTE: Notice that the function pdist used the argument euclidean, calculating the Euclidean distance between two points.

The k-NN (k-Nearest Neighbors) algorithm finds for each point its neighbors (points with smallest distance) and get the label that is the majority of labels for the neighbors.

6. Clustering documents with k-means¶

We'll try clustering with a famous text dataset called 20newsgroups. Read a bit about its content on this page.

Because this is a famous dataset, sklearn already knows how to read its content from files. If you are curious about the files, download the ZIP archive from the website. Careful, it's a big file.

First, sklearn will get the dataset:

import sklearn.datasets
all_data = sklearn.datasets.fetch_20newsgroups(subset='all')
print(len(all_data.filenames))

18846

It contains 18846 files in total, so, it's a big dataset.

Documents are grouped in classes (a class is known as a target). Let's see these classes:

print(all_data.target_names)

['alt.atheism', 'comp.graphics', 'comp.os.ms-windows.misc', 'comp.sys.ibm.pc.hardware', 'comp.sys.mac.hardware', 'comp.windows.x', 'misc.forsale', 'rec.autos', 'rec.motorcycles', 'rec.sport.baseball', 'rec.sport.hockey', 'sci.crypt', 'sci.electronics', 'sci.med', 'sci.space', 'soc.religion.christian', 'talk.politics.guns', 'talk.politics.mideast', 'talk.politics.misc', 'talk.religion.misc']

Because this dataset is usually used for text classification, it is divided in two parts: "train" data and "test" data. Both of the groups have examples with labels (the category where a piece of news belongs), but the algorithm will learn its model from the train data and then it is tested for accuracy (how well is doing) on the test data.

train_data = sklearn.datasets.fetch_20newsgroups(subset='train')
print(len(train_data.filenames))

11314

test_data = sklearn.datasets.fetch_20newsgroups(subset='test')
print(len(test_data.filenames))

7532

In this dataset, 60% of files are assigned to training and 40% are assigned to testing.

One doesn't need to work with all 20 groups at once, we can choose to focus on a subset:

groups = ['comp.graphics', 'comp.os.ms-windows.misc',
'comp.sys.ibm.pc.hardware', 'comp.sys.mac.hardware',
'comp.windows.x', 'sci.space']

train_data = sklearn.datasets.fetch_20newsgroups(subset='train', categories=groups)
print(len(train_data.filenames))

3529

Representing documents as vectors¶

Clustering works with vectors of numbers, thus, all documents need to be converted into such vectors.

sklearn already contains all classes that perform such transformation, here is an example of creating one that does stemming of words (reducing a word to its stem, for example: "working"--> "work", etc.), and then calculates tf*idf (the product of term frequency and the inverse document frequency).

import nltk.stem
english_stemmer = nltk.stem.SnowballStemmer('english')

from sklearn.feature_extraction.text import TfidfVectorizer

class StemmedTfidfVectorizer(TfidfVectorizer):
    def build_analyzer(self):
        analyzer = super(TfidfVectorizer, self).build_analyzer()
        return lambda doc: (english_stemmer.stem(w) for w in analyzer(doc))

Choices
In addition to removing stop-words, getting stems, ignoring unicode characters that cannot be decoded, we are also dropping all words that appear less than 10 times (this will remove spelling mistakes or unusual acronyms), as well as words that appear in 50% of the documents. All these efforts are for the goal of reducing the dimensions of the problem (by having a smaller number of features).

vectorizer = StemmedTfidfVectorizer(min_df=10, max_df=0.5, stop_words='english', decode_error='ignore')

vectorized = vectorizer.fit_transform(train_data.data)
vectorized

<3529x4712 sparse matrix of type '<type 'numpy.float64'>'
	with 245521 stored elements in Compressed Sparse Row format>

Apply Clustering¶

from sklearn.cluster import KMeans
num_clusters = 50
km = KMeans(n_clusters=num_clusters, init='random', n_init=1, verbose=1, random_state=3)

km

KMeans(algorithm='auto', copy_x=True, init='random', max_iter=300,
    n_clusters=50, n_init=1, n_jobs=1, precompute_distances='auto',
    random_state=3, tol=0.0001, verbose=1)

# fit the data to this "model"

km.fit(vectorized)

Initialization complete
Iteration  0, inertia 5899.560
Iteration  1, inertia 3218.298
Iteration  2, inertia 3184.333
Iteration  3, inertia 3164.867
Iteration  4, inertia 3152.004
Iteration  5, inertia 3143.111
Iteration  6, inertia 3136.256
Iteration  7, inertia 3129.325
Iteration  8, inertia 3124.567
Iteration  9, inertia 3121.900
Iteration 10, inertia 3120.210
Iteration 11, inertia 3118.627
Iteration 12, inertia 3117.363
Iteration 13, inertia 3116.811
Iteration 14, inertia 3116.588
Iteration 15, inertia 3116.417
Iteration 16, inertia 3115.760
Iteration 17, inertia 3115.374
Iteration 18, inertia 3115.155
Iteration 19, inertia 3114.949
Iteration 20, inertia 3114.515
Iteration 21, inertia 3113.937
Iteration 22, inertia 3113.720
Iteration 23, inertia 3113.548
Iteration 24, inertia 3113.475
Iteration 25, inertia 3113.447
Converged at iteration 25: center shift 0.000000e+00 within tolerance 2.069005e-08

KMeans(algorithm='auto', copy_x=True, init='random', max_iter=300,
    n_clusters=50, n_init=1, n_jobs=1, precompute_distances='auto',
    random_state=3, tol=0.0001, verbose=1)

The variable labels_ will indicate for every document the number of cluster to which the document has been assigned.

km.labels_

array([38, 17, 47, ..., 41, 14, 16], dtype=int32)

There is a label assigned to every document:

len(km.labels_)

3529

We can look at all centers, there are large vectors:

km.cluster_centers_

array([[ 0.        ,  0.        ,  0.        , ...,  0.        ,
         0.        ,  0.05115952],
       [ 0.00255397,  0.        ,  0.        , ...,  0.        ,
         0.        ,  0.        ],
       [ 0.00585929,  0.        ,  0.        , ...,  0.        ,
         0.        ,  0.        ],
       ..., 
       [ 0.        ,  0.        ,  0.        , ...,  0.        ,
         0.        ,  0.        ],
       [ 0.        ,  0.        ,  0.        , ...,  0.        ,
         0.        ,  0.        ],
       [ 0.        ,  0.07535044,  0.        , ...,  0.        ,
         0.        ,  0.        ]])

km.cluster_centers_.shape

(50, 4712)

That is, we have 50 centers, each a vector of 4712 dimensions.

Let's print out one centroid, it should be a long array of numberds (tf-idf values) for different words that are relevant to the cluster:

print list(km.cluster_centers_[0])[:200]

[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0067602679990319452, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0041845247341593733, 0.0, 0.0, 0.0, 0.00968812922633977, 0.0, 0.0, 0.0, 0.0, 0.0053445963401853817, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.006839379371716402, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0059828153824391288, 0.0, 0.0, 0.0, 0.01017709630102861, 0.0, 0.0, 0.0, 0.0, 0.0035105064928858047, 0.0, 0.0, 0.0034043452295962956, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.006952611564767181, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0030688771398140143, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0069650962930137752, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.003937057105132956, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0032545942436772971, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0050452641435770607, 0.0, 0.0, 0.0048600134629199158, 0.0, 0.0, 0.0022938899340257798, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0055136033211596654]

Let's see how many documents are in cluster 42:

indices = [i for i in range(len(km.labels_)) if km.labels_[i] == 42]
len(indices)

53

print indices

[85, 108, 155, 232, 246, 371, 376, 611, 653, 868, 901, 1049, 1285, 1399, 1401, 1436, 1442, 1455, 1476, 1502, 1515, 1594, 1682, 1771, 1804, 1816, 1854, 1855, 1939, 1941, 1974, 2073, 2088, 2117, 2305, 2319, 2401, 2449, 2505, 2574, 2601, 2618, 2729, 2769, 2803, 2873, 3037, 3040, 3101, 3138, 3245, 3305, 3306]

Indices in this case is a list of document ID-s, each document is assigned a number and we can access the data through that index:

train_data.data[indices[0]]

u'From: rborden@ugly.UVic.CA (Ross  Borden)\nSubject: Re: How many read sci.space?\nNntp-Posting-Host: ugl-gw.uvic.ca\nOrganization: University of Victoria, Victoria, BC, Canada\nLines: 33\n\nIn article <1qjs1j$306@access.digex.net> prb@access.digex.com (Pat) writes:\n>\n>\n>In the old days,  their used to be Arbitron stats\'  that analyzed\n>the readership and posting volumes  by group  and user.\n>\n>They were available from UUNET.   That\'s how you check the\n>readership of Sci.space,  not some stupid  unscientific attempt\n>to  flood the newsgroup.\n>\n>I have abetter idea.  WHy don\'t we all reply directly to the\n>origanator  of this post,  and tell him we read sci.space ;-)\n>\n>\n>pat\n\n\tSigh.\n\tI try to make a little joke, I try to inject some humour here\nand what happens?  In the immortal words of Foghorn Leghorn:\n\n\t"I say, that was a _joke_, son."\n\n\tI thought that the bit about McElwaine, not to mention the two\nsmileys, would indicate to even the most humour impaired that I was\nJOKING.\n\tSigh.\n\t(And will everyone who pat\'s suggestion (thanks bunches, pat)\n*please* stop sending me email.)\n\n-------------------------------------------------------------------------------\n|  I shot a man just to watch him die;    |     Ross Borden                   |\n|  I\'m going to Disneyland!               |     rborden@ra.uvic.ca            |\n-------------------------------------------------------------------------------\n'

TASK FOR YOU: Find the smallest cluster, then get the text of some documents from it. Does it make sense that those documents are grouped together?