Week 14: Text Classification

Author: Eni Mustafaraj
12/05/2017

Table of Contents

  1. Vectorizing text data
  2. Naive Bayes for text classification
    2.1 Create classifier instance
    2.2 Fit the classifier
    2.3 Predict new cases
  3. New task: classify emails by their title
  4. YOUR TASK: classify some of your emails
  5. EXTRA: Cross-validation

1. Vectorizing text data

We'll take the simple example from the slides with Chinese and Japanese labels.

In [1]:
X = ["Chinese Beijing Chinese", 
     "Chinese Chinese Shangai", 
     "Chinese Macao",
     "Tokyo Japan Chinese"]

y = ['c', 'c', 'c', 'j']

(X, y) is the training set with four documents and four labels.

Vectorizing means converting every document into a vector of numbers.

In [2]:
from sklearn.feature_extraction.text import CountVectorizer
In [3]:
vectorizer = CountVectorizer()
X_train = vectorizer.fit_transform(X)
X_train
Out[3]:
<4x6 sparse matrix of type '<type 'numpy.int64'>'
	with 9 stored elements in Compressed Sparse Row format>

Notice that the created matrix has 4 rows and 6 columns. There are 4 rows because there are 4 documents, but why there are 6 columns? Because that is the size of the vocabulary for this corpus.

Since the internal representatin of X_train is in compressed form, let's convert it into an array to see its content.

In [4]:
X_train.toarray()
Out[4]:
array([[1, 2, 0, 0, 0, 0],
       [0, 2, 0, 0, 1, 0],
       [0, 1, 0, 1, 0, 0],
       [0, 1, 1, 0, 0, 1]])

To see the corresponding features for each column, we refer to the vectorizer object:

In [5]:
vectorizer.get_feature_names()
Out[5]:
[u'beijing', u'chinese', u'japan', u'macao', u'shangai', u'tokyo']

If we look at document 4:

"Tokyo Japan Chinese"

it's represented by this vector:

[0, 1, 1, 0, 0, 1]

which has a 1 where the features 'chinese', 'japan' and 'tokyo' are: 2nd, 3rd, and 5th.

fit_transform vs. transform

To transform the X corpus, we used the method fit_transform, because the vectorizer had first to learn the vocubulary from the corpus (the fitting process).

When we get new data (the testing data), we will use the method transform instead, to create the representation with what is known from the training corpus.

In [6]:
test = ['Chinese Chinese Chinese Tokyo Shangai']
testVector = vectorizer.transform(test)
testVector
Out[6]:
<1x6 sparse matrix of type '<type 'numpy.int64'>'
	with 3 stored elements in Compressed Sparse Row format>
In [7]:
testVector.toarray()
Out[7]:
array([[0, 3, 0, 0, 1, 1]])

2. Naive Bayes for text classification

Now that our data is in a vector form, we can train the classifier on it.

We will use the Multinomial Naive Bayes classifier from sklearn.

Step 1: Create a classifier instance

In [8]:
from sklearn.naive_bayes import MultinomialNB
classifier = MultinomialNB()
classifier
Out[8]:
MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True)

alpha=1.0 refers to the Laplacian smoothing coefficient. It is usually 1, but there are cases when another
value might be desired (or better). We'll just keep it 1.0.

Step 2: Fit the classifier

In [9]:
classifier.fit(X_train, y)
Out[9]:
MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True)

In order to know what is happening inside the classifier, we have to call some of its methods. To find out what it has to offer, we can use the built-in Python function, dir:

In [10]:
print dir(classifier)
['__abstractmethods__', '__class__', '__delattr__', '__dict__', '__doc__', '__format__', '__getattribute__', '__getstate__', '__hash__', '__init__', '__module__', '__new__', '__reduce__', '__reduce_ex__', '__repr__', '__setattr__', '__setstate__', '__sizeof__', '__str__', '__subclasshook__', '__weakref__', '_abc_cache', '_abc_negative_cache', '_abc_negative_cache_version', '_abc_registry', '_count', '_estimator_type', '_get_coef', '_get_intercept', '_get_param_names', '_joint_log_likelihood', '_update_class_log_prior', '_update_feature_log_prob', 'alpha', 'class_count_', 'class_log_prior_', 'class_prior', 'classes_', 'coef_', 'feature_count_', 'feature_log_prob_', 'fit', 'fit_prior', 'get_params', 'intercept_', 'partial_fit', 'predict', 'predict_log_proba', 'predict_proba', 'score', 'set_params']
In [11]:
# show the class names that it has learned

classifier.classes_
Out[11]:
array(['c', 'j'],
      dtype='|S1')
In [12]:
# show how many instances for each class

classifier.class_count_
Out[12]:
array([ 3.,  1.])
In [13]:
# show total number of features for each class

classifier.feature_count_
Out[13]:
array([[ 1.,  5.,  0.,  1.,  1.,  0.],
       [ 0.,  1.,  1.,  0.,  0.,  1.]])

As you remember, the second parameter, "chinese", showed up 5 times with class "c", which we see here too.
Now, let's look at the log propabilities. Because the probabilities themselves could be really small, the function calculates and stores log probabilities:

In [14]:
classifier.feature_log_prob_
Out[14]:
array([[-1.94591015, -0.84729786, -2.63905733, -1.94591015, -1.94591015,
        -2.63905733],
       [-2.19722458, -1.5040774 , -1.5040774 , -2.19722458, -2.19722458,
        -1.5040774 ]])

What do the parameters mean?

The array about is P(w|c) for both classes "c" and "j" (chinese or japanese documents).
In the first array, we can notice three different values: -1.946, -0.847, and -2.639. These corresponds to the values 1, 5, and 0 that we can find in the feature count array.

As we discussed in lecture, the probabilities are ratios of count frequences smoothed to deal with 0 counts.

For a word that appears 0 times, the proability estimate was: $1/(count(words)+|V|)$. In our case, it should be $1/(8+6) = 1/14$.

The log value that is calculated is the natural logarithm, with base $e$. We can compare the values to see that we're getting the expected values:

In [15]:
print "probability for 0-count features:", 1/14.
import numpy as np
print "take the exponential of the log value:", np.e**-2.63905733
probability for 0-count features: 0.0714285714286
take the exponential of the log value: 0.0714285714011
In [16]:
1/14. - np.e**-2.63905733
Out[16]:
2.7481517062000194e-11

As we can see, this numbers are identical (up to the 10th digit after the comma).

Larger values (e.g., -0.847) will correspond to higher probabilities, and smaller values (e.g., -2.639) will correspond to lower probabilities.

Step 3: Predict new cases

Now that we have a classifier, we can use it to predict new instances.

We already have a test case, that we have transformed to a vector.

In [17]:
classifier.predict(testVector)
Out[17]:
array(['c'],
      dtype='|S1')

The result means that the classifier predicted the label "c" for this document. To see how confident the classifier is, we can look up the probabilities that it assigned to each label:

In [18]:
classifier.predict_proba(testVector)
Out[18]:
array([[ 0.89892033,  0.10107967]])

This shows that the classifier was pretty confident in the label "c" (remembers, the labels are "c" and "j").

Summary:

The entire process had four steps:

  1. Vectorize the training set (to turn documents into vectors of numbers)
  2. Create a classifier instance
  3. Fit the classifier (and inspect what it has learned)
  4. Make predictions on new data (which are also vectorized)

3. New task: classify emails by their titles

In this task, we'll try to classify emails into "CS111" or "CS234" by their title.

The training set is the list of titles of the 15 last emails in each of this groups as they appear in Eni's inbox.

NOTE: Notice the two strings below that use triple quotes to span multiple lines (this was a Python quiz question at the start of the semester).

In [19]:
dataCS111 = """THQuiz4 posted
Take-home Quiz 2 grades are now available
Reminder: PS09 reflection due tonight, quiz in class tomorrow
PS09 Solution + Reflection posted, due Thu, 11/30 at 23:59
PS10 has been posted
*TOMORROW*: Tissenbaum lunchtime talk on Computational Thinking in App Inventor
PS09 - task 2
How to fix wxNSApplication error in ps09 drawLs.py turtle problem
Drop in hours cancelled today
PS09 Otter Inspect
Ticks issues in thquiz3 Subtask b
Re: OtterInspect error
OtterInspector for THQ3
Re: OtterInspect: Task C
Re: Opacity of Legend for Subtask Two"""
In [20]:
dataCS234 = """Help room 7-9pm in SCI160A!
Would it be helpful to have help room on Thursday evening?
help room this evening
Datasets for machine learning tasks
Week 14 in CS 234
Check-ins for the second semester half: Sat & Sun noon-3PM
Cross-validation with sklearn
Eni's office hours today (Thu, 11/30/17): 12:30-2:30
Lmk if it would be helpful to have help room tonight from 9-10:30PM!
Follow-up on yesterday's class
Help room S160A! 7-9pm
This week in CS 234: class discussion on Tue -- please come prepared.
Code to work with the timestamps of Chrome browser
I'm walking from Clapp, be there soon [EOM]
Help room now until 9pm in SCI160A!"""

Let's split these string to create a list of documents:

In [21]:
dataCS111Docs = [line.strip() for line in dataCS111.split("\n")]
print len(dataCS111Docs)

dataCS234Docs = [line.strip() for line in dataCS234.split("\n")]
print len(dataCS234Docs)
15
15

Then the corpus is:

In [22]:
X = dataCS111Docs + dataCS234Docs
X
Out[22]:
['THQuiz4 posted',
 'Take-home Quiz 2 grades are now available',
 'Reminder: PS09 reflection due tonight, quiz in class tomorrow',
 'PS09 Solution + Reflection posted, due Thu, 11/30 at 23:59',
 'PS10 has been posted',
 '*TOMORROW*: Tissenbaum lunchtime talk on Computational Thinking in App Inventor',
 'PS09 - task 2',
 'How to fix wxNSApplication error in ps09 drawLs.py turtle problem',
 'Drop in hours cancelled today',
 'PS09 Otter Inspect',
 'Ticks issues in thquiz3 Subtask b',
 'Re: OtterInspect error',
 'OtterInspector for THQ3',
 'Re: OtterInspect: Task C',
 'Re: Opacity of Legend for Subtask Two',
 'Help room 7-9pm in SCI160A!',
 'Would it be helpful to have help room on Thursday evening?',
 'help room this evening',
 'Datasets for machine learning tasks',
 'Week 14 in CS 234',
 'Check-ins for the second semester half: Sat & Sun noon-3PM',
 'Cross-validation with sklearn',
 "Eni's office hours today (Thu, 11/30/17): 12:30-2:30",
 'Lmk if it would be helpful to have help room tonight from 9-10:30PM!',
 "Follow-up on yesterday's class",
 'Help room S160A! 7-9pm',
 'This week in CS 234: class discussion on Tue -- please come prepared.',
 'Code to work with the timestamps of Chrome browser',
 "I'm walking from Clapp, be there soon [EOM]",
 'Help room now until 9pm in SCI160A!']

Let's assign the labels too:

In [23]:
y = ["cs111"]*15 + ["cs234"]*15
print y
['cs111', 'cs111', 'cs111', 'cs111', 'cs111', 'cs111', 'cs111', 'cs111', 'cs111', 'cs111', 'cs111', 'cs111', 'cs111', 'cs111', 'cs111', 'cs234', 'cs234', 'cs234', 'cs234', 'cs234', 'cs234', 'cs234', 'cs234', 'cs234', 'cs234', 'cs234', 'cs234', 'cs234', 'cs234', 'cs234']

a. Vectorize the data

In [24]:
vectorizer = CountVectorizer()
X_train = vectorizer.fit_transform(X)
X_train
Out[24]:
<30x127 sparse matrix of type '<type 'numpy.int64'>'
	with 197 stored elements in Compressed Sparse Row format>
In [25]:
print vectorizer.get_feature_names()
[u'10', u'11', u'12', u'14', u'17', u'23', u'234', u'30', u'30pm', u'3pm', u'59', u'9pm', u'app', u'are', u'at', u'available', u'be', u'been', u'browser', u'cancelled', u'check', u'chrome', u'clapp', u'class', u'code', u'come', u'computational', u'cross', u'cs', u'datasets', u'discussion', u'drawls', u'drop', u'due', u'eni', u'eom', u'error', u'evening', u'fix', u'follow', u'for', u'from', u'grades', u'half', u'has', u'have', u'help', u'helpful', u'home', u'hours', u'how', u'if', u'in', u'ins', u'inspect', u'inventor', u'issues', u'it', u'learning', u'legend', u'lmk', u'lunchtime', u'machine', u'noon', u'now', u'of', u'office', u'on', u'opacity', u'otter', u'otterinspect', u'otterinspector', u'please', u'posted', u'prepared', u'problem', u'ps09', u'ps10', u'py', u'quiz', u're', u'reflection', u'reminder', u'room', u's160a', u'sat', u'sci160a', u'second', u'semester', u'sklearn', u'solution', u'soon', u'subtask', u'sun', u'take', u'talk', u'task', u'tasks', u'the', u'there', u'thinking', u'this', u'thq3', u'thquiz3', u'thquiz4', u'thu', u'thursday', u'ticks', u'timestamps', u'tissenbaum', u'to', u'today', u'tomorrow', u'tonight', u'tue', u'turtle', u'two', u'until', u'up', u'validation', u'walking', u'week', u'with', u'work', u'would', u'wxnsapplication', u'yesterday']

There are 127 features (distinct tokens) created in this case. Notice how the CounterVectorizer is doing the cleaning up of the data by itself.

In [26]:
vectorizer
Out[26]:
CountVectorizer(analyzer=u'word', binary=False, decode_error=u'strict',
        dtype=<type 'numpy.int64'>, encoding=u'utf-8', input=u'content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 1), preprocessor=None, stop_words=None,
        strip_accents=None, token_pattern=u'(?u)\\b\\w\\w+\\b',
        tokenizer=None, vocabulary=None)

It uses regular expressions to decide what a word is: token_pattern=u'(?u)\\b\\w\\w+\\b'
If we want a different setup for the tokens, we can pass a different regular expression to the token_pattern parameter.

b. Train the classifier

In [27]:
from sklearn.naive_bayes import MultinomialNB
classifier = MultinomialNB()
classifier.fit(X_train, y)
Out[27]:
MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True)
In [28]:
classifier.classes_
Out[28]:
array(['cs111', 'cs234'],
      dtype='|S5')
In [29]:
classifier.feature_log_prob_.shape
Out[29]:
(2, 127)
In [30]:
classifier.feature_log_prob_[0][:10]
Out[30]:
array([-5.35658627, -4.66343909, -5.35658627, -5.35658627, -5.35658627,
       -4.66343909, -5.35658627, -4.66343909, -5.35658627, -5.35658627])

These are the first 10 parameters of the probability distribution P(w|c='cs111') for the classifier.

c. Predict new cases

In [31]:
test = ["PS09 will be posted later today",
        "Moving Wednesday's drop-in hours to Tuesday",
        "No office hours this evening; I'm attending the talk at 5PM in PND Atrium",
        "Help room tonight! Claflin living room. 6-8PM"
       ]
In [32]:
testVector = vectorizer.transform(test)
testVector
Out[32]:
<4x127 sparse matrix of type '<type 'numpy.int64'>'
	with 19 stored elements in Compressed Sparse Row format>
In [33]:
# Let's predict the classes for the four new email titles
for case in testVector:
    print classifier.predict(case)[0]
cs111
cs111
cs234
cs234
In [34]:
for case in testVector:
    print classifier.predict(case)[0], classifier.predict_proba(case)
cs111 [[ 0.90925747  0.09074253]]
cs111 [[ 0.66711474  0.33288526]]
cs234 [[ 0.19866122  0.80133878]]
cs234 [[ 0.00484531  0.99515469]]

The classifier predicted correctly these new cases.

4. YOUR TASK: classify some of your emails

In this task you'll create the dataset yourself, apply labels to it, vectorize it, train the classifier and make predictions in some unseen data.

Ideas for classification:

  1. create a dataset that contains emails from two or three different professors (remove their names) and find the author of the email
  2. create a dataset with some spam emails and good emails and find spam/ham labels
  3. create a dataset with emails from your parents (or friends) and an org or something else and classify between the two.

You might want to put the data into TEXT files, because the text might be too big to have into the notebook.

5. EXTRA: Cross-validation

This is only for students striving toward excellence in the class.

For the classifier you trained performed cross-validation to find the accuracy.