Author: Eni Mustafaraj
12/05/2017
Table of Contents
We'll take the simple example from the slides with Chinese and Japanese labels.
X = ["Chinese Beijing Chinese",
"Chinese Chinese Shangai",
"Chinese Macao",
"Tokyo Japan Chinese"]
y = ['c', 'c', 'c', 'j']
(X, y) is the training set with four documents and four labels.
Vectorizing means converting every document into a vector of numbers.
from sklearn.feature_extraction.text import CountVectorizer
vectorizer = CountVectorizer()
X_train = vectorizer.fit_transform(X)
X_train
Notice that the created matrix has 4 rows and 6 columns. There are 4 rows because there are 4 documents, but why there are 6 columns? Because that is the size of the vocabulary for this corpus.
Since the internal representatin of X_train is in compressed form, let's convert it into an array to see its content.
X_train.toarray()
To see the corresponding features for each column, we refer to the vectorizer
object:
vectorizer.get_feature_names()
If we look at document 4:
"Tokyo Japan Chinese"
it's represented by this vector:
[0, 1, 1, 0, 0, 1]
which has a 1 where the features 'chinese', 'japan' and 'tokyo' are: 2nd, 3rd, and 5th.
fit_transform
vs. transform
¶To transform the X corpus, we used the method fit_transform
, because the vectorizer had first to learn the vocubulary from the corpus (the fitting process).
When we get new data (the testing data), we will use the method transform
instead, to create the representation with what is known from the training corpus.
test = ['Chinese Chinese Chinese Tokyo Shangai']
testVector = vectorizer.transform(test)
testVector
testVector.toarray()
Now that our data is in a vector form, we can train the classifier on it.
We will use the Multinomial Naive Bayes classifier from sklearn.
from sklearn.naive_bayes import MultinomialNB
classifier = MultinomialNB()
classifier
alpha=1.0
refers to the Laplacian smoothing coefficient. It is usually 1, but there are cases when another
value might be desired (or better). We'll just keep it 1.0.
classifier.fit(X_train, y)
In order to know what is happening inside the classifier, we have to call some of its methods. To find out what it has to offer, we can use the built-in Python function, dir
:
print dir(classifier)
# show the class names that it has learned
classifier.classes_
# show how many instances for each class
classifier.class_count_
# show total number of features for each class
classifier.feature_count_
As you remember, the second parameter, "chinese", showed up 5 times with class "c", which we see here too.
Now, let's look at the log propabilities. Because the probabilities themselves could be really small, the function calculates and stores log probabilities:
classifier.feature_log_prob_
The array about is P(w|c) for both classes "c" and "j" (chinese or japanese documents).
In the first array, we can notice three different values: -1.946, -0.847, and -2.639. These corresponds to the values 1, 5, and 0 that we can find in the feature count array.
As we discussed in lecture, the probabilities are ratios of count frequences smoothed to deal with 0 counts.
For a word that appears 0 times, the proability estimate was: $1/(count(words)+|V|)$. In our case, it should be $1/(8+6) = 1/14$.
The log value that is calculated is the natural logarithm, with base $e$. We can compare the values to see that we're getting the expected values:
print "probability for 0-count features:", 1/14.
import numpy as np
print "take the exponential of the log value:", np.e**-2.63905733
1/14. - np.e**-2.63905733
As we can see, this numbers are identical (up to the 10th digit after the comma).
Larger values (e.g., -0.847) will correspond to higher probabilities, and smaller values (e.g., -2.639) will correspond to lower probabilities.
Now that we have a classifier, we can use it to predict new instances.
We already have a test case, that we have transformed to a vector.
classifier.predict(testVector)
The result means that the classifier predicted the label "c" for this document. To see how confident the classifier is, we can look up the probabilities that it assigned to each label:
classifier.predict_proba(testVector)
This shows that the classifier was pretty confident in the label "c" (remembers, the labels are "c" and "j").
The entire process had four steps:
In this task, we'll try to classify emails into "CS111" or "CS234" by their title.
The training set is the list of titles of the 15 last emails in each of this groups as they appear in Eni's inbox.
NOTE: Notice the two strings below that use triple quotes to span multiple lines (this was a Python quiz question at the start of the semester).
dataCS111 = """THQuiz4 posted
Take-home Quiz 2 grades are now available
Reminder: PS09 reflection due tonight, quiz in class tomorrow
PS09 Solution + Reflection posted, due Thu, 11/30 at 23:59
PS10 has been posted
*TOMORROW*: Tissenbaum lunchtime talk on Computational Thinking in App Inventor
PS09 - task 2
How to fix wxNSApplication error in ps09 drawLs.py turtle problem
Drop in hours cancelled today
PS09 Otter Inspect
Ticks issues in thquiz3 Subtask b
Re: OtterInspect error
OtterInspector for THQ3
Re: OtterInspect: Task C
Re: Opacity of Legend for Subtask Two"""
dataCS234 = """Help room 7-9pm in SCI160A!
Would it be helpful to have help room on Thursday evening?
help room this evening
Datasets for machine learning tasks
Week 14 in CS 234
Check-ins for the second semester half: Sat & Sun noon-3PM
Cross-validation with sklearn
Eni's office hours today (Thu, 11/30/17): 12:30-2:30
Lmk if it would be helpful to have help room tonight from 9-10:30PM!
Follow-up on yesterday's class
Help room S160A! 7-9pm
This week in CS 234: class discussion on Tue -- please come prepared.
Code to work with the timestamps of Chrome browser
I'm walking from Clapp, be there soon [EOM]
Help room now until 9pm in SCI160A!"""
Let's split these string to create a list of documents:
dataCS111Docs = [line.strip() for line in dataCS111.split("\n")]
print len(dataCS111Docs)
dataCS234Docs = [line.strip() for line in dataCS234.split("\n")]
print len(dataCS234Docs)
Then the corpus is:
X = dataCS111Docs + dataCS234Docs
X
Let's assign the labels too:
y = ["cs111"]*15 + ["cs234"]*15
print y
vectorizer = CountVectorizer()
X_train = vectorizer.fit_transform(X)
X_train
print vectorizer.get_feature_names()
There are 127 features (distinct tokens) created in this case. Notice how the CounterVectorizer
is doing the cleaning up of the data by itself.
vectorizer
It uses regular expressions to decide what a word is: token_pattern=u'(?u)\\b\\w\\w+\\b'
If we want a different setup for the tokens, we can pass a different regular expression to the token_pattern parameter.
from sklearn.naive_bayes import MultinomialNB
classifier = MultinomialNB()
classifier.fit(X_train, y)
classifier.classes_
classifier.feature_log_prob_.shape
classifier.feature_log_prob_[0][:10]
These are the first 10 parameters of the probability distribution P(w|c='cs111') for the classifier.
test = ["PS09 will be posted later today",
"Moving Wednesday's drop-in hours to Tuesday",
"No office hours this evening; I'm attending the talk at 5PM in PND Atrium",
"Help room tonight! Claflin living room. 6-8PM"
]
testVector = vectorizer.transform(test)
testVector
# Let's predict the classes for the four new email titles
for case in testVector:
print classifier.predict(case)[0]
for case in testVector:
print classifier.predict(case)[0], classifier.predict_proba(case)
The classifier predicted correctly these new cases.
In this task you'll create the dataset yourself, apply labels to it, vectorize it, train the classifier and make predictions in some unseen data.
Ideas for classification:
You might want to put the data into TEXT files, because the text might be too big to have into the notebook.
This is only for students striving toward excellence in the class.
For the classifier you trained performed cross-validation to find the accuracy.