Topic: Google Searches
Goal: Study
Google searches in two ways: the searches that users do (by analyzing the
Chrome history events) and the ones that Google suggests (by automatically
capturing and analyzing its search result pages)
Learning Objectives:Practice the steps in the
data science cycle, with a focus on the modeling step through supervised classification. Concretely
you will be engaged in the following steps:
- Ask interesting questions about the problem we are studying, Google searches.
As in the previous project, you are encouraged to come up with interesting questions of your
own that can be answered with the kind of data we will learn to collect (browser history and
search page results). In addition to pursuing your curiosity, this project will also have
a common question for everyone: what proportion of a Wellesley student search queries
are informational and what are navigational. To answer this question, we will learn how to
build a supervised classifier, which you can then apply to some search history data to predict
the results. Of course the results will not be 100% accurate, but it will give you a
chance to learn about the modeling step in the data science cycle.
- Get the data from different sources: a) from a Sqlite database file, using
SQL queries; b) from the browser, using automated searches through Selenium. As you get
data from the database, you decide which data to extract, since not all is relevant. An
example was extracting a 20-minute portion of your browser history for the class activity.
You also will look at your browser history to decide whether it's safe to share it
with others (think about privacy and anonymity).
- Explore the data through descriptive statistics and visualizations. From
the browser history, you can generate timeseries of your browser-using activities, find
which websites you visit more often, calculate statistics about your daily behavior on
school days and week days, find prolongued query sessions, or display word clouds of
your search queries. Students who are unable to access their
histories (but others too) have several options: you can analyze the 600-row file
containing the query sessions we did in class about "visualizing text data", you may
ask peers at the College or elsewhere to share their browser history, or
you may explore through visualizations and statistics the JSON file of Wellesley College
searches and/or of the HTML archive of SERPs (search-engine result page). These files
can be found in this shared Google folder (accessible only to our class).
- Model the data by building a classifier to detect informational vs. navigational
search queries. In this step, we will practice the entire process of building a supervised
classification task with machine learning. We will label a dataset of Wellesley College Google searches
as a class by providing labels for each input by three independent labelers; calculate
Fleiss' Kappa to measure inter-rater reliability; extract features for learning by parsing
HTML pages related to the search queries; train one or more machine learning classifiers
on our training dataset; evaluate the results on the classifier on the same dataset; apply
the learned classifier on data from your browser history and analyze how well it's doing; perform
error analyzis to improve features.
- Communicate and visualize results by creating a web page to explain and summarize
the results of the exploration and classification. Use what you learned from your searches
on "visualizing text data" in order to find novel ways for representing the results of your
analysis. Challenge yourself to create an infographics to show results in an interesting
way, especially if you are comparing browser histories of all your group members.
Challenge Yourself: Try to improve the accuracy of your classifier
without overfitting it. We'll honor the team with the highest performing classifier.
Activities
- Choose at least one question that can lead to the visualization of some pattern
drawn from either the browser history or the search results. For example, you can show that you
never check emails on Saturdays, or that you have your peak activity on Google searches
before some paper deadline, etc.
- The question(s) you'll choose will lead you to query the SQLite database. Practice
how to write SQL queries that will allow you to only get the data you need (for example only the
Inbox URLs or only search sessions. You can create a
separate notebook only for practicing this step of getting the data.
- Perform the data exploration step by trying to use as much knowledge as possible from what you have learned so far: pandas, matplotlib or plotly, scipy.stats, BeautifulSoup, regular
expressions, datetime operations, time series, world clouds, nltk. You'll need to create a separate notebook to work on data exploration.
- Choose at least one question which can be explored with hypothesis testing. Examples of such questions
are: I am a lazy searcher or I'm a persistent searcher. The former could be tested through the hypothesis "my search phrases are on average two words long" (or a single word), while the latter through the hypothesis that a search segment is of a certain duration in number of searches or in time. Feel free to come up with other
hypotheses.
- Learn a supervised classifier from the labeled data of the wellesley related searches. The classifier
will have two classes: informational and navigational.
- Apply the classifier to predict your own search behavior or that of your peers. To do this, you will
need to create a test data set yourself from your browser history + google searches.
- Compile a report as a web page (like the one for the Wikipedia project) to share your findings from the project.
Outcomes to submit
- Create a blog page linked to your CS 234 portfolio for this project and keep track
over time of all activities related to it as well as your short summary/reflection entries about them.
- Complete all notebooks from the time frame of the project and submit the files as instructed
in previous tasks. Write in your blog page what you learned from working with them and how
that knowledge is useful for this project or the final project.
- Create a final HTML page to communicate the results in
an informative and persuasive way. The page can be hosted by one team member with others linking
to it from their blog pages.
- Upload in your dav/drop/project2 folder every notebook, dataset,
file, image, etc. that was generated by yourself or your team during the project. Notebooks should
have at the top the name of the author or authors, if it was collaborative work.
Explored Concepts
Taxonomy of web searches
Chrome History Database
SQL queries
Connect to sqlite3 with Python
Selenium and Chromedriver
Parsing pages with Python's selenium
Automated Google searches
Breadth-first search (to build crawler)
Labeling related searches to create training dataset
Inter-rater reliability (Fleiss' Kappa)
Supervised Classification
Feature Extractors
Training, validation, and test sets
Accuracy, confusion matrix, recall & precision
Cross-validation