Project Two

Topic: Google Searches

Goal: Study Google searches in two ways: the searches that users do (by analyzing the Chrome history events) and the ones that Google suggests (by automatically capturing and analyzing its search result pages)

Learning Objectives:Practice the steps in the data science cycle, with a focus on the modeling step through supervised classification. Concretely you will be engaged in the following steps:

  • Ask interesting questions about the problem we are studying, Google searches. As in the previous project, you are encouraged to come up with interesting questions of your own that can be answered with the kind of data we will learn to collect (browser history and search page results). In addition to pursuing your curiosity, this project will also have a common question for everyone: what proportion of a Wellesley student search queries are informational and what are navigational. To answer this question, we will learn how to build a supervised classifier, which you can then apply to some search history data to predict the results. Of course the results will not be 100% accurate, but it will give you a chance to learn about the modeling step in the data science cycle.
  • Get the data from different sources: a) from a Sqlite database file, using SQL queries; b) from the browser, using automated searches through Selenium. As you get data from the database, you decide which data to extract, since not all is relevant. An example was extracting a 20-minute portion of your browser history for the class activity. You also will look at your browser history to decide whether it's safe to share it with others (think about privacy and anonymity).
  • Explore the data through descriptive statistics and visualizations. From the browser history, you can generate timeseries of your browser-using activities, find which websites you visit more often, calculate statistics about your daily behavior on school days and week days, find prolongued query sessions, or display word clouds of your search queries. Students who are unable to access their histories (but others too) have several options: you can analyze the 600-row file containing the query sessions we did in class about "visualizing text data", you may ask peers at the College or elsewhere to share their browser history, or you may explore through visualizations and statistics the JSON file of Wellesley College searches and/or of the HTML archive of SERPs (search-engine result page). These files can be found in this shared Google folder (accessible only to our class).
  • Model the data by building a classifier to detect informational vs. navigational search queries. In this step, we will practice the entire process of building a supervised classification task with machine learning. We will label a dataset of Wellesley College Google searches as a class by providing labels for each input by three independent labelers; calculate Fleiss' Kappa to measure inter-rater reliability; extract features for learning by parsing HTML pages related to the search queries; train one or more machine learning classifiers on our training dataset; evaluate the results on the classifier on the same dataset; apply the learned classifier on data from your browser history and analyze how well it's doing; perform error analyzis to improve features.
  • Communicate and visualize results by creating a web page to explain and summarize the results of the exploration and classification. Use what you learned from your searches on "visualizing text data" in order to find novel ways for representing the results of your analysis. Challenge yourself to create an infographics to show results in an interesting way, especially if you are comparing browser histories of all your group members.

Challenge Yourself: Try to improve the accuracy of your classifier without overfitting it. We'll honor the team with the highest performing classifier.

Activities

  • Choose at least one question that can lead to the visualization of some pattern drawn from either the browser history or the search results. For example, you can show that you never check emails on Saturdays, or that you have your peak activity on Google searches before some paper deadline, etc.
  • The question(s) you'll choose will lead you to query the SQLite database. Practice how to write SQL queries that will allow you to only get the data you need (for example only the Inbox URLs or only search sessions. You can create a separate notebook only for practicing this step of getting the data.
  • Perform the data exploration step by trying to use as much knowledge as possible from what you have learned so far: pandas, matplotlib or plotly, scipy.stats, BeautifulSoup, regular expressions, datetime operations, time series, world clouds, nltk. You'll need to create a separate notebook to work on data exploration.
  • Choose at least one question which can be explored with hypothesis testing. Examples of such questions are: I am a lazy searcher or I'm a persistent searcher. The former could be tested through the hypothesis "my search phrases are on average two words long" (or a single word), while the latter through the hypothesis that a search segment is of a certain duration in number of searches or in time. Feel free to come up with other hypotheses.
  • Learn a supervised classifier from the labeled data of the wellesley related searches. The classifier will have two classes: informational and navigational.
  • Apply the classifier to predict your own search behavior or that of your peers. To do this, you will need to create a test data set yourself from your browser history + google searches.
  • Compile a report as a web page (like the one for the Wikipedia project) to share your findings from the project.

Outcomes to submit

  1. Create a blog page linked to your CS 234 portfolio for this project and keep track over time of all activities related to it as well as your short summary/reflection entries about them.
  2. Complete all notebooks from the time frame of the project and submit the files as instructed in previous tasks. Write in your blog page what you learned from working with them and how that knowledge is useful for this project or the final project.
  3. Create a final HTML page to communicate the results in an informative and persuasive way. The page can be hosted by one team member with others linking to it from their blog pages.
  4. Upload in your dav/drop/project2 folder every notebook, dataset, file, image, etc. that was generated by yourself or your team during the project. Notebooks should have at the top the name of the author or authors, if it was collaborative work.

Explored Concepts

Taxonomy of web searches

Chrome History Database

SQL queries

Connect to sqlite3 with Python

Selenium and Chromedriver

Parsing pages with Python's selenium

Automated Google searches

Breadth-first search (to build crawler)

Labeling related searches to create training dataset

Inter-rater reliability (Fleiss' Kappa)

Supervised Classification

Feature Extractors

Training, validation, and test sets

Accuracy, confusion matrix, recall & precision

Cross-validation