CS 234 Data, Analytics, and Visualization

Topic: Digital Natives

Goal: In this project you will identify a source (or more) of data that will allow you to study the generation (your generation) that is called "digital natives", because it grew up with technology. The data has to be "traces", that is, content or interactions that are captured automatically by digital devices. These can be your own traces (for example, all your emails, your browsing history, all your text messages, your instagram interactions, Facebook history, online games, etc.), or you can collect the traces of your friends, in order to make broader statements. Ambitious projects will try to collect multiple sources of traces to paint a more nuanced picture of how the digital natives use technology across different platforms and how much technology is part of their daily lives.

Objective:Showcase your mastery of all the different steps of the data science cycle: asking interesting questions; getting the data; exploring the data; modeling the data; and communicating the results of the analysis.

What to submit

In the folder dav/drop/project3 upload all the notebooks you created to perform the analysis; all the different datasets you worked with; a file README.txt that describes with a sentence what each file contains.
Make sure that the notebooks are carefully written, like the ones you receive in class, meaning, they show step by step how you did something, why you made certain decisions, whether there are any non-standard Python modules that you needed to install, etc.
In the folder public_html/cs234/digital upload the file index.html that will contain the HTML report you'll write for the project. This report should be composed of several sections: abstract, introduction, data collection, methods for analysis, results, conclusion. Use graphs and tables to convey the information. Well-written reports should cite literature (about digital natives and/or about other studies that deal with similar data or similar questions).
In the file public_html/cs234/digital/blog.html (which you should have created since the start of the project) there should be frequent entries over the project period showing the progress you're making from day to day. Think of this blog as your "blue notebook" in a science lab, where you keep notes of experiments you make and decisions that change the course of the project. This blog should also link to the end version of your notebooks (as HTML pages).
After the deadline (Dec 21st, 4:00PM) send an email to Eni to explain which level of project you worked toward to (see the kind of projects in the section of Grading) and to what extent you believe to have accomplished this goal.

Grading

The two previous projects in the class didn't have a grade, but this project will. This grade will be very important in deciding your final grade. However, the quality and completion of all course work counts to determine jumps up and down the grade scale.

Here is Wellesley's grade policy:

Grade A is given to students who meet with conspicuous excellence every demand which can fairly be made by the course.
Grade B is given to those students who add to the minimum of satisfactory attainment excellence in not all, but some of the following: organization, accuracy, originality, understanding, insight.
Grade C is given to those students who have attained a satisfactory familiarity with the content of a course and who have demonstrated ability to use this knowledge in a satisfactory manner.
Grade D is a passing grade.

Based on the above mentioned policy, here is some help to decide what kind of project to do:

To get a passing grade: get a ready available dataset (it can be something else besides digital traces, such as surveys with millenials, etc.), ask interesting questions, explore them through some visualizations and descriptive statistics; and write a final report (as HTML page) to describe your findings.
To get a grade of C: you should collect one source of digital traces, process the data to make it ready for the other steps of the data science cycle; ask a question; do exploration, perform one hypothesis testing; and write a final report.
To get a grade of B: in addition to what is required for the C grade, you will look at more than one question and you will also attempt to show mastery of the modeling step (classification or regression or clustering).
To get a grade of A: you need to have more than one data traces source so that you can combine or compare the sources; ask several questions; perform exploration; hypothesis testing; modeling; and communicate the results.

Questions

Q1: I'm not sure whether my project meets the A-level project, what do do?
A1: Come talk to Eni during office hours or send an email to find a time to meet and discuss.
Q2: Can you convert the expecation of "several questions" into a concrete number?
A2: This is dependent on how complex the questions are and whether they allow you to showcase different things you know how to do. If a question requires you to process the data in non-obvious ways, then 3 well-formulated questions that can be answered either through visualization or hypothesis testing are sufficient. However, if the questions are simple and the answers are revealing through simple visualizations, you would need to have at least 5 questions.
Q3: What is an example of a simple question vs. a complex question?
A3: Simple questions are the ones that involve only one variable. For example, which was my most active day in the browser history? This only involves counting entries by day and finding the maximum count. It can be done very easily with the resample method of pandas timeseries. Complex questions involve 2 or more variables. Here is an example: Is my browsing behavior different on weekdays vs. weekends? This is a complex question, because it can be answered in many ways. Here is one possible way to answer it, by framing the question in terms of the variables in our browser history: Is there any overlap between the websites that I visit the most on weekends and those I visit the most on weekdays? To answer this question, you need a multi-step process: you have to first create a new variable: day_kind with two values: weekend, weekday and group visits by it. Then, you'll have to extract the domain names; find the top domain names for each group separately, and then find the overlap between the group.
Q4: What are the best kind of questions to choose?
AA: The best questions can derive from a strong and meaningful initial idea and hypothesis. Let's assume that your main hypothesis is: my browser history shows I'm a workaholic, because I spend my time on work-related websites (google docs, gradescope, wellesley.edu) instead of fun websites (youtube, instagram, pinterest). Then, a way to test this hypothesis would be to ask a few related questions that will lead you to at least one specific question that can be tested via a formal hypothesis testing process and also allow the creation of a model of prediction or exploration of your data. For the concrete scenario ("I'm a workaholic"), one question that will help you toward your goal would be: what websites do I visit on a regular basis? This is different from which websites have the greater count visit, because as we have seen with the fivethirtyeight website, some websites case "URL pollution". It is a complex question, because it will require to calculate the % of days in which you have visited each domain name, so you'll have to create two new variables: domain name and percentage of visits (number of different days / total number of days in your history). Since in order to find the percentage you need to find all different days, this question can lead to a second question that you can explore through a visualization: a heatmap of domain sites and days you have visited them colored by the frequency of visits on that day. Notice that once you have such a heatmap, you can also find on what days you were the most busy and also what websites you were visiting that day. This heatmap though will be very big (128 days as columns by 20-30 top websites). You can collapse it into 24 rows: 12 for the weekdays and 12 for the weekends. Notice how now you created a visualization for the question: is my browsing different on weekdays vs. weekends. If you then take the data for the top 50-webistes and express them as two vectors: one containing the average value of daily visits on a weekday vs. average value of daily visits on a weekend (aggregating over all weeks), you have created two paired samples, because each website has two counts associated with it. We can agree that a workaholic is someone who visits the same websites independently of the day, because they are always mostly thinking about work. You can use a paired t-test to support your hypothesis. In this case the null hypothesis is "you are not a workaholic", because rejecting that hypothesis leads to support for the alternative hypothesis "you are a workaholic". Finally, one can imagine building a classifier with 6 labels: morning-weekday, morning-weekend, midday-weekday, midday-weekend, evening-weekday, evening-weekend, which will learn to predict the time of day+weekday based on what your list of websites visited in a certain period looked like.

Deadlines

Original Idea

Nov 7, 2017

Detailed Plan

Dec 8, 2017

Submission

Dec 21, 2017 (by 4:00PM)

CS234 Final Project

Topic: Digital Natives

What to submit

Grading

Questions

Deadlines