Topic: Digital Natives
Goal: In this project you will
identify a source (or more) of data that will allow you to
study the generation (your generation) that is called "digital natives", because it
grew up with technology. The data has to be "traces", that is, content or
interactions that are captured automatically by digital devices. These can be your own
traces (for example, all your emails, your browsing history, all your text
messages, your instagram interactions, Facebook history, online
games, etc.), or you can collect the traces of your friends, in order to
make broader statements. Ambitious projects will try to collect multiple
sources of traces to paint a more nuanced picture of how the digital natives
use technology across different platforms and how much technology is part of
their daily lives.
Objective:Showcase your mastery of all the different steps of
the data science cycle: asking interesting questions; getting the data; exploring
the data; modeling the data; and communicating the results of the analysis.
What to submit
- In the folder dav/drop/project3 upload all the notebooks you created to perform
the analysis; all the different datasets you worked with; a file README.txt that
describes with a sentence what each file contains.
- Make sure that the notebooks are carefully written, like the ones you receive
in class, meaning, they show step by step how you did something, why you made
certain decisions, whether there are any non-standard Python modules that you
needed to install, etc.
- In the folder public_html/cs234/digital upload the file index.html that will
contain the HTML report you'll write for the project. This report should be composed of
several sections: abstract, introduction, data collection, methods for analysis,
results, conclusion. Use graphs and tables to convey the information. Well-written
reports should cite literature (about digital natives and/or about other studies that deal
with similar data or similar questions).
- In the file public_html/cs234/digital/blog.html (which you should have
created since the start of the project) there should be frequent entries over the
project period showing the progress you're making from day to day. Think of this
blog as your "blue notebook" in a science lab, where you keep notes of experiments
you make and decisions that change the course of the project. This blog should
also link to the end version of your notebooks (as HTML pages).
- After the deadline (Dec 21st, 4:00PM) send an email to Eni to explain which level
of project you worked toward to (see the kind of projects in the section of Grading)
and to what extent you believe to have accomplished this goal.
Grading
The two previous projects in the class didn't have a grade, but
this project will. This grade will be very important in deciding your final grade. However, the
quality and completion of all course work counts to determine jumps up and down the grade scale.
Here is Wellesley's grade policy:
- Grade A is given to students who meet with conspicuous excellence every demand which can fairly be made by the course.
- Grade B is given to those students who add to the minimum of satisfactory attainment excellence in not all, but some of the following: organization, accuracy, originality, understanding, insight.
- Grade C is given to those students who have attained a satisfactory familiarity with the content of a course and who have demonstrated ability to use this knowledge in a satisfactory manner.
- Grade D is a passing grade.
Based on the above mentioned policy, here is some help to decide what kind of project
to do:
- To get a passing grade: get a ready available dataset (it can be something else besides digital traces, such as surveys with millenials, etc.), ask interesting questions,
explore them through some visualizations and descriptive statistics; and write a final
report (as HTML page) to describe your findings.
- To get a grade of C: you should collect one source of digital traces, process the
data to make it ready for the other steps of the data science cycle; ask a question; do exploration, perform one hypothesis testing; and write a final report.
- To get a grade of B: in addition to what is required for the C grade, you will look
at more than one question and you will
also attempt to show mastery of the modeling step (classification or regression or clustering).
- To get a grade of A: you need to have more than one data traces source so that you can combine or compare the sources; ask several
questions; perform exploration; hypothesis testing; modeling; and communicate the results.
Questions
- Q1: I'm not sure whether my project meets the A-level project, what do do?
- A1: Come talk to Eni during office hours or send an email to find a time to meet and discuss.
- Q2: Can you convert the expecation of "several questions" into a concrete number?
- A2: This is dependent on how complex the questions are and whether they allow you to showcase different things you know how to do. If a question requires you to process the
data in non-obvious ways, then 3 well-formulated questions that can be answered either
through visualization or hypothesis testing are sufficient. However, if the questions are
simple and the answers are revealing through simple visualizations, you would need to have
at least 5 questions.
- Q3: What is an example of a simple question vs. a complex question?
- A3: Simple questions are the ones that involve only one variable. For example,
which was my most active day in the browser history? This only involves counting entries by day and finding the maximum count. It can be done very easily with the resample method of pandas timeseries. Complex questions involve 2 or more variables. Here is an example: Is my browsing behavior different on weekdays vs. weekends? This is a complex question, because it can be answered in many ways. Here is one possible way to answer it, by framing the question in terms of the variables in our browser history:
Is there any overlap between the websites that I visit the most on weekends and those I visit the most on weekdays? To answer this question, you need a multi-step process: you have to first create a new variable: day_kind with two values: weekend, weekday and group visits by it. Then, you'll have to extract the domain names; find the top domain names for each group separately, and then find the overlap between the group.
- Q4: What are the best kind of questions to choose?
- AA: The best questions can derive from a strong and meaningful initial idea and hypothesis.
Let's assume that your main hypothesis is: my browser history shows I'm a workaholic, because I spend my time on work-related websites (google docs, gradescope, wellesley.edu) instead of fun websites (youtube, instagram, pinterest). Then, a way to
test this hypothesis would be to ask a few related questions that will lead you to at least one
specific question that can be tested via a formal hypothesis testing process and also allow the creation of a model of
prediction or exploration of your data. For the concrete scenario ("I'm a workaholic"), one question that will help you toward your goal would
be: what websites do I visit on a regular basis? This is different from which websites have the greater count visit, because as we have seen with the fivethirtyeight website, some websites
case "URL pollution". It is a complex question, because it will require to calculate the % of days in which you have visited each domain name, so you'll have to create two new variables: domain name and percentage of visits (number of different days / total number of days in your history). Since in order to find the percentage you need to find all different days, this question can lead to a second question that you can explore through a visualization: a heatmap of domain sites and days you have visited them colored by the frequency of visits on that day. Notice that once you have such a heatmap, you can also find on what days you were the most busy and also what websites you were visiting that day. This heatmap though will be very
big (128 days as columns by 20-30 top websites). You can collapse it into 24 rows: 12 for the weekdays and 12 for the weekends. Notice how now you created a visualization for the question:
is my browsing different on weekdays vs. weekends. If you then take the data for the top 50-webistes and express them as two vectors: one containing the average value of daily visits on a weekday vs. average value of daily visits on a weekend (aggregating over all weeks), you have created two paired samples, because each website has two counts associated with it. We can agree that a workaholic is someone who visits the same websites independently of the day, because they
are always mostly thinking about work. You can use a paired t-test to support your hypothesis.
In this case the null hypothesis is "you are not a workaholic", because rejecting that hypothesis leads to support for the alternative hypothesis "you are a workaholic". Finally,
one can imagine building a classifier with 6 labels: morning-weekday, morning-weekend,
midday-weekday, midday-weekend, evening-weekday, evening-weekend, which will learn to predict
the time of day+weekday based on what your list of websites visited in a certain period looked like.