Analyzing the Connections Between Famous People's Wikipedia Pages

  • To complete this project, generate the following files:
    • GraphAnalysis.java, containing your analysis code. Its main method should print out answers to all the questions asked below.
    • answers.pdf, containing answers to the data analysis questions in the parts below. For each question where you compute a number (such as a fraction or a representation index), include the number you computed as your answer to that question.

Submit both files to Gradescope's Project


In your project, you will explore and analyze a dataset consisting of the Wikipedia pages of famous people and the hyperlinks between them. You'll represent the data as a graph, where people's Wikipedia pages are the nodes and hyperlinks between their pages are the arcs (such that if page A links to page B, there is an arc from the vertex for page A to the vertex for page B). You will analyze the graph for interesting properties and biases of the connections between people.

Learning Goals

This project is an opportunity for you to learn, practice, and demonstrate mastery with the following skills:

  • Using graphs and graph traversals.
  • Using hash tables and reasoning about the complexity of practical code.
  • Reasoning about the real-world meaning of data represented using a Graph ADT.

The Data and Starter Code

The starter code for this project includes:

  • the data (in the datasets/ directory, one file for vertices and one for arcs) -- namely:
    • The full dataset, stored in: datasets/pantheon_nodes_all.csv and datasets/pantheon_edges_all.csv
    • A 1000-node subgraph for testing, stored in: datasets/pantheon_nodes_1000.csv and datasets/pantheon_edges_1000.csv
  • AdjListsGraph.java, a graph implementation for you to use (you do not need to implement any graph operations or modify this file)
  • a Reader class to load the graph from the data files, along with a library fastcsv which Reader uses (you need not modify Reader or fastcsv, and you need not directly use fastcsv.)

Download it here: link

The dataset consists of a graph of Wikipedia pages representing human beings, where pages are nodes and links between them are arcs. Pages are annotated with information including the gender of the page's person and the domain of which the person is a notable member (such as the arts, business & law, etc.). You can use person.getData("name") or person.getData("gender") or person.getData("domain") to get these pieces of data from Person objects, which are used to represent the vertices of the Wikipedia graph.

Note that there is also a subgraph dataset, in the files with names dataset/pantheon_nodes_1000.csv and dataset/pantheon_edges_1000.csv, in case your code takes a long time to run and you need to run it on a smaller dataset to test it. HOWEVER, if your code is slow, this probably means you've done something very inefficiently -- keep working on your code, but come talk with us about this!

Your Tasks

Step 1: Getting to Know Your Data

Start by using the provided Reader class to read the dataset in. Read the Javadoc for Reader for details on how to instantiate a new Reader and to get an AdjListsGraph from it containing the data. Then, use the graph to answer these questions:

  1. Which node or nodes in the graph has the highest out-degree? (The out-degree of a node is the number of arcs which go from the node to another node). In your answers.pdf (as for all questions in this document), give the name of the node(s) and the out-degree.

  2. How many nodes have out-degree 0? How many have out-degree 1?

  3. Which node in the graph has the highest in-degree? (The in-degree of a node is the number of arcs which come from another node to this node). To answer this question, compute the in-degree for every node and store it in a Hashtable<Person, Integer>

Update: The simplest version of solving this problem calls getPredecessors() on each node, calls size() on the predecessors list returned, and stores them in the hash table with the node as the key and the size as the value. You may submit this less algorithmically efficient solution for full credit. If you do this, you should run your code on the small, 1000 node graph, since the getPredecessors version of this code takes 10 minutes to run on the full graph on my laptop, which will make it very hard for you to debug it!

Note: (Update: this version is now extra credit): AdjListsGraph has a getPredecessors() method, but calling it on every single vertex is very inefficient, since in an adjacency list implementation, the getPredecessors() method itself has to look at every edge in the graph by iterating over all vertices of the graph itself. Design a method which uses a hash table (you should use Java's Hashtable class) to make this process more efficient. My solution has complexity O(|V| + |E|), where |V| is the number of vertices and |E| is the number of edges, whereas the solution of calling getPredecessors() on each node could have a complexity of something like O(|V|^3) for a very dense graph. We recommend that you keep this inDegrees hash table around -- it may be useful for future tasks.

  1. (this should be 4.) The in-degree can be interpreted as the prominence of a node in a graph, since higher in-degree means more ways to reach that node (and, in the case of a Wikipedia page, learn about that person). Which woman has the highest in-degree in the graph? How many men have higher in-degrees than her? (Note: If you find it useful for this task, you might try sorting a Vector using its built in sorting capability, which takes a Comparator. Check out the API docs for more details. Feel free to use whatever approach works best for you, though.)

  2. (this should be 5.) Spend 30 minutes answering another question or two of your choice about the data, and write down your question(s) and answers to those questions. You might ask questions about particular people; to find some interesting people, try printing out the names of all the nodes by printing reader.getNameMap().keySet(). (The name map is Reader's way of providing you with an easy way to get Person objects based on a person's name; it's a Hashtable<String, Person> which maps people's names to the Person objects corresponding to their Wikipedia pages.)

Step 2: Finding shortest paths between nodes

Breadth first search finds shortest paths between nodes in unweighted graphs.

  1. Find the shortest path between "Madeleine Albright" '59 and "J. R. R. Tolkien". (Use Reader.getNameMap().get(name) to get a Person object to use. Repeat this with another pair of people of your choice.

  2. What person or persons are farthest away from Secretary Albright?

Step 3: Bias in the connectivity of Wikipedia pages

This step is for OPTIONAL extra credit

If 50% of the nodes in this graph are women, then we would expect about half of all links in the graph to go to women's nodes. The same logic holds for men, as well as for domains such as the arts, humanities, and science and technology. In this step, we will test whether this is true in this sample of Wikipedia.

Define the expected in-group link fraction for a group to be the fraction of all nodes which are part of that group. So if 35% of nodes are men, then the expected in-group link fraction for men is 35%.

  1. Determine the fraction of the graph consisting of a) men and b) women. Unfortunately, this dataset has only binary labels for the genders of people represented in its pages. These numbers will be used in our analysis going forward.

Now, define the representation index for a group as follows:

  • For each node in the group (e.g., for each page representing a man), compute the fraction of links on that page which go to other pages in the group (e.g., links to other men's pages). Associate each vertex with that fraction (what data structure can be used to do this mapping from Person objects to numbers efficiently?). Call this associated value the in-group link fraction for each node.
  • Average the in-group link fractions of all nodes in the group to derive an average in-group link fraction across the whole group.
  • Divide the averaged in-group fraction for the group by the expected in-group link fraction for the group. This number is the group's representation index.

See the accompanying figure/example for a small-scale demonstration of computing the representation index. (TODO: Create this figure)

  1. Write a short paragraph which interprets the representation index as a statistical idea: what does a representation index which is equal to, greater than, or less than 1 mean?

  2. Compute the representation index for men and the representation index for women in this dataset. Write the two indices in answers.pdf, then write briefly interpreting your result. Are there any flaws with the representation index as a measure of linking bias?

  3. Repeat this analysis for the "domains" into which the famous people in this graph fall: ({'ARTS', 'BUSINESS & LAW', 'EXPLORATION', 'HUMANITIES', 'INSTITUTIONS', 'PUBLIC FIGURE', 'SCIENCE & TECHNOLOGY', 'SPORTS'}). The representation index is exactly the same statistical concept for these domains; there are just more than two categories. Report the representation indices for these domains and comment brifely. Note that if you've written your code modularly, running this second analysis should be fairly easy.

  4. Take 20 minutes to write an interpretation and commentary on your results. What are some reasons that you might have gotten the result you got? Could this be an artifact of our data source? What cultural effects could lead to this result? Are there other numbers you could compute that would better represent this data?