GraphAnalysis.java
, containing your analysis code. Its main method should
print out answers to all the questions asked below.answers.pdf
, containing answers to the data analysis questions in the parts below. For each question where you compute a number (such as a fraction or a representation index), include the number you computed as your answer to that question.Submit both files to Gradescope's Project
In your project, you will explore and analyze a dataset consisting of the Wikipedia pages of famous people and the hyperlinks between them. You'll represent the data as a graph, where people's Wikipedia pages are the nodes and hyperlinks between their pages are the arcs (such that if page A links to page B, there is an arc from the vertex for page A to the vertex for page B). You will analyze the graph for interesting properties and biases of the connections between people.
This project is an opportunity for you to learn, practice, and demonstrate mastery with the following skills:
The starter code for this project includes:
datasets/
directory, one file for vertices and one for arcs) -- namely:
datasets/pantheon_nodes_all.csv
and datasets/pantheon_edges_all.csv
datasets/pantheon_nodes_1000.csv
and datasets/pantheon_edges_1000.csv
AdjListsGraph.java
, a graph implementation for you to use (you do not need to implement any graph operations or modify this file)Reader
class to load the graph from the data files, along with a library fastcsv
which Reader
uses (you need not modify Reader
or fastcsv
, and you need not directly use fastcsv
.)Download it here: link
The dataset consists of a graph of Wikipedia pages representing human beings, where pages are nodes and links between them are arcs. Pages are annotated with information including the gender of the page's person and the domain of which the person is a notable member (such as the arts, business & law, etc.). You can use person.getData("name")
or person.getData("gender")
or person.getData("domain")
to get these pieces of data from Person
objects, which are used to represent the vertices of the Wikipedia graph.
Note that there is also a subgraph dataset, in the files with names dataset/pantheon_nodes_1000.csv
and dataset/pantheon_edges_1000.csv
, in case your code takes a long time to run and you need to run it on a smaller dataset to test it. HOWEVER, if your code is slow, this probably means you've done something very inefficiently -- keep working on your code, but come talk with us about this!
Start by using the provided Reader
class to read the dataset in. Read the Javadoc for Reader
for details on how to instantiate a new Reader
and to get an AdjListsGraph
from it containing the data. Then, use the graph to answer these questions:
Which node or nodes in the graph has the highest out-degree? (The out-degree of a node is the number of arcs which go from the node to another node). In your answers.pdf
(as for all questions in this document), give the name of the node(s) and the out-degree.
How many nodes have out-degree 0? How many have out-degree 1?
Hashtable<Person, Integer>
Update:
The simplest version of solving this problem calls getPredecessors()
on each
node, calls size()
on the predecessors list returned, and stores them in the
hash table with the node as the key and the size as the value. You may submit
this less algorithmically efficient solution for full credit. If you do this,
you should run your code on the small, 1000 node graph, since the getPredecessors
version of this code takes 10 minutes to run on the full graph on my laptop,
which will make it very hard for you to debug it!
Note: (Update: this version is now extra credit): AdjListsGraph
has a getPredecessors()
method, but calling it on every single vertex is very inefficient, since in an adjacency list implementation, the getPredecessors()
method itself has to look at every edge in the graph by iterating over all vertices of the graph itself. Design a method which uses a hash table (you should use Java's Hashtable
class) to make this process more efficient. My solution has complexity O(|V| + |E|), where |V| is the number of vertices and |E| is the number of edges, whereas the solution of calling getPredecessors()
on each node could have a complexity of something like O(|V|^3) for a very dense graph. We recommend that you keep this inDegrees hash table around -- it may be useful for future tasks.
(this should be 4.) The in-degree can be interpreted as the prominence of a node in a graph, since higher in-degree means more ways to reach that node (and, in the case of a Wikipedia page, learn about that person). Which woman has the highest in-degree in the graph? How many men have higher in-degrees than her? (Note: If you find it useful for this task, you might try sorting a Vector
using its built in sorting capability, which takes a Comparator
. Check out the API docs for more details. Feel free to use whatever approach works best for you, though.)
reader.getNameMap().keySet()
. (The name map is Reader
's way of providing you with an easy way to get Person
objects based on a person's name; it's a Hashtable<String, Person>
which maps people's names to the Person
objects corresponding to their Wikipedia pages.)Breadth first search finds shortest paths between nodes in unweighted graphs.
Find the shortest path between "Madeleine Albright" '59 and "J. R. R. Tolkien". (Use Reader.getNameMap().get(name)
to get a Person
object to use. Repeat this with another pair of people of your choice.
This step is for OPTIONAL extra credit
If 50% of the nodes in this graph are women, then we would expect about half of all links in the graph to go to women's nodes. The same logic holds for men, as well as for domains such as the arts, humanities, and science and technology. In this step, we will test whether this is true in this sample of Wikipedia.
Define the expected in-group link fraction for a group to be the fraction of all nodes which are part of that group. So if 35% of nodes are men, then the expected in-group link fraction for men is 35%.
Now, define the representation index for a group as follows:
Person
objects to numbers efficiently?). Call this associated value the in-group link fraction for each node.See the accompanying figure/example for a small-scale demonstration of computing the representation index. (TODO: Create this figure)
Write a short paragraph which interprets the representation index as a statistical idea: what does a representation index which is equal to, greater than, or less than 1 mean?
Compute the representation index for men and the representation index for women in this dataset. Write the two indices in answers.pdf
, then write briefly interpreting your result. Are there any flaws with the representation index as a measure of linking bias?
Repeat this analysis for the "domains" into which the famous people in this graph fall: ({'ARTS', 'BUSINESS & LAW', 'EXPLORATION', 'HUMANITIES', 'INSTITUTIONS', 'PUBLIC FIGURE', 'SCIENCE & TECHNOLOGY', 'SPORTS'}). The representation index is exactly the same statistical concept for these domains; there are just more than two categories. Report the representation indices for these domains and comment brifely. Note that if you've written your code modularly, running this second analysis should be fairly easy.