CS249 | MongoDB Wellesley Data

Wellesley Courses Data stored in MongoDB

In AM4 we worked with the Wellesley Courses website that contained about 770+ courses from Spring 2015. Then, for AM5 we created a historical dataset of all course offerings in the past 15 years, which can be found here. However, dealing with historical data manually is not an easy task, even when you know what you want. As computer scientists in training we would like to be able to write code to do the work for us. We discussed a Python script that aggreggated the historical enrollments for a given course. But, if we have another question, we would need to write a new script. This is the problem with imperative languages, we, the programmers have to explicitly define how we want something done. This is where systems that support declarative languages come to our rescue. We just talked about MongoDB and how we could write simple queries to get back results without explaining how the query should be implemented. We will do the same with the Wellesley Courses historical data.

In our MongoDB server, I have created a database named "wellesley" that contains about 20,000+ courses of the past 15 years in a collection named courses. In the following sections, I explain first the structure of the documents. Then, I show a series of examples of how you can answer your questions with the database. Finally, I also show how you can store the results in a JSON file to use in your program.

Document Structure

In the following screenshots, you can see the old JSON course description that comes directly from the Course Browser and the new course description that I generated with a Python script. There are several changes and additions.

On the left side old JSON for CS 111, on the right side format of document stored in the MongoDB.

On the left side old JSON for CS 112 Lab, on the right side format of document stored in the MongoDB.

In the following, I have summarized all changes and additions:

All field names were converted to lower-case.
If a field didn't have a value, (e.g., "Additional Instructor(s)" is almost always empty), the field is not included in the new document. This way, we can write queries such as {"other_instructor": {$exists: true}} to find courses that have more than one instructor.
The value for the field "Day(s)" was converted from a string to an array of comma-separated day initials.
The field "Seats Available" was split into two fields: "free_seats" and "capacity", both of which have now an integer value.
The fields "Title", "Course", "CRN", and "Instructor" have remained the same, with the expection of the lowercasing of the field name.
Based on the content of "Course" value, three new fields were added:
- "subject" - that corresponds to the first part of "Course" value, such as "CS" or "MATH". To make it easy to group courses by subjects, I have replaced the four different tracks of POL to one single subject "POL".
- "level" - that corresponds to the second part of the value but is rounded to be one of the three choices: "100", "200", or "300". This can allow us to compare the different level enrollments in different subjects.
- "lab/discussion" - that corresponds to the third part of the value (for labs or discussion, this value is L01 or D01 or LA1, etc.). If a course is a lab or discussion, than it will have this field set to true. This way, we can find all lab sections by simply writing {"lab/discussion": {$exists: true}}.
The field "Current Enrollment" was renamed "enrollment" and its valued converted into a number, so that we can add up values of different courses.
The field "Locations" was stored as an array, because there are occasions when a course meets in different locations.
The field "Distributions" was also stored as an array, to account for multiple distributions. For the years in which the "Distributions" field was empty, I added the distributions that the course had in the more recent years, hoping that it hasn't changed. For circa 3500+ course offering it was not possible to set the "distributions" value, because they had been discontinued or the CRN value had changed, so it was not possible to match them to existing courses automatically.
The field "Meeting Time(s)" was changed from a string (e.g. "09:50 am - 11:00 am"), to an object of objects. This was done to take into account that sometimes different days have different times, and also to make it easier to find enrollment data for a certain day at a certain time.
Finally, three new fields were added that had to do with the semester:
- "semester_code" - to store the 6-digit string such as "201502". Using this field, we can find all courses in a particular semester.
- "year" - to store the year, (2014, 2015, etc.) separately, in order to sort or group based on year
- "semester" - to store one of the five categories: "Fall", "Spring", "Winter", "Summer I", and "Summer II". This way, we don't need to generate codes such as "201001" to find courses offered in Winter. Instead, we can ask for all years in which Winter courses were offered: {semester: "Winter"}

New field "building"

After I had written these notes, I added one more field to each document: "building". I did this to make it easy to count number of students in one building, given that the "location" field had the room information as well.

Creating the Database

Once all 76 JSON files were processed by the Python script to convert every JSON entry into a new JSON document, all the entries were stored in a single list in the file wellesley_courses.json. This is a big file, about 9.5 MB. In case you would like to inspect the content of this file outside the database, I'm making it available here (access through Google Drive for students enrolled in this course).

This file was then imported into the MongoDB server with the following command:

mongoimport -d wellesley -c courses --jsonArray wellesley_courses.json

This command creates the database "wellesley" with the option -d, then the collection "courses" with the option -d, and then takes the array of JSON entries in the file and inserts them one by one in the database's collection. There is a total of 20,469 documents in the wellesley database.

In the next section, we will see some examples of querying the database to answer simple questions.

Query Examples

Login remotely to the CS server (using ssh or an equivalent). Then invoke the mongo client with mongo. Then enter the following commands one by one:

> use wellesley;
> db.courses.count(); // returns the total number of all courses
> db.courses.findOne({subject: "CS"}); //returns the first CS course it finds, Fall 2000

Now that you got access to the database, let's try to answer some questions.

What were the enrollments in CS 111 in the past years?

You might remember that you had a Python script to perform this operation. Below is the query in MongoDB. We are restricting it to semesters in the Fall and Spring, given that the other semesters can make the data uneven.

db.courses.find({course: {$regex: /CS 111/}, semester: {$in: ['Fall', 'Spring']}, 
                 'lab/discussion': {$exists: false}}, 
                 {course: 1, _id:0, year: 1, semester: 1, enrollment:1}).sort({year:1})

Explanation: You remember that we can use regular expressions to search text. MongoDB offers a $regex operator that will allow us to search either the course names or titles (or whatever field you want). Simply put the expression you're looking for within the slash characters and the rest will be taken care by the database engine. In addition to the operator $or that we saw in the lecture notes, we can use the operator $in when we are looking for one possible value. The results will contain all documents for which the "semester" value is either 'Fall' or 'Spring'. Additionally, we don't want to see the lab sections, because the same number of students in the lectures will be spread out among the lab sections. Once we create the query selector with these three fields, we can specify as the second argument the projection: the fields that we want to be displayed or not in the results. Finally, we sort by the year (ascending order, first 2000, 2001, and so on).

Show all courses that a certain professor has taught

Let's assume we are interested in all courses taught by Prof. Metaxas. The query follows:

db.courses.find({instructor: {$regex: /Metaxas/}}, 
                {course: 1, _id: 0, semester_code: 1}).sort({year: 1})

By default, the database will display the first 20 results. If you type "it" (for iterate), you can get to see the rest of the results.

Show all courses that are being taught in Spring 2015 on a certain day at 8:30 am

This time, we will make use of the composite structure of the "times" field. Let's assume we want courses that meet on Monday at 8:30:

db.courses.find({semester_code: "201502", "times.M.start" : "08:30 am"}, 
                {enrollment: 1, locations: 1, _id: 0, course: 1})

Important note: when we use the dot operator to access nested fields of an object, we need to enclose the whole expression in quotes.

While these simple queries give us important information, MongoDB can do more. In the next section we will talk about aggregation.

Aggregation Queries

Let's start by asking this question: What is the total number of students enrolled in CS courses in one semester? This question is different from the ones we saw in the previous section, because it asks to calculate a sum over the enrollment field in all CS courses. MongoDB has a dedicated method called .aggregate() which provides this functionality.

Let's try the following query and see the result:

db.courses.aggregate({$match: {subject: "CS", "lab/discussion": {$exists: false}, 
                               semester: {$in: ["Fall", "Spring"]}}},                      
                     {$group : {_id : "$semester_code", total_students : {$sum : "$enrollment"}}},                                            {$sort: {_id: -1}})

In this query, we use the $match operator to find the documents we are interested in (only CS lectures from Fall or Spring). Then, these documents will be given to the $group operator to perform the aggregation. Since we'll be creating new documents, we have to supply what value we want for the obligatory _id field. Then, we create a new field, "total_students" which will store the value being calculated as the sum of all enrollment values. Finally, we sort the new documents in the descending order of their _id values (start with Spring 2015 and move downward).

Here is a second query that will answer the question: Which departments had the largest enrollments in a certain semester?

db.courses.aggregate({$match: {semester_code: '201502', "lab/discussion": {$exists: false}}}, 
                     {$group : {_id : "$subject", total_students : {$sum : "$enrollment"}}}, 
                     {$sort: {total_students : -1}})

You can notice that this query is very similar to the first one, we simply are choosing other fields for the $match operation.

Here is a third query to answer the question: Who are the instructors that have taught the most number of students?

db.courses.aggregate({$match: {instructor: {$exists: true}}},
                     {$group : {_id : "$instructor", total_students : {$sum : "$enrollment"}}},                                                {$sort: {total_students : -1}})

Here is a fourth query to answer the question: How many courses are taught in every academic buildindg?

db.courses.aggregate({$match: {semester_code: "201502"}}, 
                     {$group: {_id: "$building", courses_offered: {$sum: 1}}}, 
                     {$sort: {courses_offered: -1}})

Finally, here is a query that answers the question: How many students are in each building at 8:30 am on a Monday?

db.courses.aggregate({$match: {semester_code: "201502", "times.M.start" : "08:30 am"}}, 
                     {$group: {_id: "$building", students_in: {$sum: "$enrollment"}}}, 
                     {$sort: {students_in: -1}})

Exporting Data

Once we calculate aggregations to answer our questions, we might want to store the results as a JSON file to use in our Javascript code. This turns out to be really easy in MongoDB.

Use the operator $out at the end of aggregate in order to store the new documents in a new collection.

Use the command mongoexport outside the mongo shell to export the data from the collection to a file.

Here is some code to save the results of Monday, 8:30 am.

db.courses.aggregate({$match: {semester_code: "201502", "times.M.start" : "08:30 am"}}, 
                     {$group: {_id: "$building", students_in: {$sum: "$enrollment"}}}, 
                     {$sort: {students_in: -1}}, 
                     {$out : "monday830"}
                     )

We can check that the results was stored in a new collection, by searching:

db.monday830.find();

We close the mongo shell by typing quit(), then, in the command line, we write:

/usr/bin/mongoexport -d wellesley -c monday830 --jsonArray -o monday830data.json

This is very similar to the mongoimport command we saw above, and it specifies an output file with "-o" followed by filename (file doesn't need to exist).

Wellesley Courses MongoDB