In AM4 we worked with the Wellesley Courses website that contained about 770+ courses from Spring 2015. Then, for AM5 we created a historical dataset of all course offerings in the past 15 years, which can be found here. However, dealing with historical data manually is not an easy task, even when you know what you want. As computer scientists in training we would like to be able to write code to do the work for us. We discussed a Python script that aggreggated the historical enrollments for a given course. But, if we have another question, we would need to write a new script. This is the problem with imperative languages, we, the programmers have to explicitly define how we want something done. This is where systems that support declarative languages come to our rescue. We just talked about MongoDB and how we could write simple queries to get back results without explaining how the query should be implemented. We will do the same with the Wellesley Courses historical data.
In our MongoDB server, I have created a database named "wellesley" that contains about
20,000+ courses of the past 15 years in a collection named courses
.
In the following sections, I explain first the structure of the documents. Then,
I show a series of examples of how you can answer your questions
with the database. Finally, I also show how you can store the results in a JSON file to
use in your program.
In the following screenshots, you can see the old JSON course description that comes directly from the Course Browser and the new course description that I generated with a Python script. There are several changes and additions.
In the following, I have summarized all changes and additions:
{"other_instructor": {$exists: true}}
to
find courses that have more than one instructor.{"lab/discussion": {$exists: true}}
.{semester: "Winter"}
After I had written these notes, I added one more field to each document: "building". I did this to make it easy to count number of students in one building, given that the "location" field had the room information as well.
Once all 76 JSON files were processed by the Python script to convert every JSON
entry into a new JSON document, all the entries were stored in a single list in
the file wellesley_courses.json
. This is a big file, about 9.5 MB.
In case you would like to inspect the content of this file outside the database,
I'm making it available here (access through Google Drive for students enrolled
in this course).
This file was then imported into the MongoDB server with the following command:
mongoimport -d wellesley -c courses --jsonArray wellesley_courses.json
This command creates the database "wellesley" with the option -d, then the collection "courses" with the option -d, and then takes the array of JSON entries in the file and inserts them one by one in the database's collection. There is a total of 20,469 documents in the wellesley database.
In the next section, we will see some examples of querying the database to answer simple questions.
Login remotely to the CS server (using ssh or an equivalent). Then invoke the
mongo client with mongo
. Then enter the following commands one by one:
> use wellesley; > db.courses.count(); // returns the total number of all courses > db.courses.findOne({subject: "CS"}); //returns the first CS course it finds, Fall 2000
Now that you got access to the database, let's try to answer some questions.
You might remember that you had a Python script to perform this operation. Below is the query in MongoDB. We are restricting it to semesters in the Fall and Spring, given that the other semesters can make the data uneven.
db.courses.find({course: {$regex: /CS 111/}, semester: {$in: ['Fall', 'Spring']}, 'lab/discussion': {$exists: false}}, {course: 1, _id:0, year: 1, semester: 1, enrollment:1}).sort({year:1})
Explanation: You remember that we can use regular expressions to search text. MongoDB offers
a $regex operator that will allow us to search either the course names or titles (or whatever
field you want). Simply put the expression you're looking for within the slash characters and
the rest will be taken care by the database engine. In addition to the operator $or that we
saw in the lecture notes, we can use the operator $in
when we are looking for one possible
value. The results will contain all documents for which the "semester" value is either 'Fall'
or 'Spring'. Additionally, we don't
want to see the lab sections, because the same number of students in the lectures will be
spread out among the lab sections. Once we create the query selector with these three fields,
we can specify as the second argument the projection: the fields that we want to be displayed
or not in the results. Finally, we sort by the year (ascending order, first 2000, 2001, and so on).
Let's assume we are interested in all courses taught by Prof. Metaxas. The query follows:
db.courses.find({instructor: {$regex: /Metaxas/}}, {course: 1, _id: 0, semester_code: 1}).sort({year: 1})
By default, the database will display the first 20 results. If you type "it" (for iterate), you can get to see the rest of the results.
This time, we will make use of the composite structure of the "times" field. Let's assume we want courses that meet on Monday at 8:30:
db.courses.find({semester_code: "201502", "times.M.start" : "08:30 am"}, {enrollment: 1, locations: 1, _id: 0, course: 1})
Important note: when we use the dot operator to access nested fields of an object, we need to enclose the whole expression in quotes.
While these simple queries give us important information, MongoDB can do more. In the next section we will talk about aggregation.
Let's start by asking this question: What is the total number of
students enrolled in CS courses in one semester? This question is
different from the ones we saw in the previous section, because it asks to
calculate a sum over the enrollment field in all CS courses. MongoDB has
a dedicated method called .aggregate()
which provides this functionality.
Let's try the following query and see the result:
db.courses.aggregate({$match: {subject: "CS", "lab/discussion": {$exists: false}, semester: {$in: ["Fall", "Spring"]}}}, {$group : {_id : "$semester_code", total_students : {$sum : "$enrollment"}}}, {$sort: {_id: -1}})
In this query, we use the $match
operator to find the documents we
are interested in (only CS lectures from Fall or Spring). Then, these documents
will be given to the $group
operator to perform the aggregation.
Since we'll be creating new documents, we have to supply what value we want
for the obligatory _id
field. Then, we create a new field, "total_students"
which will store the value being calculated as the sum of all enrollment values. Finally,
we sort the new documents in the descending order of their _id values (start with Spring 2015
and move downward).
Here is a second query that will answer the question: Which departments had the largest enrollments in a certain semester?
db.courses.aggregate({$match: {semester_code: '201502', "lab/discussion": {$exists: false}}}, {$group : {_id : "$subject", total_students : {$sum : "$enrollment"}}}, {$sort: {total_students : -1}})
You can notice that this query is very similar to the first one, we simply are choosing other fields for the $match operation.
Here is a third query to answer the question: Who are the instructors that have taught the most number of students?
db.courses.aggregate({$match: {instructor: {$exists: true}}}, {$group : {_id : "$instructor", total_students : {$sum : "$enrollment"}}}, {$sort: {total_students : -1}})
Here is a fourth query to answer the question: How many courses are taught in every academic buildindg?
db.courses.aggregate({$match: {semester_code: "201502"}}, {$group: {_id: "$building", courses_offered: {$sum: 1}}}, {$sort: {courses_offered: -1}})
Finally, here is a query that answers the question: How many students are in each building at 8:30 am on a Monday?
db.courses.aggregate({$match: {semester_code: "201502", "times.M.start" : "08:30 am"}}, {$group: {_id: "$building", students_in: {$sum: "$enrollment"}}}, {$sort: {students_in: -1}})
Once we calculate aggregations to answer our questions, we might want to store the results as a JSON file to use in our Javascript code. This turns out to be really easy in MongoDB.
$out
at the end of aggregate in order to store
the new documents in a new collection.mongoexport
outside the mongo shell to export the
data from the collection to a file.Here is some code to save the results of Monday, 8:30 am.
db.courses.aggregate({$match: {semester_code: "201502", "times.M.start" : "08:30 am"}}, {$group: {_id: "$building", students_in: {$sum: "$enrollment"}}}, {$sort: {students_in: -1}}, {$out : "monday830"} )
We can check that the results was stored in a new collection, by searching:
db.monday830.find();
We close the mongo shell by typing quit()
, then, in the command line,
we write:
/usr/bin/mongoexport -d wellesley -c monday830 --jsonArray -o monday830data.json
This is very similar to the mongoimport command we saw above, and it specifies an output file with "-o" followed by filename (file doesn't need to exist).