MongoDB and NoSQL Databases

MongoDB is one of a class of so-called "NoSQL" database systems. It has some nice performance properties that account for its recent surge in popularity and the decline in market share for relational database systems (RDBMS) like MySQL. Please read the following brief introductions.

  • Will NoSQL Databases Live Up to Their Promise, by Neal Levitt (PDF). This is the best description and comparison with RDBMS systems that I've read. It's only three pages long. If you read nothing else, read this. Here's a local copy of that PDF: Will NoSQL Databases Live Up to Their Promise
  • NoSQL Wikipedia article gives more information and motivation. It's about 10 pages, but I suggest reading just through the section entitled "document store", which is the first 3 pages.
  • NoSQL Explained. This article is from the folks at MongoDB, so it's a little biased, but it does a good job and is a bit more comprehensive than the first two. This is about 8 pages long. It repeats some of the ideas from the first two, but with a bit more detail.
  • MongoDB Tutorial. This is a multiple web-page tutorial. Just read the first two, please. Each page is short, bulked up with lots of ads.

Our goal for this reading is not to learn how to use MongoDB. Instead, we want to learn the basic concepts and ideas behind the choice between traditional Relational Database Systems and newer NoSQL database systems, such as MongoDB. There are some practical examples at the end, but just to make it concrete, not to start you coding with MongoDB.

MongoDB vs SQL

Take a moment to look at the table below, which which helps to translate the terminology that we are familiar with (tables, rows) to the NoSQL analogs (collections, documents). (This table is a subset drawn from comparison with SQL.)

SQL Terms/Concepts MongoDB Terms/Concepts
database database
table collection
row document or BSON document
column field
index index
table joins $lookup, embedded documents
primary key primary key
transactions transactions1

Denormalized

There's a crucial footnote to that table, talking about a denormalized data model. What does that mean? Trivially, it means that the database is not normalized. Normalization is a lengthy topic that we'll only have a little time to touch on, but it's very important, so we'll take some time for it now.

(Normalization is a topic I have taught for many years in CS 304, but it didn't make the cut this year. If you want to know more about it, see my description here: Normalization.)

Meanwhile, let's learn the basic ideas of normalization. Suppose we want to have a set-valued attribute, such as someone's hobbies (Homer's are tv, doughnuts and beer), or the genres of a movie (IMDB categorizes Dune as Action, Adventure, and Drama). We learned earlier in the course that, at least for large, open-ended sets like hobbies, we need to create a 1-to-many relationship. So not the following denormalized one-table database:

id name address hobbies
1 Homer 742 Evergreen Terrace tv
1 Homer 742 Evergreen Terrace doughnuts
1 Homer 742 Evergreen Terrace beer
2 Lisa 742 Evergreen Terrace reading
2 Lisa 742 Evergreen Terrace politics

but instead a normalized database with two tables, like this:

id name address
1 Homer 742 Evergreen Terrace
2 Lisa 742 Evergreen Terrace

and

id hobby
1 tv
1 doughnuts
1 beer
2 reading
2 politics

Such a representation is normalized and avoids redundancy in the form of storing Homer's address multiple times.

key idea Normalization eliminates redundancy and that avoids certain anomalies: update anomalies, insertion anomalies and deletion anomalies.

Anomalies

Another brief detour about anomalies. Suppose we have a denormalized database and one person is updating Homer's address ("742 Evergreen Lane"), while another person is inserting a new hobby of his ("pizza"). You can see how, if these two transactions are happening concurrently, we might end up with the following mess:

id name address hobbies
1 Homer 742 Evergreen Lane tv
1 Homer 742 Evergreen Lane doughnuts
1 Homer 742 Evergreen Lane beer
1 Homer 742 Evergreen Terrace pizza

That is, the transaction to change the address grabbed (maybe locked) three rows and the transaction to insert the new hobby got the data for the other fields from the old rows, so the address is wrong.

Space

Moreover, redundantly storing the address (and other data) in each of 3-4 rows for Homer is clearly a waste of storage space. Avoiding wasted space mattered a lot in the days when disks were small, but nowadays disks are enormous and DBMS (Database Management Systems) designers feel free to squander space in order to gain performance.

NoSQL Representation

How would the database above about Homer and his hobbies be represented in MongoDB? First of all, MongoDB and other NoSQL DBMS systems give up on the constraint that rows have fixed format and size2 and instead of rows, they have documents that can contain embedded information. So, the representation might be:

{id: 1,
 name: "Homer",
 address: "742 Evergreen Terrace",
 hobbies: ["tv", "doughnuts", "beer"]
 }

If this reminds you of JSON, that's exactly right. Items in a MongoDB database collection (table) are JSON documents. Sometimes, they are binary JSON, called BSON.

Important Note: Because of this denormalized representation, we don't have to do a join in order to get Homer's hobbies.

NoSQL Motivations

Now that we understand some of these ideas and terminology, let's briefly describe some of the motivations of NoSQL databases like MongoDB, compared to RDBMSs (Relational DBMSs) like MySQL:

  • unstructured data. A NoSQL document can have whatever fields we want, while an RDBMS requires structured data, where each row has the same columns.
  • denormalized data. A NoSQL database allows you to avoid joins in order to represent 1:N relationships.
  • No SQL. MongoDB and its ilk don't use SQL. Hence the name. SQL is a language of tables, columns and joins. We have to learn a new language to use MongoDB.
  • sharding Because NoSQL avoids joins, the database can be spread across multiple servers, yielding greater concurrency. This is called sharding (each server is a "shard" of the whole). RDBMSes are harder to spread across multiple servers because of joins.
  • space: NoSQL databases are willing to spend additional space in order to gain speed and allow sharding.

Consequences.

  • NoSQL databases often give up the ACID properties that we discussed in the class on transactions. (Though some of these properties are coming back to newer implementations.)
  • Consistency is sometimes sacrificed. The same query might yield slightly different results depending on which server it goes to and other details. But in many applications, that doesn't matter.

Now that we've discussed some of the concepts and motivation behind databases like MongoDB, let's learn a little about MongoDB. We won't be implementing anything using MongoDB in CS 304, so the details are not important.

MongoDB

MongoDB has a client-server architecture just like MySQL does. There is a daemon process that controls the database files; it's called mongodb. You can connect to the database using the client program, which is called mongo. Both are installed on the CS server.

MongoDB

Commands to know

  • help
  • db.help()
  • db.collection.help()
  • show dbs
  • use <db>
  • show collections

MongoDB shell

MongoDB is installed on the CS server. You will need to login to the CS server, but then you can run it like this. Note that there is a collection of warning messages that are printed when it starts; you can ignore those.

Unlike MySQL, where the database administrator (me) has to create a database for you, MongoDB creates them on the fly. I suggest that you use your username, followed by the letters "db" as the name of your database. So, Hermione Granger would say use hgrangerdb. Below, I'll use scottdb.

Similarly, MongoDB creates collections (tables) on the fly, so there's no separate "create table" step. You can just insert data, without any prior notice.

Here's an example interaction, where I insert some actor data into a collection of actors in the scottdb database:

$ mongo
MongoDB server version: 4.2.10
> use scottdb;
switched to scottdb
> db.actors.find() // empty
> db.actors.insertOne({"name":"Colin Firth"})
// Salma has more info; no need to be consistent 
> db.actors.insertOne({"name":"Salma Hayek","birthdate":"9/2/1966"})
> db.actors.find();
{ "_id" : ObjectId("535dfee55ed6d98999b62c71"), "name" : "Colin Firth" } 
{ "_id" : ObjectId("535dfef95ed6d98999b62c72"), "name" : "Salma Hayek", "birthdate" : "9/2/1966" } 
> db.actors.find().pretty();
{ "_id" : ObjectId("535dfee55ed6d98999b62c71"), "name" : "Colin Firth" } 
{ 
"_id" : ObjectId("535dfef95ed6d98999b62c72"), 
"name" : "Salma Hayek", 
"birthdate" : "9/2/1966" 
} 

You'll notice that each document has an _id field. That's a unique identifier that is automatically assigned to the document by MongoDB.

MongoDB, Node.js and Callbacks

MongoDB also supports the event-loop, non-blocking, asynchronous I/O model that we discussed in the context of node.js. Consequently, Node.js and MongoDB are often used together.

As we learned, though, asynchronous I/O means we don't get return values; instead we have to supply a callback function. However, the Mongo shell application above actually returns values or seems to, so what about this callback-oriented programming style?

Indeed, the mongo shell returns values, just as we would expect. But, if we connect to the mongodb server using a node.js program, we are required to use callback-style coding. We'll see that in a moment.

Aside: Promises

There's a version of the MongoDB API that uses Promises, Await and Async, which are relatively recent additions to the JavaScript language. I've avoided these new features below, since we don't know those features of JavaScript, but they should go on your long list of things I should learn more about. I've yet to find or write a good introduction/tutorial, but this Primer on Promises by Jake Archibal looks good and is also amusing.

Practical Examples

There are some examples that you can run in the course account. (You can copy them to your own account in the usual way if you want to edit/adapt them, but that's not necessary and saves us some disk space if you don't. See me if you want to do that.)

You can run my examples by just cd-ing to the folder in the course account:

cd ~cs304/pub/downloads/mongo

There are several scripts in there that create/read/delete some things in my database. The collection is called things because the example is to have a collection of "my favorite things".

Running the examples

Whether you make your own copy or use the one in the course folder, you run the examples like this:

node list-things.js 

Here's an example:

[cs304@tempest mongo] node list-things.js 
Connected successfully to server
after executing findThings
Found the following documents:
0:  raindrops on roses
1:  whiskers on kittens
2:  warm woolen mittens
3:  brown paper packages
4:  chocolate
after listing all documents
after closing database
[cs304@tempest mongo] 

Here's a useful sequence of things to try: It lists the (empty) collection of things, inserts some things, lists them again, inserts Dr. Zhivago (a more complex document), lists things again, deletes them all, and lists the empty collection. This demonstrates most of our CRUD operations: create, read, update and delete. (It only omits update.)

node list-things.js 
node insert-things.js 
node list-things.js 
node insert-zhivago.js 
node insert-things.js 
node list-things.js 
node delete-things.js 
node list-things.js 

You can read over the source code in the downloads/mongo folder.

We'll run a few of these in class if you're interested.

Query example

I won't show and discuss all the code, but we'll look at one of those scripts, namely the list-things.js script that prints the entire collection of favorite things from a collection (table) called things in the scottdb database. It runs and completes, rather than starting up a server, but you run it using node. We'll see that after looking at the code. Remember, this is JavaScript, which many of you don't know. So try not to get tripped up on syntax. We'll walk through this code in class.

// Followed example at
// http://mongodb.github.io/node-mongodb-native/3.2/tutorials/connect/

const MongoClient = require('mongodb').MongoClient;
const assert = require('assert');

// Connection URL
const url = 'mongodb://localhost:27017';

// Database Name
const dbName = 'scottdb';

// Create a new MongoClient
const client = new MongoClient(url, { useUnifiedTopology: true} );

var findThings = function(db, callback) {
    var col = db.collection('things');
    col.find({}).toArray( function(err, docs) {
        assert.equal(null, err);
        console.log('Found the following documents:');
        for( var i = 0; i < docs.length; i++ ) {
            console.log(i+': ',docs[i].thing);
        }
        console.log('after listing all documents');
        callback();
    });
};

// Use connect method to connect to the Server
client.connect(function(err) {
    assert.equal(null, err);
    console.log("Connected successfully to server");

    const db = client.db(dbName);
    var close = function () {
        client.close();
        console.log('after closing database');
    }
    findThings(db, close);
    console.log('after executing findThings');
});

Things to note in the code:

  • the accumulation of callbacks, though promises would ameliorate that
  • the way that the database connection is closed after we are done, namely by passing a callback function through several layers of function calls so that after we are done with the database, it gets closed.

When we run this with an empty collection, the output looks like this:

$ node list-things.js 
Connected successfully to server
after executing findThings
Found the following documents:
after listing all documents
after closing database
$

Note the order that the "after" strings are printed when the code is run. This demonstrates exactly the execution order from our very first node example, where 'World' is printed before the readFile completes:

console.log('Hello');
fs.readFile('/path/to/file', function(err, data) {
    // do something ...
});
console.log(' World');

A web application combining Node.js with MongoDB

A few years ago, students asked me to create a complete, albeit small, web application using Node.js and MongoDB. If you're interested, you can read more about this web app with Node.js and MongoDB, including source code.

Summary

  • Node.js and MongoDB are real players in the database world for good reason: performance
  • They do certain things very well:
    • Inserting documents (rows)
    • Handling unstructured data
    • Iterating over documents to search
    • Handling I/O-bound HTTP requests
    • Sharding to spread databases over multiple servers.
  • They (mostly) don't do other things:
    • Joins, so no normalized databases
    • Compute-bound HTTP requests (though they can with additional threads).

  1. For many scenarios, the denormalized data model (embedded documents and arrays) will continue to be optimal for your data and use cases instead of multi-document transactions. That is, for many scenarios, modeling your data appropriately will minimize the need for multi-document transactions. 

  2. In the very olden days, relational database tables were fixed size, and there was no varchar() datatype, only char(). The advent of varchar() meant that the rows aren't quite fixed size, but they aren't hierarchical and as complex as the MongoDB documents.