Node.js

A new player on the scene is Node.js. In its short life, it's become very popular, so we need to understand why that is, and also its limitations. In class, we'll do some hands-on work so that you'll get some experience with a Node.js application.

Because they often work together, we'll also talk about Mongodb, a NoSQL database management system, but next time.

A few years ago, Monica Feldman did a thesis with me, using Node.js and MongoDB. (She's now at Apple.)

Outside Readings¶

To prepare for class, I'd like you to read the following:

The about Node.js page from their website. It's a concise 3 pages or so.
The discussion of blocking versus non-blocking page from their website. It's about 5 pages and covers some important concepts.
The following is about the event-loop in JavaScript. It's pretty succinct and clear: Concurrency Model and the Event Loop.

Some important features to note about Node.js are:

It is not multi-threaded. In fact, there's only one thread.
It uses a non-blocking I/O style.
It uses callbacks and event handlers (instead of return values) for I/O

Callbacks instead of Return Values¶

One weird thing about programming using Node.js and MongoDB is that, because it's event-driven rather than multi-threaded, functions that do I/O pass the data to callback functions instead of returning values.

For example, instead of the following:

function foo() { 
    var val = bar(1,2,3);  // bar might do some I/O or something like that. 
    // do something with val ... 
    doSomething(val); 
}

which requires foo to wait for bar to return, we instead do:

function foo() { 
    bar(1,2,3,doSomething); 
}

So, we pass a callback function to bar, which then is invoked with the result of bar's computation or I/O.

Amsler Talk¶

Thomas Amsler's talk discusses the reasons that Node.js has taken the Web Development world by storm. I won't go through the whole talk, but I will discuss a few of the most important parts.

A few observations to start:

Created by Ryan Dahl in 2009. First presented at JSConf EU.
Node is a platform built on Chrome's V8 JavaScript runtime for easily building fast, scalable network applications.
Node uses an event-driven, non-blocking I/O model that makes it lightweight and efficient.
Node is for data-intensive real-time applications that run across distributed devices.

I/O is not Free¶

I/O takes time, a lot of time. Those of you who have taken CS 240 might recall part of this:

L1: 3 cycles
L2: 14 cycles
RAM: 250 cycles
DISK: 41,000,000 cycles
NETWORK: 240,000,000 cycles

http://duartes.org/gustavo/blog/post/what-your-computer-does-while-you-wait

If your program does some I/O, what does it do while waiting for the I/O to complete?

Currently (meaning, if we program the way we learned in CS 111), it blocks. That is, the program just waits; taking up memory and doing nothing. When the I/O completes, the program is awakened and keeps running.

This is called synchronous I/O:

Synchronous I/O¶

In Python, we can write a function to read the contents of a file as a string:

def file_get_contents(filename):
    with open(filename, 'r') as fin:
        return file.read()

Suppose we use that function in some code like this:

print('hello')
file_get_contents('/path/to/some/file')
print('world')

The printing of "world" has to wait until we have read the file. That's what we mean by synchronous or blocking I/O.

Threads¶

In the preceding example, what, exactly, is doing the waiting? A thread. Whatever thread is executing that code.

Each Thread takes memory
Threads introduce additional complexity
- Race conditions
- Deadlock
- Additional setup overhead
- ...

Asynchronous I/O¶

Instead, we can switch to using asynchronous or non-blocking I/O. The following is JavaScript code, running in Node.js, but otherwise completely analogous to the Python code above:

console.log('Hello');
fs.readFile('/path/to/file', function(err, data) {
    // do something ...
});
console.log(' World');

The fs.readFile function does not block, so the ' World' string is printed immediately: as soon as the I/O operation starts, rather than when it completes. (Remember, I/O takes forever.)

The second argument to the fs.readFile function is a callback. It gets invoked once the I/O completes. It gets two arguments:

an error object, which will be null if there's no error
the string that contains the contents of the file

The convention in Node.js is that the first argument to the callback is always an error object.

Callbacks instead of Threads¶

In synchronous or blocking I/O, the thread is what sits around and waits, waiting to be awoken by the lower levels of the software once the I/O completes.

In the asynchronous, non-blocking I/O, nothing needs to sit around and wait, and so we don't need threads.

Instead, these callback functions just sit in a data structure somewhere, and whenever the event that they are waiting for completes, it gets executed.

Let's compare blocking and non-blocking I/O. The subsequent computation after the I/O completes is either:

the thread in the blocking I/O example, or
the callback in the non-blocking I/O example

Thus, the thread and the callback are similar in that they both represent the rest of the computation. However they differ in that a thread needs memory (a stack) and a callback doesn't¹

Event Loop¶

What we've described is a completely different architechure called an event loop:

Efficient (if used asynchronously)
Only one stack
No memory overhead
Simpler model (no deadlocks, no race conditions ...)

Of course, we have to learn how to program with callbacks...

Why do this? Concurrency

The C10K Problem¶

C10K refers to the problem of optimizing a web server to handle a large number of clients at the same time.
C = CONCURRENT
Apache uses one thread per request, as does Flask
NGINX doesn't use multiple threads but instead uses an event loop
NGINX and Node.js are similar with respect to utilizing an event loop to achieve high concurrency on low latency

So, by comparing Apache (threaded) with Nginx (event loop) we can see the advantages of the event loop architecture.

Requests Per Second¶

Here are a couple of references from Amsler's talk:

The following picture is from the second reference. It shows a small static file being requested at high rate using Apache and Nginx.

Nginx and Apache with small static file

We can see that Nginx is able to serve 10,000 requests per second even as the number of concurrent requests (connections) increases. Apache declines to under 3000.

But here comes the best bit: because Nginx is event-based it doesn’t need to spawn new processes or threads for each request, so its memory usage is very low. Throughout my benchmark it just sat at 2.5MB of memory while Apache was using a lot more:

Nginx and Apache memory use

So, even though we learned that threads are good because they don't require much memory, an event loop is even more frugal.

But you have to learn to program using callbacks in JavaScript.

Why JavaScript?¶

Any programming language could use an event loop, so why JavaScript?

JS developers already think asynchronously (Browsers + AJAX)
JS is fast and getting faster
JS quickly is becoming a compilation target. See languages that compile to JavaScript
Code sharing between the client and server
- Maybe common libs ...

Google's V8 JavaScript Engine¶

The basis for node is Google's V8 JavaScript engine

V8 is a JavaScript engine specifically designed for fast execution of large JavaScript applications
Used in Google's Chrome browser
Written in C++
Fast property access
Dynamic machine code generation
Efficient garbage collection
v8
code optimizations

The Node.js Darkside¶

Bad idea to do raw computation in an event loop. Use node-webworker.
Debugging is hard but will significantly improve in future versions
Callbacks make for weird coding:

doA( argA, function(err, x) {
    doB(argB, function(err, y) {
            doC(argC, function(err, z) {
            // etc.
            });
    });
});

However the callback situation has been improved due to the addition of promises to the JavaScript language. Promises still use callbacks, but the syntactic issues of rightward creep, semantic issues of chaining and waiting are all handled in a better way.

I won't cover Promises in this reading.

This section is the end of my adaptation from Amsler's talk.

Compute-bound versus I/O bound¶

In the last section, I quoted Amsler saying that it was a bad idea in Node.js to do raw computation. Another way to say that is that Node.js is poor for compute-bound code but good for I/O bound, which is based on some important concepts that we need to understand.

In Algorithms, (CS 231) we talk a lot about the time-complexity of algorithms and we try to make them faster. But that typically assumes that the limitation is the amount of computation that needs to be done, as opposed to the amount of I/O (input and output).

Let's think about speeding up programs in a much broader, higher level way. We can generally put them into two categories

compute-bound, where the program does a lot of computation and so speeding up the algorithm or the processor speed will have a big effect.
I/O-bound, where the program does a lot of I/O and the way to speed it up is to speed up the I/O, say by moving data to RAM or onto faster disks.

Example of Compute-Bound versus I/O Bound¶

If you're puzzled by the distinction between compute-bound and I/O bound, perhaps a realistic example will help.

Suppose we have a program that reads 100x100 matrices from disk and computes their inverses and writes them out. How long does the program take to run?

Matrix inversion is O(n³) for Gauss-Jordan elimination. Reduce to O(n^2.804) with Strassen algorithm. Etc.
Suppose each CPU operation takes 5 nanoseconds (1ns = 1 billionth of a second)
Each disk read/write from the hard drive takes 30ms (1ms = 1 thousandth of a second)
Suppose each matrix requires one disk read/write.

(FYI, some sources say a blink of an eye is 300-400 ms; others 100-400ms. Clearly, greater than 100ms, which is forever in computer time.) So, don't think of the disk as slow; think of the processor as mind-bogglingly fast.

So, let's do the math:

N³ where N=100 is 1,000,000.
So, 5,000,000 ns compute time, or
5,000 μseconds (micro seconds, millionths of a second), or 5 milliseconds
I/O, remember, was 30ms.
So, I/O time is six times longer than CPU time.
Program spends 6/7 of the time waiting for I/O, and only 1/7 of the time computing.
Converting to Strassen from Gauss-Jordan will speed up the 1/7th, but won't touch the 6/7.
This program is I/O bound, meaning that its time is dominated by I/O time rather than processor time.
Example: suppose the program takes 70 seconds to process a batch of data: 60 seconds for I/O and 10 seconds for computation.
- upgrade the algorithm to one that is 60% faster, the program now takes 64 seconds.
- instead, upgrade the processor to one that is 5 times faster, the program now takes 62 seconds.
- instead, upgrade the hardware to a RAID array that is twice as fast, and the program now takes 40 seconds (30 for I/O and 10 for computation).

bar charts showing improvements — Improving the algorithm or buying a faster CPU shrinks the green computation part of the program, but doesn't change the blue I/O part. Buying faster disks shrinks the blue I/O part, but doesn't change the green computation part. Still, doubling the disk speed has more effect than quintupling the cpu speed.

However this discussion is just to introduce the notion of I/O-bound programs. The advantage of non-blocking I/O is not speed but avoiding the memory consumption of threads.

I/O bound and Event Loops¶

A lot of computer activities are I/O bound, which means the CPU spends most of its time waiting. Equivalently, most processes/threads spend most of their time waiting. In particular, our Flask apps mostly are waiting for I/O from the database or from the disk. They don't actually do a lot of computation.

The reason that the concept is important is that Node.js and other event loop architectures work well when I/O dominates. If CPU dominates, then we need to start web-workers (essentially, threads) to do that extra computation, which is what Amsler was referring to.

Node Examples¶

In the rest of this reading, I'll show three examples of node code, in increasing complexity.

A simple demo that just says 'hello' and counts the number of accesses
A demo that computes the Collatz sequence and handles
A demo that uses a routing system called express (like Flask)

With Node.js, as with Flask, you're responsible for the entire server. You open up a port to do so. Use your UID, as with Flask. Most Node tutorials you see online will use port 5000.

Node Shell¶

Like Python, we can just run node to get an interaction loop, where we can try code. Again, this is just like python, but it's JavaScript, running on the server rather than in a browser. Run it with the node command, and exit by typing control-d.

$ node
> 3+4
7
> function foo(x,y) { return x+y; }
undefined
> foo(3,4)
7
> ^d
$

Simplest Example¶

Our first example is a trivial web server. Here's the code:

var myPort = 1942;              // modify this

var http = require("http");

var numGreetings = 1;

function responder(request, response) {
    response.writeHead(200, {"Content-Type": "text/plain"});
    console.log(request.url);
    numGreetings++;
    response.write("hi "+numGreetings);
    response.end();
}

http.createServer(responder).listen(myPort);
console.log("Listening on http://0.0.0.0:"+myPort);

Note that our responder function takes two arguments, a request object and a response object. Both of these are roughly equivalent to the ones in Flask.

After defining the responder, we give it to createServer which sets it up as a callback to be invoked whenever a request comes in.

Collatz Example¶

Our second example is less trivial; it computes the Collatz sequence, using a GET request with the start number in the query string. It parses the URL to find the start key, get the desired input number, and computes the result. It then writes out a result page that includes a hyperlink to get the next in the sequence.

Note that we use the Collatz sequence not because its interesting, but because the output depends on the input but the actual computation is only one line of code, so we don't clutter the example with the computation.

Here's the code:

/* This code returns the next number in the Collatz conjecture sequence.

*/

var myPort = 1942;

var http = require('http');
var url = require('url');
var querystring = require('querystring');

function collatzNext(n) {
    return ( n % 2 === 0 )? n / 2 : 3 * n + 1;
}

function responder(request, response) {
    if(request.url === '/favicon.ico') return;
    console.log('\nurl is ',request.url);

    try {
        var urlObj = url.parse(request.url);
        var query = querystring.parse(urlObj.query);
        console.log("query object is ",JSON.stringify(query));

        // get the value we need, the number
        var num = query.start;

        // a simple computation
        var next = collatzNext(num);
        // var resp = 'num is '+num+' and next is '+next;
        var resp = 'num is '+num+' and next is <a href="?start='+next+'">'+next+'</a>';
        console.log(resp);

        response.writeHead(200, {'Content-Type': 'text/html'});
        response.write(resp);
        response.end();
    } catch ( e ) {
        response.writeHead(200, {'Content-Type': 'text/plain'});
        response.end('Oops. An error occurred: '+e+'\n');
    }
}

http.createServer(responder).listen(myPort);
console.log("Listening on http://0.0.0.0:"+myPort);

Modules and NPM¶

Out of the box, Node doesn't do much. Even in the examples above, we have used several add-on modules:

var http = require('http'); 
var url = require('url'); 
var querystring = require('querystring');

By convention, the module is loaded and held in a global variable that matches the name of the module, so the syntax for using something in the module is very much like in Python:

module_name.function_name(arg1, arg2);

Additional modules can be loaded using the Node Package Manager (NPM), which is very much like PIP.

Our next example requires a bunch of libraries, so just like using Python we needed to use Pip to install a bunch of Python libraries, including Flask, PyMySQL and bcrypt, we'll use NPM to install a bunch of JavaScript libraries. I won't walk through installing the packages we'll use; contact me if you'd like to do that and you need help.

The Express Module¶

A popular module for routing and many other things is called Express. Here is some information on routing. To install it, see installing. I won't go over that.

Express Example¶

Here's the code for the express example. As you can see, the single responder callback has been broken up into a collection of handlers for particular routes that are matched against the URL, just like in Flask.

One important difference is that in Flask, both methods are combined in a single handler (and we distinguish them by looking at request.method), but here we differentiate by saying app.get(); we would say app.post() to support a POST method.

var myPort = 1942;

var express = require('express');

var app = express();

// respond with "hello world" when a GET request is made to the homepage
app.get('/', function (req, res) {
  res.send('hello world');
});

app.get('/hello/', function (req, res) {
    res.send('Hello to you, too!');
});

app.get('/bye/', function (req, res) {
    res.send('Come back soon!');
});

function collatzNext(n) {
    return ( n % 2 === 0 )? n / 2 : 3 * n + 1;
}

app.get('/collatz/:start', (req, res) => {
        // a simple computation
        var num = req.params.start;
        var next = collatzNext(num);
        // var resp = 'num is '+num+' and next is '+next;
        var resp = 'num is '+num+' and next is <a href="/collatz/'+next+'">'+next+'</a>';
        console.log(resp);
        res.send(resp);
});

app.listen(myPort, () => console.log(`Example app listening on port http://0.0.0.0:${myPort}!`));

Notice the parameterized URL, this time using colons instead of angle brackets: /collatz/:start rather than /collatz/<start>.

So, given your knowledge of Flask, the learning curve for Node and Express is much less steep.

QuestForm Example¶

Our last example has two routes, both matching /

GET reads an HTML file from disk. That HTML file contains a form to fill out. The form posts to the / route.
POST gets the form data that was submitted, and writes (appends) it to a log file of the form submissions.

Both reading the HTMl file from disk and appending to the log demonstrate the callback-style coding.

Here's the code:

var myPort = 1942;

var express = require('express');
var fs = require('fs');
var bodyParser = require('body-parser');

var app = express();

// set up static file serving, served out of the 'static' folder:
// we'll use this for CSS and our logo

app.use(express.static('static'))

// for parsing application/xwww-
app.use(bodyParser.urlencoded({ extended: true })); 

app.get('/', function (req, res) {
    // A templating system would probably be better, and it is 
    // inefficient to read the file every time, but what the heck
    fs.readFile('questform.html', function (err, data) {
        if(err) {
            return console.error(err);
        }
        res.writeHead(200, {'Content-Type': 'text/html'});
        res.write(data);
        res.end();
    });
});

// The questform will submit to the empty string,
// which is the same route, but using POST

app.post('/', function (req, res) {
    console.log('body')
    // the bodyParse creates this, as JS object with the form data
    console.log(req.body);
    fs.appendFile('questform.log',
                  // JSON is easy to parse
                  JSON.stringify(req.body)+'\n',
                  function (err, data) {
                      if(err) {
                          console.log('error writing to questdata.log: '+err);
                      }
                  });
    // happens concurrently with writing the log
    res.writeHead(200, {'Content-Type': 'text/html'});
    res.write('Thanks for your submission');
    res.end();
});


const msg = `Example app listening on port http://0.0.0.0:${myPort}!`;
app.listen(myPort, () => console.log(msg));

Demonstrations¶

A video demonstrating all these examples is on the videos page.

Node.js Resources¶

If you want to learn more, the following is a good start: Node Beginner, which I've cribbed some from.

The Mozilla Developers Network has some excellent tutorials. I suggest starting with Node and Express Introduction

Summary¶

Node.js is an important player in the web application world for good reason: performance
It requires learning to code using callbacks.
It handles
- Handling I/O-bound HTTP requests, where the request is mostly, say, database I/O or file I/O.
It doesn't do certain things well:
- Compute-bound HTTP requests (though it can with additional web-worker threads).

If you read carefully about event loops, you'll see I elided some details. First, the callbacks are executed in order, so if there are several whose I/O has completed or whose event has occurred, they get put on a queue of things to execute. Since they aren't executed simultaneously, that avoids all the deadlocks and race conditions we would have to worry about with threads. Second, the event-loop that executes these callbacks is a normal computer program, so it has a thread, but there's only one thread, so again we can avoid the deadlocks and race conditions. ↩