Unix Symlinks, Copying and Environment Variables

Earlier in the course, we learned some basic Unix skills, including learning about:

  • files
  • folders
  • recursive copy with cp -r

In this reading, we'll level up our Unix skills and learn about

  • symbolic links (symlinks)
  • recursive copy with symlinks
  • environment variables

These skills will make our use of Node.js and MongoDB a little easier and more sophisticated.

Filesystems

We learned that Unix filesystems are a tree of files and folders. (Pretty much every computer filesystem is, including Windows and Mac OS X.) Actually, to be slightly more accurate, each filesystem is a tree of files and folders and a computer can have access to more than one such filesystem, making a (very small) forest. For example, here are just some of the filesystems on Tempest:

Filesystem                                Size  Used Avail Use% Mounted on
/dev/mapper/cl-root                       105G   78G   27G  75% /
/dev/sda1                                 1.1G  289M  775M  28% /boot
academicstore:/volume16/tempest-home      5.4T  3.9T  1.5T  73% /home
academicstore:/volume17/tempest-students  608G  447G  162G  74% /students
academicstore:/volume11/credlab2022       3.7T  2.7T  977G  73% /credlab2022

The details aren't important; just the idea that a computer has several filesystems, each of which is a tree.

Now, the way that Unix works, the forest of filesystem is treated as a tree by "mounting" different filesystems into the tree starting at /. So, the /home directory in the / filesystem is associated with another filesystem, this one called academicstore:/volume16/tempest-home.

We don't need to learn more about mounting, but we will have reason to look back at this notion in short while.

The things in a filesystem are files and folders. Well, almost everything is a file or a folder. There's another kind of thing called a "symbolic link". That thing is a tiny bit of text (stored in a folder) that is a pathname of something else in the filesystem. It's a kind of pointer or a shortcut.

For short, a symbolic link is often called a symlink. I'll use that shorthand.

Here's an example. I keep a folder for each course I teach in a directory tree in my personal account. It might look like:

/home
    anderson
       teaching
           21fall
               cs204
               cs304
           22spring
               cs230
               cs304
           22fall
               cs204
               cs304
           23spring
               cs304
               cs307

Suppose I found it tedious to cd to the current cs304 directory. I could create a symlink in my home directory that says:

304 -> teaching/23spring/cs304

Then, I could login and just do cd 304 instead of cd teaching/23spring/cs304.

Or, suppose I wanted to put the schedule page of CS 304 into my public_html. I could create a symlink in my personal public_html folder that says:

304sched.html -> /home/cs304node/public_html/top/schedule.html

Then, if someone tries to visit the following URL:

https://cs.wellesley.edu/~anderson/304sched.html

They will get the cs304node schedule page. It's as if the 304 schedule is in two different places on the computer.

It's worth repeating this and expanding on it. The very deepest levels of the operating system understand symlinks, so that if a program tries to open a "file" and it turns out that the "file" is really a symlink to some other file, the operating system will silently and automatically follow the symlink and open the other file instead. The two names become synonyms.

Fun fact: I put a symlink from index.html to the real filename, like unix2.html in pretty much every lecture and reading folder. That allows me to edit a file called unix2.html while using a shorthand url like /readings/unix2/ instead of /readings/unix2/unix2.html in the browser.

A symlink can be to a file or a folder; it works the same way.

In fact, a symlink can even link to another symlink, for two or more levels of indirection.

There is a tiny extra delay as the OS follows a symlink, but we will ignore any "inefficiency" for symlinks.

Note that the symlink can even name a file or folder on a different filesystem. So, for example, a student could put a symlink to the CS304 schedule page in their public_html and that would work even though /home and /students are different filesystems.

So what? Well, we will take advantage of symlinks in our Node.js apps.

If you'd like to learn more about symlinks, you can learn more at these links:

We'll talk shortly about how to create them and how to copy them. First, let's motivate them.

Node.js programs require lots of add-on modules in order to do anything interesting. I've tried to keep things lean, but for our demo apps and assignments, I've installed at least 10 modules, each of which loads additional modules as part of its implementation.

These are all installed into a folder called node_modules that is in the main folder of an application.

The result is that the node_modules folder can get big quickly. The one for the cs304node account on Tempest takes up 48MB and 6375 files.

Now, to be fair, 48MB isn't that big by modern standards. But if I implement 10 different apps, each of which is 48MB, that's half a gigabyte of storage, 90% of which is perfectly, literally redundant. Multiply that by 30+ people in this course, and the redundant storage becomes a bit more costly.

But disk space isn't even the big problem. The big problem is copying an app to/from your laptop. (You might decide not to copy apps to/from your laptop and, indeed, most students don't. But you should be able to doso without tedious amounts of redundant copying.) If you have a Node.js app for an assignment that is maybe 50K of code and you did the work on your laptop, you'd have to copy your app folder to the server to turn it in. If the 50MB node_modules folder is in your myApp folder, as it must be, then copying your myApp folder to the server is now a 50MB task, which is literally 1000 times longer than the 50K of your code.

Symlinks can solve both these problems:

  • an app folder can have a symlink to a shared node_modules folder instead of duplicating the storage, and
  • copying a folder with a symlink doesn't have to involve copying the stuff that the symlink points to.

Now, to be clear, there needs to be a node_modules on both your laptop and the server, since symlinks don't work across networks, but you can install a single 50MB node_modules folder on your laptop once and make as many symlinks to it as you want.

Let's see how this might work in practice.

App Structure

You will build several apps in this course, for assignments (lookup, crud, ajax ...) and various versions of your project (draft, alpha, beta). Each app will be in its own folder, but they will be structured in a similar way:

apps/
    lookup/
        server.js
        public/
        views/
        node_modules/
    crud/
       server.js
       public/
       views/
       node_modules/
    draft/
       server.js
       public/
       views/
       node_modules/
    ...

So far, so good. But, as we've observed, the node_modules folder is both large and identical. So, I'm suggesting that we replace the node_modules folders above with symlinks to a single real node_modules folder that each app shares.

However, we also want to be able to copy folders like lookup to/from our laptops. To do that, we need the symlink to have the same value on both the laptop and the server. We can accomplish that by symlinking to a node_modules in the parent folder. Like this:

apps/
    node_modules
    lookup/
        server.js
        public/
        views/
        node_modules -> ../node_modules
    crud/
        server.js
        public/
        views/
        node_modules -> ../node_modules
    draft/
       server.js
       public/
       views/
       node_modules/
    ...

So, the apps/node_modules folder could be the real node_modules and each app folder (such as lookup) just symlinks to that one. We can copy an app folder (again, such as lookup) to/from our laptops without having to copy the real 50MB node_modules folder across the network.

We can then introduce a second layer of indirection and replace the apps/node_modules in each student's account with a symlink to a shared node_modules folder in the course account. Like this:

apps/
    node_modules -> /home/cs304node/omnibus/node_modules
    lookup/
        server.js
        public/
        views/
        node_modules -> ../node_modules
    crud/
        server.js
        public/
        views/
        node_modules -> ../node_modules
    draft/
       server.js
       public/
       views/
       node_modules/
    ...

Apps on your Laptop

On your laptop, the situation is nearly identical. You can have an apps folder someplace and it can be like this:

apps/
    node_modules (the real folder, with 48MB of code)
    lookup/
        server.js
        public/
        views/
        node_modules -> ../node_modules
    crud/
        server.js
        public/
        views/
        node_modules -> ../node_modules
    draft/
       server.js
       public/
       views/
       node_modules/
    ...

Notice that each app folder is now identical, and so we can copy between laptop and server without needing to make any adjustments. The difference is in the parent folder.

A symlink is created with the ln command with the -s switch. (The -s says we want a symbolic link; the default is a "hard" link, which we will not be using or discussing, but feel free to talk to me in office hours if you're curious.)

The way I always create a symlink is to go to the directory I want the symlink to be in, and then use the command to specify an existing thing (the target of the symlink). So, to create the symlink in the apps/lookup folder, above, I would do:

cd apps/lookup
ln -s ../node_modules .

Note the dot at the end of the command! It says to create the symlink in the current directory, using the same name. Very convenient.

When we copy a folder with symlinks in it, there's a choice:

  • copy the stuff that the symlink points to (such as the 50MB of code), or
  • copy the text of the symlink (just the shortcut)

The text of the symlink is just a handful of bytes, as you've seen, so the latter is obviously millions of times faster. It's often, but not always, what we want.

Suppose I've already got my draft folder of my project set up, with its symlink, and it's working great. It's time to start working on the alpha version of the project, so we want to start with recursively copying the draft folder to a new folder called alpha.

Furthermore, we want to do the recursive copy in the second way, copying the symlink as a symlink, rather than the stuff it points to. The cp command has a switch for that choice: -d.

(If you're curious about the details of the cp command, you can read the cp manual page. The cp command has a zillion options, but if you search for "dereference" you'll find options to enable or disable dereferencing. To "dereference" means to follow the symlink and copy the stuff it points to, while not dereferencing means to copy the symlink as a symlink.)

Here's how to do the copy:

cd apps/
cp -rd draft alpha

Couldn't be easier.

If you want to see the entries in a folder with information about whether they are symlinks or not, you can get a "long" listing by adding the -l switch to the ls command. Here's what an ls -l might look like in the alpha folder:

$ ls -l
total 200
lrwxrwxrwx. 1 cs304node cs304node     16 Feb 25 16:42 connection.js -> ../connection.js
-rw-r-----. 1 cs304node cs304node   3311 Jan 17 19:34 cs304.js
lrwxrwxrwx. 1 cs304node cs304node     15 Jan 18 17:04 node_modules -> ../node_modules
-rw-r-----. 1 cs304node cs304node    459 Jan 17 19:49 package.json
drwxr-x---. 2 cs304node cs304node   4096 Jan 17 19:34 public
-rw-r-----. 1 cs304node cs304node  14480 Jan 19 16:34 server.js
drwxr-x---. 3 cs304node cs304node   4096 Jan 19 16:31 views

You see that the node_modules entry is a symlink to the parent folder, as is the connection.js file. Everything else is either a normal file or a directory. The first character on each line tells you the kind of thing the entry is:

  • d is for directories, such as views
  • s is for symlinks, such as node_modules
  • - is for regular files, such as server.js

Environment Variables

As long as we are leveling up our Unix knowledge, this is a good time to talk about environment variables.

When a program runs, it often needs configuration information. For example, the "print" command needs to know the name and location of your printer. The "edit" command needs to know my favorite editor (Emacs, not vim). Python needs to know where to load Python modules from. And Node apps will need some configuration values as well.

One way that Unix has used for decades to store those configuration values are environment variables. You can think of these as variables that live outside your programs, but which the program can read if they want them.

You can configure these environment variables when you login, or you can do it later, whenever you like.

You can find out the value of an environment variable with the echo command in the shell, preceding the variable's name with a dollar sign. Here are just a few; try them

echo $HOSTNAME # the name of the computer you are logged into
echo $USER     # your username
echo $HOME     # your home directory
echo $PATH     # where to find commands
echo $EDITOR   # the editor to use
echo $NODEJS   # the version of node to use

You can find out all your environment variables with the printenv command:

printenv

You can set an environment variable in your shell like this:

export VAR=value

Try it! Set a variable FAV to your favorite color or song or something:

export FAV='blue'

And test it:

echo $FAV

Note that setting an environment variables in the shell with the export command is not permanent. They are only in that one shell and go away when you logout. If you want them to be permanent, save the settings in a file that is read when you login.

Environment Variables in CS 304

I've created MongoDB accounts for each of you with a random password. When we run the mongo shell (or mongosh), we need a fancy URI to connect, including the username and password. There's a file in your home directory called .cs304env.sh that sets three environment variables using the export command. You can look at it:

cat ~/.cs304env.sh

and you can run it:

source ~/.cs304env.sh

Running it will set the environment variables. You can test that it worked:

echo $MONGO_URI

You can then use the MONGO_URI variable when you run the mongo shell:

mongo $MONGO_URI

We'll do that in lab.

Summary

  • symlinks allow us to replace a file or directory with a pointer to another location
  • symlinks allow us to reduce duplication of files and save space and time (in copying)
  • the cp command with the -rd switches will copy recursively and copying symlinks as symlinks
  • environment variables are a handy way to provide configuration information to programs
  • the source command reads shell commands from a file
  • your MONGO username, password and URI are stored in your ~/.cs304env.sh file