Unix Symlinks, Copying and Environment Variables¶
Earlier in the course, we learned some basic Unix skills, including learning about:
- files
- folders
- recursive copy with
cp -r
In this reading, we'll level up our Unix skills and learn about
- symbolic links (symlinks)
- recursive copy with symlinks
- environment variables
These skills will make our use of Node.js and MongoDB a little easier and more sophisticated.
Filesystems¶
We learned that Unix filesystems are a tree of files and folders. (Pretty much every computer filesystem is, including Windows and Mac OS X.) Actually, to be slightly more accurate, each filesystem is a tree of files and folders and a computer can have access to more than one such filesystem, making a (very small) forest. For example, here are just some of the filesystems on Tempest:
Filesystem Size Used Avail Use% Mounted on
/dev/mapper/cl-root 105G 78G 27G 75% /
/dev/sda1 1.1G 289M 775M 28% /boot
academicstore:/volume16/tempest-home 5.4T 3.9T 1.5T 73% /home
academicstore:/volume17/tempest-students 608G 447G 162G 74% /students
academicstore:/volume11/credlab2022 3.7T 2.7T 977G 73% /credlab2022
The details aren't important; just the idea that a computer has several filesystems, each of which is a tree.
Now, the way that Unix works, the forest of filesystem is treated as a
tree by "mounting" different filesystems into the tree starting at
/
. So, the /home
directory in the /
filesystem is associated
with another filesystem, this one called
academicstore:/volume16/tempest-home
.
We don't need to learn more about mounting, but we will have reason to look back at this notion in short while.
Symbolic Links¶
The things in a filesystem are files and folders. Well, almost everything is a file or a folder. There's another kind of thing called a "symbolic link". That thing is a tiny bit of text (stored in a folder) that is a pathname of something else in the filesystem. It's a kind of pointer or a shortcut.
For short, a symbolic link is often called a symlink. I'll use that shorthand.
Here's an example. I keep a folder for each course I teach in a directory tree in my personal account. It might look like:
/home
anderson
teaching
21fall
cs204
cs304
22spring
cs230
cs304
22fall
cs204
cs304
23spring
cs304
cs307
Suppose I found it tedious to cd
to the current cs304
directory. I
could create a symlink in my home directory that says:
304 -> teaching/23spring/cs304
Then, I could login and just do cd 304
instead of cd teaching/23spring/cs304
.
Or, suppose I wanted to put the schedule page of CS 304 into my
public_html
. I could create a symlink in my personal public_html
folder that says:
304sched.html -> /home/cs304node/public_html/top/schedule.html
Then, if someone tries to visit the following URL:
https://cs.wellesley.edu/~anderson/304sched.html
They will get the cs304node schedule page. It's as if the 304 schedule is in two different places on the computer.
It's worth repeating this and expanding on it. The very deepest levels of the operating system understand symlinks, so that if a program tries to open a "file" and it turns out that the "file" is really a symlink to some other file, the operating system will silently and automatically follow the symlink and open the other file instead. The two names become synonyms.
Fun fact: I put a symlink from
index.html
to the real filename, likeunix2.html
in pretty much every lecture and reading folder. That allows me to edit a file calledunix2.html
while using a shorthand url like/readings/unix2/
instead of/readings/unix2/unix2.html
in the browser.
A symlink can be to a file or a folder; it works the same way.
In fact, a symlink can even link to another symlink, for two or more levels of indirection.
There is a tiny extra delay as the OS follows a symlink, but we will ignore any "inefficiency" for symlinks.
Note that the symlink can even name a file or folder on a different
filesystem. So, for example, a student could put a symlink to the
CS304 schedule page in their public_html
and that would work even
though /home
and /students
are different filesystems.
So what? Well, we will take advantage of symlinks in our Node.js apps.
If you'd like to learn more about symlinks, you can learn more at these links:
We'll talk shortly about how to create them and how to copy them. First, let's motivate them.
node_modules
and Symlinks¶
Node.js programs require lots of add-on modules in order to do anything interesting. I've tried to keep things lean, but for our demo apps and assignments, I've installed at least 10 modules, each of which loads additional modules as part of its implementation.
These are all installed into a folder called node_modules
that is in
the main folder of an application.
The result is that the node_modules
folder can get big quickly. The
one for the cs304node
account on Tempest takes up 48MB and 6375
files.
Now, to be fair, 48MB isn't that big by modern standards. But if I implement 10 different apps, each of which is 48MB, that's half a gigabyte of storage, 90% of which is perfectly, literally redundant. Multiply that by 30+ people in this course, and the redundant storage becomes a bit more costly.
But disk space isn't even the big problem. The big problem is copying
an app to/from your laptop. (You might decide not to copy apps to/from
your laptop and, indeed, most students don't. But you should be able
to doso without tedious amounts of redundant copying.) If you have a
Node.js app for an assignment that is maybe 50K of code and you did
the work on your laptop, you'd have to copy your app folder to the
server to turn it in. If the 50MB node_modules
folder is in your
myApp
folder, as it must be, then copying your myApp
folder to the
server is now a 50MB task, which is literally 1000 times longer than
the 50K of your code.
Symlinks can solve both these problems:
- an app folder can have a symlink to a shared
node_modules
folder instead of duplicating the storage, and - copying a folder with a symlink doesn't have to involve copying the stuff that the symlink points to.
Now, to be clear, there needs to be a node_modules
on both your
laptop and the server, since symlinks don't work across networks, but
you can install a single 50MB node_modules
folder on your laptop
once and make as many symlinks to it as you want.
Let's see how this might work in practice.
App Structure¶
You will build several apps in this course, for assignments (lookup
,
crud
, ajax
...) and various versions of your project (draft
,
alpha
, beta
). Each app will be in its own folder, but they will be
structured in a similar way:
apps/
lookup/
server.js
public/
views/
node_modules/
crud/
server.js
public/
views/
node_modules/
draft/
server.js
public/
views/
node_modules/
...
So far, so good. But, as we've observed, the node_modules
folder is
both large and identical. So, I'm suggesting that we replace the
node_modules
folders above with symlinks to a single real node_modules
folder that each app shares.
However, we also want to be able to copy folders like lookup
to/from our laptops. To do that, we need the symlink to have the same
value on both the laptop and the server. We can accomplish that by
symlinking to a node_modules
in the parent folder. Like this:
apps/
node_modules
lookup/
server.js
public/
views/
node_modules -> ../node_modules
crud/
server.js
public/
views/
node_modules -> ../node_modules
draft/
server.js
public/
views/
node_modules/
...
So, the apps/node_modules
folder could be the real node_modules and
each app folder (such as lookup
) just symlinks to that one. We can
copy an app folder (again, such as lookup
) to/from our laptops
without having to copy the real 50MB node_modules
folder across the
network.
We can then introduce a second layer of indirection and replace the
apps/node_modules
in each student's account with a symlink to a
shared node_modules
folder in the course account. Like this:
apps/
node_modules -> /home/cs304node/omnibus/node_modules
lookup/
server.js
public/
views/
node_modules -> ../node_modules
crud/
server.js
public/
views/
node_modules -> ../node_modules
draft/
server.js
public/
views/
node_modules/
...
Apps on your Laptop¶
On your laptop, the situation is nearly identical. You can have an
apps
folder someplace and it can be like this:
apps/
node_modules (the real folder, with 48MB of code)
lookup/
server.js
public/
views/
node_modules -> ../node_modules
crud/
server.js
public/
views/
node_modules -> ../node_modules
draft/
server.js
public/
views/
node_modules/
...
Notice that each app folder is now identical, and so we can copy between laptop and server without needing to make any adjustments. The difference is in the parent folder.
Creating a Symlink¶
A symlink is created with the ln
command with the -s
switch. (The
-s
says we want a symbolic link; the default is a "hard" link,
which we will not be using or discussing, but feel free to talk to
me in office hours if you're curious.)
The way I always create a symlink is to go to the directory I want the
symlink to be in, and then use the command to specify an existing
thing (the target of the symlink). So, to create the symlink in the
apps/lookup
folder, above, I would do:
cd apps/lookup
ln -s ../node_modules .
Note the dot at the end of the command! It says to create the symlink in the current directory, using the same name. Very convenient.
Copying a Folder with Symlinks¶
When we copy a folder with symlinks in it, there's a choice:
- copy the stuff that the symlink points to (such as the 50MB of code), or
- copy the text of the symlink (just the shortcut)
The text of the symlink is just a handful of bytes, as you've seen, so the latter is obviously millions of times faster. It's often, but not always, what we want.
Suppose I've already got my draft
folder of my project set up, with
its symlink, and it's working great. It's time to start working on the
alpha
version of the project, so we want to start with recursively
copying the draft
folder to a new folder called alpha
.
Furthermore, we want to do the recursive copy in the second way,
copying the symlink as a symlink, rather than the stuff it points
to. The cp
command has a switch for that choice: -d
.
(If you're curious about the details of the cp
command, you can read
the cp manual
page. The cp
command has a zillion options, but if you search for "dereference"
you'll find options to enable or disable dereferencing. To
"dereference" means to follow the symlink and copy the stuff it points
to, while not dereferencing means to copy the symlink as a symlink.)
Here's how to do the copy:
cd apps/
cp -rd draft alpha
Couldn't be easier.
Seeing Symlinks¶
If you want to see the entries in a folder with information about
whether they are symlinks or not, you can get a "long" listing by
adding the -l
switch to the ls
command. Here's what an ls -l
might look like in the alpha
folder:
$ ls -l
total 200
lrwxrwxrwx. 1 cs304node cs304node 16 Feb 25 16:42 connection.js -> ../connection.js
-rw-r-----. 1 cs304node cs304node 3311 Jan 17 19:34 cs304.js
lrwxrwxrwx. 1 cs304node cs304node 15 Jan 18 17:04 node_modules -> ../node_modules
-rw-r-----. 1 cs304node cs304node 459 Jan 17 19:49 package.json
drwxr-x---. 2 cs304node cs304node 4096 Jan 17 19:34 public
-rw-r-----. 1 cs304node cs304node 14480 Jan 19 16:34 server.js
drwxr-x---. 3 cs304node cs304node 4096 Jan 19 16:31 views
You see that the node_modules
entry is a symlink to the parent
folder, as is the connection.js
file. Everything else is either a
normal file or a directory. The first character on each line tells
you the kind of thing the entry is:
d
is for directories, such asviews
s
is for symlinks, such asnode_modules
-
is for regular files, such asserver.js
Environment Variables¶
As long as we are leveling up our Unix knowledge, this is a good time to talk about environment variables.
When a program runs, it often needs configuration information. For example, the "print" command needs to know the name and location of your printer. The "edit" command needs to know my favorite editor (Emacs, not vim). Python needs to know where to load Python modules from. And Node apps will need some configuration values as well.
One way that Unix has used for decades to store those configuration values are environment variables. You can think of these as variables that live outside your programs, but which the program can read if they want them.
You can configure these environment variables when you login, or you can do it later, whenever you like.
You can find out the value of an environment variable with the echo
command in the shell, preceding the variable's name with a dollar
sign. Here are just a few; try them
echo $HOSTNAME # the name of the computer you are logged into
echo $USER # your username
echo $HOME # your home directory
echo $PATH # where to find commands
echo $EDITOR # the editor to use
echo $NODEJS # the version of node to use
You can find out all your environment variables with the printenv
command:
printenv
You can set an environment variable in your shell like this:
export VAR=value
Try it! Set a variable FAV
to your favorite color or song or something:
export FAV='blue'
And test it:
echo $FAV
Note that setting an environment variables in the shell with the
export
command is not permanent. They are only in that one shell and
go away when you logout. If you want them to be permanent, save the
settings in a file that is read when you login.
Environment Variables in CS 304¶
I've created MongoDB accounts for each of you with a random
password. When we run the mongo
shell (or mongosh
), we need a
fancy URI to connect, including the username and password. There's a
file in your home directory called .cs304env.sh
that sets three
environment variables using the export
command. You can look at it:
cat ~/.cs304env.sh
and you can run it:
source ~/.cs304env.sh
Running it will set the environment variables. You can test that it worked:
echo $MONGO_URI
You can then use the MONGO_URI
variable when you run the mongo
shell:
mongo $MONGO_URI
We'll do that in lab.
Summary¶
- symlinks allow us to replace a file or directory with a pointer to another location
- symlinks allow us to reduce duplication of files and save space and time (in copying)
- the
cp
command with the-rd
switches will copy recursively and copying symlinks as symlinks - environment variables are a handy way to provide configuration information to programs
- the
source
command reads shell commands from a file - your MONGO username, password and URI are stored in your
~/.cs304env.sh
file