CS249 Systems Programming: Unix File Systems

The Unix file system is one of the principle reasons for the success of Unix. From a user and application point of view, the system is remarkably simple, uniform, and easy to use. Perhaps more surprising, the file system abstraction is maintained even within the kernel, that is, operating system routines not directly concerned with a particular class of file system can (and do) treat files and directories uniformly using the low-level Unix I/O interface.

This flexibility and uniform simplicity is a testament to soundly engineered interfaces and abstractions, but is also due in large measure to the use of an object-oriented approach to system building. None of the kernel code is written in an object-oriented language of course: it's all still C. But, as we shall see, much of the file system code uses hand-coded objects and methods.

The abstraction layer that presents file systems to the rest of the kernel and thence to user programs is called the Virtual File System (VFS). We shall look at the key data structures (objects) in the VFS below.

Though the Unix virtual file system abstraction is applied to many devices, the easiest way to understand it is to go back to its origins as a model for organizing information on a disk. From this point of view, each partition on a disk is looked upon as a separate disk.

A file system must, at the very minimum, allow the storage and retrieval of information in named files. A file system must also keep track of file metadata, such as who owns a file, and when it was last written. Unix file systems have directories, so there must be some way to record the contents of a directory. Finally, there is information about the file system as a whole, such as its size and quotas.

As you can imagine, this information can be organized in myriad ways, and any particular organization of the data can be mapped to disk in many ways.

It's data, it's metadata, it's SUPERBLOCK!

Most operating systems have a data structure for information about the file system itself, and this data structure is normally stored at the beginning of the disk or partition it describes. (In fact, there is normally some bookkeeping data stored at the very beginning, and the file system proper starts at some fixed displacement.) Unix calls this structure the superblock, and you may also hear it referred to as a file system control block. When Unix mounts a file system (i.e., makes it available by assigning it place in the global file namespace), the superblock is read off of the disk, and a superblock object is filled in in memory.

The superblock is created when the file system is created (see the manual page for mkfs) and specifies such things as the file system type, device block size, the maximum permitted file size, a name, and information related to file system caching and synchronization.

There are also some fields in the superblock that support the object-oriented model alluded to above. Most notably, there is a field in the struct that defines a superblock object called s_op (in Linux). This field defines a structure whose items are pointers to functions for manipulating the superblock (and some of the underlying file systems data structures). These functions manipulate the superblock object (its in memory and on disk versions) as well as inodes (see below). For example, there is a field called statfs:


   struct super_operations {
           ...
          int (*statfs) (struct super_block *, struct statfs *));
   }

When you make a statfs() call to get the status information for a file system, the operating system finds the superblock for the file system you are referring to (say it's in a variable called sb), extracts the operations, then this function from the operations structure, and then calls the function:


   return sb->s_op->statfs(sb, statfsbuf);

Notice how the superblock is passed in as the first argument. If the kernel were written in an object-oriented language, this would not be necessary. However, there is no way to get a method's parent object in C, so we have to pass it in explicitly. Of course, this is exactly what happens in the compiled code for an object-oriented language: this in Java is really an implicit first argument to all the methods!

The rest of the disk

All of the rest of the disk contains file information. The Unix designers elected to separate file metadata from file contents. Perhaps more strangely, a file's name is not directly connected to the file's contents or its metadata. Here's how it works:

After the superblock, there is an array of data structures on the disk called inodes (for index node, but you're decidedly uncool if you actually call them that). Inodes function as a disk index, and there is one inode for every file stored on the disk. The inode contains all the file's metadata and also says where on disk the file's contents may be found. It does not contain a name for the file.

Inode numbers start at 1. Inode 0 means an empty directory. Inode 1 is often used as a place to keep track of bad disk blocks. Inode 2 is the file system's root directory. The other inodes are allocated as new files and directories are created.

It is the inode that is read off the disk to populate the structure you get back from the stat() family of system calls. Here is that structure (which we will see in the laboratory):

   struct stat {
       dev_t     st_dev;     /* dev for this FS */
       ino_t     st_ino;     /* i-node number */
       mode_t    st_mode;    /* mode bits */
       nlink_t   st_nlink;   /* # hard links */
       uid_t     st_uid;     /* user */
       gid_t     st_gid;     /* group */
       dev_t     st_rdev;    /* if it's a device file */
       off_t     st_size;    /* in bytes */
       time_t    st_atime;   /* access */
       time_t    st_mtime;   /* modify */
       time_t    st_ctime;   /* i-node update */
       blksize_t st_blksize; /* optimal I/O size */
       blkcnt_t  st_blocks;  /* might have holes */
       };

Some Unix systems may have other information in here as well, but this much you can count on.

The full inode data structure contains quite a bit besides this. There is a pointer to the superblock for the file system that contains the inode, all the information you get in the stat structure, and some synchronization information, of course. There are two fields that are of particular concern here: i_op is a pointer to a structure of inode operations and i_fop is a pointer to a structure of file operations. These are used in the same object-oriented style we saw above.

Another programming practice comes to light when you examine the inode structure field generic_ip (that's what it's called under Linux, anyway).

This field has the type void *. What is it for? It is for filesystem-specific data. The issue is this: VFS demands that all file systems produce inodes upon request and conform to an interface based on the disk model we have been discussing. However, not all file systems really look like this. Some disk file systems, for example, store file metadata with the file contents on disk. Such file systems may require state information to keep track of how it is presenting itself to the VFS. This field allows such file systems to store a pointer to such a data structure as part of the in memory inode object. Then its methods may gain access to this information, cast it to the appropriate type, and continue the masquerade. void * is being used as a substitute for data abstraction!

That's it! Everything else on the disk is file data store in disk blocks, and, with only a few exceptions, Unix doesn't concern itself with the what's inside those data blocks.

But I'm sure I've seen directories on Unix Systems!

You have. You read some directories in lab. But they are not an explicit part of the disk model. A directory is just a file, and, as a file, it has an inode and file contents. The inode does maintain a bit that says the associated file system object is a directory. The contents of a directory file consists of a list of names and corresponding inodes. For example, the name . would be associated with the inode of the directory itself, and .. is associated with the inode of the parent directory. In fact, you can read the contents of a directory using ordinary file open and read operations. But the specific data format may vary from file system to file system. That is why the system provides readdir(). This is also one case where Unix cares about the contents of a file (symbolic links and executables are the other two cases).

Are directories done using an object-oriented style, too? Yes! In Linux, when you manipulate a directory entry, there is an in memory data structure called a struct dirent that contains a field called d_op, which is a pointer to a structure of operations on dentry objects.

The file structure manipulated by the operating system works in the same way, too. A file object has a field called f_op (under Linux) that is a pointer to a file operations table. The entries in this table are pointers to the actual implementations of the file system calls for that file object.

Links

A reference to an inode in a directory file is called a link or hard link. Creating a file creates a link, of course. You can also make a hard link to an existing file using the ln command or the link() system call. Every time a directory associates a name with an inode (either through file creation or a link() call), the st_nlink field in the inode is incremented, i.e., this field is a reference count for the inode.

The rm command stands for “remove.” But this command doesn't necessarily cause a file to be deleted so its disk blocks can be reused. Instead, rm uses the unlink() system call, which does exactly what its name implies: it removes the link to the file from the parent directory. When it does this, it decrements the link count (st_nlink) in the file's inode. If the link count isn't 0, then that's the end of it. If the link count has gone to zero, and no running process has the file open, then the file can be deleted: i.e., its disk blocks and inode can be reused. If the link count is 0 but the file is open, then the operating system waits until no process has it open any more before actually reclaiming the space. (This means that a process can continue to write to a doomed file.)

If you read the manual page for the ln, you'll discover that you are not allowed to make a hard link to an existing directory, only to a file. The only way to create a new link to a directory is to create a new directory. Why? Here is a hint in the form of a popular MIT AI Lab koan:

One day a student came to Moon and said: "I understand how to make a better garbage collector. We must keep a reference count of the pointers to each cons."
Moon patiently told the student the following story:
"One day a student came to Moon and said: `I understand how to make a better garbage collector...'"

A hard link is necessarily a reference to an inode on the same device, so you can't make a hard link to a file on another disk or remote computer.

Symbolic Links

A symobolic link is really just an ordinary file whose contents is a path to another file. It is a way to say, when I operate on file A, I really mean to be operating on file B. Symbolic links overcome the two main limitations of hard links:

Symbolic links can refer to directories.
Symbolic links can refer to objects in other file systems.

The value of symbolic links is that they allow one to create multiple views on a file system namespace.

Because of symbolic links, you must think about whether you want to transparently operate through links or whether you want to manipulate the link itself, and this is precisely the distinction between stat() and lstat(). Both return the file metadata contained in an inode, but when applied to a file that is a symbolic link, stat() will return the metatdata associated with the link target while lstat() will return the metatdata associated with the link itself. It may seem a burden to have to worry about symbolic links, but the fact that all the system calls take them into account means that links work much better in Unix then they do in other systems (notably DOS and its descendants).

You can put anything you want in a symbolic link. The OS does no verification that, for example, the link target exists. You use ln -s to create a symbolic link. There are no restrictions on how such links are created (unlike hard links).

Programmers' Summary

From a programming point of view, Unix provides certain file system abstractions: files (map from inode to ordered sequence of bytes stored in disk blocks), directory entries (map from names to inodes), inodes (file metadata and pointer to contents), and mount points (map from place in global filesystem namespace to filesystem on disk partition). [from Love, p. 187]. Within the kernel (and within user-level file system implementations such as NFS), these abstractions are represented in an object-oriented way.