Systems Programming

Low-Level I/O

If you are running on a Unix system, you may choose to use low-level I/O system calls rather than the standard I/O library calls we saw before. Because they are really system calls, low-level I/O operations get into the kernel slightly faster than the library routines, but this does not mean that your code will run faster if you use them. To gain efficiency, you must do very careful planning of your I/O operations. The low-level operations give you a bit more control, particularly of special I/O devices. If you are writing kernel level code, you must use the system calls because the standard I/O library is not available to you.

Be warned that most applications code, particularly that using ordinary files, will be more portable, easier to code, and more efficient with less tuning if it uses the standard I/O library. The low-level facilities give up the user-level buffering and error handling standard I/O provides. But this might be exactly what you want.

After all these caveats and warnings, there is nothing particularly deep or mysterious about system-level I/O. You still create, open, close, read, and write files. You will just get slightly more control and be exposed to more operational details. First some more general information about Unix file systems.

A Masked Ball

Every process has: an effective user ID, an effective group ID, a present (current) working directory, a umask, and information about open files in a file descriptor table. The current working directory says where to look for or create a file given a relative pathname. The effective user and group IDs are used to check whether an action is permitted and are associated with any newly created files. The other two elements are new.

Go ahead and type umask to the shell. You'll get back an octal number that says which bits should be turned off on newly created files. For example, I got back 0007 from my shell, which means that all the permission bits for other should be turned off by default when I create new files.

This is not just a shell feature: every process has a umask associated with it. A process inherits its umask from its parent, i.e., the process that started it. When a process asks to create a file with permissions 666 (user, group, other all have read and write permissions), this value is ANDed with the complement of the umask. In the case of my shell, such a request would yield a file with permissions 0660 which is 0666 & ~0007 or 0666 & 0770.

You can change the umask with the umask() system call:

      mode_t umask(mode_t mask);
      
umask() returns the previous value of the umask, which can be handy when you want to change the umask and then reset it later.

File Descriptors

Every process has an array of file descriptors associated with it. Your program refers to open files by the integer index of the corresponding file descriptor in the table. Under Linux, each table entry is a pointer to a struct file containing a variety of elements, including a pointer to the file operations we discussed last time, as well as the current offset into the file for use with the next I/O operation (often called the file pointer).

When a process begins execution, there are three file descriptors already allocated, as you know. Standard input is file descriptor 0, standard output is file descriptor 1, and standard error is file descriptor 2. Now you know why shell file redirection uses program 2> foo to redirect program's standard error output to the file foo: An integer in a redirection specifier is the file descriptor number! In your C code, you can also refer to the file descriptor numbers of these streams as STDIN_FILENO, STDOUT_FILENO, and STDERR_FILENO, which is theoretically more portable.

Opening/Creating/Closing/Deleting Files

       int open(const char *pathname, int flags);
       int open(const char *pathname, int flags, mode_t mode);
       int creat(const char *pathname, mode_t mode);

       int close(int fd);

       int unlink(const char *pathname);
       
Not suprisingly, you open a file using the open() system call. creat() is an old system call for a very common way to open a file for output, as we'll see below.

The integer returned when you open or create a file is the index of the new file descriptor in the process's file descriptor table or -1 if there is an error (and errno is set accordingly). The OS will always use the lowest numbered available file descriptor table entry, which suggests one way to do redirection: close the file associated with a descriptor and then open another file (which will replace the file you just closed unless you had another empty file descriptor table entry).

mode, when specified is the permissions the process would like the file to have in the event that a file is created. These permissions are ANDed with the complement of the umask as described above to determine the actual file permissions. The result of this computation has no affect on what this process can do with the file: a program may open a newly created read-only file for writing and then write to it. The permissions will apply to whomever opens the file later.

You are not to assume that the permission bits are in the order we all associate with ls -l. You should OR together the appropriate compile time constants. For example, a mode of

       S_IRUSR | S_IWUSR | S_IXUSR | S_IRGRP | S_IXGRP | S_IXOTH
       
means that the owner can read, write and execute the file; a group member may read and execute the file; and everyone else may only execute the file. We would typically describe this as 0751.

There are also shorthands for the common cases in which one category of user has all permissions: S_IRWXU, S_IRWXG, and S_IRWXO.

The flags argument specifies what you want to do with the file. O_RDONLY means to open the file for reading, O_WRONLY means to open the file for writing, and O_RDWR means to open the file for both. ORing togther the first two is not the same as the third, because read only is traditionally indicated by all zeros.

If you want a file to be created in the event the file doesn't already exist, you OR in O_CREAT. If you would like to insist that the file must not already exist, then you OR in O_CREAT | O_EXCL. In this case, the file will be created if it doesn't exist, and you'll get an error if does exist. This could be useful for checking whether a previous invocation of your program left some temporary files behind or for logging. Sometimes this option is used for locking, but it doesn't always work (e.g., on NFS mounted file systems).

If you would like the file to be truncated to 0 bytes in length before you start working on it, you OR in O_TRUNC. If you want to append, you specify O_APPEND. There are others flags, too, but this will cover most common applications.

So now we can reveal that creat() is just like calling

     open(path, O_WRONLY | O_CREAT | O_TRUNC, perms);
       

When you open an existing file, your priviledges are checked agains the file's permissions. You must not only have read or write permission for the file, but you must also have execute (search) permissions for all directory entries in the path (execute permission on a directory means that you can cd to it).

To create a file, you must have write permission for the directory, which makes sense since creating a file means making a new link, i.e., updating the contents of the file that represents the directory the file is in.

close() closes a file and allows the file descriptor to be reused. If no other references to the file are alive in the system, the kernel resources associated with the file are freed. If there are no more links to the file, the file is finally removed.

close() does not insist that pending writes be complete and is therefore very fast.. Unix performs the actual writes later at its convenience. If there are any errors, you won't find out about them (there is no way to tell you!).

You can force output to go to disk using fsync(). This is not the same as fflush() in the standard I/O library: fflush() simply forces all data in the user-level I/O buffers to be written via the write() system call.

To remove a file, call unlink(). As we have already discussed, this doesn't necessarily delete the file. unlink() removes the link to the file specified in the path name, i.e., it removes that entry from the directory and decreases the reference count in the file's inode by one. If the reference count goes to 0, then the system checks whether any processes have the file open. If not, the disk blocks are freed. If so, then then the file is retained until it is no longer open for any process.

Reading and Writing

       ssize_t read(int fd, void *buf, size_t count);
       ssize_t write(int fd, const void *buf, size_t count);
       
These are similar to what you've seen before. The first argument is the index of the relevant file descriptor, the second is a data buffer in the process's memory, and the third argument is a number of bytes (either the maximum to read in or the number to write out). Each of these functions begins its work at the current file offset. The first read() of a file starts at offset 0 even if the file was opened for appending.

For both reading and writing, getting back a smaller than expected count is not necessarily an error. Here is where working with low-level I/O can be more painful.

read() returns 0 on end of file. On error, read() returns -1 and errno is set. If errno is EINTR, that means that the kernel was interrupted by something else before it got to read any data, and you should just try again. You can get a short count because the kernel was interrupted in the middle of a transfer or the device wasn't ready to give any more data just now (for example, you were reading from a terminal and this is all that's available). You can write a loop to be more forceful, which is part of what the standard I/O library does for you. Here is an example from Rochkind, p. 97:

 
ssize_t really_read(int fd, void *buf, size_t nbyte) 
{ 
        ssize_t nread = 0, n;

        do {
                if ((n = read(fd, 
                              &((char *)buf)[nread], 
                              nbyte - nread)) 
                    == -1) {
                        if (errno == EINTR)
                                continue;
                        else
                                return -1;
                }
                if (n == 0) return nread;
                nread += n;
        } while (nread < nbyte);

        return nread;
}
The write() function has some similar complications plus the issue of delayed writes. As discussed above, writes to a file do not go directly to disk. This makes programs run faster, but can create problems in the event of system crashes (thankfully fairly rare when using a stable OS). Not only do writes not take place immediately after the system call, but they do not necessarily proceed in any predictable order. You can use fsync() in cases where you are willing to trade performance for certain guarantees.

As with reading a file, writing fewer bytes than you wanted is not considered an error. You might be writing to a pipe that is full, for example. For regular files, this event will be rare. You can code a loop similar to the one above for writing.

Why do we do this again?

Manipulating data in small chunks or without regard to the underlying device's blocksize can be very inefficient. The buffering the kernel provides is, as you can see above, frought with unpreditability. What's not so obvious is that it is not necessarily efficient. Remember that system calls are very expensive because of the context switch from user to kernel space and back. Doing that to copy a small amount of data into kernel buffers represents tremendous overhead. It is usually easier and more efficient to do some buffering at the user level, and this is exactly what the standard I/O library does! To get a handle on the efficiency issues, consider two versions of a program that copies its input to its output.

unbuffered-copy.c copies the data one byte at a time, i.e., it does no user-level buffering:

buffered-copy.c, on the other hand, does user-level buffering with a buffer size of BUFSIZ. In our present server configuration, BLKSIZ is 8192 (8K). This is a compile-time constant that is not chosen based on the specific device chosen.

Here is a simple test that involved copying a 3MB file using the two programs above:


  % ls -l io-test
  -rw-rw-r--    1 systems  faculty   3408774 Mar  4 00:06 io-test
  %
  % time unbuffered-copy <io-test > /dev/null

  real	0m6.469s
  user	0m1.490s
  sys	0m4.950s
  %
  % time buffered-copy <io-test > /dev/null

  real	0m0.011s
  user	0m0.000s
  sys	0m0.010s
  %
       
Subjectively, there is a noticeable difference in run-time between the two, and the measurements show why. The unbuffered version spends nearly 500 times longer in the kernel (perhaps not too bad when you consider that we make over 8000 times as many read() and write() calls).

Seek and Ye Shall Find

It is often convenient to begin operating at a particular place in a file. A database application might well make use of this ability. We saw fseek() in the standard I/O library, and lseek() is similar:
 
  off_t lseek(int fildes, off_t offset, int whence); 
lseek() does not do any actual I/O — it doesn't even communicate with the device driver that services the file associated with the file descriptor. The offset in the file descriptor is changed. That's all. How it is changed is determined by which of three values is passed in as whence: Seeking beyond the end of a file is considered perfectly fair. A subsequent read will simple return end of file. But writes will cause the file to be enlarged so that the new bytes will go at the prescribed place, leaving a hole in the file. Reads from the hole will return all zeros, but the file does not actually get extra data blocks to contain the data. This is why a file's size can be much larger that the space consumed by its allocated disk blocks.

Is seeking efficient?

That depends on exactly how your file is laid out on disk. Unix systems generally use a system that makes seeking to a random location in a file fairly fast by storing a file's list of data blocks in a tree structure. In the Linux ext2 system, for example, it works like this: Each inode in an ext2 file system contains 15 slots each of which can hold the address of a data block on disk (the block number). 12 of the 15 slots address blocks that contain actual file data, i.e., you can seek to any address in 12KB of data (each data block holds 1K in the ext2 system) with a single disk access once you have the inode. Very small files can be accessed quite efficiently!

The 13th slot points to an indirect data block, i.e., it points to a block of pointers to data blocks. The indirect blocks are 512 bytes, so each one holds 128 pointers to actual data.

The 14th slot points to a double indirect data block, i.e., it points to a block of pointers to blocks of pointers to data blocks.

The 15th slot points to a triple indirect data block, i.e., it points to a block of pointers to blocks of pointers to blocks of pointers to data blocks.

If you do the arithmetic, that gets us to 1 giga-block as a maximum file size. If disk blocks were 4K, then that would be 4GB, which is as large as the 32 bit file size field in the inode can get. An ext2 file system can contain no more than 4 TB of data in total.

In the worse case, you can get to any data in a file in 4 disk block accesses.

Other Calls for Reading and Writing

    ssize_t pread(int fd, void *buf, size_t count, off_t offset);
    ssize_t pwrite(int fd, const void *buf, size_t count, off_t offset);
       
Semantically, these are like doing an lseek() followed by a read() or write(). However, they don't actually refer to or affect the file pointer in the file descriptor: they use their offset directly. If the file was opened with the O_APPEND flag, then pwrite() ignores the offset argument and, in essence, always seeks to the end of file before writing anything.

The main advantage to these calls, in addition to providing one call to replace two, is that they are more thread safe. You don't have to worry about other threads of control doing things that affect the file pointer in a shared file descriptor.

       ssize_t readv(int fd, struct iovec *vector, int count);
       ssize_t writev(int fd, const struct iovec *vector, int count);
       
These are the so-called scatter read and gather write calls. We won't go into great detail here. The idea is that vector is an array of a struct iovecs, each of which contains a pointer to data and a number of bytes. The memory involved therefore need not be contiguous, though the data will be logically contiguous in the file.

These calls represent a way to do a set of reads and writes all at once, but with the imposition of the need to create the structures for the vector.


Author: Mark A. Sheldon
Modified: 17 March 2008