Be warned that most applications code, particularly that using ordinary files, will be more portable, easier to code, and more efficient with less tuning if it uses the standard I/O library. The low-level facilities give up the user-level buffering and error handling standard I/O provides. But this might be exactly what you want.
After all these caveats and warnings, there is nothing particularly deep or mysterious about system-level I/O. You still create, open, close, read, and write files. You will just get slightly more control and be exposed to more operational details. First some more general information about Unix file systems.
A Masked Ball
Every process has: an effective user ID, an effective group ID, a present (current) working directory, a umask, and information about open files in a file descriptor table. The current working directory says where to look for or create a file given a relative pathname. The effective user and group IDs are used to check whether an action is permitted and are associated with any newly created files. The other two elements are new.
Go ahead and type umask
to the shell. You'll
get back an octal number that says which bits should be
turned off on newly created files. For example, I got
back 0007
from my shell, which means that all the
permission bits for other
should be turned off by
default when I create new files.
This is not just a shell feature: every process has a umask
associated with it. A process inherits its umask from its
parent, i.e., the process that started it. When a process asks
to create a file with permissions 666
(user, group,
other all have read and write permissions), this value is ANDed
with the complement of the umask. In the case of my shell, such
a request would yield a file with permissions 0660
which is 0666 & ~0007
or 0666 & 0770
.
You can change the umask with the umask()
system
call:
mode_t umask(mode_t mask);
umask()
returns the previous value of the
umask, which can be handy when you want to change the umask and
then reset it later.
File Descriptors
Every process has an array of file descriptors associated with it. Your program refers to open files by the integer index of the corresponding file descriptor in the table. Under Linux, each table entry is a pointer to astruct file
containing a variety of elements, including a pointer to the
file operations we discussed last time, as well as the current
offset into the file for use with the next I/O operation (often
called the file pointer).
When a process begins execution, there are three file
descriptors already allocated, as you know. Standard input is
file descriptor 0
, standard output is file
descriptor 1
, and standard error is file descriptor
2
. Now you know why shell file redirection uses
program 2> foo
to redirect program
's
standard error output to the file foo
: An integer
in a redirection specifier is the file descriptor number! In
your C code, you can also refer to the file descriptor numbers
of these streams as STDIN_FILENO
,
STDOUT_FILENO
, and STDERR_FILENO
,
which is theoretically more portable.
Opening/Creating/Closing/Deleting Files
int open(const char *pathname, int flags); int open(const char *pathname, int flags, mode_t mode); int creat(const char *pathname, mode_t mode); int close(int fd); int unlink(const char *pathname);Not suprisingly, you open a file using the
open()
system call. creat()
is an old system call for a
very common way to open a file for output, as we'll see below.
The integer returned when you open or create a file is the index
of the new file descriptor in the process's file descriptor
table or -1
if there is an error (and
errno
is set accordingly). The OS will always use
the lowest numbered available file descriptor table entry, which
suggests one way to do redirection: close the file
associated with a descriptor and then open another file (which
will replace the file you just closed unless you had another
empty file descriptor table entry).
mode
, when specified is the permissions the
process would like the file to have in the event that a file is
created. These permissions are ANDed with the complement of the
umask as described above to determine the actual file
permissions. The result of this computation has no affect on
what this process can do with the file: a program may open a
newly created read-only file for writing and then write to it.
The permissions will apply to whomever opens the file later.
You are not to assume that the permission bits are in the
order we all associate with ls -l
. You should OR
together the appropriate compile time constants. For example,
a mode of
S_IRUSR | S_IWUSR | S_IXUSR | S_IRGRP | S_IXGRP | S_IXOTHmeans that the owner can read, write and execute the file; a group member may read and execute the file; and everyone else may only execute the file. We would typically describe this as
0751
.
There are also shorthands for the common cases in which one
category of user has all permissions: S_IRWXU
,
S_IRWXG
, and S_IRWXO
.
The flags argument specifies what you want to do with the
file. O_RDONLY
means to open the file for
reading, O_WRONLY
means to open the file for
writing, and O_RDWR
means to open the file for
both. ORing togther the first two is not the same as the
third, because read only is traditionally indicated by all
zeros.
If you want a file to be created in the event the file
doesn't already exist, you OR in O_CREAT
. If you
would like to insist that the file must not already
exist, then you OR in O_CREAT | O_EXCL
. In this
case, the file will be created if it doesn't exist, and you'll
get an error if does exist. This could be useful for checking
whether a previous invocation of your program left some
temporary files behind or for logging. Sometimes this option
is used for locking, but it doesn't always work (e.g., on NFS
mounted file systems).
If you would like the file to be truncated to 0 bytes in
length before you start working on it, you OR in
O_TRUNC
. If you want to append, you specify
O_APPEND
. There are others flags, too, but this
will cover most common applications.
So now we can reveal that creat()
is just like
calling
open(path, O_WRONLY | O_CREAT | O_TRUNC, perms);
When you open an existing file, your priviledges are checked
agains the file's permissions. You must not only have read or
write permission for the file, but you must also have execute
(search) permissions for all directory entries in the path
(execute permission on a directory means that you can
cd
to it).
To create a file, you must have write permission for the directory, which makes sense since creating a file means making a new link, i.e., updating the contents of the file that represents the directory the file is in.
close()
closes a file and allows the file
descriptor to be reused. If no other references to the file
are alive in the system, the kernel resources associated with
the file are freed. If there are no more links to the file,
the file is finally removed.
close()
does not insist that pending
writes be complete and is therefore very fast.. Unix performs
the actual writes later at its convenience. If there are any
errors, you won't find out about them (there is no way to tell
you!).
You can force output to go to disk using
fsync()
. This is not the same as
fflush()
in the standard I/O library:
fflush()
simply forces all data in the user-level
I/O buffers to be written via the write()
system
call.
To remove a file, call unlink()
. As we have
already discussed, this doesn't necessarily delete the file.
unlink()
removes the link to the file specified in
the path name, i.e., it removes that entry from the directory
and decreases the reference count in the file's inode by one.
If the reference count goes to 0, then the system checks
whether any processes have the file open. If not, the disk
blocks are freed. If so, then then the file is retained until
it is no longer open for any process.
Reading and Writing
ssize_t read(int fd, void *buf, size_t count); ssize_t write(int fd, const void *buf, size_t count);These are similar to what you've seen before. The first argument is the index of the relevant file descriptor, the second is a data buffer in the process's memory, and the third argument is a number of bytes (either the maximum to read in or the number to write out). Each of these functions begins its work at the current file offset. The first
read()
of a file starts at offset 0 even if the file was opened for
appending.
For both reading and writing, getting back a smaller than expected count is not necessarily an error. Here is where working with low-level I/O can be more painful.
read()
returns 0
on end of file.
On error, read()
returns -1
and
errno
is set. If errno
is
EINTR
, that means that the kernel was interrupted
by something else before it got to read any data, and you
should just try again. You can get a short count because the
kernel was interrupted in the middle of a transfer or the
device wasn't ready to give any more data just now (for
example, you were reading from a terminal and this is all
that's available). You can write a loop to be more forceful,
which is part of what the standard I/O library does for you.
Here is an example from Rochkind, p. 97:
ssize_t really_read(int fd, void *buf, size_t nbyte) { ssize_t nread = 0, n; do { if ((n = read(fd, &((char *)buf)[nread], nbyte - nread)) == -1) { if (errno == EINTR) continue; else return -1; } if (n == 0) return nread; nread += n; } while (nread < nbyte); return nread; }The
write()
function has some similar
complications plus the issue of delayed writes. As discussed
above, writes to a file do not go directly to disk. This makes
programs run faster, but can create problems in the event of
system crashes (thankfully fairly rare when using a stable OS).
Not only do writes not take place immediately after the system
call, but they do not necessarily proceed in any predictable
order. You can use fsync()
in cases where you are
willing to trade performance for certain guarantees.
As with reading a file, writing fewer bytes than you wanted is not considered an error. You might be writing to a pipe that is full, for example. For regular files, this event will be rare. You can code a loop similar to the one above for writing.
Why do we do this again?
Manipulating data in small chunks or without regard to the underlying device's blocksize can be very inefficient. The buffering the kernel provides is, as you can see above, frought with unpreditability. What's not so obvious is that it is not necessarily efficient. Remember that system calls are very expensive because of the context switch from user to kernel space and back. Doing that to copy a small amount of data into kernel buffers represents tremendous overhead. It is usually easier and more efficient to do some buffering at the user level, and this is exactly what the standard I/O library does! To get a handle on the efficiency issues, consider two versions of a program that copies its input to its output.unbuffered-copy.c
copies the data one byte at a
time, i.e., it does no user-level buffering:
buffered-copy.c
, on the other hand, does
user-level buffering with a buffer size of
BUFSIZ
. In our present server configuration,
BLKSIZ
is 8192
(8K). This is a
compile-time constant that is not chosen based on the specific
device chosen.
Here is a simple test that involved copying a 3MB file using the two programs above:
% ls -l io-test
-rw-rw-r-- 1 systems faculty 3408774 Mar 4 00:06 io-test
%
% time unbuffered-copy <io-test > /dev/null
real 0m6.469s
user 0m1.490s
sys 0m4.950s
%
% time buffered-copy <io-test > /dev/null
real 0m0.011s
user 0m0.000s
sys 0m0.010s
%
Subjectively, there is a noticeable difference in run-time
between the two, and the measurements show why. The unbuffered
version spends nearly 500 times longer in the kernel (perhaps
not too bad when you consider that we make over 8000 times as
many read()
and write()
calls).
Seek and Ye Shall Find
It is often convenient to begin operating at a particular place in a file. A database application might well make use of this ability. We sawfseek()
in the standard I/O
library, and lseek()
is similar:
off_t lseek(int fildes, off_t offset, int whence);
lseek()
does not do any actual I/O —
it doesn't even communicate with the device driver that
services the file associated with the file descriptor. The
offset in the file descriptor is changed. That's all. How it
is changed is determined by which of three values is passed in
as whence
:
SEEK_SET
indicates the file pointer should be set to the value ofoffset
, that is, the seek is relative to the start of file.SEEK_CUR
indicates the file pointer should be incremented byoffset
, that is, the seek is relative to the current file pointer's position.SEEK_END
indicates the file pointer should be set to the sum of the file size andoffset
, that is the seek is relative to the end of the file. The offset may be positive or negative.
Is seeking efficient?
That depends on exactly how your file is laid out on disk. Unix systems generally use a system that makes seeking to a random location in a file fairly fast by storing a file's list of data blocks in a tree structure. In the Linuxext2
system, for example, it works like this:
Each inode in an ext2
file system contains 15
slots each of which can hold the address of a data block on
disk (the block number). 12 of the 15 slots address blocks
that contain actual file data, i.e., you can seek to any
address in 12KB of data (each data block holds 1K in the
ext2
system) with a single disk access once you
have the inode. Very small files can be accessed quite
efficiently!
The 13th slot points to an indirect data block, i.e., it points to a block of pointers to data blocks. The indirect blocks are 512 bytes, so each one holds 128 pointers to actual data.
The 14th slot points to a double indirect data block, i.e., it points to a block of pointers to blocks of pointers to data blocks.
The 15th slot points to a triple indirect data block, i.e., it points to a block of pointers to blocks of pointers to blocks of pointers to data blocks.
If you do the arithmetic, that gets us to 1 giga-block as a
maximum file size. If disk blocks were 4K, then that would be
4GB, which is as large as the 32 bit file size field in the
inode can get. An ext2
file system can contain
no more than 4 TB of data in total.
In the worse case, you can get to any data in a file in 4 disk block accesses.
Other Calls for Reading and Writing
ssize_t pread(int fd, void *buf, size_t count, off_t offset); ssize_t pwrite(int fd, const void *buf, size_t count, off_t offset);Semantically, these are like doing an
lseek()
followed by a read()
or write()
.
However, they don't actually refer to or affect the file
pointer in the file descriptor: they use their offset
directly. If the file was opened with the
O_APPEND
flag, then pwrite()
ignores
the offset
argument and, in essence, always seeks
to the end of file before writing anything.
The main advantage to these calls, in addition to providing one call to replace two, is that they are more thread safe. You don't have to worry about other threads of control doing things that affect the file pointer in a shared file descriptor.
ssize_t readv(int fd, struct iovec *vector, int count); ssize_t writev(int fd, const struct iovec *vector, int count);These are the so-called scatter read and gather write calls. We won't go into great detail here. The idea is that
vector
is an array of a struct
iovec
s, each of which contains a pointer to data and a
number of bytes. The memory involved therefore need not be
contiguous, though the data will be logically contiguous in the
file.
These calls represent a way to do a set of reads and writes
all at once, but with the imposition of the need to create the
structures for the vector
.
Modified: 17 March 2008