Be warned that most applications code, particularly that using ordinary files, will be more portable, easier to code, and more efficient with less tuning if it uses the standard I/O library. The low-level facilities give up the user-level buffering and error handling standard I/O provides. But this might be exactly what you want.
After all these caveats and warnings, there is nothing particularly deep or mysterious about system-level I/O. You still create, open, close, read, and write files. You will just get slightly more control and be exposed to more operational details. First some more general information about Unix file systems.
A Masked Ball
Every process has: an effective user ID, an effective group ID, a present (current) working directory, a umask, and information about open files in a file descriptor table. The current working directory says where to look for or create a file given a relative pathname. The effective user and group IDs are used to check whether an action is permitted and are associated with any newly created files. The other two elements are new.
Go ahead and type umask to the shell.  You'll
      get back an octal number that says which bits should be
      turned off on newly created files.  For example, I got
      back 0007 from my shell, which means that all the
      permission bits for other should be turned off by
      default when I create new files.
      
This is not just a shell feature: every process has a umask
      associated with it.  A process inherits its umask from its
      parent, i.e., the process that started it.  When a process asks
      to create a file with permissions 666 (user, group,
      other all have read and write permissions), this value is ANDed
      with the complement of the umask.  In the case of my shell, such
      a request would yield a file with permissions 0660
      which is 0666 & ~0007 or 0666 & 0770.
      
You can change the umask with the umask() system
      call:
      
      mode_t umask(mode_t mask);
      
      umask() returns the previous value of the
      umask, which can be handy when you want to change the umask and
      then reset it later.
      File Descriptors
Every process has an array of file descriptors associated with it. Your program refers to open files by the integer index of the corresponding file descriptor in the table. Under Linux, each table entry is a pointer to astruct file
      containing a variety of elements, including a pointer to the
      file operations we discussed last time, as well as the current
      offset into the file for use with the next I/O operation (often
      called the file pointer).
      When a process begins execution, there are three file
      descriptors already allocated, as you know.  Standard input is
      file descriptor 0, standard output is file
      descriptor 1, and standard error is file descriptor
      2.  Now you know why shell file redirection uses
      program 2> foo to redirect program's
      standard error output to the file foo: An integer
      in a redirection specifier is the file descriptor number!  In
      your C code, you can also refer to the file descriptor numbers
      of these streams as STDIN_FILENO,
      STDOUT_FILENO, and STDERR_FILENO,
      which is theoretically more portable.
      
Opening/Creating/Closing/Deleting Files
       int open(const char *pathname, int flags);
       int open(const char *pathname, int flags, mode_t mode);
       int creat(const char *pathname, mode_t mode);
       int close(int fd);
       int unlink(const char *pathname);
       
      Not suprisingly, you open a file using the open()
      system call.  creat() is an old system call for a
      very common way to open a file for output, as we'll see below.
      
      The integer returned when you open or create a file is the index
      of the new file descriptor in the process's file descriptor
      table or -1 if there is an error (and
      errno is set accordingly).  The OS will always use
      the lowest numbered available file descriptor table entry, which
      suggests one way to do redirection: close the file
      associated with a descriptor and then open another file (which
      will replace the file you just closed unless you had another
      empty file descriptor table entry).
      
mode, when specified is the permissions the
      process would like the file to have in the event that a file is
      created.  These permissions are ANDed with the complement of the
      umask as described above to determine the actual file
      permissions.  The result of this computation has no affect on
      what this process can do with the file: a program may open a
      newly created read-only file for writing and then write to it.
      The permissions will apply to whomever opens the file later.
       
You are not to assume that the permission bits are in the
       order we all associate with ls -l.  You should OR
       together the appropriate compile time constants.  For example,
       a mode of
       
       S_IRUSR | S_IWUSR | S_IXUSR | S_IRGRP | S_IXGRP | S_IXOTH
       
       means that the owner can read, write and execute the file; a
       group member may read and execute the file; and everyone else
       may only execute the file.  We would typically describe this as
       0751. 
       There are also shorthands for the common cases in which one
       category of user has all permissions:  S_IRWXU,
       S_IRWXG, and S_IRWXO.
       
The flags argument specifies what you want to do with the
       file.  O_RDONLY means to open the file for
       reading, O_WRONLY means to open the file for
       writing, and O_RDWR means to open the file for
       both.   ORing togther the first two is not the same as the
       third, because read only is traditionally indicated by all
       zeros.
       
If you want a file to be created in the event the file
       doesn't already exist, you OR in O_CREAT.  If you
       would like to insist that the file must not already
       exist, then you OR in O_CREAT | O_EXCL.  In this
       case, the file will be created if it doesn't exist, and you'll
       get an error if does exist.  This could be useful for checking
       whether a previous invocation of your program left some
       temporary files behind or for logging.  Sometimes this option
       is used for locking, but it doesn't always work (e.g., on NFS
       mounted file systems).  
       
If you would like the file to be truncated to 0 bytes in
       length before you start working on it, you OR in
       O_TRUNC.  If you want to append, you specify
       O_APPEND.  There are others flags, too, but this
       will cover most common applications.
       
So now we can reveal that creat() is just like
       calling  
       
     open(path, O_WRONLY | O_CREAT | O_TRUNC, perms);
       
       When you open an existing file, your priviledges are checked
       agains the file's permissions.  You must not only have read or
       write permission for the file, but you must also have execute
       (search) permissions for all directory entries in the path
       (execute permission on a directory means that you can
       cd to it).
       
To create a file, you must have write permission for the directory, which makes sense since creating a file means making a new link, i.e., updating the contents of the file that represents the directory the file is in.
close() closes a file and allows the file
       descriptor to be reused.  If no other references to the file
       are alive in the system, the kernel resources associated with
       the file are freed.  If there are no more links to the file,
       the file is finally removed. 
       
close() does not insist that pending
       writes be complete and is therefore very fast..  Unix performs
       the actual writes later at its convenience.  If there are any
       errors, you won't find out about them (there is no way to tell
       you!).  
       
You can force output to go to disk using
       fsync().  This is not the same as
       fflush() in the standard I/O library:
       fflush() simply forces all data in the user-level
       I/O buffers to be written via the write() system
       call.
       
To remove a file, call unlink().  As we have
       already discussed, this doesn't necessarily delete the file.
       unlink() removes the link to the file specified in
       the path name, i.e., it removes that entry from the directory
       and decreases the reference count in the file's inode by one.
       If the reference count goes to 0, then the system checks
       whether any processes have the file open.  If not, the disk
       blocks are freed.  If so, then then the file is retained until
       it is no longer open for any process.
       
       
Reading and Writing
       ssize_t read(int fd, void *buf, size_t count);
       ssize_t write(int fd, const void *buf, size_t count);
       
       These are similar to what you've seen before.  The first
       argument is the index of the relevant file descriptor, the
       second is a data buffer in the process's memory, and the third
       argument is a number of bytes (either the maximum to read in or
       the number to write out).  Each of these functions begins its
       work at the current file offset.  The first read()
       of a file starts at offset 0 even if the file was opened for
       appending.  
       For both reading and writing, getting back a smaller than expected count is not necessarily an error. Here is where working with low-level I/O can be more painful.
read() returns 0 on end of file.
       On error, read() returns -1 and
       errno is set.  If errno is
       EINTR, that means that the kernel was interrupted
       by something else before it got to read any data, and you
       should just try again.  You can get a short count because the
       kernel was interrupted in the middle of a transfer or the
       device wasn't ready to give any more data just now (for
       example, you were reading from a terminal and this is all
       that's available).  You can write a loop to be more forceful,
       which is part of what the standard I/O library does for you.
       Here is an example from Rochkind, p. 97: 
 
ssize_t really_read(int fd, void *buf, size_t nbyte) 
{ 
        ssize_t nread = 0, n;
        do {
                if ((n = read(fd, 
                              &((char *)buf)[nread], 
                              nbyte - nread)) 
                    == -1) {
                        if (errno == EINTR)
                                continue;
                        else
                                return -1;
                }
                if (n == 0) return nread;
                nread += n;
        } while (nread < nbyte);
        return nread;
}
       The write() function has some similar
       complications plus the issue of delayed writes.  As discussed
       above, writes to a file do not go directly to disk.  This makes
       programs run faster, but can create problems in the event of
       system crashes (thankfully fairly rare when using a stable OS).
       Not only do writes not take place immediately after the system
       call, but they do not necessarily proceed in any predictable
       order.  You can use fsync() in cases where you are
       willing to trade performance for certain guarantees.
       As with reading a file, writing fewer bytes than you wanted is not considered an error. You might be writing to a pipe that is full, for example. For regular files, this event will be rare. You can code a loop similar to the one above for writing.
Why do we do this again?
Manipulating data in small chunks or without regard to the underlying device's blocksize can be very inefficient. The buffering the kernel provides is, as you can see above, frought with unpreditability. What's not so obvious is that it is not necessarily efficient. Remember that system calls are very expensive because of the context switch from user to kernel space and back. Doing that to copy a small amount of data into kernel buffers represents tremendous overhead. It is usually easier and more efficient to do some buffering at the user level, and this is exactly what the standard I/O library does! To get a handle on the efficiency issues, consider two versions of a program that copies its input to its output.unbuffered-copy.c copies the data one byte at a
       time, i.e., it does no user-level buffering:
       
buffered-copy.c, on the other hand, does
       user-level buffering with a buffer size of
       BUFSIZ.  In our present server configuration,
       BLKSIZ is 8192 (8K).  This is a
       compile-time constant that is not chosen based on the specific
       device chosen.
       
Here is a simple test that involved copying a 3MB file using the two programs above:
  % ls -l io-test
  -rw-rw-r--    1 systems  faculty   3408774 Mar  4 00:06 io-test
  %
  % time unbuffered-copy <io-test > /dev/null
  real	0m6.469s
  user	0m1.490s
  sys	0m4.950s
  %
  % time buffered-copy <io-test > /dev/null
  real	0m0.011s
  user	0m0.000s
  sys	0m0.010s
  %
       read() and write() calls).
       
       Seek and Ye Shall Find
It is often convenient to begin operating at a particular place in a file. A database application might well make use of this ability. We sawfseek() in the standard I/O
       library, and lseek() is similar: 
off_t lseek(int fildes, off_t offset, int whence);
lseek() does not do any actual I/O —
       it doesn't even communicate with the device driver that
       services the file associated with the file descriptor.  The
       offset in the file descriptor is changed.  That's all.  How it
       is changed is determined by which of three values is passed in
       as whence:
       - SEEK_SETindicates the file pointer should be set to the value of- offset, that is, the seek is relative to the start of file.
- SEEK_CURindicates the file pointer should be incremented by- offset, that is, the seek is relative to the current file pointer's position.
- SEEK_ENDindicates the file pointer should be set to the sum of the file size and- offset, that is the seek is relative to the end of the file. The offset may be positive or negative.
Is seeking efficient?
That depends on exactly how your file is laid out on disk. Unix systems generally use a system that makes seeking to a random location in a file fairly fast by storing a file's list of data blocks in a tree structure. In the Linuxext2 system, for example, it works like this:
	Each inode in an ext2 file system contains 15
	slots each of which can hold the address of a data block on
	disk (the block number).  12 of the 15 slots address blocks
	that contain actual file data, i.e., you can seek to any
	address in 12KB of data (each data block holds 1K in the
	ext2 system) with a single disk access once you
	have the inode.  Very small files can be accessed quite
	efficiently!
	The 13th slot points to an indirect data block, i.e., it points to a block of pointers to data blocks. The indirect blocks are 512 bytes, so each one holds 128 pointers to actual data.
The 14th slot points to a double indirect data block, i.e., it points to a block of pointers to blocks of pointers to data blocks.
The 15th slot points to a triple indirect data block, i.e., it points to a block of pointers to blocks of pointers to blocks of pointers to data blocks.
If you do the arithmetic, that gets us to 1 giga-block as a
	maximum file size.  If disk blocks were 4K, then that would be
	4GB, which is as large as the 32 bit file size field in the
	inode can get.  An ext2 file system can contain
	no more than 4 TB of data in total.
	
In the worse case, you can get to any data in a file in 4 disk block accesses.
Other Calls for Reading and Writing
    ssize_t pread(int fd, void *buf, size_t count, off_t offset);
    ssize_t pwrite(int fd, const void *buf, size_t count, off_t offset);
       
	
	Semantically, these are like doing an lseek()
	followed by a read() or write().
	However, they don't actually refer to or affect the file
	pointer in the file descriptor:  they use their offset
	directly.  If the file was opened with the
	O_APPEND flag, then pwrite() ignores
	the offset argument and, in essence, always seeks
	to the end of file before writing anything.
	The main advantage to these calls, in addition to providing one call to replace two, is that they are more thread safe. You don't have to worry about other threads of control doing things that affect the file pointer in a shared file descriptor.
       ssize_t readv(int fd, struct iovec *vector, int count);
       ssize_t writev(int fd, const struct iovec *vector, int count);
       
       These are the so-called scatter read and gather
       write calls.  We won't go into great detail here.  The idea
       is that vector is an array of a struct
       iovecs, each of which contains a pointer to data and a
       number of bytes.  The memory involved therefore need not be
       contiguous, though the data will be logically contiguous in the
       file. 
       These calls represent a way to do a set of reads and writes
       all at once, but with the imposition of the need to create the
       structures for the vector.
        
        
Modified: 17 March 2008
