CS249 Systems Programming: Midterm Exam 1 Solutions

This midterm exam contains a few pencil and paper questions and a main programming exercise. Do not save the programming problem until the last minute!

The non-programming questions are short-answer: Answer the question and explain your answer in full sentences (not just "yes" or "no" or "1024"). At most a couple a paragraphs should suffice. Brevity and concision will be important criterial for evaluating your response.

Create a directory called midterm1 in your cs249 directory. All the code for your programming task will go in there. For the short-answer questions, you may either turn in hardcopy (by slipping your stapled solutions under my door) or softcopy by putting your answers into text files in your midterm1 directory. If you are submitting softcopy text, put the answer to each question in a file named short-answer-n, where n is the question number.

When you are done, make a tarball and submit the result using the drop program.

You may not discuss the exam with your classmates at all, not even in high-level English. This is different from a regular homework. Bring all questions specifically about the exam to me. That way, I can ensure that everyone has the same information.

You can use any of the course notes and, of course, the relevant man pages for the exam.

Question 1 [10pts]

What is the generic pointer type in C and why do we need one? I.e., why can't we just choose something, (char *) for example, and use that as a generic pointer type?

Answer

[2pts] C's generic pointer type is (void *).
[4pts] The C standard guarantees that that any pointer can be converted to a (void *) and back again without loss of data. The generic pointer must, therefore, be at least as large as the largest pointer type. While programmers often assume that all pointers are the same size, and on some computers this is true, it need not always be true. Intel processors have more than one pointer size, for example: a long pointer contains a segment and an offset while a short pointer just contains an offset. C implementations for such systems may encode certain pointers using the short format for efficiency. Converting a pointer to (void *) forces the compiler to use the long version.
[4pts] The main motivation for having a generic pointer type is abstraction. If a value has the generic pointer type, then all you know is that it is a pointer to something, but you don't know what. This means that you can't dereference the pointer. So if you convert a pointer to (void *), then this presents an obstacle to those who would attempt to poke through your abstraction boundary and manipulate your data elements directly. Of course, C will allow someone to cast a (void *) to whatever they want, and if they happen to know your representation, they can subvert your abstractions. But they can't argue it was an accident!

Question 2 [10pts]

What, if any, limitations does the Unix file system model (embedded in the Virtual File System (VFS) specification) place on:

The length of file names.
The length of file extensions (e.g., .txt, .texinfo, etc.).
The number of files in a directory.

Briefly explain your answers.

Answer

There are no a priori restriction on any of these items in the abstract VFS model.

[2pts] The only place a file name is stored is as the contents of a directory, which is itself a file. The only real limit is thus the portion of the file system devoted to holding data — you could have a file name that takes up all the your disk space (or up to your quota if quotas are enforced)!
[2pts] As far as the kernel is concerned, file name extensions don't exist: they are just part of the file name and are merely a user and application convention. A file name doesn't have to have an extension at all (like Makefile). Or the whole file name could be after a ., and so the theoretical limit is one character less than the maximum file name (because the . is included).
[2pts] A directory is an ordinary file, so the maximum number of directory entries is however many will fit in all the data blocks in the file system. You could also argue that the maximum number of files in a directory is the maximum number of files you can have in the system, which is the same as number of inodes. However, it is possible to have multiple hard links to the same file, so you could have more directory entries than there are files in the system (but they wouldn't all be distinct).

The overall quality of the explanation(s) was worth an additional 4 points.

You were not responsible for the following information, but you will not be surprised to know that, while the general model doesn't have any restrictions, practical implementations do. In particular, some system calls will only accept path names of a limited size (due to fixed sized internal buffers). For example, open() may return an error indication of “name too long.” You can use the system calls pathconf() and fpathconf() to find out various system limitations, including the maximum path length for a given file system.

Some file systems may, for their own internal reasons, impose restrictions. The ext2 file system on Linux insists that file names be no more than 255 characters long.

Question 3 [15pts]

You are working for a company that is extending Linux in various ways. Your boss, N. C. Kyoor, reminds you that current versions of Unix/Linux associate a file with, and specify permissions for, the file's owner, a single group, and the rest of the world. N. C. proposes a new scheme under which a file is associated with an explicit list of users. For example, you could create a file and then say that you and your partner may read and write the file, the other members of this class (each listed by user name) may read the file, and every one else is blocked from any access.

What operating system/file system data structure(s) would have to change to support this security model, and how would they have to change? (You do not have to write structure declarations, just an English description.) Would this be a small change to Linux?

Answer

In Unix, we would try to get this behavior by creating new accounts and groups. In the example above, you and your partner could create a team account. The rest of the class could be in a new group. This is unsatisfying on two levels: First, creating accounts and groups is a heavy-weight task that requires root permission. Second, it doesn't give you enough control. There can be only one group associated with the file, and everyone else must fall into the other categorty. It also becomes very awkward when you and your partner create files owned by one of you rather than by the team account.

[6pts] Unfortunately, this is a rather fundamental change to the Unix security model. The notions of user, group, and other are deeply embedded in Unix. Not only would ls have to change, but the kernel code associated with validating access permissions would need to change drastically.

[6pts] The data structures of interest are inodes and the stat structure. Rather than 9 mode bits (OK, a few more for regular file, setuid, etc.), you would need to add a way to store the various by-user permissions. On way to do this would be to include a linked list of userID × mode_t. Or, you could create a list of permission groups in which each group has a list of users.

[2pts] In either case, you are disrupting an imporant assumption in the design of the Unix file system model: VFS relies on inodes having a fixed size. This change affects several parts of the kernel code/algorithms, data structures, and would greatly complicate the layout of the data on disk. I wouldn't promise a quick turnaround on this project!

The overal quality of the explanation was was worth an additional 6 points.

Question 4 [10 points]

In the object-oriented representations we saw in class (in the VFS model and in the counter class example), the operations/methods were part of a separate structure referenced by the structure that represented an object. For example, a file object is a struct file that contains a field f_op, which is a pointer to a table of operations (a

struct
 file_operations

), which in turn has a field called write. Likewise, a counter instance was a structure that contained an ops field that pointed to a structure of pointers to functions that implemented the counter methods.

Why? Why not represent a counter as a structure with a decrement field, or a file as a structure with a write field? What are the advantages and disadvantages of the strategy shown?

It turns out that Java actually does the same thing. The methods you define in a class are stored in a single dispatch table that lives in the data structure that represents the class. Each instance only has to store a reference to its class. Invoking an instance method indirects through the class and then through a dispatch table. After this exam, you understand why!

There are two motivations for this approach:

The main advantage is that it saves space. There is exactly one table of operations for each kind of file system no matter how many file objects are created. All the file instances have only one field for the operations, and all instances point to the same operation table. Similarly, each counter instance has one field to point to the operation table, and all instances share the same operation table. This allows each instance to be smaller: one pointer field rather than a pointer for each operation.
The space savings over directly embedding a pointer to each operation in each object is:
(num_operations − 1) × num_objects
This can represent a considerable amount of storage, especially when the number of objects created is large (as in FILE objects). Given that the Linux kernel runs in physical, rather than virtual memory, having the space be proportional to the number of file system types rather than the number of file objects can mean the difference between robust and useless (especially on a file server).
Similarly, in Java, the overhead of storing a method dispatch table is paid only once for each class rather than once per object.
The second consideration is time. It takes little time to initialize a single pointer, and num_operations times more time to initialize a full table of operations.

The drawback to this approach is the cost of the indirection. Rather than directly accessing the function (or method in Java) you want, you have to first indirect through a pointer to the dispact table. So every method invocation for this object structure has to pay this time overhead.

Question 5 [45pts]

Programming a Growable String Buffer

In this task, you will implement a growable string buffer. You have noticed how troublesome programming with fixed sized items and manual memory management can be. A growable string buffer is a data object that a client program can continue to append new data to without worrying about all the messy details of memory management. Java has a StringBuffer class for this sort of thing, and you will implement a basic version of it in C.

A string buffer has 3 elements of state:

the contents, which contains a sequence of characters (possibly including embedded NUL characters);
the size, which is the number of characters currently used in the buffer; and
the capacity, which is the maximum number of characters the buffer can hold at the present time.

You can find the API documented in string_buffer.h. Read this file now or have it open while you read the following description.

A client program will create a new string buffer (using mk_strbuff()), which will have a size of 0 and a capacity of 1 character. The client can then append new characters to the end of the buffer as necessary (using strbuff_append()). Your implementation will append the new character to the existing contents if there is sufficient room. If there is not room, you will double the capacity of the string buffer and then do the append. To append a string, you will just append the characters one at a time.

Clients may also extract or modify characters in the buffer by index. And, of course, when the client is done with a string buffer, it may free its space by calling strbuff_destroy(). A client may reuse a string buffer by tuncating it to a size of 0.

The API requires the use of a couple of symbols defined by errno.h, so don't forget to include it.

You may find the memcpy() function useful.

Requirements:

Put your string buffer implementation into a file called string_buffer.c.
Your code must not crash and you must not abort the program inside your implementation (i.e., you must not call exit() in string_buffer.c). The API defines the implementation's behavior in all error cases (and if it doesn't, then you must find and document a way to avoid crashing, and preferably notify the client of the problem).
Create a test program with source code in test_string_buffer.c, which includes string_buffer.h and any other header files that are necessary.
It should be possible for a client to have more than one string buffer active at a time with no interactions between the two.
Once the client program creates a new string buffer and gets a pointer to it, it must never need to change the pointer for that string buffer. In other words, all the allocation and deallocation you do inside your implemenation are not visible to the client, who has a pointer to a single data value that represents the buffer.
Your test cases should test every function in the given string buffer API. (Testing each function individually is called unit testing.) You should also check all feasible error modes. You may or may not be able so simulate running out of memory, but all the bounds checking, for example, should be tested.

Answer

Structures that can grow dynamically are extremely useful, and this sort of problem comes up over and over again. As the problem observed, the Java language includes pre-defined classes for growable string buffers and vectors that use the technique of repeated doubling and copying. In C, it is an idiom that one sees fairly often. (I urge you to look at the Java StringBuffer API if you haven't already.) The beauty of this approach is that you can load data of unknown size into a structure in linear time, and then have constant time access to the data elements. (In Algorithms, you will see how you can prove that the cost of all the copying is amortized in such a way that the asymptotic running time remains linear.) Another advantage in a language like C without automatic storage management (garbage collection) is that it isolates the storage management associated with the structure in one place, thus improving modularity.

The point breakdown is as follows: 5 points for the data type declaration and 4 points for each function (there are 10 functions). 1 point out of the 4 for each function comes from the testing.

Here is my solution:

This solution is based more directly on the Java StringBuffer class than your code was required to be: It extends the API with a function called strbuff_ensure_capacity() that, if the string buffer's current capacity is not adequate, expands the string buffer's capacity to the larger of the requested capacity and twice the current capacity (plus two). This rather complex specification ensures that the rate of growth is large enough to cover the copying overhead.

Using strbuff_ensure_capacity(), one can take a different approach to strbuff_append_str(). Rather than have a loop that appends each character of the string in turn, one can make sure the string buffer has the space for the string, and then just copy its contents in.

Things that were important to look out for:

When allocating the string buffer structure, you also had to allocate a 1-character content field (and, naturally, check that both malloc() calls succeeded).
In strbuff_to_string(), you had to malloc() the result and copy the data in. (See Question 5 for more about this copying.)
You had to check the capacity and double the size of the content storage if necessary in strbuff_append().
In strbuff_truncate(), you did not have to free the contents or alter the contents in any way. You just had to set the size of the string buffer to the appropriate value.
In strbuff_destroy(), you had to free both the contents and the string buffer structure itself (and order is important).
Error checking is a must!

The harness I built for unit testing is not terribly robust, but it illustrates the kind of debugging infrastructure one can create for oneself as you go along. Many of you realized that part of building debugging tools is debugging the tools! A nice way to do it is to build a test suite that grows in tandem with your code base. Add a function, add some testing code for the function. You can see a transcript of a test run with my implementation. This test includes a test that reads string_buffer.c into a string buffer, a 3K file. I have tested the implementation up to 9MB.

In this case, a lot of the testing code involves ensuring that the code matches the specification in error conditions. We also want to cover the other specified behaviors as well as any other programming situations we can anticipate.

A truly robust test suite would have a way to stress the code under memory limitations. This was not required, nor is it illustrated here. One way to do it would be to use the limit command of the shell (or the equivalent run-time functionality) to reduce the size of the heap available to your program. In fact, test_string_buffer could take a heap size argument.

(One known bug in the testing code is that if the file you give it to read into the testing buffer cannot be read (due to permissions problems or because it doesn't exist), then some tests will fail even if the string_buffer code is correct.)

Question 6 [10pts]

Consider your strbuff_to_string() implementation. Briefly discuss the advantages and disadvantages of each of strcpy(), memcpy(), and memmove() for getting the data into the output string.

Answer

This question got at some interesting issues. The specification said that your string buffer implementation should be able to contain embedded NUL characters ('\0'). (This means that you don't have to check for '\0' in strbuff_append().) But NUL characters are the conventional string terminator in C. So, if your client will treat the result of strbuf_to_string() as a conventional string, they'll only see up to the first NUL character. But you really have to give back all the characters currently in the buffer, because the caller can find out the size of the buffer and may want to access the result beyond the first NUL character. That is, they may be using your code to handle characters including '\0'.

[2.5 pts] strcpy() and strncpy()are thus not good choices for copying the data into the storage for the result. They will copy only up to the first NUL. strcpy() does no length checking, so you you would have to ensure that your buffer contents always ends with a NUL character to avoid a wild trip through lots of inappropriate memory.

[2.5 pts] memcpy() handles embedded NULs just fine, and it does length checking. The only consideration is that it is not guaranteed to work if the source and target memory areas overlap. In our case, you are required to allocate new storage for the result string (to avoid the interactions outlawed by the API). Thus, the memory is guaranteed not to overlap.

[2.5 pts] memmove() handles embedded NULs and overlapping memory. So, in a belt and suspenders way, it would be the safest choice. But you do pay some efficiency cost for its flexibility, and we know that our source and target memory won't overlap.

Overall quality of the answer was an additional 2.5 points.

Author: Mark A. Sheldon
Modified: 23 May 2008