Systems Programming

A Whirlwind Tour of C

Since you already know Java, and many of you have seen C++, C will look very familiar. The basic syntax and control constructs are very much the same. Before we get too involved, however, why don't we look at a typical program.

Anatomy of a C Program

Several things to notice:

Here's a session on a Wellesley machine that shows how to compile and run a program (and includes some other typical interactions with the shell.

Let's consider another example (stolen from this online C tutorial).

Note the use of #define to create compile-time constants. It is conventional to use lower case for variables and UPPER CASE for compile-time constants. Tokens beginning with # are sometimes called compiler directives, but they are actually handled by a separate program called the C pre-processor, cpp. Here is what really happens when you compile your program.

Note the variable declarations and initializations above. While Java allows you to declare a variable anywhere, and gcc will let you do this in C, we will follow the ANSI C standard and limit declarations to top level and the start of a block. We'll discuss declarations more below.

Here's another example that defines and uses a function:

Declarations

The previous examples show a variety of data and function declarations. A declaration in C defines a name, and its properties. Some declarations only introduce type names, but most introduce a name for an actual object. The properties of object names include their type (e.g., int and Function returning int), the status of their storage requirements, and their visibility.

The type of an object implies a size, or the amount of storage the object requires (and you can ask for the size of any type using the sizeof operator). The other properties are determined by a rather complicated set of rules. Data declarations can include initializations, but there are some restrictions.

Declaring a name with extern says that the compiler should not allocate any space for it because the name is defined (and its space is allocated) externally to this program file, i.e., it will come from somewhere else during program linking.

All names in C must be declared before they are used. A declaration that occurs before the definition of a function or variable (i.e., a prototype which does not allocate space or specify a function body) is called a forward declaration. This is useful for organizing input files with utility functions down at the bottom and for mutually recursive definitions.

Names declared at top-level, i.e., not nested inside a function, are allocated at link/load time and are neither on the stack nor the heap. Functions are allocated in the code segment (called the text segment in Unix), other data are allocated in the data segment. Constant data may actually be placed in either (they are read-only data).

Top-level definitions described as static are only visible in the current file: they cannot be linked to by other code modules. Top-level initializations must use constant (compile-time computable) data.

Normal non-top-level declarations allocate space on the stack, and are called automatic. This space is deallocated on exit from the function or program block in which the declaration occurs. Initializations (if any) are performed upon every entry to the block. Automatic variables are visible only within the block in which they are declared. Declaring a local variable (in a block) to be static is a way of allocating the storage statically (like a top-level declaration) but making the name visible only to a given block. From an initialization point of view, they are treated like top-level declarations. This may seem strange, but it can be very useful as a way to give a function persistent state. A random number generator, for example, might want to have a seed persist across invocations without letting callers have access to it. (Such data can frustrate the develoment of multi-threaded code, as we shall see in a few weeks.)

C Base Types

Ignoring floating point numbers, there are really only two kind of primitive data objects in C: integers (of various sizes, with and without signs), and pointers (which are often confused with integers). From a type-theoretic point of view, a pointer is a derived type, and I'll therefore describe pointers later. However, it is important to keep in mind that pointers are fundamental to C's worldview. C only supports four built-in data types:

Integers come in short, long, and unsigned flavors. Integer constants look normal, most of the time. 123L forces the integer to be long. Integer constants beginning with a 0 are interpreted as being in octal, which is base 8. Integer constants that begin with a 0x or 0X are interpreted as hexadecimal (base 16) numbers.

Caution: The treatment of leading zeros can sometimes be confusing. For example, when initializing a table of identification numbers (maybe part numbers in an inventory or employee numbers for a payroll program), it can be tempting to make all the numbers the same size for readability. But 10 is not the same as 010 (the latter is the same as 8.

Floating point numbers (of whatever precision) are probably the same as whatever you are used to. We won't use them very much, but it is worth pointing out that you are not allowed to use floating point facilities when writing Linux kernel code.

Characters are really just a special kind of integer, and knowing this helps explain the baroque conversions that the C standard describes. A character is assumed to be an unsigned, one-byte, integer. Character constants come between single quotes, e.g., 'a', and include a variety of escape sequences for non-printing characters like '\n' for newline. You can specify an arbitrary character using '\ddd' where ddd is one to three octal digits representing the numeric value of a character in the implementation's coding scheme, ASCII for example.

Character constants are automatically promoted to the integer value of the corresponding character in the local collating sequence. This gives meaning to character comparisons such as ((c > 'a') && (c < 'z')). It is also relied on for various input and output operations as we'll see later. It also means that you can do arithmetic on characters: ('z' - 'a') != 25 means that we're not in ASCII anymore.

You will notice some omissions: There are no strings or boolean values. There is special support for string constants, but don't expect them to work the way they do in Java!

In C, 0 is false; all other numeric values are considered to be true. Boolean operators, like && return 0 for false and 1 for true. An if is treated as if it compared its test value against zero:

if (test) ...
is the same as
if ((test) != 0) ...

It's common to use this property of integers, integer expressions, and pointers (though the case with pointers is a bit complicated under the covers):

while (things_to_do) {
        ...
        things_to_do--;
}
Modern usage discourages this shorthand for clearer formulations:
while (things_to_do > 0) {
        ...
        things_to_do--;
}

There is no string type in C, though there are string constants. As we shall see below, a string is the same as an array of characters (bytes) terminated by the null character '\0' (in ASCII called NUL) whose value is guaranteed to be zero. An alternative view is that a string is a pointer to a sequence of characters (terminated by NUL). For example, the string "foo" represents a sequence of four characters: 'f', 'o', 'o', '\0'. If you write a string constant, the terminating '\0' is inserted for you by the compiler. But when you manipulate strings, you need to be aware of its existence. (Caution: the symbol NUL is sometimes used for '\0'. The symbol NULL, with two L's, is something else, and confusing these can cause strange problems.)

void

The type name void has three distinct uses. It should not be confused with the type-theoretic void, which is the type with no values (and code of this type never returns) — you may meet it in a programming languages course.

In C, void as the return type of a function means that the function is only to be called for side effect and must not be used in a context where a return value is expected.

If a function prototype contains the argument list(void), then the function takes no arguments and the compiler will not let you supply any. This is different from an empty argument list in a function prototype which is taken to mean `I don't want to say what the arguments are, allow any number and trust me.'

void as the target of a pointer type has another meaning, which we'll discuss below.

Derived/Aggregate Data Types

C provides 4 derived/aggregate data types: C also provides an enumerated type.

Arrays

An array is a sequence of values of a specified type:
int part_numbers[256];
declares a sequence of 256 integers indexed from 0 through 255.

C guarantees that array elements are stored in successive memory locations (which is important when we talk about pointers and arrays). Also, C has no built-in support for array bounds checking.

There are no multi-dimensional arrays, as such, in C: one simply uses an array with an array at each element:

double soil_samples[100][500]
is an array of 100 elements. Each of those elements is itself an array of 500 double precision floating point elements. Thus, C exposes the representation of matrices as being in row-major order.

The following code fragment prints out the values in our soil_samples matrix with one row (of 500 elements) per line:

for (i = 0; i < 100; i++) {
        for (j = 0; j < 500; j++)
                printf("%f\t", soil_samples[i][j]);
        printf("\n");
}
Here is our weight conversion program re-written to use arrays:

One view of strings is as a sequence/array of characters. Sometimes you'll see a variable that will represent a string of some maximum length declared like this:

char pathname[MAX_PATH_LEN];
In C, the size of an array must be specified when its space is allocated, and its size cannot change. This sort of declaration for a string is therefore common when allocating a static area for a string. It is also common when creating a buffer into which characters will be placed for parsing/processing.

You can leave out an array size if you have an explicit initialization:

char hello[] = "hello";
static int backward_digits[] 
        = {9, 8, 7, 6, 5, 4, 3, 2, 1, 0}

ANSI C allows initialization of any array with constant data, but traditional C does not allow initialization of automatic (stack allocated) arrays.

Arrays (and functions) cannot be passed to or returned from functions, though pointers to them can.

Structures

Structures in C are data aggregates, like arrays. However, there are two key differences: components have names rather than integer indices; and they are heterogeneous, i.e., they can contain values of different types.
struct employee {
        char[] name;
        int number;
        float salary;
};
is a structure of 3 fields: a string for the employee's name, an integer for the employee number, and a floating point number for their salary. One extracts or sets a field in a structure with dot notation:
an_empoyee.number = 1234;
Declaring structures can seem a bit strange. The name of the structure is optional, and these names inhabit a separate namespace from other type names. We can use the employee structure above directly in a declaration (and we could leave out the name):
struct {
        char* name;
        int number;
        float salary;
} an_employee;
or, given the named structure declaration above, we could declare an array of employees:
struct employee sales_personnel[300];
You might like to use the name employee as a type, but you can't. That is why you will often see declarations like the struct employee followed immediately by a type declaration
typedef struct employee employee_t ;
which allows employee_t to be used like any type name:
float give_raise(employee_t emp, float pct_incr);
You might think that, by analogy with arrays, a name of a structure (e.g., emp inside the give_raise() function above) would be equivalent to a pointer to the first location of the structure. That would be wrong. A structure has the same status as an integer or a pointer in that a name (or expression) with a structure type stands for the entire structure value. That means that when you pass a structure to a function as an argument, the bytes are copied into new storage for a new structure that will be manipulated by the function. An assignment of a structure to a structure variable similarly copies the structure's contents into the structure denoted by the variable on the left of the assignment. Refer to the code in struct_test.c. This program demonstrates that a structure that contains a fixed-sized array is not considered the same as one that contains a pointer to the array's type and that structure values are copied on assignment or when passed as arguments to a function.

Unions

Union types are used to set aside memory for data that may represent data of one of several types. A union holds only one data element at a time, and the space reserved for a union is large enough for the largest possibility. A compiler may want a union type to represent the a manifest constant in a source program:
union u_const {
        int i;
        float f;
        char c;
        char *s;
} constval; 
One uses the same dot notation to access/update the contents of a union.

It is the programmers job to keep track of what sort of value is represented by constval. This is why unions are often combined with structures that contain a union together with a tag, whose value says which of the various possibilities that particular union value holds.

typedef struct const_s {
        int const_tag;
        union u_const;
} prog_constant;

Enumerations

Enumerations are useful for declaring a set of named constants. They are often used in place of #defined constants. For example,
enum code_quality {good, bad, ugly} my_program;
declares the variable my_program to have the type enum code_quality. I can then test and assign any one of the enumerated names to the variable:
my_program = ugly;
C exposes the fact that the listed identifiers are actually integers (whose size is implementation-dependent), and they may be freely used in integer contexts (testing them for equality is a natural place). The above declaration therefore defines 5 names: the type enum code_quality; the integer variables good, bad, and ugly; and the variable my_program of type enum code_quality.

By default, the first enumerated name has the value 0, and each successive name is one more than the previous one. This can be changed through explicit assignment:

enum code_quality {good = 1, bad, ugly = 0} my_program;
The above code creates an enumeration data type in which good has the value 1, bad has the value 2 (if there is no assigned value, it is one more than the previous value), and ugly has the value 0.

No name may be defined in common between two enumerations.

Pointers

Pointers are memory addresses. Implementations can, in theory, create some abstraction for pointers, but they must behave like addresses into a byte-addressed memory. Here is a pointer to an integer:
int *int_ptr;
You get the value pointed to by such a variable by dereferencing the pointer. So int_ptr is a pointer to (an address of) an integer, and *int_ptr is the integer it points to. We can assign to pointer variables, and until you get very used to this, I recommend that you draw pictures to represent your pointer structures. For example, int_ptr = p; does not change the memory the location int_ptr was pointing to: it makes int_ptr point to some new place in memory. *int_ptr = 3; changes the contents of the memory location pointed to by int_ptr.

You can get the address of a variable by using the & operator.

int size;
int *p = &size;

Declarations of pointers are a source of confusion to students at first. The logic is that type (and optional storage class) appears first followed by a list of variable expressions. The variable expressions use the declared variable in an expression that shows how you use it to get to the declared type. Thus, declaration of int_ptr above should be understood to mean “dereferencing int_ptr (*int_ptr) will yield an int.” This is important when declaring multiple variables:

int i, *ret_code, num_files;
declares three variables: the integer variables i and num_files and a pointer to an integer ret_code.

There is a distinguished pointer, NULL or 0 that is guaranteed not to point to any object. So it is common to use this pointer to terminate lists or, when used as a return value of a function, to indicate failure. It is not guaranteed that the value of the null pointer is zero, only that writing a constant 0 will be compiled to the null pointer (in fact, NULL is really #defined to be 0).

NULL is a perfectly legal pointer, but it is an error to try to dereference it. Unfortunately, not all implementations will generate an error (they'll just silently give you garbage).

One use of arrays and pointers is important to most C programs: in the argument list. C assumes that every main() is actually a function of two arguments:

int main(int argc, char *argv[])
where argc is the number of arguments specified on the command line that invoked the program, and argv is an array (a vector) of strings, one for each input argument. argv[0] is the name of the program being invoked, and argv[1] up to argv[argc-1] are the rest. C also guarantees that argv[argc] is the NULL pointer:

The type pointer to void, written (void *), has special meaning. It represents a pointer to an unknown type, and C guarantees that any pointer may be converted to a (void *) and back without any loss of information. In some implementations, pointers may come in various sizes, so a pointer to void has the largest possible pointer size. This pointer type is typically used as a kind of abstraction device: functions can take and return pointers to an abstract data value without revealing to a client what the value actually looks like.

Pointers, Arrays, Strings, and Address Arithmetic

C guarantees and exposes certain low-level details about how memory and data structures are laid out. An unsubscripted use of an array name is interpreted as the address of the first array element (except that sizeof() will yield the size of the array and not the size of the first element).

Array elements are in successive memory locations, so if you know the address of one element, the address of the next one is the sum of the current address and the size of an array element. Further, arithmetic on pointers, including increment and decrement, adjusts the pointer by the size of the data pointed to. (You can find out the size of a datatype by using the sizeof() operator.) The address of an element of an integer array arr with index i is therefore equivalent to arr + i which has the value arr + (i * sizeof(int)).

In the case of strings, which are sequences of characters, it is conventional to use the declaration

char *s;
to refer to a string, unless the size of the string or the buffer containing the string is an issue.

Note: Declaring an array allocates space for the array; declaring a pointer to a type only allocates space for the pointer.

int i;
char *s = "Hello!";

for (i = 0; s[i] != '\0'; i++)
        s[i] = tolower(s[i]); 
converts the string in s to lower case. (Actually, there is a subtle bug — we'll come back to that.) But so does the function str_lower() in the following program:

This program illustrates quite a few interesting things. There are a variety of ways to declare and initialize arrays of characters (strings). Interestingly, they are not all exactly the same: str_lower(oh) produces a segmentation fault on my computer, but only under certain circumstances. Why?

This program also illustrates how a string constant can be used like an array and how array subscripting is commutative! That latter property is a consequence of the fact that C defines array subscripting in terms of address arithmetic.

There are also two occurrences of the ternary operator for conditional expressions, which has nothing to do with arrays or pointers or strings, but is cool nonetheless. This operator also exists in Java, by the way.


Author: Mark A. Sheldon
Modified: 23 January 2008