CS249 Systems Programming: Compiling C Programs

The C compiler is not a monolithic transformation from source code to executable. There are a number of stages along the way. cc or gcc accepts arguments for each stage and allows you to stop and examine the result at any stage boundary.

The C pre-processor is responsible for handling pre-processor directives (those lines beginning with #). Lines with #include are replaced by the contents of the referenced file (with different search rules for names in quotes versus those in angle brackets). Names introduced with #define are systematically replaced with their definitions throughout the program, expanding as necessary in the case of macro definitions. #if and its relatives are processed. You can invoke the C pre-processor independently using the command cpp or you may examine the result by using the gcc -E.
The actual compiler translates pre-processed source into assembly language. You may examine the assembly language output with gcc -S. Assembly language file names normally end with .s in Unix-like systems.
The assembler converts the assembly language source to an object, .o, file. An object file is not an executable: it may require definitions from other files, including libraries. The assembler can be run separately with the as command. You can stop the compilation process here using gcc -c and get an unlinked object file.
The linker resolves all the references in a set of object, .o files (and libraries or archive, .a files) and produces an executable image. The linker can be run separately using the ld command, and you can get some debugging hints by noticing when an error message is preceded by ld:, which means that what follows is a link time error (probably a missing object file or library).

When you try to run a program, the operating system creates a new process (with its attendant resources), loads the executable image into memory, and then runs the process.

There are two important gcc command line arguments you should get in the habit of using. -Wall tells the gcc to print out all warnings. This will often help you to spot a surprising number of errors that won't stop the program from compiling but will make it run incorrectly. The other argument you should always specify is -g, which tells gcc to emit special information the gdb debugger can use to help you debug your program.

A Common Confusion

As stated above, the result of each phase of compilation can be viewed using appropriate compiler arguments, it is unusual to stop compilation after the pre-processing phase or after the assembly code has been generated. However, it is common, in fact, it is the usual routine of building practical systems, to stop the compiler after it produces object (.o) files and to use it again in a separate linking step.

Beginners get confused about this because, unfortunately, we use the same shell command both for producing object files and for linking them togeter (gcc, for example). Despite the same command name, the activities are different, and what is required is different in the two cases.

Suppose a program in a file called control-panel.c needs to use a linked list package and a specialized graphics package whose source is in linked-list.c and window-toolkit.c, respectively. We want to build program a control-panel program, but how do we do this?

We will proceed in two phases:

Convert all the source files to object files, and
Link all the object files together into an executable.

In order to do the first job, we will perform the first 3 phases of compilation on each .c file individually, and we don't need the other .c files for this. The compiler only needs to know the types of any variables or functions that will be used in a particular file are but are defined elsewhere. For example, the code in control-panel.c will refer to list and graphics functions like cons() and resize_window(), but these defintions will be in the other .c files. These types will be written in corresponding .h header files that are #included by control-panel.c. I.e., control-panel.c will contain lines like this:

#include "linked-list.h" #include "window-toolkit.h"

It is important to understand that this only provides the compiler with type information so it knows how big data values returned from external functions are, how many arguments functions take, etc. This is enough to produce the object code for control-panel.c. To get the object file for control-panel.c we need to tell the compiler not to produce an exectuable, but to stop after compiler phase 3 by using the -c compiler switch:

gcc -Wall -g -o control-panel.o -c control-panel.c

If you omit the -c, then gcc will assume you wnat an executable program, but when it gets to compiler phase 3 it will find it doesn't have the actual definition of, say cons(). You'll get an error about a missing reference to cons(), and you'll be told that ld failed, i.e., the program could not be linked.

We will repeat this procedure for all the source files in the system we are building, and then we will have a bunch of .o files that are refer to values and functions that they don't yet have access to.

The final build phase happens after all the object files are made. The executable program will need the actual definitions of externally defined items in order to run, so the object files must be linked together. That is, we need to perform phase 3 of the compilation process. This time, we already have all the object files, but we need to resolve references among them. We don't need the header files any more, nor do we need the C source files.

gcc -Wall -g -o control-panel control-panel.o linked-list.o window-toolkit.o

We shall see that this build process, which can get very involved, can be automated.

Author: Mark A. Sheldon
Modified: 5 March 2008