Anatomy of a C Program
Several things to notice:
main()
There must be exactly one in every runnable program. Tricky note: functions without declared return values are assumed to returnint
. What does this function return? Typeecho $?
to the shell after running it and see!- Comments. No
//
comments.gcc
will let you use them, but, in the interest of writing portable code, you should stick to standard C comments. P.S. Comments do not nest, and therefore, some people recommend using conditional compilation (we'll talk about this later) to comment out blocks of code. #include
Incorporates another file into this one. Include files, also known as header files, contain declarations that a program might find useful, typically the interface for a function or library.stdio.h
contains the interface for the standard I/O library.- Curly brackets enclose a block, and a function body is
always a block. Variables are supposed to be declared at the
start of a block.
gcc
will let you declare variables anywhere (C++/Java style), but let's stick to the standard. - String constants are enclosed in double quotes, but they can't
be freely concatenated as in Java. Special characters, like the
newline (
'\n'
), are embedded in strings using the escape character (\
).
Here's a session on a Wellesley machine that shows how to compile and run a program (and includes some other typical interactions with the shell.
Let's consider another example (stolen from this online C tutorial).
Note the use of #define
to create
compile-time constants. It is conventional to use lower case
for variables and UPPER CASE for compile-time constants. Tokens
beginning with #
are sometimes called
compiler directives, but they are actually handled by a
separate program called the C pre-processor, cpp
.
Here is
what really happens when you compile your program.
Note the variable declarations and initializations above. While
Java allows you to declare a variable anywhere, and gcc
will let you do this in C, we will follow the ANSI C standard and
limit declarations to top level and the start of a block. We'll
discuss declarations more below.
Here's another example that defines and uses a function:
Declarations
The previous examples show a variety of data and function declarations. A declaration in C defines a name, and its properties. Some declarations only introduce type names, but most introduce a name for an actual object. The properties of object names include their type (e.g.,int
and Function
returning int
), the status of their storage
requirements, and their visibility.
The type of an object implies a size, or the amount of storage the
object requires (and you can ask for the size of any type using the
sizeof
operator). The other properties are determined
by a rather complicated set of rules. Data declarations can include
initializations, but there are some restrictions.
Declaring a name with extern
says that the compiler
should not allocate any space for it because the name is defined (and
its space is allocated) externally to this program file, i.e.,
it will come from somewhere else during program linking.
All names in C must be declared before they are used. A declaration that occurs before the definition of a function or variable (i.e., a prototype which does not allocate space or specify a function body) is called a forward declaration. This is useful for organizing input files with utility functions down at the bottom and for mutually recursive definitions.
Names declared at top-level, i.e., not nested inside a function, are allocated at link/load time and are neither on the stack nor the heap. Functions are allocated in the code segment (called the text segment in Unix), other data are allocated in the data segment. Constant data may actually be placed in either (they are read-only data).
Top-level definitions described as static
are only
visible in the current file: they cannot be linked to by other code
modules. Top-level initializations must use constant (compile-time
computable) data.
Normal non-top-level declarations allocate space on the stack, and
are called automatic. This space is deallocated on exit from
the function or program block in which the declaration occurs.
Initializations (if any) are performed upon every entry to the block.
Automatic variables are visible only within the block in which they are
declared. Declaring a local variable (in a block) to be
static
is a way of allocating the storage statically
(like a top-level declaration) but making the name visible only to a
given block. From an initialization point of view, they are treated
like top-level declarations. This may seem strange, but it can be
very useful as a way to give a function persistent state. A random
number generator, for example, might want to have a seed persist
across invocations without letting callers have access to it. (Such
data can frustrate the develoment of multi-threaded code, as we shall
see in a few weeks.)
C Base Types
Ignoring floating point numbers, there are really only two kind of primitive data objects in C: integers (of various sizes, with and without signs), and pointers (which are often confused with integers). From a type-theoretic point of view, a pointer is a derived type, and I'll therefore describe pointers later. However, it is important to keep in mind that pointers are fundamental to C's worldview. C only supports four built-in data types:int
An integer that is the size of the local machine's word.float
Single precision floating point number.double
Double precision floating point number.char
A character is one byte in size. More recent C standards include wide characters, but we probably won't see much of them. In fact, a character is really just a special size of integer from C's point of view.
Integers come in short
, long
, and
unsigned
flavors. Integer constants look normal,
most of the time. 123L
forces the integer to be
long
. Integer constants beginning with a
0
are interpreted as being in octal, which
is base 8. Integer constants that begin with a
0x
or 0X
are interpreted as
hexadecimal (base 16) numbers.
Caution: The treatment of leading zeros can sometimes be
confusing. For example, when initializing a table of
identification numbers (maybe part numbers in an inventory or
employee numbers for a payroll program), it can be tempting to
make all the numbers the same size for readability. But
10 is not the same as 010
(the latter is the same as 8 .
|
Floating point numbers (of whatever precision) are probably the same as whatever you are used to. We won't use them very much, but it is worth pointing out that you are not allowed to use floating point facilities when writing Linux kernel code.
Characters are really just a special kind of integer,
and knowing this helps explain the baroque conversions that
the C standard describes. A character is assumed to be an
unsigned, one-byte, integer. Character constants come between
single quotes, e.g., 'a'
, and include a variety
of escape sequences for non-printing characters like
'\n'
for newline. You can specify an arbitrary
character using '\ddd'
where ddd
is
one to three octal digits representing the numeric value of a
character in the implementation's coding scheme, ASCII for
example.
Character constants are automatically promoted to the
integer value of the corresponding character in the local
collating sequence. This gives meaning to character
comparisons such as ((c > 'a') &&
(c < 'z'))
. It is also relied on for various input
and output operations as we'll see later. It also means that
you can do arithmetic on characters: ('z' - 'a') !=
25
means that we're not in ASCII anymore.
You will notice some omissions: There are no strings or boolean values. There is special support for string constants, but don't expect them to work the way they do in Java!
In C, 0
is false; all other numeric values are
considered to be true. Boolean operators, like
&&
return 0
for false and
1
for true. An if
is treated as if
it compared its test value against zero:
if (test) ...
is the same as
if ((test) != 0) ...
It's common to use this property of integers, integer expressions, and pointers (though the case with pointers is a bit complicated under the covers):
Modern usage discourages this shorthand for clearer formulations:while (things_to_do) { ... things_to_do--; }
while (things_to_do > 0) { ... things_to_do--; }
There is no string type in C, though there
are string constants. As we shall see below, a string is the
same as an array of characters (bytes) terminated by the null
character '\0'
(in ASCII called NUL
)
whose value is guaranteed to be zero. An alternative view is
that a string is a pointer to a sequence of characters
(terminated by NUL
). For example, the string
"foo"
represents a sequence of four characters:
'f'
, 'o'
, 'o'
,
'\0'
. If you write a string constant, the
terminating '\0'
is inserted for you by the
compiler. But when you manipulate strings, you need to be
aware of its existence. (Caution: the symbol NUL
is sometimes used for '\0'
. The symbol
NULL
, with two L's, is something else, and
confusing these can cause strange problems.)
void
The type name void
has three distinct uses. It
should not be confused with the type-theoretic void,
which is the type with no values (and code of this type never
returns) — you may meet it in a programming languages
course.
In C, void
as the return type of a
function means that the function is only to be called for side
effect and must not be used in a context where a return value
is expected.
If a function prototype contains the argument
list(void)
, then the function takes no arguments
and the compiler will not let you supply any. This is
different from an empty argument list in a function prototype
which is taken to mean `I don't want to say what the arguments
are, allow any number and trust me.'
void
as the target of a pointer type has
another meaning, which we'll discuss below.
Derived/Aggregate Data Types
C provides 4 derived/aggregate data types:- Arrays
- Structures
- Unions
- Pointers
Arrays
An array is a sequence of values of a specified type:int part_numbers[256];
declares a sequence of 256 integers indexed from
0
through 255
.
C guarantees that array elements are stored in successive memory locations (which is important when we talk about pointers and arrays). Also, C has no built-in support for array bounds checking.
There are no multi-dimensional arrays, as such, in C: one simply uses an array with an array at each element:
double soil_samples[100][500]
is an array of 100 elements. Each of those elements is itself
an array of 500 double precision floating point elements.
Thus, C exposes the representation
of matrices as being in row-major order.
The following code fragment prints out the values in our
soil_samples
matrix with one row (of 500
elements) per line:
Here is our weight conversion program re-written to use arrays:for (i = 0; i < 100; i++) { for (j = 0; j < 500; j++) printf("%f\t", soil_samples[i][j]); printf("\n"); }
One view of strings is as a sequence/array of characters. Sometimes you'll see a variable that will represent a string of some maximum length declared like this:
char pathname[MAX_PATH_LEN];
In C, the size of an array must be specified when its space is
allocated, and its size cannot change. This sort of
declaration for a string is therefore common when allocating a
static area for a string. It is also common when creating a
buffer into which characters will be placed
for parsing/processing.
You can leave out an array size if you have an explicit initialization:
char hello[] = "hello"; static int backward_digits[] = {9, 8, 7, 6, 5, 4, 3, 2, 1, 0}
ANSI C allows initialization of any array with constant data, but traditional C does not allow initialization of automatic (stack allocated) arrays.
Arrays (and functions) cannot be passed to or returned from functions, though pointers to them can.
Structures
Structures in C are data aggregates, like arrays. However, there are two key differences: components have names rather than integer indices; and they are heterogeneous, i.e., they can contain values of different types.is a structure of 3 fields: a string for the employee's name, an integer for the employee number, and a floating point number for their salary. One extracts or sets a field in a structure with dot notation:struct employee { char[] name; int number; float salary; };
an_empoyee.number = 1234;
Declaring structures can seem a bit strange. The name of the
structure is optional, and these names inhabit a
separate namespace from other type names. We can use the
employee structure above directly in a declaration (and we
could leave out the name):
or, given the named structure declaration above, we could declare an array of employees:struct { char* name; int number; float salary; } an_employee;
struct employee sales_personnel[300];
You might like to use the name employee
as a
type, but you can't. That is why you will often see
declarations like the struct employee
followed
immediately by a type declaration
typedef struct employee employee_t ;
which allows employee_t
to be used like any type
name:
float give_raise(employee_t emp, float pct_incr);
You might think that, by analogy with arrays, a name of a
structure (e.g., emp
inside the
give_raise()
function above) would be equivalent
to a pointer to the first location of the structure. That
would be wrong. A structure has the same status as an integer
or a pointer in that a name (or expression) with a structure
type stands for the entire structure value. That means that
when you pass a structure to a function as an argument, the
bytes are copied into new storage for a new structure that
will be manipulated by the function. An assignment of a
structure to a structure variable similarly copies the
structure's contents into the structure denoted by the
variable on the left of the assignment. Refer to the code in
struct_test.c
. This
program demonstrates that a structure that contains a
fixed-sized array is not considered the same as one that
contains a pointer to the array's type and that structure
values are copied on assignment or when passed as arguments to
a function.
Unions
Union types are used to set aside memory for data that may represent data of one of several types. A union holds only one data element at a time, and the space reserved for a union is large enough for the largest possibility. A compiler may want a union type to represent the a manifest constant in a source program:One uses the same dot notation to access/update the contents of a union.union u_const { int i; float f; char c; char *s; } constval;
It is the programmers job to keep track of what sort of value
is represented by constval
. This is why unions
are often combined with structures that contain a union
together with a tag, whose value says which of the various
possibilities that particular union value holds.
typedef struct const_s { int const_tag; union u_const; } prog_constant;
Enumerations
Enumerations are useful for declaring a set of named constants. They are often used in place of#define
d constants. For example,
enum code_quality {good, bad, ugly} my_program;
declares the variable my_program
to have the type
enum code_quality
. I can then test and assign
any one of the enumerated names to the variable:
my_program = ugly;
C exposes the fact that the listed identifiers are actually
integers (whose size is implementation-dependent), and they
may be freely used in integer contexts (testing them for
equality is a natural place). The above declaration therefore
defines 5 names: the type enum code_quality
; the
integer variables good
, bad
, and
ugly
; and the variable my_program
of
type enum code_quality
.
By default, the first enumerated name has the value 0, and each successive name is one more than the previous one. This can be changed through explicit assignment:
enum code_quality {good = 1, bad, ugly = 0} my_program;
The above code creates an enumeration data type in which
good
has the value 1
,
bad
has the value 2
(if there is no assigned
value, it is one more than the previous value), and
ugly
has the value 0
.
No name may be defined in common between two enumerations.
Pointers
Pointers are memory addresses. Implementations can, in theory, create some abstraction for pointers, but they must behave like addresses into a byte-addressed memory. Here is a pointer to an integer:
int *int_ptr;
You get the value pointed to by such a variable by
dereferencing the pointer. So int_ptr
is
a pointer to (an address of) an integer, and
*int_ptr
is the integer it points to. We can
assign to pointer variables, and until you get very used to
this, I recommend that you draw pictures to represent your
pointer structures. For example, int_ptr = p;
does not change the memory the location int_ptr
was pointing to: it makes int_ptr
point to some
new place in memory. *int_ptr = 3;
changes the
contents of the memory location pointed to by
int_ptr
.
You can get the address of a variable by using the
&
operator.
int size;
int *p = &size;
Declarations of pointers are a source of confusion to
students at first. The logic is that type (and optional
storage class) appears first followed by a list of variable
expressions. The variable expressions use the declared
variable in an expression that shows how you use it to get to
the declared type. Thus, declaration of int_ptr
above should be understood to mean “dereferencing
int_ptr
(*int_ptr
) will yield an
int
.” This is important when declaring
multiple variables:
int i, *ret_code, num_files;
declares three variables: the integer variables
i
and num_files
and a pointer to an
integer ret_code
.
There is a distinguished pointer, NULL
or
0
that is guaranteed not to point to any object.
So it is common to use this pointer to terminate lists or,
when used as a return value of a function, to indicate
failure. It is not guaranteed that the value of the null
pointer is zero, only that writing a constant 0
will be compiled to the null pointer (in fact,
NULL
is really #define
d to be
0
).
NULL
is a perfectly legal pointer, but it is
an error to try to dereference it. Unfortunately, not all
implementations will generate an error (they'll just silently
give you garbage).
One use of arrays and pointers is important to most C
programs: in the argument list. C assumes that every
main()
is actually a function of two
arguments:
int main(int argc, char *argv[])
where argc
is the number of arguments specified
on the command line that invoked the program, and
argv
is an array (a vector) of strings,
one for each input argument. argv[0]
is the name
of the program being invoked, and argv[1]
up to
argv[argc-1]
are the rest. C also guarantees
that argv[argc]
is the NULL
pointer:
The type pointer to void
, written (void
*)
, has special meaning. It represents a pointer to an
unknown type, and C guarantees that any pointer may be
converted to a (void *)
and back without any loss
of information. In some implementations, pointers may
come in various sizes, so a pointer to void
has
the largest possible pointer size. This pointer type is
typically used as a kind of abstraction device: functions can
take and return pointers to an abstract data value without
revealing to a client what the value actually looks like.
Pointers, Arrays, Strings, and Address Arithmetic
C guarantees and exposes certain low-level details about how memory and data structures are laid out. An unsubscripted use of an array name is interpreted as the address of the first array element (except thatsizeof()
will yield the
size of the array and not the size of the first element).
Array elements are in successive memory locations, so if you
know the address of one element, the address of the next one
is the sum of the current address and the size of an array
element. Further, arithmetic on pointers, including increment
and decrement,
adjusts the pointer by the size of the
data pointed to. (You can find out the size of a datatype
by using the sizeof()
operator.) The address of an
element of an integer array arr
with index
i
is therefore equivalent to arr + i
which has the value arr + (i * sizeof(int)).
In the case of strings, which are sequences of characters, it is conventional to use the declaration
char *s;
to refer to a string, unless the size of the string or the
buffer containing the string is an issue.
Note: Declaring an array allocates space for the array; declaring a pointer to a type only allocates space for the pointer.
converts the string inint i; char *s = "Hello!"; for (i = 0; s[i] != '\0'; i++) s[i] = tolower(s[i]);
s
to lower case. (Actually,
there is a subtle bug — we'll come back to that.) But so
does the function str_lower()
in the following program:
This program illustrates quite a few interesting things.
There are a variety of ways to declare and initialize arrays
of characters (strings). Interestingly, they are not all
exactly the same: str_lower(oh)
produces a
segmentation fault on my computer, but only under certain
circumstances. Why?
This program also illustrates how a string constant can be used like an array and how array subscripting is commutative! That latter property is a consequence of the fact that C defines array subscripting in terms of address arithmetic.
There are also two occurrences of the ternary operator for
conditional expressions, which has nothing to do with arrays
or pointers or strings, but is cool nonetheless. This
operator also exists in Java, by the way.
Modified: 23 January 2008