code Brew a command parser with C pointers and arrays.

Contents

Overview

[Preparation: If you did not do the later Predictions sections or a few of the Practice Pointer Programming examples on the pointers lab, we recommend working on those before starting this assignment.]

The main task of this assignment is to implement a shell command parser as a small library in C. The shell is the program that reads and interprets the commands you type at a command line terminal. Later in the semester, you will build a full shell. For now, we focus on the first step toward a shell: splitting a user’s command line into its meaningful parts.

This parser takes a command line, stored as a single string, and converts it to an array of strings representing the command and its space-separated arguments. This may sound trivial given your previous programming experience with strings, but you will be using a substantially different and lower-level toolbox to accomplish the job. We prohibit the use of string manipulation functions from the C standard library to force you to confront how strings are implemented and manipulated in memory.

The goals of this assignment are:

  • to become familiar with the byte-addressable memory model, C pointers, arrays, and the link between pointer arithmetic and array indexing;
  • to practice principled techniques for reasoning about program execution, using assertions and structured debugging.

Please skim this document before starting work. Pay special attention to orange boxes like this one.

Plan, build incrementally, test, and commit.

Programming with pointers is like brewing a potion. Both tend to explode spectacularly in your face unless you brew them just right.

For each stage in the suggested workflow:

  1. Plan carefully before typing any code.
  2. Implement one step at a time (with useful assertions).
  3. Test each step extensively before implementing the next.
  4. Commit each tested step before implementing more.

This careful process will save time by catching bugs early and saving working versions you can recover if things start going wrong later.

Starter Code

Starter code for this assignment is a shared Mercurial repository. Get a copy of the code as follows (need help?):

  1. Fork wellesleycs240 / cs240-pointers once per team to create your own copy on Bitbucket.
  2. Add bpw and your partner as Admins on the repository.
  3. Clone your Bitbucket repository to your local computer account:
    hg clone ssh://hg@bitbucket.org/yourusername/cs240-pointers
Do true pair programming.

If you have chosen to work with a partner, do your programming together, sitting in front of one screen. If both partners will use a copy of the code, please read the Mercurial Team Workflow before trying that.

Your working copy should contain several files covering this week’s lab material and this assignment:

  • Makefile: recipes to compile the various parts
  • command.c: a file where you will implement a parser for simple shell commands
  • command.h: a C header file describing the functions you will implement in command.c
  • command_test.c: tests for your command parser

To compile the command-parsing code, run make command_test. Run it with ./command_test.

Command Structure and Parsing

Your task will be to write a handful of functions to convert command lines (strings) to command arrays (arrays of one-word strings) and work with command arrays. This section describes command lines, command arrays, and parsing in general. The next section describes the specific functions you will write.

Command Lines

A command line is a string. Here are two examples:

  • "ls -l cs240-pointers"
  • "emacs cs240-pointers/command.c &"

The command-line string may be comprised of several words (contiguous non-space characters) that indicate substructure:

  • The first word is the name of the command to run: ls or emacs
  • All remaining words are arguments to pass to this command: -l and cs240-pointers or cs240-pointers/command.c.
  • An optional & character at the end is never an argument (or part of an argument). For the parser, all we need to do is report the presence or absence of & and leave it out of the result we return.
    • The & is metadata indicating that the command should be run as a background command. Command lines not ending with & indicate foreground commands. When executing a foreground command, the shell waits for the command to finish before giving the next command prompt. When executing a background command, the shell gives the next prompt immediately, while the command continues to run in the background. Implementing command behavior will be part of our later shell implementation.

Valid Command Lines

All command lines without the & character are valid. If the & character appears in a command line, it may appear only as the last non-space character of the command line. For example, these are valid command lines:

"emacs &"
"emacs&   "
" sleep 10    & "

These are invalid command lines (for this assignment – some are valid in other shells):

"&uhoh"
"  & uh oh"
"uh & oh"
"uh& &oh"
"uh oh & &"

Command Arrays

The parser converts a command line into a command array, an array of strings representing the words of the command line, in order, terminated by a NULL element. Recall that a string in C is not a special type; it is just an array of chars terminated by a null character ('\0'). A command array is thus an array of pointers to arrays of chars and has the type char**. All arrays in this structure must be null-terminated, using the right “null” for each array.

'\0' is not NULL. 'a' is not "a".

'\0' is the null character. As a character, it is one byte in size. NULL is the null address. As an address, it is one machine word in size (32 or 64 bits, depending). Do not mix them up!

Literal char values in C are given in single quotes: 'a'. Literal string values are given in double quotes: "a". These are not the same. Comparing these with == will yield a false result. The former is the one-byte ASCII encoding of a, a value of type char. The latter is the address of the start of the one-character null-terminated string "a" in memory, a one-word value of type char*.

Here is an example command array for the command line string "ls -l cs240pointers":

Command Array:             Null-Terminated Strings
Index   Contents           (stored elsewhere in memory)
      +----------+
    0 |  ptr  *----------> "ls"
      +----------+
    1 |  ptr  *----------> "-l"
      +----------+
    2 |  ptr  *----------> "cs240-pointers"
      +----------+
    3 |   NULL   |
      +----------+

Here is the same array drawn another way and showing how the array is arranged in memory, with each element’s offset from the base address of the array. This assumes addresses (and hence pointers) are 32 bits (4 bytes), so indices are related to offsets by a factor of 4.

Index:        0       1       2       3 
Offset:   +0      +4      +8      +12     +16
          +-------+-------+-------+-------+
Contents: |   *   |   *   |   *   | NULL  |
          +---|---+---|---+---|---+-------+
              |       |       |
              V       V       V
              "ls"    "-l"    "cs240-pointers"

Although we draw “strings” in the above pictures, this is an abstraction. Each string is actually represented by a '\0'-terminated array of 1-byte characters in memory. Since each element is one byte, the offset is identical to the index.

Index:      0     1     2
Offset:  +0    +1    +2    +3
         +-----+-----+-----+
         | 'l' | 's' |'\0' |
         +-----+-----+-----+

Note: Practice reading memory diagrams every which way. Pay attention to what direction addresses grow. We are intentionally using different conventions in different drawings to help you get used to thinking on your feet…

Specification

Your task is to write four functions (plus any necessary helper functions) supporting command parsing in command.c according to the headers in command.h. If you write helper functions, be sure to read about how C function declarations work. We give a plan for implementation and testing below.

  • char** command_parse(char* line, int* foreground)

    Parse a command-line string line and:

    • If the command line is valid:
      • Store 0 where the foregroundpointer points if the command line contains & after the last word. Store 1 if it contains no &.
      • Return a command array corresponding to the command-line string.
    • If the command line contains invalid use of &, return NULL without setting foreground/background status.
  • void command_print(char** command)

    Print a command array in the form of a command line, with the command words separated by spaces. Do not include a newline at the end. (We will rely on that when reusing this code later.) You will want to learn how to use printf.

  • void command_show(char** command)

    Print the structure of a command array to aid in debugging and data inspection.

    The output should make it clear exactly what strings the command array holds, but the format is left to you. For example, given the command line:

      "ls -l cs240-pointers   "

    the output of a command_show implementation using helpful formatting will distinguish a correct command array where all spaces are stripped away:

      command array:
          - "ls"
          - "-l"
          - "cs240-pointers"
      end

    from an incorrect command array that contains some trailing spaces in words:

      command array:
          - "ls "
          - "-l "
          - "cs240-pointers   "
      end

    Calling command_print, which does not delineate the bounds of words explicitly, may not make this so clear:

      ls  -l  cs240-pointers
  • void command_free(char** command)

    Free all parts of a command array structure previously created by command_parse and not previously freed.

Memory Allocation

All data structures used by the parser must be stored somewhere in memory. Consider what code should be responsible for managing this memory and deciding when it is no longer needed. Follow this discipline in your code:

  • Command-line strings are “owned” and managed by code outside of the command library. The parser should never free or mutate command-line strings.
  • Command arrays should be allocated dynamically by the parser with malloc. Think carefully about how many allocations to make and how big they should be. Since command_parse hands these command arrays over to the caller, the command_free function is designed to make it easy for the caller to free the whole command array structure when the caller is done with the command array.

Coding Rules

You are free to use array indexing while you develop your code, but for full credit, your final version should use only pointers and pointer arithmetic. Choosing pointer arithmetic over array indexing is not necessarily the best choice for all code – here we require you to use pointers to help you learn how arrays and array operations are represented under the hood.

If you have first developed your code using array indexing, hg commit your work before converting to pointer arithmetic. Replace all uses of array indexing in command.c with pointer arithmetic. For example, x = a[i]; could, at minimum, be transformed to x = *(a + i);. However, you should also transform the code to an idiomatic pointer style.

An idiomatic loop over an array with array indexing typically uses an integer index variable incremented on each iteration. Keep the scope of the index variable (or “loop variable”) as small as possible for the task.

// pre:  a is the address of a null-terminated string.
// post: replaces all characters in a by 'Z'
for (int i = 0; a[i] != '\0'; i++) {
    a[i] = 'Z';
}

An idiomatic loop over an array with pointer arithmetic typically uses a cursor pointer that is incremented to point to the next element on each iteration. Keep the scope of the cursor pointer variable as small as possible for the task.

// pre:  a is the address of a null-terminated string.
// post: replaces all characters in a by 'Z'
for (char* p = a; *p != '\0'; p++) {
    *p = 'Z';
}

Your final code should contain zero array[index] operations.

Grading

To receive full credit for the assignment:

  • Your code must have good design, organization, style, documentation, and tests: 30%
    • Here is a good baseline style guide. In addition:
    • We provide documentation in the form of specification comments.
    • You should document how your code implements the specifications.
    • Document pre- and post-conditions and use assertions to check assumptions.
  • Your code must work according to the specification: 70%
    • This implies that your code always returns valid results for valid inputs and that it is free of compiler warnings, runtime memory errors, or other crashes.
    • Fix warnings, use assertions, and run many tests under valgrind to help ensure correctness.

Workflow and Tools

Follow the workflow below, much of which also corresponds closely to the functionality of command_parse as described in command.c.

  1. If you did not finish the Predictions or Practice Programming sections of the pointers lab, we recommend finishing those before starting to build your command parser.

  2. Add more hard-coded command array test cases in command_test.c.

  3. Implement command_show and command_print first. They are only a few lines of code each. Test them on the constant, statically allocated command arrays in command_test.c.

  4. Implement command_parse in stages, testing each stage on several inputs and committing a working version before continuing.

    1. Ignore detection of foreground/background commands to start. Assume commands contain no & characters.

    2. Count the number of words in line.

    3. Allocate the top-level command array.

    4. Copy each word in line into a newly allocated string and save it in the command array.

    5. Add foreground/background status parsing.

      • Commands containing & must place it as the last non-space character of the command line.
      • Other placements of & should cause command_parse to return NULL.
  5. Implement command_free.

  6. Convert functions from array indexing to pointer arithmetic if you have not already.

Testing and Debugging

Test and debug wisely.

Programming with pointers in C is error-prone, even for experts. Haphazard print-based testing is an inefficient (and often ineffective) use of your time. Wield the tools in this section to catch errors sooner and faster.

Assert the truth.

Assertions are “executable documentation” or “executable specifications”: they document rules about how code should be used or how data should be structured, but they also make it easier to detect violations of these rules (a.k.a. bugs!). Use the assert(...) tool in C by including assert.h and asserting expected properties. For example, the provided code already includes code that asserts that the arguments to command_ functions are not NULL. Thus if a NULL argument is ever passed to these functions, an error message will be printed and execution will halt immediately. Detecting errors early like this (vs. crashing or corrupting data later wherever the code depends on this assumption) saves a lot of time. Add assertions to make the “rules” of your code clear wherever you make assumptions.

Write extensive tests.

And other advice:

  • Write tests driven from the main function in command_test.c, build your tests with make command_test for just this part, and run them with ./command_test. If you inspect the Makefile’s rule for command_test, you will see it is produced by compiling command_test.c and command.c.
  • Do not add a main function in command.c – this will cause problems down the line.
  • Call command_free only on command arrays returned by command_parse (not on statically declared command arrays).

Use Valgrind.

Valgrind is an extended memory error checker. It helps catch pointer errors, misuse of malloc and free, and more. Run valgrind on your compiled program like this: valgrind ./command_test. Valgrind will run your program and watch it execute to detect errors at run time. See the tools page for additional reference links.

Use GDB.

Use GDB to help debug your programs. (Review previous lab activities for tips.) When debugging programs with pointers, pay special attention to the pointer values your program generates. Inspect them like other variables or use the address-of (&) and dereference (*) operators at the gdb prompt to help explore program state. See the tools page for additional reference links.

Does GDB tell you that it cannot display something due to compiler optimizations? If so, turn off compiler optimizations by adding -O0 (that’s “space dash Capital-Oh zero”) at the end of the CFLAGS = ... line in your Makefile. This will take effect the next time you use make to compile your code. The CFLAGS variable here determines what flags are passed to the compiler. -O0 enables level zero of optimization (i.e., none).

Emacs Tips

  • Compile from within Emacs by typing M-x compile and then editing the make command to refer to the executable you want to compile (probably command_test). Now you can click on errors to jump to the source line involved. Note: all the mini buffers where you type in commands, etc., have history accessible with up and down arrows.
  • Run GDB from with Emacs with M-x gdb, starting from a C buffer.
    • Click left margin to set/delete breakpoints.
    • Use buttons at top to run, continue, step, next, navigate up and down the stack, etc.

C Function Declarations

In C, a function may be used only after (i.e., later in the file than) its declaration. This differs from Java, which allows you to refer to later methods. When declaring helper functions, you can do one of a few things to deal with this:

  1. Just declare your helper function before the functions that use it.
  2. Write a function header earlier in the file and the actual definition later in the file. The function header just describes the name and type of the function, much like an interface method in Java. For example:

     // A function header declares that such a function exists,
     // and will be implemented elsewhere.
     int helper(int x, int y);
    
     // Parameter names are optional in headers.
     int helper2(char*);
    
     void needsHelp() {
         // OK, because header precedes this point in file.
         helper(7, 8);
         helper2("hello");
     }
    
     int helper(int x, int y) {
         return x + y;
     }
    
     int helper2(char* str) {
         return 7;
     }
  3. If the functions would likely get used elsewhere, then put the header in a header file, a file ending in .h that contains only function headers (for related functions) and data type declarations. For example, if you added another general function (not just a helper function) for manipulating commands (not required in this assignment), it would be best to place a function header for it with the other function headers in command.h so that users of your command library can call it.

    Header files are included (essentially programmatically copy-pasted) by the #include directive you often see at the tops of C source files. Then these functions can be used and their implementations will later be found elsewhere if compiled correctly.