Contents

Overview

The main task of this assignment is to implement a shell command parser as a small library in C. The shell is the program that reads and interprets the commands you type at a command line terminal. Later in the semester, you will build a full shell. For now, we focus on the first step: splitting a command line into its meaningful parts.

The parser takes a command line, a single string, and converts it to a command array, an array of strings representing the command and its space-separated arguments. The string split function available in many programming languages’ standard libraries would accomplish most of the task, but you will essentially implement split at a low level. We prohibit the use of string manipulation functions from the C standard library to force you to confront how strings are implemented and manipulated in memory.

Goals

  • To become familiar with the byte-addressable memory model through pointers and arrays at the C programming language level of abstraction.
  • To understand the link between pointer arithmetic and array indexing and representation.
  • To practice principled reasoning about program execution, using assertions, testing, and structured debugging.
  • To practice using tools for memory safety checking and debugging.
  • To have at least one aha! moment in appreciating how storage abstractions in your favorite programming language relate to (and hide!) details of the byte-addressable memory model.

Advice

Programming with C pointers has a tendency go wrong in entertaining or painful ways unless you practice careful discipline in programming.

For each stage in the suggested workflow:

  1. Plan carefully before typing any code.
  2. Implement one step at a time (with useful assertions).
  3. Test each step extensively, including with tools for detecting pointer errors, before implementing the next.
  4. Commit each tested step before implementing more.

This careful process will save time by catching bugs early and saving working versions you can recover if things start going wrong later.

Do true pair programming.

We recommend working in pairs on this assignment. Assuming you work with a partner, do your programming together, sitting in front of one screen. Take turns driving. If both partners will use a copy of the code, use the Team Workflow.

Time Reports

According to self-reported times on this assignment from Fall 2018:

  • 25% of students spent <= 7 hours.
  • 50% of students spent <= 9 hours.
  • 75% of students spent <= 12 hours.

Setup

Get your repository with cs240 start pointers, whether working alone or with a teammate.

Your repository initially contains the following files:

  • command.c: a file where you will implement a parser for simple shell commands
  • command.h: a C header file describing the functions you will implement in command.c
  • command_test.c: tests for your command parser
  • Makefile: recipes to compile the various parts

Compile the command-parsing code with the test harness with the command make (or make command_test). Run the compiled executable with ./command_test.

You must use a CS 240 computing environment for CS 240 code assignments.

Tasks

You will modify command.c to implement a small library for parsing shell commands with the interface declared in command.h and the functionality described by the Specification section below.

char** command_parse(char* line, int* foreground);
void   command_print(char** command);
void   command_show(char** command);
void   command_free(char** command);

The core of your library is the command_parse function, which takes a command line – a string capturing the command typed in the terminal – and parses it to produce a command array – an array of strings capturing the individual words of the command line. You will implement the command_parse function, plus three other simple functions for displaying and freeing command arrays.

The provided command_test.c file implements a testing harness for your parsing library and includes a few simple test inputs. You will add several additional test inputs for more robust testing of your command parsing library implementation.

Grading considers design, documentation, style, correctness, and performance.

The remainder of this document describes:

  1. Preparatory exercises to help you navigate and plan the assignment.
  2. The definition of command lines and command arrays.
  3. The specification of the functions you must implement.
  4. The suggested implementation workflow.
  5. Documentation of how to compile, test, and debug your implementation.
  6. The grading criteria.
  7. Additional tips.

Preparatory Exercises

As you read this document, complete these exercises to familiarize yourself with command lines, command arrays, and the recommended workflow.

Preparation is your ticket for assistance.

  1. Complete at least the practice.c pointer programming examples (Part 1) in the Pointer Practice exercises before starting this assignment. The rest of the exercises are also highly useful in filling out your mental model of programming with memory. The GDB and Valgrind exercises (Parts 4 and 8) are especially relevant for debugging and testing.
  2. Is the & character ever considered part of a command word?
  3. What is the command to compile testing code for this assignment?
  4. What is the command to test your parsing code for this assignment?
  5. Why does step 5a come before step 5b in the workflow?
  6. Are you allowed to mutate the contents of a command line string in command_parse?

Command Structure

Software driven by a command line interface (CLI), such as most software used in CS 240, interprets textual commands typed by a user at a command prompt. To interpret a command, a system must extract the command’s parts to understand its structure. In this assignment, your command parsing library will convert between two representations of simple multi-word commands:

  • A command line is a string capturing the full text of a command, just as the user types it in a terminal.
  • A command array is an array of strings, where each string captures one meaningful token in the command, in order.

The remainder of this section defines the structure of commands and these two representations as used in this assignment.

Command Lines

A command line is a null-terminated string containing zero or more space-separated words, with an optional single background status indicator character, '&', after the final word.

  • Spaces may appear anywhere.
  • Word characters include all printable characters except space (' ') and ampersand ('&').
  • A word is a contiguous sequence of word characters delineated on each end by a non-word character or the beginning or end of the string.
  • Ampersand ('&') is a special status indicator character, not a word character.
    • In a valid command, ampersand ('&') may appear at most once, and only after the final word, followed only by zero or more spaces to the end of the string. Any other placement of ampersand is invalid.1
    • A valid command containing no ampersand is a foreground command.
    • A valid command containing one ampersand is a background command.

Command Line Examples:

Here are two typical command lines:

  • The command line "ls -l cs240-pointers" indicates a foreground command containing the words "ls", "-l", and "cs240-pointers".
  • The command line "emacs cs240-pointers/command.c &" indicates a background command containing the words "emacs" and "cs240-pointers/command.c".

The definition above allows any spacing. The following lines are all valid command lines indicating the foreground command containing the words "ls", "-l", and "cs240-pointers":

"ls -l cs240-pointers"
"ls       -l cs240-pointers  "
"       ls   -l    cs240-pointers   "

The definition above requires that, if present, ampersand ('&') must be the last non-space character of the command line. It may appear adjacent to or separate from the last word. Regardless of placement, ampersand ('&') is not a word character and is never part of any command word. For example, these valid command lines all indicate the background command containing the single word "emacs":

"emacs &"
"emacs&"
"   emacs&   "

The following are examples of invalid command lines, with ampersand misused:

"&uhoh"
"  & uh oh"
"uh & oh"
"uh& &oh"
"uh oh & &"

Command Arrays

The parser converts a command line into a command array, an array of strings representing the words of the command line, in order, terminated by a NULL element. Recall that a string in C is not a special type; it is just an array of chars terminated by a null character ('\0'). A command array is thus an array of pointers to arrays of chars and has the type char**. All arrays in this structure must be null-terminated, using the right notion of nullness for each array.

'\0' is not NULL. 'a' is not "a".

'\0' is the null character. As a character (char), it is one byte in size.
NULL is the null address. As an address (pointer value), it is one machine word in size.
Do not mix them up!

Literal character (char) values in C are given in single quotes: 'a'.
Literal string values are given in double quotes: "a".
Do not mix them up!

Comparing 'a' == "a" will yield a false result. 'a' is the one-byte ASCII encoding of the letter a, a value of type char. "a" is the address of the start of the one-character null-terminated string a in memory, a one-word value of type char*.

Here is an example command array for the command line string "ls -l cs240pointers":

Command Array:             Null-Terminated Strings
Index   Contents           (stored elsewhere in memory)
      +----------+
    0 |  ptr  *----------> "ls"
      +----------+
    1 |  ptr  *----------> "-l"
      +----------+
    2 |  ptr  *----------> "cs240-pointers"
      +----------+
    3 |   NULL   |
      +----------+

Here is the same array drawn another way and showing how the array is arranged in memory, with each element’s offset from the base address of the array. Addresses grow left to right, and are assumed to be 64 bits (8 bytes), so indices are related to offsets by a factor of 8.

Index:        0       1       2       3 
Offset:   +0      +8      +16     +24     +32
          +-------+-------+-------+-------+
Contents: |   *   |   *   |   *   | NULL  |
          +---|---+---|---+---|---+-------+
              |       |       |
              V       V       V
              "ls"    "-l"    "cs240-pointers"

Although we draw “strings” in the above pictures, this is an abstraction. Each string is actually represented by a '\0'-terminated array of 1-byte characters in memory. Since each element is one byte, the addresses of adjacent characters differ by 1 and the offset is identical to the index.

Index:      0     1     2
Offset:  +0    +1    +2    +3
         +-----+-----+-----+
         | 'l' | 's' |'\0' |
         +-----+-----+-----+

Note: Practice reading memory diagrams every which way. Pay attention to what direction addresses grow. We are intentionally using different conventions in different drawings to help you get used to thinking on your feet…

Specification

You must write four functions (plus any necessary helper functions) supporting command parsing in command.c according to the headers in command.h.

If you write helper functions, be sure to read about how C function declarations work. We give a plan for implementation and testing below.

Function Specifications

  • char** command_parse(char* line, int* foreground)

    Parse the command-line string, line, and:

  • void command_print(char** command)

    Print a command array in the form of a command line, with the command words separated by spaces. Do not include quotes. Do not include a newline ('\n') at the end. You will want to learn how to use printf.

  • void command_show(char** command)

    Print the structure of a command array to aid in debugging and data inspection.

    The output should make it clear exactly what strings the command array holds, but the format is left to you. For example, given the command line

      "ls -l cs240-pointers   "

    the output of a command_show implementation using helpful formatting will distinguish a correct command array where all spaces are stripped away:

      command array:
          - "ls"
          - "-l"
          - "cs240-pointers"
      end

    from an incorrect command array that contains some trailing spaces in words:

      command array:
          - "ls "
          - "-l "
          - "cs240-pointers   "
      end

    Calling command_print, which does not delineate the bounds of words explicitly, may not make this so clear:

      ls  -l  cs240-pointers
  • void command_free(char** command)

    Free all parts of a command array structure previously created by command_parse and not previously freed.

Memory Rules

Naturally, your code should be free of memory safety violations (as detected by the valgrind memory error detector):

  • No use of uninitialized memory contents.
  • No out-of-bounds accesses.
  • free only pointers to allocated blocks returned by malloc, and only once per block.

In addition, your code must follow these stricter memory allocation rules:

  • Clients of the command library own and manage the memory representing command line strings.
    • Command library functions must never free nor mutate command line strings.
    • Clients may free or mutate command lines at any time outside a call to command library functions.
  • The command_parse function must allocate command array structures dynamically with malloc and return them to the client. Once returned, these structures are own by the client.
    • Clients will not mutate command array structures.
    • Clients may call command_free to free a command array at most once, at any time outside a command library function.
  • The command library functions must not allocate any memory that they do not return as part of a command array structure.
  • Assuming that the client code eventually calls command_free on every command array returned by command_parse, all memory allocated by the command library functions should be freed.

Coding Style Rules

Follow this general C style guide as a baseline, along with the following specific rules:

  • Do not use any C standard library string functions in command.c. You must define all string processing from scratch.
  • Do not use array notation in this assignment. Use idiomatic pointer style instead. See below.
  • Document in comments how your code implements the specifications provided, connecting your C implementation to the specification.
    • Our sample solution for command_parse is under 70 clean lines of code and 20–30 lines of comments.
  • Document pre- and post-conditions and use assertions to check assumptions where reasonable.
  • Use well-scoped helper functions where reasonable.
    • Clean solutions are possible with and without helper functions.
    • If using helper functions, match them to well-defined sub-problems with clear arguments and results.
    • Be sure to order function headers properly.
  • Prefer simple efficient code over complicated code.
    • Our sample solution for command_parse is under 70 clean lines of code and 20–30 lines of comments.
  • Ensure your code is free of compiler warnings (and, obviously, errors).
  • Ensure your code is free of run-time errors, including memory errors and leaks as detected by valgrind.

Idiomatic Pointer Style

For full credit, your submitted code must use only pointers and pointer arithmetic, with no array notation. Choosing pointer arithmetic over array indexing is not necessarily the best choice for clear code in all cases – here we require you to use pointers to learn how arrays and array operations work under the hood.

A simple way to think with arrays but write with pointers is to use *(a + i) wherever you think a[i]. However, this simple transformation rarely matches an idiomatic pointer style.

An idiomatic loop over an array with array indexing typically uses an integer index variable incremented on each iteration. Keep the scope of the index variable (or “loop variable”) as small as possible for the task.

// pre:  a is the address of a null-terminated string.
// post: replaces all characters in a by 'Z'
for (int i = 0; a[i] != '\0'; i++) {
    a[i] = 'Z';
}

An idiomatic loop over an array with pointer arithmetic typically uses a cursor pointer that is incremented to point to the next element on each iteration. Keep the scope of the cursor pointer variable as small as possible for the task.

// pre:  a is the address of a null-terminated string.
// post: replaces all characters in a by 'Z'
for (char* p = a; *p != '\0'; p++) {
    *p = 'Z';
}

Your final code should contain zero array[index] operations.

Implementation Workflow

Follow this workflow to understand and implement command_parse and friends.

  1. Finish the prep exercises before starting to build your command parser.

  2. Add several more hard-coded command array test cases to command_test.c.

  3. Implement and test command_show and command_print first. They are only a few lines of code each. Test them on the constant, statically allocated command arrays in command_test.c.

  4. Add several more hard-coded valid and invalid command lines to command_test.c to test many aspects of the specification.

  5. Implement and test command_parse in stages, testing each stage on several inputs and committing a working version before continuing.

    1. Count the number of words in line and detect use of &, returning NULL for invalid commands and marking the foreground/background status for valid commands.

    2. Allocate the top-level command array.

    3. For each word in line, allocate properly sized space to hold the word as a null-terminated string, copy the word into this space, and save it in the command array.

    Follow the memory allocation and coding style rules.

  6. Implement and test command_free.

Compile, Test, Debug

The main test harness function is in command_test.c. To compile the command-parsing code with the test harness, run make (or make command_test) to compile a test harness executable called command_test. Run the compiled executable with ./command_test.

Test and debug wisely.

Programming with pointers in C is error-prone, even for experts. Haphazard print-based debugging is an inefficient (and often ineffective) use of your time. Wield the tools in this section to catch errors sooner and faster.

Test aggressively.

Assert the truth.

Assertions are “executable documentation” or “executable specifications”: they document rules about how code should be used or how data should be structured, but they also make it easier to detect violations of these rules (a.k.a. bugs!). Use the assert(...) statment in C by including assert.h and asserting expected properties. For example, the provided code already includes code that asserts that the arguments to command_ functions are not NULL. Thus if a NULL argument is ever passed to these functions, an error message will be printed and execution will halt immediately. Detecting errors early like this (vs. crashing or corrupting data later wherever the code depends on this assumption) saves a lot of time. Add assertions to make the “rules” of your code clear wherever you make assumptions.

Write extensive tests.

Develop your own large suite of test inputs to help check that your code meets the specification. We will not grade your tests themselves, but preparing and using them will help you ensure your code is correct and efficient. It is easy to add new tests to the harness in command_test.c.

  • Add more command line strings to the COMMAND_LINES array in command_test.c, keeping the NULL terminator entry at the end.
  • For early testing of command_print and command_show before implementing command_parse, add command arrays to COMMAND_ARRAYS.

Debug methodically.

valgrind and gdb are the tools of choice. (See the Pointer Practice exercises, which introduce their usage.) Avoid print-based debugging. Backtrack from symptoms to cause.

Use valgrind.

Valgrind is an extended memory error checker. It helps catch pointer errors, misuse of malloc and free, and more. Run valgrind on your compiled program like this: valgrind ./command_test. Valgrind will run your program and observe its execution to detect errors at run time. See the tools page for additional reference links.

Always run your code under valgrind during development and testing to help catch memory errors early.

Use gdb.

Use GDB to help debug your programs when you need more information than valgrind offers. Review previous lab activities for tips. When debugging programs with pointers, pay special attention to the pointer values your program generates. Inspect them like other variables or use the address-of (&) and dereference (*) operators at the gdb prompt to help explore program state.

Documentation and quick reference for gdb:

Does GDB tell you that it cannot display something due to compiler optimizations? If so, turn off compiler optimizations by adding -O0 (that’s “space dash Capital-Oh zero”) at the end of the CFLAGS = ... line in your Makefile. The CFLAGS variable here determines what flags are passed to the compiler. -O0 enables level zero of optimization (i.e., none). Run make clean and then make to compile your code again without optimizations.

Debug by backtracking.

Did your program hit an explicit error? An assertion failure? A segfault? Debug by collecting evidence and backtracking methodically to find the cause of this symptom.

  • Localize the crash. Exactly what line of code, what operation in this line, what value(s) used by this operation manifested what error and crashed? This symptom does not necessarily indicate the cause, but it marks where to begin the search.
    • valgrind will provide most of this information for every memory error.
    • If more information is needed, use gdb to run the code until it crashes, then find the line that crashed (and the context in which it executed) with bt, backtrace, or where.
    • Inspect the most relevant line of code and the error report to determine what the error means. Was it a segfault? If it was an assertion failure, what was the assertion checking?
    • Inspect the arguments shown in the backtrace or print variables to determine what ill-formed pointer was dereferenced to cause a segmentation fault or what illegal values failed an assertion.
  • Trace backward through the dependences of this broken operation to the original logic error.
    • Invalid values: Did any of this operation’s arguments hold invalid values? Did this operation load any invalid values from memory?
      • What operations produced these invalid values?
        • This is easy to answer for invalid arguments.
        • For invalid values loaded from memory, consider what earlier operations (perhaps in completely separate function calls) may have saved (or failed to save) this value.
      • Continue tracing backwards, treating these producer operations as broken operations.
    • Invalid choices: Was it invalid to apply this operation here given correct arguments (or regardless of the arguments)?
      • What control flow decision led to execution of this operation?
      • Is the logic of the decision correct?
      • If so, continue tracing backwards treating the control flow decision as a broken operation.
    • Callers: If using helper functions, you may need to trace back beyond the function where the crash happened.
      • Get a backtrace to see where the current function was called. (Don’t just assume you know which function called this one. Get the truth.)
      • If you want to inspect local variables in functions that called this one, use up and down in gdb to step up and down the call stack, then print what you want to see.
  • Minimize the input. Try to write a small command line that captures the essence of the command line on which your command parser encounters an error. A smaller input is easier to reason about in full.

Avoid print-based debugging.

Avoid using printing for debugging if at all possible. If resorting to printing (try valgrind and gdb first), use fprintf(stderr, "...\n", ...) instead of printf("...\n", ...) and be sure to remember to include an ending newline character. Disable all printing in command.c (except where specified for command_print and command_show) in the final version.

Submission

Before submitting, disable any diagnostic printing in command.c. Only command_print and command_show should print, as specified.

Submit: The course staff will collect your work directly from your hosted repository. To submit your work:

  1. Test your source code files one last time. Make sure that, at a minimum, submitted source code is free of syntax errors and any other static errors (such as static type errors or name/scope errors). In other words: the code does not need to complete the correct computation when invoked, but it must be a valid program. We will not grade files that do not pass this bar.

  2. Make sure you have committed your latest changes. (Replace FILES with the files you changed and MESSAGE with your commit message.)

    $ git add FILES
    $ git commit -m "MESSAGE"
  3. Run the command cs240 sign to sign your work and respond to any assignment survey questions.

    $ cs240 sign
  4. Push your signature and your latest local commits to the hosted repository.

    $ git push

Confirm: All local changes have been submitted if the output of git status shows both:

  • Your branch is up to date with 'origin/master', meaning all local commits have been pushed
  • nothing to commit, meaning all local changes have been committed

Resubmit: If you realize you need to change something later, just repeat this process.

Grading

Your grade is derived as follows:

We compute the correctness and efficiency components of your grade using a private suite of test inputs in addition to the test inputs provided with the starter code. You should extend command_test.c with your own large suite of test inputs and run under valgrind to help check that your code meets the specification and is free of memory safety violations. We will not grade your tests themselves, but preparing and using them will help you ensure your code is fully correct and efficient.

More Tips

C Function Headers and Declaration Order

In C, a function is allowed to be used only after (i.e., later in the file than) its declaration. This differs from Java, which allows you to refer to later methods. When declaring helper functions, you can do one of a few things to deal with this:

  1. Just declare your helper function before the functions that use it.
  2. Write a function header earlier in the file and the actual definition later in the file. The function header just describes the name and type of the function, much like an interface method in Java. For example:

     // A function header declares that such a function exists,
     // and will be implemented elsewhere.
     int helper(int x, int y);
    
     // Parameter names are optional in headers.
     int helper2(char*);
    
     void needsHelp() {
         // OK, because header precedes this point in file.
         helper(7, 8);
         helper2("hello");
     }
    
     int helper(int x, int y) {
         return x + y;
     }
    
     int helper2(char* str) {
         return 7;
     }
  3. If the functions would likely get used elsewhere, then put the header in a header file, a file ending in .h that contains only function headers (for related functions) and data type declarations. For example, if you added another general function (not just a helper function) for manipulating commands (not required in this assignment), it would be best to place a function header for it with the other function headers in command.h so that users of your command library can call it.

    Header files are included (essentially programmatically copy-pasted) by the #include directive you often see at the tops of C source files. Then these functions can be used and their implementations will later be found elsewhere if compiled correctly.

Extra Fun Extensions

This part is not required. Extra Fun indicates ideas for optional extensions or challenges to try if you are having fun after completing the main assignment. Instructors will be happy to discuss ideas or review your solutions, but Extra Fun means what it says: it’s for extra fun, not extra credit.

Before trying either of these extensions:

  1. Finish and submit your work for the required parts with a descriptive commit message.
  2. Make a copy of your command_parse function and rename the copy command_parse_standard.
  3. Replace entire the body of the command_parse function with return command_parse_standard(line, foreground);. Now, command_parse is just a wrapper for whatever version you are working on.

After completing a working extension, make sure to test again before submitting any work on the extensions.

Extra Fun Extend the Command Language with Quoting

Start by making a copy of command_parse_standard and renaming it command_parse_quotes, then changing the body of command_parse to call this new function.

Extend your command-line parser to support quoting with the single-quote (') or double-quote (") characters. Note:

  • After a starting quote character, all characters (including ampersand & and space) until the matching close quote are part of the current word.
  • The delimiting quote characters are not part of the word.
  • A quoted section immediately adjacent to a non-quoted word (with no intervening space) together form a single word.
  • A literal quote character may be used as part of a quoted or unquoted word by using the escape character \ to prefix it.

Here are some examples using single-quote:

  • Basic quoting example with the single-quote character:
    • Command line:
      • As a C string literal: "echo 'hello && world' si'ngle word' &"
      • When printed: echo 'hello && world' si'ngle word' &
    • Command array:
      • "echo"
      • "hello & world"
      • "single word"
      • NULL
  • With escape characters to allow quote characters inside quoted elements:
    • Command line:
      • As a C string literal: "echo 'Don\\'t unquote until here,' I\\'m sure."
      • When printed: echo 'Don\'t unquote until here,' I\'m sure.
    • Command array:
      • "echo"
      • "Don't unquote until here,"
      • "I'm"
      • "sure."
      • NULL

Extra Fun Replace Multi-Pass Loop Code with Single There-and-Back-Again2 Recursive Code

Define a small family of recursive helper functions that accomplish the processes of counting words, counting letters, allocating, copying, etc., without using any loops. Use only recursion. Replace the body of command_parse with a single call to the top-level recursive function you define.

There are many levels of progressively more sophisticated recursive formulations.

Your loop-based code probably made several repeated scans of the command-line string or its subsections in order to avoid additional allocations. In the most sophisticated recursive versions, it is possible to accomplish the entire parsing process with only a single pass over the command-line string, with exactly one recursive function call and one command-line character array access for each character of the command line (plus null terminator). Local variables at each layer of recursion can remember (as recursion progresses deeper) and later reuse relevant input characters and other information (as recursion returns from deeper levels). This is not entirely free: no extra heap space or command-line scans are required, but the call stack will store information about each recursive call in memory.

License

Creative Commons License
Pointers by Benjamin P. Wood at Wellesley College is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.

Source code and evaluation infrastructure for Pointers are available upon request for reuse by instructors in other courses or institutions.

  1. This formulation restricts placement of ampersand further than most shells, for simplicity. 

  2. There and Back Again. Olivier Danvy and Mayer Goldberg. Fundamenta Informaticae, vol. 66, pp. 397–413. IOS Press. 2005.