Pointer Potions
code Brew a command parser with C pointers and arrays.
- Assign: Monday, 6 March
- Checkpoint: aim to complete at least word counting and top-level array allocation by 11:59pm Thursday, 9 March
- Due: (hard due date) 11:59pm Monday, 13 March
- Collaboration: Pair (recommended) or individual code assignment policy, as defined by the syllabus and honor code.
- Partner Search: Find a partner here.
- Starter Code: (Need help?)
- If partnered: Designate one partner as the
owner
and grant the other partner write permission on the owner'scs240-pointers-owner
repository on Bitbucket. - Everyone:
hg clone ssh://hg@bitbucket.org/cs240codetub/cs240-pointers-owner
- If partnered: In your repository working directory, run
./partner.py
to record your partner's Bitbucket username.
- If partnered: Designate one partner as the
- Submit: Commit and push your final revision and double check. (Need help?)
- Relevant Reference:
We recommend working in pairs on this assignment. Assuming you work with a partner, do your programming together, sitting in front of one screen. Take turns driving. If both partners will use a copy of the code, please read the Mercurial Team Workflow before trying that.
Contents
- Overview
- Setup
- Preparatory Exercises
- Command Structure
- Tasks and Specification
- Implementation Workflow
- Compile, Test, Debug
- Grading
- More Tips
Overview
The main task of this assignment is to implement a shell command parser as a small library in C. The shell is the program that reads and interprets the commands you type at a command line terminal. Later in the semester, you will build a full shell. For now, we focus on the first step: splitting a command line into its meaningful parts.
The parser takes a command line, a single string, and converts it to
a command array, an array of strings representing the command and
its space-separated arguments. The string split
function available
in many programming languages’ standard libraries would accomplish
most of the task, but you will essentially implement split
at a low
level. We prohibit the use of string manipulation functions from the
C standard library to force you to confront how strings are
implemented and manipulated in memory.
The goals of this assignment are:
- to become familiar with the byte-addressable memory model, C pointers, arrays, and the link between pointer arithmetic and array indexing;
- to practice principled techniques for reasoning about program execution, using assertions and structured debugging.
Please skim this document before starting work. Pay special attention to orange boxes like this one.
Programming with C pointers is like brewing a potion. Both tend to flare up in your face unless you brew them just right.
For each stage in the suggested workflow:
- Plan carefully before typing any code.
- Implement one step at a time (with useful assertions).
- Test each step extensively before implementing the next.
- Commit each tested step before implementing more.
This careful process will save time by catching bugs early and saving working versions you can recover if things start going wrong later.
Setup
As with
Bit Transfiguration and
all code assignments, the assignment manifest
gives steps for obtaining starter code and extra steps for forming a
pair/team in your Bitbucket assignment repository. For this
assignment, where we allow/encourage pair programming, these extra
steps are required to attribute your code to (and share it with) both
you and your partner. If you work alone, you need only hg clone
.
Your Potion Ingredients Kit contains the following files:
command.c
: a file where you will implement a parser for simple shell commandscommand.h
: a C header file describing the functions you will implement incommand.c
command_test.c
: tests for your command parserMakefile
: recipes to compile the various parts
Compile the command-parsing code with the test harness with the
command make
(or make command_test
). Run the compiled
executable with ./command_test
.
Preparatory Exercises
Answer these questions (ungraded) as you read over this document. You must complete these prep exercises before requesting help with coding in this assignment.
- Complete the
practice.c
pointer programming examples in the pointers lab before starting this assignment. - Is the
&
character ever considered part of a command word? - What is the command to compile testing code for this assignment?
- What is the comamnd to test your parsing code for this assignment?
- Why does step 5a come before step 5b in the workflow?
- Are you allowed to mutate the contents of a command line string in
command_parse
?
Command Structure
You will write a handful of functions to convert command lines (strings) to command arrays (arrays of one-word strings) and work with command arrays. This section describes command lines and command arrays. The next section describes the specific functions you will write.
Command Lines
A command line is a null-terminated string containing zero or more
space-separated words, with an optional single background status
indicator character, '&'
, after the final word.
- Spaces may appear anywwhere.
- Word characters include all printable characters except space
(
' '
) and ampersand ('&'
). - A word is a contiquous sequence of word characters delinated on each end by a non-word character or the beginning or end of the string.
- Ampersand (
'&'
) is a special status indicator character, not a word character.- In a valid command, ampersand (
'&'
) may appear at most once, and only after the final word, followed only by zero or more spaces to the end of the string. Any other placement of ampersand is invalid.1 - A valid command containing no ampersand is a foreground command.
- A valid command containing an ampersand is a background command.
- In a valid command, ampersand (
Command Line Examples:
Here are two typical command lines:
- The command line
"ls -l cs240-pointers"
indicates a foreground command containing the words"ls"
,"-l"
, and"cs240-pointers"
. - The command line
"emacs cs240-pointers/command.c &"
indicates a background command containing the words"emacs"
and"cs240-pointers/command.c"
.
The definition above allows any spacing. The following lines are all
valid command lines indicating the foreground command containing the
words "ls"
, "-l"
, and "cs240-pointers"
:
"ls -l cs240-pointers"
"ls -l cs240-pointers "
" ls -l cs240-pointers "
The definition above requires that, if present, ampersand ('&'
) must
be the last non-space character of the command line. It may appear
adjacent to or separate from the last word. Regardless of placement,
ampersand ('&'
) is not a word character and is never part of any
command word. For example, these valid command lines all indicate
the background command containing the single word "emacs"
:
"emacs &"
"emacs&"
" emacs& "
The following are examples of invalid command lines, with ampersand misused:
"&uhoh"
" & uh oh"
"uh & oh"
"uh& &oh"
"uh oh & &"
Command Arrays
The parser converts a command line into a command array, an array of strings representing the words of the command line, in order, terminated by a NULL
element. Recall that a string in C is not a special type; it is just an array of char
s terminated by a null character ('\0'
). A command array is thus an array of pointers to arrays of char
s and has the type char**
. All arrays in this structure must be null-terminated, using the right notion of nullness for each array.
'\0'
is the null character. As a character (char
), it is one byte in size.
NULL
is the null address. As an address (pointer value), it is one machine word in size.
Do not mix them up!
Literal character (char
) values in C are given in single quotes: 'a'
.
Literal string values are given in double quotes: "a"
.
Do not mix them up!
Comparing 'a' == "a"
will yield a false result. 'a'
is the
one-byte ASCII encoding of the letter a, a value of type char
.
"a"
is the address of the start of the one-character null-terminated
string a in memory, a one-word value of type char*
.
Here is an example command array for the command line string "ls -l cs240pointers"
:
Command Array: Null-Terminated Strings
Index Contents (stored elsewhere in memory)
+----------+
0 | ptr *----------> "ls"
+----------+
1 | ptr *----------> "-l"
+----------+
2 | ptr *----------> "cs240-pointers"
+----------+
3 | NULL |
+----------+
Here is the same array drawn another way and showing how the array is arranged in memory, with each element’s offset from the base address of the array. Addresses grow left to right, and are assumed to be 64 bits (8 bytes), so indices are related to offsets by a factor of 8.
Index: 0 1 2 3
Offset: +0 +8 +16 +24 +32
+-------+-------+-------+-------+
Contents: | * | * | * | NULL |
+---|---+---|---+---|---+-------+
| | |
V V V
"ls" "-l" "cs240-pointers"
Although we draw “strings” in the above pictures, this is an abstraction. Each string is actually represented by a '\0'
-terminated array of 1-byte characters in memory. Since each element is one byte, the addresses of adjacent characters differ by 1 and the offset is identical to the index.
Index: 0 1 2
Offset: +0 +1 +2 +3
+-----+-----+-----+
| 'l' | 's' |'\0' |
+-----+-----+-----+
Note: Practice reading memory diagrams every which way. Pay attention to what direction addresses grow. We are intentionally using different conventions in different drawings to help you get used to thinking on your feet…
Tasks and Specification
You must write four functions (plus any necessary helper functions) supporting command parsing in command.c
according to the headers in command.h
. If you write helper functions, be sure to read about how C function declarations work. We give a plan for implementation and testing below.
Function Specifications
-
char** command_parse(char* line, int* foreground)
Parse a command-line string
line
and:- If the command line is valid:
- Store
0
(false) in memory at the address given by the pointerforeground
if the command is a background command or store1
(true) at this location if the command is a foreground command. - Return a command array containing the words of the command line, in order.
- Store
- If the command line is invalid, return
NULL
without setting foreground/background status.
- If the command line is valid:
-
void command_print(char** command)
Print a command array in the form of a command line, with the command words separated by spaces. Do not include quotes. Do not include a newline (
'\n'
) at the end. You will want to learn how to use printf. -
void command_show(char** command)
Print the structure of a command array to aid in debugging and data inspection.
The output should make it clear exactly what strings the command array holds, but the format is left to you. For example, given the command line
"ls -l cs240-pointers "
the output of a
command_show
implementation using helpful formatting will distinguish a correct command array where all spaces are stripped away:command array: - "ls" - "-l" - "cs240-pointers" end
from an incorrect command array that contains some trailing spaces in words:
command array: - "ls " - "-l " - "cs240-pointers " end
Calling
command_print
, which does not delineate the bounds of words explicitly, may not make this so clear:ls -l cs240-pointers
-
void command_free(char** command)
Free all parts of a command array structure previously created by
command_parse
and not previously freed.
Memory Rules
Naturally, your code should be free of memory safety violations (as
detected by the valgrind
memory error detector):
- No use of uninitialized memory contents.
- No out-of-bounds accesses.
free
only pointers to allocated blocks returned bymalloc
, and only once per block.
In addition, your code must follow these stricter memory allocation rules:
- Clients of the command library own and manage the memory
representing command line strings.
- Command library functions must never free nor mutate command line strings.
- Clients may free or mutate command lines at any time outside a call to command library functions.
- The
command_parse
function must allocate command array structures dynamically withmalloc
and return them to the client. Once returned, these structures are own by the client.- Clients will not mutate command array structures.
- Clients may call
command_free
to free a command array at most once, at any time outside a command library function.
- The command library functions must not allocate any memory that they do not return as part of a command array structure.
- Assuming that the client code eventually calls
command_free
on every command array returned bycommand_parse
, all memory allocated by the command library functions should be freed.
Coding Style Rules
Follow this general C style guide as a baseline, along with the following specific rules:
- Do not use any C standard library string functions in
command.c
. You must define do all string processing from scratch. - Do not use array notation in this assignment. Use idiomatic pointer style instead. See below.
- Document in comments how your code implements the specifications
provided, connecting your C implementation to the specification.
- Our sample solution for
command_parse
is under 70 clean lines of code and 20–30 lines of comments.
- Our sample solution for
- Document pre- and post-conditions and use assertions to check assumptions where reasonable.
- Use well-scoped helper functions where reasonable.
- Clean solutions are possible with and without helper functions.
- If using helper functions, match them to well-defined sub-problems with clear arguments and results.
- Be sure to order function headers properly.
- Prefer simple efficient code over complicated code.
- Our sample solution for
command_parse
is under 70 clean lines of code and 20–30 lines of comments.
- Our sample solution for
- Ensure your code is free of compiler warnings (and, obviously, errors).
- Ensure your code is free of run-time errors, including memory errors
and leaks as detected by
valgrind
.
Idiomatic Pointer Style
For full credit, your submitted code must use only pointers and pointer arithmetic, with no array notation. Choosing pointer arithmetic over array indexing is not necessarily the best choice for clear code in all cases – here we require you to use pointers to learn how arrays and array operations work under the hood.
A simple way to think with arrays but write with pointers is to use
*(a + i)
wherever you think a[i]
. However, this simple
transformation rarely matches an idiomatic pointer style.
An idiomatic loop over an array with array indexing typically uses an integer index variable incremented on each iteration. Keep the scope of the index variable (or “loop variable”) as small as possible for the task.
// pre: a is the address of a null-terminated string.
// post: replaces all characters in a by 'Z'
for (int i = 0; a[i] != '\0'; i++) {
a[i] = 'Z';
}
An idiomatic loop over an array with pointer arithmetic typically uses a cursor pointer that is incremented to point to the next element on each iteration. Keep the scope of the cursor pointer variable as small as possible for the task.
// pre: a is the address of a null-terminated string.
// post: replaces all characters in a by 'Z'
for (char* p = a; *p != '\0'; p++) {
*p = 'Z';
}
Your final code should contain zero array[index]
operations.
Implementation Workflow
Follow this workflow to understand and implement command_parse
and
friends.
-
Finish the prep exercises before starting to build your command parser.
-
Add several more hard-coded command array test cases to
command_test.c
. -
Implement and test
command_show
andcommand_print
first. They are only a few lines of code each. Test them on the constant, statically allocated command arrays incommand_test.c
. -
Add several more hard-coded valid and invalid command lines to
command_test.c
to test many aspects of the specification. -
Implement and test
command_parse
in stages, testing each stage on several inputs and committing a working version before continuing.-
Count the number of words in
line
and detect use of&
, returningNULL
for invalid commands and marking the foreground/background status for valid commands. -
Allocate the top-level command array.
-
For each each word in
line
, allocate properly sized space to hold the word as a null-terminated string, copy the word into this space, and save it in the command array.
Follow the memory allocation and coding style rules.
-
-
Implement and test
command_free
.
Compile, Test, Debug
The main test harness function is in command_test.c
. To compile the
command-parsing code with the test harness, run make
(or make
command_test
) to compile a test harness executable called
command_test
. Run the compiled executable with ./command_test
.
Programming with pointers in C is error-prone, even for experts. Haphazard print-based debugging is an inefficient (and often ineffective) use of your time. Wield the tools in this section to catch errors sooner and faster.
Assert the truth.
Assertions are “executable documentation” or “executable specifications”: they document rules about how code should be used or how data should be structured, but they also make it easier to detect violations of these rules (a.k.a. bugs!). Use the assert(...)
statment in C by including assert.h
and asserting expected properties. For example, the provided code already includes code that asserts that the arguments to command_
functions are not NULL
. Thus if a NULL
argument is ever passed to these functions, an error message will be printed and execution will halt immediately. Detecting errors early like this (vs. crashing or corrupting data later wherever the code depends on this assumption) saves a lot of time. Add assertions to make the “rules” of your code clear wherever you make assumptions.
Write extensive tests.
Develop your own large suite of test inputs to help check that your
code meets the specification. We will not grade your tests
themselves, but preparing and using them will help you ensure your
code is correct and efficient. It is easy to add new tests to the
harness in command_test.c
.
- Add more command line strings to the
COMMAND_LINES
array incommand_test.c
and updateNUM_COMMAND_LINES
to match. - For early testing of
command_print
andcommand_show
before implementingcommand_parse
, add command arrays toCOMMAND_ARRAYS
.
Debug methodically.
valgrind
and gdb
are the tools of choice.
Avoid print-based debugging. Backtrack from symptoms to
cause.
Use valgrind
.
Valgrind is an extended memory error checker. It helps catch pointer
errors, misuse of malloc
and free
, and more. Run valgrind on your
compiled program like this: valgrind ./command_test
. Valgrind will
run your program and observe its execution to detect errors at run
time. See the tools page
for additional reference links.
Always run your code under valgrind
during development and testing
to help catch memory errors early.
Use gdb
.
Use GDB to help debug your programs when you need more information
than valgrind
offers. Review previous lab activities
for tips. When debugging programs with pointers, pay special
attention to the pointer values your program generates. Inspect them
like other variables or use the address-of (&
) and dereference (*
)
operators at the gdb
prompt to help explore program state.
Documentation and quick reference for gdb
:
- CS 240 GDB reference sheet (pdf, txt)
- GDB Quick Reference card
- GDB manual
- For documentation from the command line, type
help
at thegdb
command prompt, or typeman gdb
, orinfo gdb
in the shell. Some people also like to rungdb
under gdb-mode in emacs.
Does GDB tell you that it cannot display something due to compiler
optimizations? If so, turn off compiler optimizations by adding
-O0
(that’s “space dash Capital-Oh zero”) at the end of the CFLAGS
= ...
line in your Makefile
. This will take effect the next time
you use make
to compile your code. The CFLAGS
variable here
determines what flags are passed to the compiler. -O0
enables level
zero of optimization (i.e., none).
Debug by backtracking.
Did your program hit an explicit error? An assertion failure? A segfault? Debug by collecting evidence and backtracking methodically to find the cause of this symptom.
- Localize the crash. Exactly what line of code, what
operation in this line, what value(s) used by this operation
manifested what error and crashed? This symptom does not
necessarily indicate the cause, but it marks where to begin the
search.
valgrind
will provide most of this information for every memory error.- If more information is needed, use
gdb
torun
the code until it crashes, then find the line that crashed (and the context in which it executed) withbt
,backtrace
, orwhere
. - Inspect the most relevant line of code and the error report to determine what the error means. Was it a segfault? If it was an assertion failure, what was the assertion checking?
- Inspect the arguments shown in the backtrace or
print
variables to determine what ill-formed pointer was dereferenced to cause a segmentation fault or what illegal values failed an assertion.
- Trace backward through the dependences of this broken
operation to the original logic error.
- Invalid values: Did any of this operation’s arguments hold
invalid values? Did this operation load any invalid values from
memory?
- What operations produced these invalid values?
- This is easy to answer for invalid arguments.
- For invalid values loaded from memory, consider what earlier operations (perhaps in completely separate function calls) may have saved (or failed to save) this value.
- Continue tracing backwards, treating these producer operations as broken operations.
- What operations produced these invalid values?
- Invalid choices: Was it invalid to apply this operation here
given correct arguments (or regardless of the arguments)?
- What control flow decision led to execution of this operation?
- Is the logic of the decision correct?
- If so, continue tracing backwards treating the control flow decision as a broken operation.
- Callers: If using helper functions, you may need to
trace back beyond the function where the crash happened.
- Get a
backtrace
to see where the current function was called. (Don’t just assume you know which function called this one. Get the truth.) - If you want to inspect local variables in functions that
called this one, use
up
anddown
ingdb
to step up and down the call stack, thenprint
what you want to see.
- Get a
- Invalid values: Did any of this operation’s arguments hold
invalid values? Did this operation load any invalid values from
memory?
- Minimize the input. Try to write a small command line that captures the essence of the command line on which your command parser encounters an error. A smaller input is easier to reason about in full.
Avoid print-based debugging.
Avoid using printing for debugging if at all possible. If resorting
to printing (try valgrind
and gdb
first), use fprintf(stderr,
"...\n", ...)
instead of printf("...\n", ...)
and be sure to
remember to include an ending newline character. Disable all
printing in command.c
(except where specified for command_print
and command_show
) in the final version.
Grading
Submit your work by committing and pushing your latest work to your Bitbucket repository, as indicated in the assignment manifest.
Before submitting, disable any diagnostic printing in command.c
. Only
command_print
and command_show
should print, as specified.
Your grade is derived as follows:
- Functional Correctness (50 points):
- When called on well-structured inputs, your code must behave according to the specification.
- Memory Safety and Efficiency (20 points):
- Your code results in no run-time memory safety violations
(as detected by
valgrind
). - Your code allocates no more memory than strictly necessary.
- Your code does not suffer memory leaks (as detected by
valgrind
).
- Your code results in no run-time memory safety violations
(as detected by
- Design, style, and documentation (30 points):
- Your code follows idiomatic pointer style.
- Your code uses clean design and style.
- Your code is well documented.
We compute the correctness and efficiency components of your grade
using a private suite of test inputs in addition to the test inputs
provided with the starter code. You should extend command_test.c
with your own large suite of test inputs and run under
valgrind
to help check that your code meets the
specification and is free of memory safety violations. We will not
grade your tests themselves, but preparing and using them will help
you ensure your code is fully correct and efficient.
More Tips
Emacs Tips
- Compile from within Emacs by typing
M-x compile
, editing themake
command that appears in the mini-buffer (justmake
should work), and hit enter. The compiler output will appear in an Emacs buffer where you can click on errors to jump to the source line involved. Note: the mini-buffers where you type in commands have history accessible with up and down arrows. - Run GDB from with Emacs with
M-x gdb
, starting from a C buffer.- Click left margin to set/delete breakpoints.
- Use buttons at top to run, continue, step, next, navigate up and down the stack, etc.
C Function Declarations
In C, a function is allowed to be used only after (i.e., later in the file than) its declaration. This differs from Java, which allows you to refer to later methods. When declaring helper functions, you can do one of a few things to deal with this:
- Just declare your helper function before the functions that use it.
-
Write a function header earlier in the file and the actual definition later in the file. The function header just describes the name and type of the function, much like an interface method in Java. For example:
// A function header declares that such a function exists, // and will be implemented elsewhere. int helper(int x, int y); // Parameter names are optional in headers. int helper2(char*); void needsHelp() { // OK, because header precedes this point in file. helper(7, 8); helper2("hello"); } int helper(int x, int y) { return x + y; } int helper2(char* str) { return 7; }
-
If the functions would likely get used elsewhere, then put the header in a header file, a file ending in
.h
that contains only function headers (for related functions) and data type declarations. For example, if you added another general function (not just a helper function) for manipulating commands (not required in this assignment), it would be best to place a function header for it with the other function headers incommand.h
so that users of your command library can call it.Header files are included (essentially programmatically copy-pasted) by the
#include
directive you often see at the tops of C source files. Then these functions can be used and their implementations will later be found elsewhere if compiled correctly.
-
This assignment restricts placement of ampersand further than most shells, for simplicity. ↩