🔬 Lab
CS 240 Lab 6
Pointers in C
CS 240 Lab 6
In this lab, you will get practice working with C programs that use pointers, including predicting results of functions that use pointers, writing pointer code, and analyzing incorrect code.
You will also be introduced to some new tools which will be quite helpful in understanding and solving problems with your programs. The GNU debugger gdb allows you to see what is going on “inside” a program while it executes—or what it was doing at the moment it crashed. The debugger allows you to display values of variables and examine contents of memory, which will be handy in understanding the effect of your programs on the hardware of the system.
Another tool you will use is Valgrind, an instrumentation framework for building dynamic analysis tools (pronounced like “grinned” not “grind” as it’s named after a gate in Norse mythology). The Valgrind module we will be using, called Memcheck, is for memory error detection.
Setup
On a lab computer booted to Linux, open VSCode (you can also use your own laptop if necessary).
To get all the files for this week’s lab, in the Terminal from the Linux command line enter:
cd cs240-repos
cs240 start cmemory
cd cmemory
Warm-up
These exercises will help check your basic knowledge of C arrays and strings.
Arrays
In C, an array variable can be declared by putting square brackets after a variable, optionally with a number inside to indicate the size. Curly braces can be used to store fixed data in an array. For example:
int x[3] = {11, 22, 33};
("%d, %d, %d\n", x[0], x[1], x[2]);
printf[2] += 11;
x("%d\n", x[2]); printf
This code will print:
11, 22, 33 44
(this slide has a printf
refresher)
Exercise 1:
In C, each variable holds a value which can be stored in a single register (sometimes these are saved in memory, but they need to fit into a register so that the CPU can load them from memory to operate on them). In the code above, where are the numbers 11, 22, and 33 stored when the array is created?
They are
stored in whichever register is used for
x
.
They
are stored in the computer’s memory somewhere.
They are
stored in a series of registers in the CPU.
Explanation: Note that option 1 is wrong, because you cannot
store three numbers in a single register (imagine if each number were
very large). Option 3 is more plausible, but breaks down if you imagine
an array of 100 or 1000 numbers.
What value is actually stored in the register (or memory location) used for the variable x?
The first of
the three numbers, 11.
The size of
the array, 3.
The
memory address at which the first number is
stored.
In the HW ISA, which instruction would be used to access a value in
the array using the variable’s value? Correct
answer: LW Explanation: For example, if x was stored in register R2, the
instruction LW R3, 0(R2)
would load x[0]
into
register R3.
In addition to the value of the variable x, what other piece of information needs to be kept around in order to safely use the array?
Example answer: The length of the array (3 in this case) needs to be retained, since otherwise you won’t be able to do things like iterate over it until the end. Unlike with a string, there won’t be any indication in the stored data that you’ve reached the end of the array.
Strings
A string in C can be declared using the type char *
and
can be given a value using a double-quoted sequence of characters. For
example:
char *greeting = "Hello!";
The *
indicates that strings are pointers to characters,
and they work just like arrays of characters, except that they are
null-terminated: the last character in the array will
always be the special '\0'
NUL byte which
has numeric value 0. This means that unlike arrays, you can figure out
the length of a string based on the pointer value: just follow that
pointer and then march along until you see the NUL.
Exercise 2:
The NUL terminator means that in addition to the characters specified in the double quotes (stored in 8 bits each using the ASCII encoding), C stores a single byte of all zeroes at the end of each string. How many bytes are used to store the string “Hello!” in this example, including the NUL byte? Correct answer: 7
What is the length of this string, in characters? Correct answer: 6
Name one benefit and one drawback of using a NUL byte:
Hint: One benefit relates to the arrays part above; one drawback relates to memory use.
A benefit: unlike arrays, you don’t have to store the length of a string and make sure to pass it around wherever you use the string. Anyone with a string can figure out its length by iterating until they find a NUL.
A drawback: unlike arrays, each string requires at least 8 extra bits. If you have many very small strings (e.g., 1-character strings) then in the worst case, the total storage requirement can be twice as much as it would be without sentinel values.
Another more subtle drawback: string values themselves cannot include NUL bytes. So there are only 255 possible characters to use instead of 256.
Of course, a string can be as long or short as it needs to be, it
could be thousands of characters long. But remember: in C each variable
is a value that can be stored in a register. For the variable
greeting
above, what value is stored in a register?
The
address in memory where the ‘H’ is stored.
The address
in memory where the final 0 byte is stored.
The address
in memory of all of the stored characters.
The
ASCII-encoded bytes for the ‘H’ and the ‘e’.
Explanation: Answer 2 is wrong because if we stored that
address, there would be no way for us to figure out where the string
started. Answer 3 is wrong because each character is at a different
address: “all the stored characters” don’t actually have one address.
Answer 4 is wrong because even if you tried to store pieces of the
string in subsequent registers, you’d very quickly run out of registers.
In contrast, RAM has lots of space available.
Which character in the code above serves as an indication that the
greeting
variable is holding an address, not a direct
value?
Example
answer: The *
next to
greeting
means that the type of the variable is “pointer to
a character” rather than just “a character.”
Unlike Python, in C you cannot use ‘+’ to concatenate strings (instead there are a few different functions you can use for this). Given that strings are stored as pointers rather than direct values, describe what would be necessary to concatenate two different strings and return a third new string:
Example answer: You would need to copy the characters from both input strings into a new memory location next to each other, put a new sentinel NUL at the end, and then return the address of the start of the new memory location. This would take time proportional to the length of the result.
Pointers Practice
For this part of the lab, you’ll be deducing the values of various expressions involving pointers.
You’ll be filling in values based on this code:
char *p = (char*) 0x35a0;
char *q = (char*) 0x35b0;
// Line 1
int *ip = (int*) p;
int *iq = (int*) q;
// Line 2
[0] = 2;
p[1] = 1;
p[2] = 0;
p[3] = 0;
p// Line 3
[2] = 1;
p*ip = 515;
// Line 4
For each line in the table below, give the type and value of the result of the listed expression. Each expression should be evaluated as if it were run at a specific point in the code, identified by one of the ‘Line’ comments. The first line has been filled in as an example.
- All of your answers in column 3 should be numbers. Write
addresses in hexadecimal with a leading
0x
to indicate their base, and write integer and character values in decimal. - Assume that you are using a machine with 32-bit integers and 64-bit addresses using little-endian byte ordering.
- The
sizeof
function returns the number of bytes taken up by one value of the specified type. - The type for any
sizeof
result should besize_t
, and the type for any result which is a difference between two addresses should beptrdiff_t
. - Remember that when adding to or subtracting from a pointer to do address arithmetic, the integer added or subtracted will automatically be multiplied by the size (in bytes) of the kind of value that the pointer points to. This ensures that the new address is properly aligned.
Exercise 3:
Evaluate these expressions as if they were run where it says “Line 1:”
Expression | Type | Value |
---|---|---|
p |
char * |
0x35a0 |
&p[1] |
Correct answer: char * | Correct answer: 0x35a1 |
&p[-1] |
Correct answer: char * | Correct answer: 0x359F |
&p[0] |
Correct answer: char * | Correct answer: 0x35a0 |
&p[1] - &p[0] |
Correct answer: ptrdiff_t | Correct answer: 1 |
&p[8] |
Correct answer: char * | Correct answer: 0x35a8 |
(p + 1) - p |
Correct answer: ptrdiff_t | Correct answer: 1 |
&p[16] - p |
Correct answer: ptrdiff_t | Correct answer: 16 |
q - p |
Correct answer: ptrdiff_t | Correct answer: 16 |
sizeof(p) |
Correct answer: size_t | Correct answer: 8 |
sizeof(*p) |
Correct answer: size_t | Correct answer: 1 |
These expressions should be evaluated as if the code had advanced to
the “Line 2” marker, so that the ip
and iq
variables are defined:
Expression | Type | Value |
---|---|---|
&ip[0] |
Correct answer: int * | Correct answer: 0x35a0 |
&ip[1] |
Correct answer: int * | Correct answer: 0x35a4 |
&ip[1] - &ip[0] |
Correct answer: ptrdiff_t | Correct answer: 1 |
((char*) &ip[1]) - p |
Correct answer: ptrdiff_t | Correct answer: 4 |
sizeof(ip) |
Correct answer: size_t | Correct answer: 8 |
sizeof(*ip) |
Correct answer: size_t | Correct answer: 4 |
&ip[sizeof(int)] |
Correct answer: int * | Correct answer: 0x35b0 |
ip + sizeof(int) |
Correct answer: int * | Correct answer: 0x35b0 |
ip + 1 |
Correct answer: int * | Correct answer: 0x35a4 |
p + sizeof(int) |
Correct answer: char * | Correct answer: 0x35a4 |
iq - ip |
Correct answer: ptrdiff_t | Correct answer: 4 |
&iq[-1] - ip |
Correct answer: ptrdiff_t | Correct answer: 3 |
These expressions should be evaluated as if the code was at “Line 3,”
where the items in the p
array have been set to new
values:
Expression | Type | Value |
---|---|---|
*p |
Correct answer: char | Correct answer: 2 |
*(p + 1) |
Correct answer: char | Correct answer: 1 |
*ip |
Correct answer: int | Correct answer: 258 |
These expressions should be evaluated as if the code was at “Line 4,”
after *ip
has been updated.
Expression | Type | Value |
---|---|---|
p[0] |
Correct answer: char | Correct answer: 3 |
p[1] |
Correct answer: char | Correct answer: 2 |
p[2] |
Correct answer: char | Correct answer: 0 |
*ip |
Correct answer: int | Correct answer: 515 |
*((char *)ip) |
Correct answer: char | Correct answer: 3 |
*(((char *)ip) + 1) |
Correct answer: char | Correct answer: 2 |
(*((char *)ip)) + 1 |
Correct answer: char | Correct answer: 4 |
gdb
(GNU Debugger)
Next, let’s look at the program main4
from the
cmemory
folder in gdb
, the GNU debugger, which
will allow us to execute our program and examine variables and memory
along the way to find problems.
For extra details, you may want to consult the official
gdb
manual, or some of the links on our tools page, in particular this CS240 reference sheet and this general
quick-reference card.
Exercise 4:
First, build the main4.bin
program (“bin” stands for
“binary” and refers to the fact that it’s in machine code form, ready to
run on the CPU) using make
on the command line:
make main4.bin
If this command doesn’t work, double check that
you’ve done the setup steps from the beginning of the lab. You need to
be in the cmemory
directory that was created when you ran
cs240 start cmemory
.
Next, start the program from the command prompt using
gdb
:
gdb main4.bin
You should now get a prompt that says (gdb)
. First,
we’ll view the code of the program from within gdb
:
(gdb) list // note that line 7 contains the assignment of ‘B’ to ptr[20]
1 #include <stdio.h>
2 #include <stdlib.h>
3
4 int main(void) {
5 char ch = 'A';
6 char* ptr = &ch;
7 ptr[20] = 'B';
8 printf("%c\n", ptr[20]);
9 } 10
Set a breakpoint:
(gdb) break 7
Run the program (it will hit the breakpoint)
(gdb) run
Breakpoint 1, main () at main4.c:7 7 ptr[20] = 'B';
The program is paused at this point, and is ready to execute line 7 next.
You can display the contents of variables or memory locations using
the print
command (the /x
means display the
value in hexadecimal notation). Print the pointer (address in memory) as
a hexadecimal number:
(gdb) print /x ptr
Paste what is displayed, and explain it:
$1 = 0x7fffffffd7e7
This shows the value of the pointer ptr
(in hex), which
is the memory address of the character stored in variable
ch
.
Extra detail: because the address starts with 0x7ff
,
it’s on the stack.
Prints value being pointed to as a hexadecimal number:
(gdb) print /x *ptr
Paste what is displayed, and explain it:
$2 = 0x41
This shows one character (the letter ‘A’) as hex. It’s the value in memory pointed to by the pointer.
Print a value as a character (/c):
(gdb) print /c *ptr
Paste what is displayed, and explain it:
$3 = 65 ‘A’
Shows both the hex value for this byte and the ASCII symbol.
Print a value located at offset from a base address, using ptr notation to refer to the value:
(gdb) print /c *(ptr+20)
Paste what is displayed, and explain it:
$1 = -9 ‘\367’
This is an uninitialized value, so it could be anything. Here we see the number -9, hex 0xf7, represented with a 3-digit octal escape, because it’s a non-printable ASCII character.
I looked it up and apparently octal escape codes are common for the non-printing ASCII control characters (or at least they were around the time GDB was designed).
You can also print a value using array notation to refer to value offset from a base address:
(gdb) print /c ptr[20]
Next, display multiple memory values using x (stands for eXamine). You can specify multiple values to be displayed, started at the specified address:
(gdb) x /24bx ptr //will display 24 bytes of data, starting at ptr in memory
Paste what is displayed, and explain it:
0x7fffffffd7e7: 0x41 0xe7 0xd7 0xff 0xff 0xff 0x7f 0x00
0x7fffffffd7ef: 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x7fffffffd7f7: 0x00 0x55 0xf5 0xa2 0xf7 0xff 0x7f 0x00
This shows the memory contents starting at 0x7fffffffd7e7, which is where the ‘A’ is stored on the stack (your specific memory addresses may differ). The ‘A’ (0x41) is shown first in the top left, and the 0xe7 is next, reading to the right and then down. 8 values are shown per row, with 24 total values.
The byte at ptr[20] is actually part of a larger value being stored on the stack for later use: 0x7ffff7a2f55 Do you see how you can read that value from the display? (Start at byte 22 and read backwards). Why do you have to read backwards?
Example
answer: The debugger displays bytes left to
right, with the byte with the lowest address leftmost (the opposite of
little-endian, where the lowest byte holds the least-significant value,
but that least-significant value is written to the right.). So, to write
a multi-byte value with the most significant byte at the highest address
(little-endian order, as we prefer), you need to reverse the order of
the bytes as shown in gdb
.
Now, execute a single instruction using step
:
(gdb) step
At this point, the ptr[20] = ‘B’; instruction would be executed, and you would expect the memory location 20 away from ptr to have a new value. Examine it as you did before:
(gdb) x /24bx ptr
Paste the output and explain which byte got which new value:
0x7fffffffe077: 0x41 0x77 0xe0 0xff 0xff 0xff 0x7f 0x00
0x7fffffffe07f: 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x7fffffffe087: 0x00 0x55 0xf5 0xa2 0x42 0xff 0x7f 0x00
The byte 4th from the right in the bottom row has the value
0x42
now, which is the ASCII value for uppercase ‘B’.
Continue after the breakpoint:
(gdb) continue // will continue running after the breakpoint to complete the program
What happens next?
First, B is printed. Then the program generates a segmentation fault. The message reads (note you may have a different specific address):
Program received signal SIGSEGV, Segmentation fault. 0x00007fff42a2f555 in ?? ()
Notice that gdb
tells you where the segmentation fault
occurred. What is significant about that address?
It
contains the 0x42
byte that we wrote into
memory.
It matches
the 8-byte address stored directly after the ‘A’ in
memory.
It doesn’t
start with 7fff… so it’s not a stack
address.
If you get a segmentation fault when running your program from the
command prompt, you can try running it in gdb
, as well,
because you can get more information in the debugger about what is
causing the problem.
Now we’re ready to move on, so exit gdb
by typing
quit
:
(gdb) quit
valgrind
(valgrind
is pronounced like “val
grinned,” after the gate to Valhalla from Norse mythology.)
To understand how pointers work, it is helpful to consider some more
examples where code for allocating memory does not produce the desired
result. In this exercise, you will run your code using
valgrind
, which should help you understand why your code is
incorrect. The following files are in the cmemory
repository:
fibloop2.c
Makefile
search.c
strings3.c
fibloop.c
matvec.c
strings2.c
strings.c
Exercise 5:
Compile the buggy programs by running:
make bugs
Open the file fibloop.c
and examine the code. Explain
how it works:
This
malloc
slide has a quick overview of what the
malloc
function does.
Example answer: It loops through indices putting the fibonacci number in each index by adding together the two numbers in the two previous slots.
Run the program:
./fibloop.bin
Does the program seem to work correctly? Yes
No
Now, run the program using valgrind
:
valgrind ./fibloop.bin
Examine the output carefully (it should mostly match what’s shown here):
==23213== Memcheck, a memory error detector
==23213== Copyright (C) 2002-2012, and GNU GPL'd, by Julian Seward et al.
==23213== Using Valgrind-3.8.1 and LibVEX; rerun with -h for copyright info
==23213== Command: fibloop
==23213==
==23213== Invalid write of size 4
==23213== at 0x400625: fibs (fibloop.c:23)
==23213== by 0x400651: main (fibloop.c:29)
==23213== Address 0x4c2e054 is 0 bytes after a block of size 20 alloc'd
==23213== at 0x4A06A2E: malloc (vg_replace_malloc.c:270)
==23213== by 0x4005BF: fibs (fibloop.c:17)
==23213== by 0x400651: main (fibloop.c:29)
==23213==
==23213== Invalid write of size 4
==23213== at 0x400625: fibs (fibloop.c:23)
==23213== by 0x40065F: main (fibloop.c:30)
==23213== Address 0x4c2e0bc is 0 bytes after a block of size 28 alloc'd
==23213== at 0x4A06A2E: malloc (vg_replace_malloc.c:270)
==23213== by 0x4005BF: fibs (fibloop.c:17)
==23213== by 0x40065F: main (fibloop.c:30)
==23213==
result = 0 1 1 2 3
==23213== Invalid read of size 4
==23213== at 0x4006DE: main (fibloop.c:41)
==23213== Address 0x4c2e0bc is 0 bytes after a block of size 28 alloc'd
==23213== at 0x4A06A2E: malloc (vg_replace_malloc.c:270)
==23213== by 0x4005BF: fibs (fibloop.c:17)
==23213== by 0x40065F: main (fibloop.c:30)
==23213==
result = 0 1 1 2 3 5 8 13
==23213==
==23213== HEAP SUMMARY:
==23213== in use at exit: 0 bytes in 0 blocks
==23213== total heap usage: 2 allocs, 2 frees, 48 bytes allocated
==23213==
==23213== All heap blocks were freed -- no leaks are possible
==23213==
==23213== For counts of detected and suppressed errors, rerun with: -v ==23213== ERROR SUMMARY: 3 errors from 3 contexts (suppressed: 6 from 6)
Based on this output, describe the problem with the program.
Example
answer: The loop in fibs has the wrong
continuation condition (i <= n
instead of
i < n
), causing it to run one extra time and write to
memory beyond the end of the allocated block. One of the two prints in
main also prints out this extra value.
Fix the problem by modifying the code, and save the code file. Before
re-running it with valgrind
, what command do you need to
issue in the shell?
Example
answer: You need to run make
(or make bugs
or make fibloop.bin
) to re-build
the machine code from the C code. Otherwise, valgrind
will
be running the old machine code that still has the bugs in it.
Now, re-run it, verifying that valgrind
no longer
produces errors.
Which lines of code needed to be changed to get rid of the
valgrind
errors?
Line 19 had the bug that caused an invalid write. Changing
<=
to <
on this line fixes it. Line 40
had a similar bug, with the same solution.
Alternatively, changing 7 to 8 on line 30 would generate an extra number so that the loop on line 40 would be valid.
Exercise 6:
Now, try the same thing with fibloop2.c
(examine code,
run under valgrind
, analyze and fix errors).
Describe the problem(s) you fixed:
Same <= n issue with the loop, and the return value (now just a
number) is also wrong because it returns the item after the end of the
array. Modifying the malloc
call to allocate one extra spot
is the simplest/best way to fix this because it keeps the semantics
(return the nth fibonacci number).
ALSO, the allocated memory is never freed, so more and more memory
gets wasted every time you call fib
(valgrind
reports a “leak summary” about this).
Predicting the Behavior of Mystery Functions
To get more practice with pointers, we’ll predict the behavior of some functions and verify by running the programs and observing the output.
Exercise 7:
Examine the files main0.c
through main3.c
from the repository and predict the output of each (you already saw
main4.c
in the gdb
section). Feel free to give
a vague answer if you’re not sure, but do actually read the code:
main0:
Example answer: Prints the letter ‘A’
main1:
Example answer: Prints the letter ‘A’
main2:
Example answer: Prints the address of the variable ‘ch’, which will be somewhere on the stack.
main3:
Example answer: Prints a character on or after the stack 20 characters beyond ‘ch’. We don’t know what this character will be and it may change every time we run the program.
Compile all the files by running:
make predictions
Note: You can also compile any of the individual
filename.c
files by running make filename.bin
.
In general, if the makefile
is set up properly, typing
make <name>
will create the file named
<name>
.
Run main0 through main4 to test your predictions by entering the commands as shown below at the command prompt. The expected output is also displayed.
Note: the $
shown here takes the
place of the shell prompt and is included to indicate which lines you
should run vs. which are output.
$ ./main0.bin
A
$ ./main1.bin
A
$ ./main2.bin
0x7ffe8bacf967
$ ./main3.bin
?
Note that the value you see for main2.bin
may be
different, and the character printed by main3.bin
is
effectively random.
The value printed by main2.bin
is the address in memory
that ptr
refers to on the machine you are using to run the
program. A large address in this range (starting with 7ff) in our
machine is understood to be part of the stack, which is
the area of memory used for storing local variables for a function
(along with other important information about which we will soon be
learning more).
The value printed by main3.bin
is whatever is 20 spots
beyond the local variable ch
. We don’t know what will be
there, and a security measure called address
space layout randomization changes what will be there each time the
program runs (this is also why the value for main2.bin
varies). You can try running main2.bin
and/or
main3.bin
several times to see if you get different
results.
Writing Code using Pointers
This section of the lab will give you some practice writing functions with pointers for string manipulation, which are very similar to some you will need for the next assignment.
Exercise 8:
Open practice.c
, and write the code for the following
functions that are declared but are unimplemented:
Write code for string_length_a
. Compare your code to the
implementation shown here:
/**
* Return the length of str, not counting the null terminator
* character.
*
* Precondition: str is a well-formed C string.
*
* Use array indexing. Do not use any other functions.
*/
int string_length_a(char str[]) {
int count = 0;
while (str[count]) {
+= 1;
count }
return count;
}
After writing string_length_a
, compile and run with:
$ make practice.bin
$ ./practice.bin
To be able to scroll through the output, you could also run:
$ ./practice.bin | less
If you use ‘less’, you can hit ‘q’ to quit. Note that in less, you can also hit ‘/’ to search the output.
You should look at the first block of output labeled “Testing
string_length_a
” and ensure that it says “0 FAIL” at the
end.
Next, write the code for string_length_p:
/**
* Return the length of str, not counting the null terminator
* character.
*
* Precondition: str is a well-formed C string.
*
* Do not use array indexing. [] is banned. Use pointer arithmetic instead.
* Do not use any other functions.
*/
int string_length_p(char* str) {
int count = 0;
while (*str) {
+= 1;
str += 1;
count }
return count;
}
Once more, you can compile and run with:
$ make practice.bin
$ ./practice.bin
Check that the tests for string_length_p
succeeded.
Now write code for contains_char_a
. Note that there are
multiple valid approaches here; the solution below is just one of
them.
/**
* Return 1 if the string given by haystack contains the character
* given by needle.
* Return 0 otherwise.
*
* Precondition: haystack is a valid pointer to a well-formed string.
*
* Use array indexing.
* Do not use any other functions.
*/
int contains_char_a(char* haystack, char needle) {
int index = 0;
char this = 1;
while (this != 0) {
= haystack[index];
this += 1;
index if (this == needle) {
return 1;
}
}
return 0;
}
Once more, compile your code and check that the tests passed.
Next, write code for contains_char_p
:
/**
* Return 1 if the string given by haystack contains the character
* given by needle.
* Return 0 otherwise.
*
* Precondition: haystack is a valid pointer to a well-formed string.
*
* Do not use array indexing. [] is banned. Use pointer arithmetic instead.
* Do not use any other functions.
*/
int contains_char_p(char* haystack, char needle) {
while (*haystack) {
if (*haystack == needle) {
return 1;
}
+= 1;
haystack }
return 0;
}
Run the file to make sure your code passes the tests.
Extra Exercises: Why Pointers?
We’ve already established that arrays and strings simply cannot be stored in registers, because they have variable length, unlike integers or characters. Let’s look at an example of that in assembly code.
Consider the following HW ISA assembly code “function”:
Note: in assembly code, ;
is
sometimes used instead of //
to introduce a
comment.
Address | Machine Code | Assembly | Comment |
---|---|---|---|
… | |||
30 | 4004 | OR, R0, R0, R3 | ; Set R3 to 0, this will be our index/counter |
32 | 2245 | ADD, R2, R3, R5 | ; Add R2 and R3 to compute the current address. Store it in R5 |
34 | 0560 | LW R6, 0(R5) | ; Load from memory using R5 as the address, storing into R6 |
36 | 7602 | BEQ R6, R0, 2 | ; Skip next 2 instructions if the R6 value is 0 |
38 | 2144 | ADD R1, R3, R3 | ; R3 += 1 |
3A | 8019 | JMP 25 | ; Jump to instruction #25 (address 0x32) to restart the loop |
3C | FFFF | HALT | ; Replace w/ appropriate jump to continue instead |
This code implements a loop which reads data from sequential memory
addresses until it finds a whole word that’s 0, and then stops. It’s
stored starting at address 0x30, so there’s room for other code before
and after it. You can “call” it by jumping to address 0x30, using the
instruction JMP 24, encoded as 0x8018
.
Exercise 9:
If we were to use this assembly code as part of a larger program, which register should we put a value into before jumping into this code to “call” it?
Correct answer: R2
If we change the final HALT instruction to jump back into other code that we’re “calling” this code from, which register can that code read from to get the number of non-zero memory values that this function counted?
Correct answer: R3
If we treat this as a function with one argument and a return value using the registers you just identified, and we assume that the argument is supposed to be a string, how would you describe what this function does?
It measures the length of a string.
Small caveat: this function requires the string to end with 16 0 bits, rather than just 8, because it loads an entire 16-bit word at a time.
Assume that you have the following data stored in memory starting at address 5:
Address | Value | ASCII Letter |
---|---|---|
5 | 0x48 | H |
6 | 0x45 | E |
7 | 0x4C | L |
8 | 0x4C | L |
9 | 0x4F | O |
A | 0x00 | NUL |
B | 0x00 | NUL |
What steps do you need to take to measure the length of the string
"HELLO"
by “calling” our assembly code “function?” Describe
specific assembly instructions and which register(s) need to hold what
data (but you don’t have to actually list every required
instruction).
First, you need to store the value 5 in register R2. This can be accomplished by using a few ADD instructions starting with R1 as an input, since R1 always has the value 1. You can do 1 + 1 = 2, then 2 + 2 = 4, and finally 4 + 1 = 5.
Next, you need to “call” the function by issuing a “JMP 24” instruction.
Finally, you’d need to modify the final HALT in the function to jump back to your code (wherever it’s being written) so that you can then read the value of register R3 which will have your result.
What would the result (stored in R3) be if we set R2 to 7 and then ran this code?
Correct answer: 3
Extra Exercises: More Predictions
For each of main9.c
, mystery0.c
, and
mystery1.c
: open the file, read the C program, predict the
printed output, then compile and run the program to verify your
predictions. Explain any surprising behavior in the actual output.
Exercise 10:
Predict the output of main9.bin
:
Prints the following:
- The letter ‘P’ (hex 0x50 is ASCII capital P; that’s the byte stored at the start of the 4 bytes that together form the 4th integer in the array ‘a’ (at index 3).
- The letter ‘A’ (hex 0x41, stored in the MSB of a[0], which we get because we cast ‘a’ to a char* before adding 3).
- The letters “D C B A H G F E L K J I P O N M” (we’re iterating through the whole array of ints printing each byte as a char, but they’re in little-endian order.
Was there anything surprising about the output of
main9
?
Example answer: The fact that the numbers are little-endian and so you don’t get the alphabet in order is probably surprising…
Predict the output of mystery0.bin
:
Prints:
a = "hello!"
b = "hello!"
a = "cs 240"
b = "hello!"
a = "cs0xF" b = "hello!"
Was there anything surprising about the output of
mystery0.bin
?
Example answer: The behavior of the final block is a bit tricky. Since the mystery function (which copies strings) writes a NUL byte at the end, the “cs 240” string is effectively shortened.
Predict the output for mystery1.bin
:
Note: this one is basically impossible to predict correctly, even ignoring that you can’t predict stack addresses!
Prints:
x = 19
p = <some stack address starting with 0x7ff…>
a = "Hello!" b = "Hi!"
Now the message “Hi, CS 240!” gets written starting from b, and since
b only has space for 4 bytes, the “Hi,” goes into b, and further bytes
overwrite other values. It would be reasonable to assume that the 4
bytes used to store the integer x are next after the array b, followed
by the 8 bytes for a and then the 8 bytes for p. However, if we run the
program with gdb
we can use the print command to show the
addresses of our variables: print &p
then
print &a
, etc. It turns out that the ordering is b,
then a, then p, then x, and there’s a 4-byte gap between p and x. The
ordering probably has to do with their types, although I’m not certain
on the details of how the compiler makes these decisions. The gap for x
has to do with “alignment” which we’ll see more about later.
This means that the rest of the string, including the NUL terminator,
fits neatly into a
.
So we print:
x = 19
p = <same stack address as before>
a = "CS 240!" b = "Hi, CS 240!"
Importantly, even though b is only 4 bytes, when we print it, the print code runs until it sees the NUL byte, regardless of how big b is “supposed” to be.
Next up, we have the string “What happens if we use a really really long string?” This will overwrite not only all of the variables on the stack, but also bookkeeping information like the return address, causing a segmentation fault when the function returns (but it’s not going to do that just yet…). These bytes are assigned to our variable as follows:
- “What” to b
- ” happens” to a
- ” if we u” to p
- ” rea” to x after placing “se a” into the gap between p and x
- The rest of the string overwrites critical stack information beyond the variables.
Hex values for ” if we u” are 0x20, 0x69, 0x66, 0x20, 0x77, 0x65, 0x20, and 0x75. They will form 0x7520657720666920 in little-endian order, so that’s what p will show as.
Meanwhile, the values for x (” rea”) are 0x20, 0x72, 0x65, and 0x61, which in little-endian order is hex 0x61657220. A calculator tells me that that value is 1634038304 in decimal, which is what will be printed for x.
So we print:
x = 1634038304
p = 0x7520657720666920
a = " happens if we use a really really long string?"
b = "What happens if we use a really really long string?"
Note that once again the a and b prints don’t care that information is written ‘outside’ of the variable; they just continue printing until NUL is found.
Finally, we try to copy ‘Hi?’ to p, but since the value of p got scrambled, we get a segmentation fault here.
Surprises in actual output for mystery1:
Everything is a surprise! The exact way memory gets corrupted on the second and third writes is surprising, and the segmentation fault or other weird behavior for the last block is definitely surprising.
Especially surprising to me was the way that the variables were laid
out in memory on the stack. I had to triple-check with gdb
to know what their addresses were in order to predict the results from
the really long string, which I got wrong on my first 2-3 attempts.
Extra Exercises: Memory Diagrams
The strings3.c
file contains the following code:
#include <stdlib.h>
#include <stdio.h>
int main(int argc, char** argv) {
// dynamically allocate space for 3 pointers
char** commandA = (char**) malloc( 3 * sizeof(char*) );
// commandA is an array of pointers to strings.
// commandA itself is also the address in memory where the first pointer is stored.
// each pointer will be the starting address of a string of characters.
// Initialize the strings for commandA
[0] = "emacs";
commandA[1] = "strings.c";
commandA[2] = NULL; /* mark the end of the commandA array by assigning a
commandA NULL to the last pointer in the array */
// deallocate the memory used for the commandA array
(commandA);
free
// allocate a new array of 3 pointers called commandB
char** commandB = (char**)malloc( 3 * sizeof(char*) );
//initialize the strings for commandB
[0] = "ls";
commandB[1] = "cs240-pointers";
commandB[2] = NULL;
commandB
//change the second string of the commandA array
[1] = "uh oh";
commandA
//print the strings of each array
("A: %s %s\n", commandA[0], commandA[1]);
printf("B: %s %s\n", commandB[0], commandB[1]);
printf
// deallocate the commandB array
(commandB);
free
return 0;
}
The first four statements in the main function are:
char** commandA = (char**)malloc( 3 * sizeof(char*) );
[0] = "emacs";
commandA[1] = "strings.c";
commandA[2] = NULL; commandA
In previous classes like CS 111 and CS 230, you’ve used memory diagrams to help you understand how data is organized, particularly when it contains references, which arrays/lists in Java and Python do. Now you get to actually see how memory is really used:
- An arrow in a memory diagram represents a pointer, with the numerical address of whatever is at the head of the arrow stored in the box that holds the tail of the arrow.
- A small box in a memory diagram that can hold a value represents a particular spot in memory, or possibly (when used for a variable) a register in the CPU. These boxes can hold a specific amount of bits based on the hardware design of the CPU (typically 32 or 64).
- A large box (such as for an array or list) represents a contiguous block of memory that holds multiple values (the small boxes inside).
Instead of drawing a typical memory diagram, we will now draw memory as a column of rectangles, each representing one word of memory (64 bits for now, since the CS department server is a “64-bit” machine that has 64-bit registers in its CPU). The rectangles each have an address written in the left-hand margin, which you can make up for now. In addition to drawing arrows in this diagram that point upwards or downwards to other boxes, you will write the address of the box they point to in the box containing the arrow, so that the arrows are purely a visual aid, and the boxes represent the actual data stored.
Here’s an example of an abstract memory diagram that you’re familiar with on the left, plus the same data shown in the new style we’ll be using in labs. Note that we need to make up an address for each piece of data, for now you can use any addresses as long as you’re consistent. The actual specifics of which values are on the stack, on the heap, or stored in a program’s data segment are things we’ll learn in the coming weeks.
This is the memory diagram format we’ve used in previous classes:
Here is the same data in CS 240 lab style:
This table represents the same data (with the same layout) as the new-style diagram, but it may be more accessible to people using a keyboard to navigate and/or using a screen reader.
Variable | Address | Data | Comments |
---|---|---|---|
0x200 | 240 | Third int in the array that
starts below |
|
0x1F8 | 230, 111 | Two 4-byte int in this 8-byte
slot |
|
… | |||
C | 0xA0 | 0x1F8 | Type int * . Points to start
of block above. |
… | |||
0x68 | 0x0, 0x53, 0x43 | Three bytes ‘C’ = 0x43, ‘S’ = 0x53, and NUL = 0x0. Little endian. | |
… | |||
0x38 | 0x0 | “null” pointer | |
0x30 | 0x68 | Type char * . Points to start
of block above. |
|
… | |||
B | 0x18 | 0x68 | Type char * . Points to start
of third block up. |
A | 0x10 | 0x30 | Type char ** . Points to start
of next block up. |
Notes on the new style of memory diagrams:
- You get to make up whatever addresses you want, except that data
which is by definition contiguous, like string characters or array
entries, must be shown at contiguous addresses.
- Smaller addresses start at the bottom. Note: You
will see some contexts, like
gdb
“examine” output, where the convention is the opposite - When addresses are not contiguous, you should add ellipses to indicate this.
- In this diagram, A and B did not need to be drawn together, and C was placed separately to illustrate this point. But drawing C next to A and B would have saved some ellipses and would also be fine.
- Smaller addresses start at the bottom. Note: You
will see some contexts, like
- Each box is 64 bits. When multiple smaller values are packed into
one box, we sub-divide the box appropriately.
- Each address is a multiple of 8 bytes, so the last digit in hex is either a 0 or an 8.
- Little-endian ordering affects where things are placed. The eight individual bytes in each box are addressed from right to left.
- Values are numbers, but may be written as hexadecimal (prefixed with
0x) or decimal as appropriate.
- 1-byte character values can be written as single-quoted letters (not shown here; see below).
- Interpretations of the data such as string values belong in the “comments” column.
- Address values should always be written in hex (including pointers).
- The arrows always point to the address matching the value in the box. They only exist to help you trace where pointers refer to; they are not actually data stored in the system.
Here is a memory diagram for the commandA
variable from
the code above:
Here’s the equivalent table:
Variable | Address | Data | Comments |
---|---|---|---|
0x140 | ‘\0’, ‘c’ | continued from below | |
0x138 | ‘.’, ‘s’, ‘g’, ‘n’, ‘i’, ‘r’, ‘t’, ‘s’ | “strings.c” | |
… | |||
0x100 | 0x00, 0x73, 0x63, 0x61, 0x6D, 0x65 | “emacs” | |
… | |||
0xD0 | 0x0 | null | |
0xC8 | 0x138 | Type char * |
|
0xC0 | 0x100 | Type char * |
|
… | |||
commandA | 0x48 | 0xC0 | Type char ** |
Here are the declarations for commandB
:
char** commandB = (char**)malloc( 3 * sizeof(char*) );
[0] = "ls";
commandB[1] = "cs240-pointers";
commandB[2] = NULL; commandB
Draw a memory diagram for commandB
similar to the one
above, and check it against the table below (note that your addresses
will likely be different, which is fine as long as they’re consistent
with each other).
Variable | Address | Data | Comments |
---|---|---|---|
0xD8 | ‘\0’, ‘s’, ‘r’, ‘e’, ‘t’, ‘n’, ‘i’ | continued | |
0xD0 | ‘o’, ‘p’, ‘-’, ‘0’, ‘4’, ‘2’, ‘s’, ‘c’ | “cs240-pointers” | |
… | |||
0xA0 | ‘\0’, ‘s’, ‘l’ | “ls” | |
… | |||
0x50 | 0x0 | null | |
0x48 | 0xD0 | Type char * |
|
0x40 | 0xA0 | Type char * |
|
… | |||
commandB | 0x20 | 0x40 | Type char ** |
Predict what you think the output will be after these 3 statements in the program:
[1] = "uh oh";
commandA("A: %s %s\n", commandA[0], commandA[1]);
printf("B: %s %s\n", commandB[0], commandB[1]); printf
Both commandA and commandB will have their second part updated, since commandB is likely allocated right on top of the freed memory for commandA.
Run the program:
$ ./strings3.bin
What happens? Use valgrind
and re-examine the complete
program to explain the problem (but do not fix the error):
It prints:
A: ls uh oh B: ls uh oh
As predicted, the memory for commandB is corrupted because we accessed previously-freed memory which overlapped with it.
Use gdb to examine memory and understand the connection between the diagram and real memory.
$ gdb ./strings3.bin
Set a breakpoint at the beginning of the program:
(gdb) break main
What address in memory does main begin (you should be able to read that as a result of setting the breakpoint)?
0x4005cc (much earlier in memory than the stack)
Note that your value may differ slightly.
Run the program within gdb
:
(gdb) run
You will hit the breakpoint, meaning you are at the beginning of the main program.
List the instructions:
(gdb) list
7 int main(int argc, char** argv) {
8 char** commandA = (char**)malloc( 3 * sizeof(char*) );
9
10 commandA[0] = "emacs";
11 commandA[1] = "strings.c";
12 commandA[2] = NULL;
13 14 free(commandA);
Examine the local variables at this point:
(gdb) info locals
There should be two local variables listed, commandA
and
commandB
.
You can also display the value of a variable directly:
(gdb) print commandA
What is the value of commandA
at this point? Why?
gdb
should show something like:
$2 = (char **) 0x0
The code hasn’t started to run yet, so the variable does not yet have a value set. It’s not necessarily the case, but it’s a decent bet that variables which have not yet been initialized will be 0.
Execute one step (the first line in main, which allocates space for
commandA
):
(gdb) step 1 10 commandA[0] = "emacs";
Note: after you execute a step, gdb will display the next line to be executed.
Now print commandA
again:
(gdb) print commandA
What has changed and why?
Now we see something like:
$3 = (char **) 0x602010
Now commandA
points to some space on the heap, because
we’ve called malloc
.
Now, examine the contents of memory at commandA
(using
the /3a
argument to the x
command to mean
“mean display the 3 addresses stored there”):
(gdb) x /3a commandA
You should see something very similar to this (although the actual addresses on your machine may be different):
0x602010: 0x0 0x0 0x602020: 0x0
Why are we interested in 3 addresses – what are you looking at? Why are they all 0 right now?
Example
answer: We’re looking at the pointers in the
commandA
array-of-pointers. We hope they’re all 0 because
they’re uninitialized, although they could have any values, because we
did not call calloc
to allocate + clear the memory, we
called malloc
to just allocate it.
Execute the next instruction:
(gdb) step 1
Examine commandA
again:
(gdb) x /3a commandA
0x602010: 0x400740 0x0 0x602020: 0x0
Note: Remember that gdb
displays
lower address values first, followed by higher values, which is the
opposite of the memory diagrams we just drew.
You could also examine the value with a print
command,
although it will only display the single value pointed to by
commandA1
(not all three pointers/addresses in the
array).
What has changed, and why? (Remember that the instruction you just
executed is: commandA[0] = "emacs";
)
As you might expect, the first pointer in the array of pointers has been assigned a value. It now points to the program data where “emacs” was stored as part of the program’s code.
To examine memory at the address specified by the new value you see, and display a byte in hexadecimal notation, run:
(gdb) x /bx *commandA
You can also do this with:
(gdb) print /x *commandA[0]
Or,
(gdb) print /x **commandA
Record what you see:
0x400740: 0x65
Again, your exact address may be different.
What does this value represent (remember that commandA
is an array of pointers to strings, and that each string can be thought
of as an array of characters)?
Example answer: This is the ‘e’ of the string “emacs”
Repeat the command, but display the contents of memory as characters (in this case, the 6 characters in memory starting at the specified address):
(gdb) x /6c *commandA
Examine memory at that same address again, this time formatting the data as a string:
(gdb) x /s *commandA
Or, to do the same thing:
(gdb) x /s commandA[0]
Or:
(gdb) print commandA[0]
Now, execute another step:
(gdb) step 1
Print the new value:
(gdb) print /a commandA[1]
What has changed? Explain what the new value/address you see represents:
You will see something like:
$5 = 0x400746
This is the memory address of the string “strings.c”, also allocated with the program data.
Examine memory at the new address, formatting the data as a string:
(gdb) print /s commandA[1]
Is it what you expected?
Example answer: Indeed this shows “strings.c”
Execute another step and print the third entry in the
commandA
array:
(gdb) step 1 (gdb) print commandA[2]
Has anything changed? What did the last instruction accomplish?
(commandA[2] = NULL;
)
You will see:
$7 = 0x0
This sets the last part of the commandA array to NULL. If we had junk data in there before, it will be cleared away, but it’s reasonably likely the value was already zero (as shown above). Now it will reliably be zero.
After the next step, you will have executed the free()
call:
(gdb) step 1
What does the free do?
Tells malloc
that it can re-use the memory we had
previously reserved. The first part of that memory may be immediately
overwritten by a new value (it was in my testing).
Execute the next step, which initializes commandB
, then
print commandA
and commandB
again:
(gdb) step 1 (gdb) info locals
Can you now explain why the program prints “uh oh” twice?
commandB
was allocated on top of the free memory that
used to be commandA
, so changes using the
commandA
pointer will affect commandB
. It’s
invalid to use the commandA
pointer at all after it’s been
freed.
Exit gdb:
(gdb) quit
Now, at the command prompt, run the program using
valgrind
, and examine the errors produced:
valgrind ./strings3.bin
The output should look like this:
==7698== Memcheck, a memory error detector
==7698== Copyright (C) 2002-2017, and GNU GPL'd, by Julian Seward et al.
==7698== Using Valgrind-3.15.0 and LibVEX; rerun with -h for copyright info
==7698== Command: ./strings3.bin
==7698==
==7698== Invalid write of size 8
==7698== at 0x40064E: main (strings3.c:22)
==7698== Address 0x5205048 is 8 bytes inside a block of size 24 free'd
==7698== at 0x4C2B06D: free (vg_replace_malloc.c:540)
==7698== by 0x40060E: main (strings3.c:14)
==7698== Block was alloc'd at
==7698== at 0x4C29F73: malloc (vg_replace_malloc.c:309)
==7698== by 0x4005D5: main (strings3.c:8)
==7698==
==7698== Invalid read of size 8
==7698== at 0x40065D: main (strings3.c:23)
==7698== Address 0x5205048 is 8 bytes inside a block of size 24 free'd
==7698== at 0x4C2B06D: free (vg_replace_malloc.c:540)
==7698== by 0x40060E: main (strings3.c:14)
==7698== Block was alloc'd at
==7698== at 0x4C29F73: malloc (vg_replace_malloc.c:309)
==7698== by 0x4005D5: main (strings3.c:8)
==7698==
==7698== Invalid read of size 8
==7698== at 0x400664: main (strings3.c:23)
==7698== Address 0x5205040 is 0 bytes inside a block of size 24 free'd
==7698== at 0x4C2B06D: free (vg_replace_malloc.c:540)
==7698== by 0x40060E: main (strings3.c:14)
==7698== Block was alloc'd at
==7698== at 0x4C29F73: malloc (vg_replace_malloc.c:309)
==7698== by 0x4005D5: main (strings3.c:8)
==7698==
A: emacs uh oh
B: ls cs240-pointers
==7698==
==7698== HEAP SUMMARY:
==7698== in use at exit: 0 bytes in 0 blocks
==7698== total heap usage: 2 allocs, 2 frees, 48 bytes allocated
==7698==
==7698== All heap blocks were freed -- no leaks are possible
==7698==
==7698== For lists of detected and suppressed errors, rerun with: -s ==7698== ERROR SUMMARY: 3 errors from 3 contexts (suppressed: 0 from 0)
Explain the meaning of each of the three errors shown above:
The first error is the write to commandA[1]
which was
already freed.
The second two errors are for reading commandA[0]
and
commandA[1]
for the print.
Explain how to fix these errors.
Example answer: Move the free call for commandA down to where the free for commandB is.
Extra Exercises: More Pointer Code
If you want extra practice, here are solutions for some of the more
complex functions in the practice.c
file:
Remember that you can compile and run tests with:
make practice.bin
./practice.bin
You may wish to add | less
to the
second command to paginate the output. If you do, use ‘q’ to quit
pagination when you’re done.
substring
/**
* Return a pointer to a newly allocated string holding the characters
* in haystack starting with the character at index start and ending
* just before the character at index end.
*
* Preconditions:
* - haystack is a valid pointer to a well-formed string.
* - To begin, assume start and end respect the bounds of haystack:
* - start > 0
* - start < (length of string) - 1
* - end < (length of string) - 1
* - end - start >= 0
*
* Do not use any other functions besides malloc.
*/
char* substring(char* haystack, int start, int end) {
int len = end - start; // abusable
char *result = (char*) malloc((len + 1) * sizeof(char));
int i;
for (i = 0; i < len; ++i) {
[i] = haystack[start + i];
result}
[i] = '\0';
resultreturn result;
}
contains_string
/**
* Return 1 if the string given by haystack contains the string given
* by needle as a substring.
* Return 0 otherwise.
*
* Precondition: haystack and needle are both valid pointers to
* well-formed strings.
*
* Do not use any other functions. There are a few ways to implement
* this, some more efficient than others. Start with a simple
* approach. Optimize if you have time left over at the end of lab.
*/
int contains_string(char* haystack, char* needle) {
int matchfrom;
int matchto;
int targetlen = string_length_a(needle);
if (targetlen == 0) {
return 1;
}
int index = 0;
char this;
do {
= haystack[index];
this += 1;
index if (this == needle[0]) {
// Attempt to match the whole string here before proceeding
char cmp = 'A'; // just to get into the loop
= 1;
matchfrom = 0; // because we've already added 1 to index
matchto while ((matchfrom < targetlen) && cmp) {
= haystack[index + matchto];
cmp if (cmp == needle[matchfrom]) {
+= 1;
matchfrom += 1;
matchto } else {
break;
}
}
if (matchfrom == targetlen) {
return 1;
}
}
} while (this);
return 0;
}