CS 240 Lab 10

Data Structures in Memory

CS 240 Lab 10

You have seen how primitive data like integers, floating point numbers, and characters from C programs are stored in memory (as contiguous numerical values, accessed by location/address in memory). Today, we will investigate how arrays in C are stored in memory at the machine level, and we will take this opportunity to practice writing assembly code by hand.

Writing Assembly Code

When writing assembly code, we first identify segments to store data and/or instructions in, typically using just the .text segment for code and the .data segment for global variable values. Then we use directives to tell the assembler how to handle things, and symbols to mark the start of functions or data values so that we can refer to their addresses without knowing ahead of time what those addresses will be. The rest is just assembly code instructions, which the assembler will convert into machine instructions for us.

Exercise 1

To get used to writing assembly code, we’ll write a “Hello world” program. Create a lab10 folder in your cs140-repos directory, and within that folder, create a file named hello.s (.s is the extension for assembly code files). In that file, write:

.data

.text

to define our two segments. In the .data segment, we’ll need to store the text that the program will print out. Define a symbol message that stores the address of a string that spells out “Hello world.”.

Which directive do you need to use to tell the assembler you want to include a string value?

Correct answer: .string

Check your code against the answer below:

Example answer:

.data
message: .string "Hello world.\n"

.text

Now we’ll add a “global” symbol named “main,” using the .global directive within the .text segment. The assembler will by default build an executable file which starts by jumping to that label. Check that your code now looks like this:

Example answer:

.data
message: .string "Hello world.\n"

.text
.global main
main:

After the “main” symbol, write assembly code to accomplish the following steps, checking your code along the way:

  1. Subtract 8 from the stack pointer: Correct answer: sub $8, %rsp Explanation: This is necessary because we are going to call a function, and the stack pointer must always be a multiple of 16 when a function is called. Since the return address from main will be added to the stack when we call it, and the stack pointer would have been a multiple of 16 before that, we need to subtract another 8 for things to line up. We will not use the slot that we get this way, and we will have to give it back before we return.
  2. Put the message address into the register used for the first argument to a function call: Correct answer: mov $message, %rdi Explanation: Since message will be replaced with a memory address, we need to use $ to put that address into %rdi, instead of loading from memory at that address into %rdi.
  3. Call printf: Correct answer: call printf Explanation: Note that certain basic I/O functions are made available as symbols even without specifying any includes. When writing C code, we should technically use #include <stdio.h> whenever we use printf, but we’re skipping that step here.
  4. Put 0 into %rax as our return value: Correct answer: mov $0, %rax
  5. Add 8 to the stack pointer: Correct answer: add $8, %rsp
  6. Return: Correct answer: ret

Now compile your program using gcc:

gcc -Wall -std=c99 -g -O -o hello.bin hello.s

Run your program, and it should print out “Hello world.” If it doesn’t, check your program against this solution:

Example answer:

.data
.global message
message: .string "Hello world.\n"

.text
.global main
main:
    sub $8,%rsp
    mov $message,%rdi
    call printf
    mov $0,%rax
    add $8,%rsp
    ret

Finally, replace your message: line with the following line of code:

message: .byte 0x48, 0x65, 0x6c, 0x6c, 0x6f, 0x20, 0x77, 0x6f, 0x72, 0x6c, 0x64, 0x21, 0x0a, 0x00

What happens when you recompile and run the program again. Why?

Example answer:

It still works, and now it prints out “Hello world!” The bytes listed above are the ASCII values for the letters in that message, so we’ve just assembled the same binary data (with the period changed to an exclamation point) as before, using a different directive.

One-dimensional arrays

Let’s begin with perhaps the simplest data structure, a one-dimensional array. When all the elements of an array are the same size (such as an array of integer values), the elements of the array can be stored in their indexed order in memory, and accessed using:

base address of array + (index * size of element)

where the size of the element is in bytes (so, an integer has a size of 4).

Exercise 2

Here is a simple C program to return a single element from an array of integers:

#include <stdio.h>

int elements[] = {0x1,0x3,0x5,0x7,0x9,0xb,0xd,0xf};
long aindex;

int getElement(int* baseAddress,int index) {
    return *(baseAddress + index);
}

int main() {
    printf("Enter an array index: ");
    scanf("%lx",&aindex);
    printf(
        "The value of the array element is: %x\n",
        getElement(elements, aindex)
    );
    return 0;
}

Save this code into a file called array.c (you can put it into a folder called lab10 within your cs240-repos folder if you want to keep it organized).

Including our usual -W and -std= flags, plus the flag for including debug information and for basic optimization, what gcc command would be used to compile this file into an array.bin file?

Correct answer: gcc -Wall -std=c99 -g -O -o array.bin array.c

Once it’s compiled, what command would be used to run the program?

Correct answer: ./array.bin

Run it several times, and fill in the table below by entering the indices and seeing what value is printed (answer with what is printed, not what the decimal value is):

Index Value
0 Correct answer: 1
1 Correct answer: 3
4 Correct answer: 9
7 Correct answer: f

Exercise 3

Below is a version of X86 code that does the same thing as the main function in the C program:

.text

.global main
main:
    # setup
    sub    $0x8,%rsp

    # prompt for index of element in array
    mov    $prompt_string,%edi
    mov    $0x0,%eax
    call     printf

    # accept input
    mov    $aindex,%esi
    mov    $format_string1,%edi
    mov    $0x0,%eax
    call     scanf

    # call function to return specified element of array
    mov    aindex,%rsi
    mov    $elements,%edi
    call    getElement

    # print specified array element
    mov    %eax,%esi
    mov    $format_string2,%edi
    mov    $0x0,%eax
    call     printf

    # cleanup and return from main
    mov    $0x0,%eax
    add    $0x8,%rsp
    ret

Create a new file called arrayASM.s (files containing X86 code should end with the .s prefix), and paste the above X86 code for main into the file.

The code above is missing two things: data declarations for the elements, aindex, and prompt/format string variables, as well as the code of the getElement function. In your code, add the declaration for the elements global variable into a .data section. Check your result against the answer below:

Example answer:

.data

elements: .int 0x1, 0x3, 0x5, 0x7, 0x9, 0xb, 0xd, 0xf

Now in the same segment, add aindex, including a .global specification so that it’s identified as a symbol that can be linked externally. Check your answer below:

Example answer:

.global aindex
aindex: .quad 0

Now add declarations for the prompt_string, format_string1, and format_string2:

Example answer:

prompt_string: .string "Enter an array index: "
format_string1: .string "%lx"
format_string2: .string "The value of the array element is: %x\n"

Now, add a getElement() function that accomplishes the same task as the C version of the program (remember to put it in your .text section). Your function only needs to use two assembly instructions.

Example answer:

.global getElement
getElement:
    mov (%rdi,%rsi,4), %eax
    ret

Note: Remember the register saving conventions for callee if you modify registers %rbx, %rbp, or %r12 - %r15 in getElement. You should not have to do that though.

Assemble and then run your code with the following shell commands:

gcc -Wall --std=c99 -g -O -o arrayASM.bin arrayASM.s
./arrayASM.bin

Test by using various values in the range 0 – 7 for the index, and verify that you get the correct values by filling in this table again:

Index Value
0 Correct answer: 1
1 Correct answer: 3
4 Correct answer: 9
7 Correct answer: f

Now try using the number ‘8’ as an index in your assembly program. What value is printed? Why?

Example answer:

8 is printed, because there are only 7 elements in the array, so 8 ends up retrieving data from what is after the array. In your assembly file, you declared aindex right after elements, so the first 4 bytes of aindex are read as elements[8]. We entered 8, so that’s what’s stored there.

Exercise 4

Add two new x86 array declarations in arrayASM.s , including an array of words (2 bytes each) and an array of bytes, equivalent to this C code:

int elements[] = {0x1,0x3,0x5,0x7,0x9,0xb,0xd,0xf};
short short_elements[] = {0x23, 0x25, 0x27, 0x29, 0x31, 0x33, 0x35, 0x37};
char byte_elements[] = {0x20, 0x30, 0x40, 0x50, 0x60, 0x70, 0x80, 0x90};

Which GNU x86 assembly directive (reference for assembly directives is here) should be used to assemble 16-bit (= 2-byte) values?

Correct answer: .short Feedback: .word Explanation: Both .short and .word work. Note that “word” here is DIFFERENT than “word” in other contexts, which often means 4 byes, not 2.

Which directive can be used for 1-byte (= 8-bit) values?

Correct answer: .byte

Note: In x86, you can use the .byte directive to specify byte-sized values.

Place your new declarations directly after the declaration of the elements array. Double-check against this example:

Example answer:

elements: .int 0x1, 0x3, 0x5, 0x7, 0x9, 0xb, 0xd, 0xf
short_elements: .short 0x23, 0x25, 0x27, 0x29, 0x31, 0x33, 0x35, 0x37
byte_elements: .byte 0x20, 0x30, 0x40, 0x50, 0x60, 0x70, 0x80, 0x90
.global aindex
aindex: .quad 0

Re-assemble your file and start gdb:

gcc -Wall --std=c99 -g -O -o arrayASM.bin arrayASM.s
gdb arrayASM.bin

Disassemble the main function:

(gdb) start
(gdb) disas main

Examine the 14 words of memory at the address elements:

(gdb) x /14wx &elements

How many 4-byte slots does the short_elements array take up?

Correct answer: 4 Explanation: Each entry is written as 2 hex digits, but the .short directive stores them as 2 bytes, with a 00 byte in addition to what we wrote, so 8 2-byte entries is 16 bytes or 4 4-byte slots.

How many 4-byte slots does the byte_elements array take up?

Correct answer: 2 Explanation: Here each entry is 1 byte, so 8 entries = 2 4-byte slots.

In the gdb output, does 0x23 appear to the right or to the left of 0x25? Why?

Example answer: It appears to the right of 0x25, because since we asked to treat each 4-byte slot as a single number, those two numbers (along with their 00 bytes) are together in one slot. Since 0x23 is first in memory and we’re on a little-endian system, where the little address is the end of the number, 0x23 is treated as the end of the number and 0x25 as the middle, so when writing with least-significant digits to the right as we do in English, 0x23 appears to the right of 0x25.

How much of the elements, short_elements, and byte_elements arrays did our x command display?

Example answer: It displayed the contents of all 3 arrays.

Continue execution, and enter an index of 8. What value is printed? Explain.

Example answer: The value printed is 250023. This is printed in hex format, based on the first int-sized (i.e., 4-byte) slot after the end of elements. That’s the beginning of short_elements, and as explained above, the value is the 0x0023 followed by 0x0025 taken together as one little-endian integer value.

For each of the values in the table below, enter the index you need to input to get the program to print it out:

Index Printed Value
Correct answer: 9 290027
Correct answer: a 330031
Correct answer: c 50403020
Correct answer: d 90807060

Exercise 5

Based on your getElement function, write a getElementByte function in assembly code (again, only two instructions are needed). It should take an array and an index and, assuming the array is an array of 1-byte items, it should return the item at the given index. Check your answer below:

Example answer:

.global getElementByte
getElementByte:
    mov (%rdi,%rsi,1), %al
    ret

Note the destination registers are appropriately sized for the amount of data we want to move, which is how mov knows how much to deliver.

If we wanted to declare the original elements array using the .byte directive instead of the .int directive, but we wanted it to retain the same value as an array of integers, how many byte values would we need to specify?

Correct answer: 32 Explanation: The elements array contains 8 integers, each of which is 4 bytes, so in total that’s 32 bytes.

When we declared elements we wrote:

elements: .int 0x1, 0x3, 0x5, 0x7, 0x9, 0xb, 0xd, 0xf

We only used 8 hex digits on that line of code. How were we able to specify 32 bytes of data using only enough hex digits to specify 32 bits of ones and zeroes?

Example answer: The .int directive specifies that each value should be interpreted as a 4-byte = 32-bit value. Since the values we’re using are small, they consist of mostly zeroes, and we don’t have to write out extra zeroes on the left of the most-significant digit.

What would the .byte directive version of this data look like for the first two entries of the array?

Example answer:

elements: .byte 0x01, 0x00, 0x00, 0x00, 0x03, 0x00, 0x00, 0x00

Remember: all data stored in a computer is stored in binary. We assign meanings to it, and view it as hexadecimal, decimal, or even via a character encoding as letters in a string. But if we get mixed up, we may accidentally interpret one kind of data as another. Whenever a computer shows you a surprising value that looks garbled, consider whether using a different interpretation might make sense of it.

Two-dimensional arrays

In lecture, you have learned that two-dimensional arrays in C can be implemented as contiguously stored elements in memory using row-major format. The address of an element can be calculated by the following formula:

address of element[row][col] =

    base address of array +
    (row * size_of_row * size_of_element) +
    (col * size_of_element)

or, more succinctly:

    base address of array +
    (row*(size_of_row) + col)*size_of_element

The number of bytes per element is the “size of element,” and the number of elements in a row is the number of columns, also called the “size of row.”

Below is a C program which accesses an element of a 2-dimensional array:

#include <stdio.h>

int grid[4][4] = {
    {0x1, 0x2, 0x3, 0x4},
    {0x5, 0x6, 0x7, 0x8},
    {0x9, 0x10,0x11,0x12},
    {0x13,0x14,0x15,0x16}
};

int row;
int col;

int getElement(int* baseAddress, int size_of_row, int row, int col) {
    return *(baseAddress+ (row*size_of_row  + col));
    //pointer arithmetic determines num of bytes per element
}

int main() {
    printf("Enter row: ");
    scanf("%x",&row);
    printf("Enter column: ");
    scanf("%x",&col);
    printf("The indexed element is: %x\n",getElement((int*)grid,4,row,col));
    return 0;
}

And the x86 equivalent for declaring the required data structures and main:

.data
grid: .long 0x1,0x2,0x3,0x4
      .long 0x5,0x6,0x7,0x8
      .long 0x9,0x10,0x11,0x12
      .long 0x13,0x14,0x15,0x16

row:     .long 0
col:     .long 0
sizerow: .long 4

prompt1: .string "Enter row: "
prompt2: .string "Enter column: "
result:  .string "The indexed element is: %x\n"
fstr:    .string "%x"

.text
.global main
main:
    sub $0x8, %rsp
    # prompt and read in row index
    mov $prompt1,%edi
    mov $0,%eax
    call printf

    mov $row,%esi
    mov $fstr,%edi
    mov $0,%eax
    call scanf

    #prompt and read in column index
    mov $prompt2,%edi
    mov $0,%eax
    call printf

    mov $col,%esi
    movl $fstr,%edi

    call scanf

    # getElement parameters = base address of array, size of row,row
    # index, and column index
    mov col,%ecx
    mov  row,%edx
    mov  sizerow,%esi
    mov $grid,%edi
    #call getElement   # uncomment this line after you write getElement

    mov   %eax,%esi
    mov   $result,%edi
    mov $0,%eax
    call    printf

    mov $0,%eax
    add $0x8,%rsp
    ret

Copy this code into a file named grid.s. What command will we use to compile grid.s?

Example answer:

gcc -Wall --std=c99 -g -O -o grid.bin grid.s

Run that command and then start up the binary file in gdb. What gdb command can we use to display the contents of the grid as 16 individual 4-byte hexadecimal numbers, using &grid as the target address?

Example answer: x /16wx &grid Explanation: If we were using the C code, we would be able to write just grid, as it would be a pointer. However, when you create a symbol in assembly code, gcc assumes that the symbol will be used as a regular global variable, not an array. So we need to use & to get that variable’s address so that we can use x to examine the memory there.

Note: Symbols in assembly code do NOT have types like C variables do. The directives like .int or .string specify how to interpret the data we type afterwards and convert it to binary for storage in the executable, but they don’t give a type to the symbol. You can see this if you ask gdb to do print grid: it will complain that it does not know the type to use, and invite you to add a cast to tell it how that data should be treated.

Based on the output of your command, if we interpret the four bytes stored starting at the address 16 bytes after the start of grid as an integer, what value would it have (in decimal)?

Correct answer: 5

Now exit gdb and edit your code to add a getElement definition, matching the definition in the C code above. Compare against this code:

Example answer:

.global getElement
getElement:
    imul %edx,%esi            # 2rd param (size of row) *= 3rd param (row)
    add %ecx,%esi             # row result += 4th param (col)
    mov (%rdi,%rsi,4),%eax    # read 4 bytes from array, scaling index by 4
    ret

Note: You must also uncomment the call to getElement in main.

Run the program with getElement included, and fill in the following table of results (fill in either the value at that row/col, or the row/col that should be used to get the specified value):

Row Col Value
0 0 Correct answer: 1
0 1 Correct answer: 2
1 1 Correct answer: 6
3 3 Correct answer: 16
Correct answer: 2 Correct answer: 0 9
Correct answer: 1 Correct answer: 3 8
0 4 Correct answer: 5
0 5 Correct answer: 6
1 a Correct answer: 15

Explain what is happening in the last three rows of the table above:

Example answer:

All three of the last rows represent invalid indices: their rows are reasonable, but their column values are out-of-bounds for a 4x4 array. However, based on the index math, the results are still values from the array: as our column extends past the end of a row, we just wrap around into the next row, because adding 4 via adding one to the rows value and adding 4 via adding 4 to the columns value gives the same result.

For example, in the last line row 1, column a (10 in decimal) is 1*4 + 10 = 14 as a linear index, which is the same linear index you’d get if you did 3*4 + 2 = 14. So we’re seeing the value from row 3, column 2 as the result.

What values do you see if you access row 2, column 8, row 3, column 4, or row 4, column 0? Explain why you see these values, based on the structure of variables declared in the .data segment:

Example answer:

The values you see will be 2, 3, and 4. In each case, it matches the row value you entered. This is because the index math in all 3 examples accesses the 17th element of the 16-element array, reading 4 bytes beyond the end. In our .data segment, the row global variable happens to be allocated next, so that’s the data we read.

x86 Assignment

With the remaining time in lab, you should work on the x86 assignment. Note that if your partner for that assignment is in another lab section, you will have to work alone and share notes back with them later. Make sure to explain carefully whatever you figure out in lab so that they’re not disadvantaged on the exam (and they should explain whatever they figure out when they work on it in their lab).

If you haven’t started the assignment yet, yikes, and also, consider asking your instructor about finding a partner since others may be in the same boat.