CS 240 Lab 7

Disassembly and Reverse Engineering

CS 240 Lab 7

In lab today, you will practice using disassembly tools to recreate the assembly language instructions for a C program for which you have been given only the object (executable) code. This will help you learn about the X86 instruction set architecture, teach you debugging skills, and also help you get started with the next assignment.

Open VSCode and a Terminal.

To begin, you will use the cmemory repository from lab 6/7. If you don’t have a copy, you can go into your cs240-repos folder and do cs240 start cmemory now. Either way, cd into your cmemory folder.

Disassembly Tools

We can examine the assembly version of a C file in several different ways:

  1. by compiling to produce assembly code,
  2. by using objdump to dump the assembly code from an executable file, or
  3. by disassembling parts of an executable while debugging with gdb.

Compiling to produce assembly code

Up to now, you have compiled by using make, which uses a Makefile that contains the actual command to compile your code using gcc (GCC stands for “GNU C Compiler”).

You can also compile directly by using gcc at the command line, and you can use various options to control the type of output produced by the command.

Exercise 1:

Create an X86 assembly language file by using gcc with the -S option on your practice.c source file.

Note that you can run gcc --help to print out basic help for gcc (most programs accept a --help and/or -h option to do this) and you can run man gcc to open the manual for gcc (most programs have a manual page). Sadly, the complexity of these help pages is sometimes limiting. Using the ‘/’ key to search within manual pages can help somewhat.

Use the following arguments:

  • -S tells gcc that it should stop at the assembly code stage, rather than continuing to assemble the assembly code into machine code and then link it into an executable file.
  • -Wall tells gcc to enable “all” warnings (there are some that are still excluded from -Wall but most of them are enabled by it).
  • -std=c99 says to compile using the 1999 C standard, as opposed to more modern standard. All programs we compile in this class will use this standard, because many of the exploits we use are designed with it in mind.
  • -m64 controls the sizes of some data types. Notably it sets int to 32 bits and pointer types to 64 bits.
  • -Og turns on optimization-for-debugging. This greatly simplifies the assembly code in some ways, although it makes it more complicated in others.
  • -g instructs gcc to produce extra debugging information that gdb can use. We will almost always use this flag.
  • -o practice.s tells the compiler to output its results into a file named practice.s. .s is the standard extension for assembly code.
  • practice.c tells it what file to compile.

Run your command and make sure it executes without errors, then paste the full command here to double-check it:

Correct answer: gcc -S -Wall -std=c99 -m64 -Og -g -o practice.s practice.c Explanation: Note that the order of the arguments usually isn’t important, although some programs are exceptions to that rule.

Now open practice.s to get an idea of the instructions produced. See if you can find:

  1. A label, starting with a . and ending with a :
  2. The name of a function.
  3. An actual assembly instruction, like movl or ret.

Using objdump and strings

The Linux objdump commands can be used as a disassembler to view an executable in assembly form.

Use the objdump tool to display the disassembled executable file by running:

objdump -d practice.bin

Note: If you don’t have a practice.bin file, use gcc to create it. The same command as above, without the -S, and using -o practice.bin instead of -o practice.s to specify the output file, should work.

You will see output that looks something like this for each user-defined and library function used by the program. The following code is part of the test_string_length function:

00000000004011e3 <test_string_length>:
  4011e3:       55                      push   %rbp
  4011e4:       48 89 e5                mov    %rsp,%rbp
  4011e7:       48 83 ec 30             sub    $0x30,%rsp
  4011eb:       48 89 7d d8             mov    %rdi,-0x28(%rbp)
  4011ef:       48 89 75 d0             mov    %rsi,-0x30(%rbp)
  4011f3:       48 8b 45 d8             mov    -0x28(%rbp),%rax
  4011f7:       48 89 c6                mov    %rax,%rsi
  4011fa:       bf a2 20 40 00          mov    $0x4020a2,%edi
  4011ff:       b8 00 00 00 00          mov    $0x0,%eax
  401204:       e8 47 fe ff ff          callq  401050 <printf@plt>
  401209:       c7 45 fc 00 00 00 00    movl   $0x0,-0x4(%rbp)
  401210:       c7 45 f8 00 00 00 00    movl   $0x0,-0x8(%rbp)
  401217:       c7 45 f4 00 00 00 00    movl   $0x0,-0xc(%rbp)
  40121e:       e9 9b 00 00 00          jmpq   4012be <test_string_length+0xdb>
  401223:       83 45 f8 01             addl   $0x1,-0x8(%rbp)
  401227:       8b 45 f4                mov    -0xc(%rbp),%eax

Note: Don’t worry if your objdump output doesn’t match this exactly. This exercise doesn’t depend on the exact instructions involved, and things like different optimization flags can generate very different assembly code from the same C code.

Exercise 2:

For a particular instruction, we see a line like this:

  4011e7:       48 83 ec 30             sub    $0x30,%rsp

Try to guess what each part means:

4011e7 is the:

Address of the entire instruction.
Address of the first byte of the instruction.
Assembly instruction.
Machine instruction.
Hex for the ASCII encoding of the assembly instruction.

48 83 ec 30 is the:

Address of the entire instruction.
Address of the first byte of the instruction.
Assembly instruction.
Machine instruction.
Hex for the ASCII encoding of the assembly instruction.

sub $0x30,%rsp is the:

Address of the entire instruction.
Address of the first byte of the instruction.
Assembly instruction.
Machine instruction.
Hex for the ASCII encoding of the assembly instruction.

Also, try the strings command, which lists information about all the labels and strings defined in your program. What command can be used to access more information about how strings works?

Correct answer: man strings

Given that information, how can you invoke strings to show you the strings for the practice program?

Correct answer: strings practice.bin

That should produce output like the following:

compaction
actual
face
face the action
face the faction
factual
facet
facetious
face facet
face facts facetiously
effacing
efface
aaabb
aabb
Testing %s
PASS
FAIL
[%s]  %s("%s") = %d; expected %d
Of %4d tests of %s
   %4d PASS
   %4d FAIL
[%s]  %s("%s", '%c') = %d; expected %d
[%s]  %s("%s", "%s") = %d; expected %d
substring

You may recognize many of the strings that were used in the program for various purposes, such as test values, user output, or function names.

You may find strings to be a helpful utility for debugging or for the reverse engineering assignment.

Disassembly using gdb

The gdb debugger can be used to examine the X86 assembly language version of a C program (the disassembled version of the program).

For the Pointer practice problems, you were asked to write a function which returned the length of a string:

int string_length_a(char[] str)

Here is a possible solution (which is probably quite similar to the one that you wrote):

int string_length_a(char str[]) {
    // initialize count to 0 and to refer to first character in the string
    int count = 0;

    // step through the array of characters until the character is 0 (null)
    while (str[count]) {
        count++;
    }

    //  count should contain the number of characters
    return count;
}

Exercise 3:

  1. Edit practice.c and replace your own code for string_length_a with the above program.

  2. Re-compile practice.c (make sure to include -Og to “optimize for debugging):

    gcc   -Wall   -std=c99   -m64  -g  -Og  -o  practice.bin  practice.c
  3. Start gdb, and disassemble using the disas command:

    gdb  ./practice.bin
    (gdb) start
    (gdb) disas string_length_a

Examine the x86 code CAREFULLY to understand how it represents the C code.

Note: In this example as in most examples, actual addresses in memory on your machine may be different than shown, because of address space randomization.


Address of         Offset from
Instruction        beginning    Instruction
-----------        -----------  -------------------

0x0000000000401156 <+0>:        mov    $0x0,%eax
0x000000000040115b <+5>:        jmp    0x401160 <string_length_a+10>
0x000000000040115d <+7>:        add    $0x1,%eax
0x0000000000401160 <+10>:       movslq %eax,%rdx
0x0000000000401163 <+13>:       cmpb   $0x0,(%rdi,%rdx,1)
0x0000000000401167 <+17>:       jne    0x40115d <string_length_a+7>
0x0000000000401169 <+19>:       ret
Click here to see a version of the code with some explanatory comments

Address   Offset  Instruction                            Comment
--------- ------- -------------------                    -----------

0x401156  <+0>:   mov    $0x0,%eax                       ; Set return value to 0
0x40115b  <+5>:   jmp    0x401160 <string_length_a+10>   ; Jump into loop past update
0x40115d  <+7>:   add    $0x1,%eax                       ; Increment count
0x401160  <+10>:  movslq %eax,%rdx                       ; Copy count into %rdx
0x401163  <+13>:  cmpb   $0x0,(%rdi,%rdx,1)              ; Compare one byte to 0
0x401167  <+17>:  jne    0x40115d <string_length_a+7>    ; Continue loop if not equal
0x401169  <+19>:  ret                                    ; Return if byte is NUL

Answer the following questions:

  1. What is the starting address of the function?

    Correct answer: 0x401156 Explanation: It’s also fine to include more leading zeroes.

  2. What is the significance of the red and orange highlights above (on the addresses of the jump instructions at addresses +5 and +17, and on the address parts of the lines with addresses +7 and +10)?

    Example answer: These highlights are showing where the jump instructions go to: The highlighted addresses match the value specified for each jump, since line +5 jumps to line +10, and line +17 jumps to line +7.

  3. The blue, purple, and teal highlights above help track which register is used where. For each register, click to select whether this function reads from it and/or writes to it:

    • %rdi Only reads
      Only writes
      Reads and writes
    • %rdx Only reads
      Only writes
      Reads and writes
    • %eax Only reads
      Only writes
      Reads and writes
  4. Which register(s) are used as memory addresses and/or offsets in memory?

    %rdx
    %rdi and %rdx
    %rdi
    %eax and %rdx

    How can you tell when a register is being used as a memory address?

    Example answer: Parentheses in the assembly code denote memory access, so registers that appear within parentheses are used as an address (in the first spot) or an offset (in the second spot).

  5. You have learned that by default, certain registers are used to pass parameters to a function. When you begin execution of the function, which register contains the parameter?

    Correct answer: %rdi

  6. By examining practice.c, how many parameters does this function have?

    Correct answer: 1 Explanation: We know this because parameters are always stored starting in %rdi, then in %rsi, etc., and this function uses %rdi but does NOT use %rsi.

  7. Which register is used to represent the value of count?

    Correct answer: %eax Explanation: Note that the name for the full-length version of this register is %rax. Only the last 4 bytes of the register are used when we refer to it as %eax, which is consistent with the int type for count. Also, the value is moved into %rdx as well which is then used to index the string. %rdx is used to hold an 8-byte version of the same value (sign-extended) because the offset needs to be 8 bytes since the string address in %rdi is 8 bytes. The ‘slq’ in movslq stands for ‘signed long to quad’ meaning we sign-extend 4 bytes into 8.

  8. You have learned that a particular register is used for returning the value from a fruitful function. When you return from the function, which register will hold the return value?

    Correct answer: %eax Explanation: Again %rax is the name for the full register.

  9. Which four instructions constitute the while loop from the C program? List their relative addresses, separated by commas:

    Correct answer: 7, 10, 13, 17

  10. Check the commented version of the code above. Are there any instructions that don’t make sense to you?

    Example answer: Ask the instructor to explain things that don’t make sense at this stage. You will be asked to read and understand a lot of assembly code on the upcoming assignments.

Exercise 4:

Now, execute the program in gdb, using stepwise execution, and use gdb commands to understand the contents of registers and memory, by following these instructions:

  1. Start by setting a breakpoint at the beginning of string_length_a and then run to execute the program until it hits the breakpoint:

    (gdb) break string_length_a
    (gdb) run
  2. At this point, the program is paused at the beginning of the function, and %rdi should contain the address of the string whose length is being calculated. Show the value of %rdi using:

    (gdb) info reg rdi
    rdi        0x4009d5    4196821

    Note that the value of the register is shown in both hexadecimal and decimal notation. Your value may be different, since it’s a memory address that gets randomized.

  3. The first test string when running the program is "act". Let’s examine memory to confirm that:

    To display the value pointed to by %rdi as a string:

    (gdb) x /s $rdi
    0x4009d5:    "act"

    To display as hex values:

    (gdb) x /4bx $rdi
    0x4009d5:    0x61    0x63    0x74    0x00

    You are seeing the 4 bytes in hexadecimal notation which are the ASCII values for the characters of the string, plus a null character to terminate the string.

    Now, use the command help x in gdb to display help for the x command and figure out how we could display the string (including the NUL byte) as characters:

    Correct answer: x /4bc $rdi Explanation: Note that x /4c $rdi will also work, since ‘c’ for character implies that only a single byte is being read.

    Run that command and you should see:

    0x4009d5:    97 'a'    99 'c'    116 't'    0 '\000'

    You are seeing the characters (and the decimal notation for their ASCII values) in the string.

  4. Now, single-step several times, executing one line of C code and a time and pausing after each:

    Note: when stepping in gdb, step will complete one line of C code while stepi will complete one assembly instruction.

    (gdb) step
    24      int count = 0;
    (gdb) step
    26        count++;
    (gdb) step
    25      while (str[count])
    (gdb) step
    26        count++;
    (gdb) step
    25      while (str[count])
    (gdb)step
    26        count++;
  5. Examine the disassembled code again, and find the address of the instruction which will return from the function (on line +19 below).

    
     (gdb) disas
     Dump of assembler code for function string_length_a:
     0x0000000000401156 <+0>:        mov    $0x0,%eax
     0x000000000040115b <+5>:        jmp    0x401160 <string_length_a+10>
     0x000000000040115d <+7>:        add    $0x1,%eax
     0x0000000000401160 <+10>:       movslq %eax,%rdx
     0x0000000000401163 <+13>:       cmpb   $0x0,(%rdi,%rdx,1)
     0x0000000000401167 <+17>:       jne    0x40115d <string_length_a+7>
     0x0000000000401169 <+19>:       ret
     
  6. Use help break in gdb to find the command to set a breakpoint at the return instruction:

    Correct answer: break *string_length_a + 19 Explanation: Note: You could also use * followed by the specific memory address of that line, which varies between users. If your code doesn’t match this assembly code, use the address of the ret (or rep ret) instruction in your code.

  7. Continue execution (your program will run until it hits the breakpoint).

    (gdb) continue
    Breakpoint 2, 0x0000000000401169 in string_length_a (str=0x4009d5 "act")
        at practice.c:33
  8. What gdb command will show us the contents of the %eax register that is holding the return value?

    Correct answer: info reg eax Explanation: Note that print $eax would also work.

    Run that command and you should see the number 3 (in hex). This is because the string "act" has 3 characters.

  9. Exit from gdb using quit.

Reverse Engineering/Deciphering Adventure

In this part of the lab we will get some practice on the x86 assignment, although we won’t actually start the real assignment. To create a directory and get the starter files, use these commands in the shell, from within your cs240-repos directory:

mkdir x86practice
cd x86practice
wget https://cs.wellesley.edu/~cs240/lab/lab07/starter/main.c
wget https://cs.wellesley.edu/~cs240/lab/lab07/starter/sample.bin
chmod 755 sample.bin
cp sample.bin backup.bin

The wget program can download files into the current directory based on a URL. We’re fetching two files: main.c, and sample.bin. The first file is part of the C code that was used to compile an “adventure” program which requires the user to input a series of obscure codes to solve puzzles. Unfortunately, the main.c file does not contain the C code for any of the puzzles (you can see at the top it does #include phases.h, but we don’t have a copy of either phases.h or phases.c). The sample.bin file is the entire compiled program, and since it’s in executable format, that means it contains all of the machine instructions. Your job for the assignment will be to reverse engineer those machine instructions, using gdb and/or objdump to convert them to assembly and then reading through the assembly code to figure out what the program does, and thus what inputs it needs to receive. You can write those input into a file named inputs.txt and feed them into the program so that you don’t have to re-type them each time. One caveat though: the program is set up to self-destruct, and will erase itself if you give it the wrong passwords!

Note: In the code above, we included commands to make a backup copy of sample.bin in case it self-destructs. You can use cp backup.bin sample.bin to restore it if it erases itself. If you’re quick, you can also hit control-C as it’s counting down to interrupt the self-destruct process, and in gdb, you can set a breakpoint to stop the self-destruct process from triggering. In the worst case, you can always re-download the file.

Exercise 5:

Look at the code in main.c, and answer these questions:

  1. How can the program be asked to read input from a file?

    Example answer: By giving it the file name as an argument. For example, instead of running ./practice.bin you could run ./practice.bin inputs.txt.

  2. How many phases are there in the program? Correct answer: 6

  3. Which functions does it call from code that you do not have access to?

    Example answer: initialize_obstacles, read_line, phase_1 through phase_6, and phase_disarmed. Note that printf and fopen are standard library functions.

  4. What argument is passed to each phase function?

    Example answer: Each of the phases gets the result of a read_line call which presumably gets either user input or input from the specified file.

  5. Since arguments are passed to functions via registers, and registers hold at most 64 bits on our system, what value is actually put in a register when the assembly code jumps to one of the phase functions?

    Example answer: A pointer: the address of a location where the user’s input was stored as a string.

Each phase uses a string input by the user to perform some task. The phases have to do with (1) comparison, (2) loops, (3) switch statements, (4) recursion, (5) pointers and arrays, (6) sorting linked lists.

If you input the incorrect string for a phase, you get an error message and the program will attempt to delete itself. You must complete a phase by entering the correct string in order to move on to solving the next phase.

How can we figure out what string is required if we don’t have access to the C code?

Example answer: Disassemble the code using objdump and read through the assembly code to understand what it does. Also, debug the program with gdb and use disas to inspect the assembly code plus breakpoints, stepping, and print/info reg/examine to see what the program is actually doing step by step.

Exercise 6:

Run sample.bin with gdb and display the disassembled version of the main function:

gdb ./sample.bin     
(gdb) start  //runs the program and pauses at the beginning of the main function
(gdb) disas  // shows disassembly of current function
Dump of assembler code for function main:
=> 0x0000000000400e2d <+0>:    push   %rbx
   0x0000000000400e2e <+1>:    mov    %rsi,%rbx
   0x0000000000400e31 <+4>:    cmp    $0x1,%edi
   0x0000000000400e34 <+7>:    jne    0x400e46 <main+25>
   0x0000000000400e36 <+9>:    mov    0x202ccb(%rip),%rax        # 0x603b08 <stdin@@GLIBC_2.2.5>
   0x0000000000400e3d <+16>:    mov    %rax,0x202cdc(%rip)        # 0x603b20 <infile>
   0x0000000000400e44 <+23>:    jmp    0x400e9c <main+111>
   0x0000000000400e46 <+25>:    cmp    $0x2,%edi
   0x0000000000400e49 <+28>:    jne    0x400e80 <main+83>
   0x0000000000400e4b <+30>:    mov    0x8(%rsi),%rdi
   0x0000000000400e4f <+34>:    mov    $0x4027ed,%esi
   0x0000000000400e54 <+39>:    callq  0x400cb0 <fopen@plt>

Hit -return- several times to display the complete disassembled code for main. Read through the code, and notice that you can recognize the calls to functions that you saw in the C source (such as calls to printf, phase_1, etc.).

You have learned in that up to 6 arguments for a functions are stored in the following registers:

  1. arg1: %rdi
  2. arg2: %rsi
  3. arg3: %rdx
  4. arg4: %rcx
  5. arg5: %r8
  6. arg6: %r9

Therefore, for main (which has 2 arguments, argc, the number of command-line arguments, and argv, the array of strings representing the command-line arguments), you would expect argc to be stored in %rdi, and argv to be stored in %rsi.

Display those registers and their current values:

(gdb) info reg rdi
rdi            0x1    1
(gdb) info reg rsi
rsi            0x7fffffffe168    140737488347496

Remember: your machine may have somewhat different values than shown here when an address in memory is displayed, so %rsi may show a different value on your machine than the one displayed here.

  1. From the value of %rdi, how many command-line arguments are there for the current invocation of main?

    Correct answer: 1

  2. Which of the following gdb commands can be used to display the string pointed to by argv (use help in gdb and test these out if you need to)?

    print argv
    print *argv
    print **argv
    x /s argv
    x /s *argv
    x /s *$rsi
    print *(char**)$rsi
    Explanation: Note that x /s *$rsi does not work, because the system knows the type of argv but does not know the type of $rsi, and assumes that it points to 4 bytes instead of 8.

  3. What is the meaning of the first command-line argument to the program?

    Example answer: It’s the path to the executable file that we ran.

    You can view the arguments for any function call by examining the registers which hold the parameters at the beginning of the function. This will be useful to you when deciphering your adventure.

Continue execution of the program at this point:

(gdb) continue
You have just ridden the elevator to the hidden 7th floor
of the Science Center.  A fantastic adventure awaits!

First, you must pass a *stringy* spider's web beyond *compare*, guarding
a trap door!

The program is waiting for you to enter a string at this point (which is the solution for phase_1).

You don’t know the answer yet, so just guess and enter a string. A message similar to the following will be displayed:

!!!! WOMP WOMP !!!!


This adventure will self-destruct in 3 seconds...
This adventure will self-destruct in 2 seconds...
This adventure will self-destruct in 1 seconds...
This adventure will self-destruct in 0 seconds...

**** POOF! ****

Refer to the note above on how to recover the file if you didn’t hit control-C in time.

We’d like to avoid the self-destruct mechanism without having to press control-C quickly. Let’s look at why that happened, and how we can avoid it. First, go back into gdb (possibly after restoring the file):

gdb ./sample.bin

The main function calls phase_1. You need to understand what phase_1 does in order to figure out the correct string.

What is the gdb command to set a breakpoint at the phase_1 function?

Correct answer: break phase_1

Set a breakpoint at phase_1 and run the program. The program will not immediately reach the phase_1 function: first it needs user input for read_line in main.

Type a single character, and hit -return-.

Note: You are entering an incorrect string again here, but don’t worry.

You should now stop at the breakpoint for phase_1:

Breakpoint 2, 0x0000000000400f77 in phase_1 ()

Disassemble to display the assembly code for the function:


(gdb) disas
Dump of assembler code for function phase_1:
=> 0x0000000000401080 <+0>:     sub    $0x8,%rsp
   0x0000000000401084 <+4>:     mov    $0x4021b3,%esi
   0x0000000000401089 <+9>:     callq  0x40143e <strings_not_equal>
   0x000000000040108e <+14>:    test   %eax,%eax
   0x0000000000401090 <+16>:    je     0x401097 <phase_1+23>
   0x0000000000401092 <+18>:    callq  0x40181c <trip_alarm>
   0x0000000000401097 <+23>:    mov    $0x1,%eax
   0x000000000040109c <+28>:    add    $0x8,%rsp
   0x00000000004010a0 <+32>:    retq
End of assembler dump.

Notice that there is a call to a function called trip_alarm (and a jump that can skip that call). trip_alarm is responsible for deleting your sample when an incorrect string is entered.

To avoid executing trip_alarm, set a breakpoint at trip_alarm:

(gdb) break trip_alarm

Since you have entered an incorrect string, you will expect to hit the breakpoint when you continue execution of the program:

(gdb) continue
Breakpoint 2, 0x0000000000401392 in trip_alarm ()

You are paused at the beginning of trip_alarm at this point. To avoid executing it, simple re-run your program again from the beginning:

(gdb) run
The program being debugged has been started already.
Start it from the beginning? (y or n) y

Answer ‘y,’ and, you are again executing from the beginning of the program (and it has not been deleted).

Note: to avoid tripping the alarm, ALWAYS set a breakpoint to trip_alarm as soon as you start gdb! There is a mechanism to do this automatically; create a file called .gdbinit and write ‘break trip_alarm’ in it, then follow the instructions about enabling auto-load that GDB prints when it starts.

Exercise 7:

Now, examine phase_1 more closely

Once again, enter an incorrect string when you are prompted, and you will again hit the breakpoint for phase_1.

Disassemble to display the assembly code for the function:


(gdb) disas
Dump of assembler code for function phase_1:
=> 0x0000000000401080 <+0>:     sub    $0x8,%rsp
   0x0000000000401084 <+4>:     mov    $0x4021b3,%esi
   0x0000000000401089 <+9>:     callq  0x40143e <strings_not_equal>
   0x000000000040108e <+14>:    test   %eax,%eax
   0x0000000000401090 <+16>:    je     0x401097 <phase_1+23>
   0x0000000000401092 <+18>:    callq  0x40181c <trip_alarm>
   0x0000000000401097 <+23>:    mov    $0x1,%eax
   0x000000000040109c <+28>:    add    $0x8,%rsp
   0x00000000004010a0 <+32>:    retq
End of assembler dump.
  1. From earlier, do you remember how many parameters phase_1 has?

    Correct answer: 1

    What register holds the parameter?

    Correct answer: %rdi

  2. What command will show the value of the parameter at this point?

    Correct answer: info reg rdi Explanation: print $rdi would also work, but only shows the decimal value unless you add /x.

    Run that command and you should see something like:

    rdi            0x603640    6305344
  3. What command can be used to examine the string at that address?

    Correct answer: x /s $rdi

    When you run this command, you should see something like:

    0x603640 <input_strings>:    "a"

    This assumes the incorrect string entered earlier is an “a.” You should see whatever you typed in as an input.

    The main function puts a pointer to the string you typed into %rdi before calling phase_1.

    In general, if a function is called, the parameters needed must be put in the proper registers before the call. Examining the contents of the registers holding arguments before the call is often a good way to understand what the function being called will do.

So, let’s look at what functions are called from phase_1:


Dump of assembler code for function phase_1:
=> 0x0000000000401080 <+0>:     sub    $0x8,%rsp
   0x0000000000401084 <+4>:     mov    $0x4021b3,%esi
   0x0000000000401089 <+9>:     callq  0x40143e <strings_not_equal>
   0x000000000040108e <+14>:    test   %eax,%eax
   0x0000000000401090 <+16>:    je     0x401097 <phase_1+23>
   0x0000000000401092 <+18>:    callq  0x40181c <trip_alarm>
   0x0000000000401097 <+23>:    mov    $0x1,%eax
   0x000000000040109c <+28>:    add    $0x8,%rsp
   0x00000000004010a0 <+32>:    retq
End of assembler dump.

Here we have highlighted each value listed in the assembly code which is a memory address.

  1. Of the highlighted addresses, which are not the address of a function?

    0x4021b3
    0x40143e
    0x401097
    0x40181c
    Explanation: The first address is put into %esi, and since that’s the ‘e’ part of %rsi that holds the second argument to a function, it’s being used as a function parameter. The second and fourth addresses are listed with angle brackets afterwards showing that they are jumps to labeled addresses, and those labels are the function names for their functions. The third address has angle brackets, but the offset +23 indicates it’s a jump within a function body, not a jump to call a function.

  2. What do you think the function called strings_not_equal does?

    Example answer: Probably returns 1 (i.e., sets %rax to 1) if the two strings it gets as arguments are not the same, and returns 0 (i.e., sets %rax to 0) if they are the same.

  3. How many parameters do you think this function would need?

    Correct answer: 2

  4. Before strings_not_equal is called, the registers which pass the parameters to it must be set up with the parameter values, which we assume must be pointers to strings. Which two registers would need to have pointers in them?

    Example answer: %rdi and %rsi are the registers used for the first two arguments to a function, so these need to have strings in them.

  5. Does the code of phase_1 modify either of these registers before calling strings_not_equal?

    Example answer: Yes, the instruction at +4 modifies %esi which is the ‘e’ part of %rsi.

  6. Why doesn’t the code of phase_1 need to modify the value of %rdi as well as %rsi?

    Example answer: When phase_1 is called, the user’s input string is already placed into %rdi. So to compare that string against another string, it just needs to set %rsi to the second string and leave %rdi alone.

  7. What gdb command can show us what value %esi points to?

    Correct answer: x /s $esi Explanation: In this case we use /s because we know it should be a string. We might use something like /xg if we expected a pointer, /dw if we expected an integer, etc. (see help x in gdb)

    Run this command and note down what it shows.

  8. The value in %esi may not have been what you expected. But wait. Where are we in the code right now? Use disas and look for the arrow to see which instruction is up next. If you’re still before the line that sets up %esi, then you can either use x to examine the raw address 0x4021b3, or use stepi to step forward until the mov command executes and then check the value of %esi. What is the string that your input will be compared against?

    Correct answer: Alexandria Botanic Garden

    What do you think the significance of this string is?

    Example answer: It’s the correct answer to the first phase!

Now restart the program by entering run. Enter the string you just discovered, and when your phase_1 breakpoint triggers, enter c to continue. If all goes well, you should be on to phase 2.

You can now add that string to the inputs.txt file. Then run your program with ./sample.bin inputs.txt (or in gdb use run inputs.txt) and it will automatically enter that input for you each time you re-run it.

Note: Once you pass command-line arguments to run within gdb once, it will remember them and you and subsequently just type run without arguments. It will print “Starting program:” and then show you the full set of arguments each time you run.

Exercise 8:

You have seen how to find the answer for phase_1 using a bit of guesswork based on the strings_not_equal function name, but we didn’t look into the details of the rest of the assembly code.

Notice the instructions following the call to strings_not_equal:


Dump of assembler code for function phase_1:
=> 0x0000000000401080 <+0>:     sub    $0x8,%rsp
   0x0000000000401084 <+4>:     mov    $0x4021b3,%esi
   0x0000000000401089 <+9>:     callq  0x40143e <strings_not_equal>
   0x000000000040108e <+14>:    test   %eax,%eax
   0x0000000000401090 <+16>:    je     0x401097 <phase_1+23>
   0x0000000000401092 <+18>:    callq  0x40181c <trip_alarm>
   0x0000000000401097 <+23>:    mov    $0x1,%eax
   0x000000000040109c <+28>:    add    $0x8,%rsp
   0x00000000004010a0 <+32>:    retq
End of assembler dump.

Remember that %eax is just the lower half of %rax, which is the register used for returning a value from a function. So, the value of %eax in this instruction must be the value returned from strings_not_equal.

  1. What is the effect of test %eax,%eax (look up the x86 test instruction)?

    Example answer: It performs bitwise AND and discards the result, but sets flags based on that result. In particular, in our code the “ZF” or “zero flag” will be important.

  2. How does the result of the test instruction affect the following instruction (look up the x86 je instruction)?

    Example answer: The je instruction jumps “if equal” but what this really means is “if the zero flag is set” because “equal” is discovered through subtraction just like in the HW ISA. So if the %eax value is 0, the test instruction will set the zero flag and the jump will happen. This makes the jump happen when the strings are NOT “not equal” (since the function is called strings_not_equal) or in other words, when they’re equal. This x86 jumps reference is pretty useful for all of the different jump instructions.

  3. What would the C code that generated this assembly code look like (just give a vague sketch and then check the answer)?

    Note: The sub $0x8, %rsp and corresponding add $0x8, %rsp instructions are modifying %rsp the “stack pointer register” which we will learn about later. They are not terribly meaningful within this code although they are necessary to maintain stack alignment.

    Example answer:

    int phase_1(char* input_string) {
        if (strings_not_equal(input_string,correct_string)) {
            trip_alarm();
        }
        return 0;
    }

Stretch Exercises: Read Six Numbers

Next, let’s look at the function read_six_numbers (a function called from phase_2).

Disassemble the function phase_2:


(gdb) disas phase_2
Dump of assembler code for function phase_2:
   0x00000000004010a1 <+0>:    push   %rbp
   0x00000000004010a2 <+1>:    push   %rbx
   0x00000000004010a3 <+2>:    sub    $0x28,%rsp
   0x00000000004010a7 <+6>:    mov    %rsp,%rsi
   0x00000000004010aa <+9>:    callq  0x4018b7 <read_six_numbers>

Note: Like phase_1, phase_2 has one parameter, which is the pointer to the string that is input by the user. Therefore, %rdi will contain the pointer to the input when the function begins execution.

Notice the highlighted line above. The second parameter to read_six_numbers is being set up by copying the current %rsp to %rsi before the call to read_six_numbers.

So, the first parameter to read_six_numbers is the string input by the user, and the second parameter is a location in the stack area of memory.

Exercise 9:

What do you think the purpose of read_six_numbers is?

Example answer: Probably it will split up the input into 6 different numbers?

Where do you think the results of executing read_six_numbers will be stored?

In the code segment of the executable.
On the computer’s hard drive.
On the heap, after allocating space with malloc.
On the stack, within the stack frame for phase_2 so that phase_2 can make use of them.
Explanation: The key clue here is that the stack pointer during the phase_2 call was passed as an argument to read_six_numbers. It’s possible that that function calls malloc and returns a pointer to numbers allocated on the heap, but giving it a pointer to the stack makes it likely that it will store the numbers there.

Next, disassemble read_six_numbers and examine the code (here with address values highlighted):


(gdb) disas read_six_numbers
Dump of assembler code for function read_six_numbers:
   0x00000000004018b7 <+0>:     sub    $0x18,%rsp
   0x00000000004018bb <+4>:     mov    %rsi,%rdx
   0x00000000004018be <+7>:     lea    0x4(%rsi),%rcx
   0x00000000004018c2 <+11>:    lea    0x14(%rsi),%rax
   0x00000000004018c6 <+15>:    mov    %rax,0x8(%rsp)
   0x00000000004018cb <+20>:    lea    0x10(%rsi),%rax
   0x00000000004018cf <+24>:    mov    %rax,(%rsp)
   0x00000000004018d3 <+28>:    lea    0xc(%rsi),%r9
   0x00000000004018d7 <+32>:    lea    0x8(%rsi),%r8
   0x00000000004018db <+36>:    mov    $0x40277d,%esi
   0x00000000004018e0 <+41>:    mov    $0x0,%eax
   0x00000000004018e5 <+46>:    callq  0x400c80 <__isoc99_sscanf@plt>
   0x00000000004018ea <+51>:    cmp    $0x5,%eax
   0x00000000004018ed <+54>:    jg     0x4018f4 <read_six_numbers+61>
   0x00000000004018ef <+56>:    callq  0x40181c <trip_alarm>
   0x00000000004018f4 <+61>:    add    $0x18,%rsp
   0x00000000004018f8 <+65>:    retq
End of assembler dump.

Notice that this function call a C library function called sscanf (highlighted in blue above).

Search online for a definition of the function sscanf and answer the following questions:

What does sscanf do?

Example answer: It reads formatted values from a string, and stores them into variables pointed to by pointers it is given in addition to the string to scan.

How many parameters does sscanf have?

Example answer: It has a variable number of parameters, but always at least 3: the first parameter is the string to scan, the second is the format to recognize, and the third and subsequent parameters are pointers to variables where the scan results should be placed. For each % escape in the format string, the result will be placed into the variable pointed to by the next argument in order.

What are the first and second parameter of sscanf, and how are they used?

Example answer: The first parameter is the string to scan, and the second is the format string. The format string determines what kind(s) of values sscanf will try to read from the string being scanned, using % escapes like printf.

What are the remaining parameters of sscanf, and how are they used?

Example answer: Each additional parameter is a pointer to a variable into which the next scan result should be placed.

What value is returned by sscanf?

Example answer: It returns the number of items successfully scanned.

Notice that a large immediate value is loaded into %esi before the call to read_six_numbers (highlighted in red above). This is setting up the second parameter for the function. Examine memory at that address to understand the meaning of the large constant:

Note: Your own computer may load a different value than 0x40277d into %esi; if so, in the command below; use the value from your own computer. If you step far enough, you could also use $esi.

(gdb) x /s 0x40277d
0x40277d:    "%d %d %d %d %d %d"

Explain what this string tells you about the format of the expected input string:

Example answer: It means that the input string will be read as 6 decimal numbers in base 10, separated by whitespace.

Now that you understand the format of the string expected, again examine the code in phase_2 after the call to read_six_numbers (starting at the +14 line right after the call to read_six_numbers):


(gdb) disas phase_2
Dump of assembler code for function phase_2:
   0x00000000004010a1 <+0>:     push   %rbp
   0x00000000004010a2 <+1>:     push   %rbx
   0x00000000004010a3 <+2>:     sub    $0x28,%rsp
   0x00000000004010a7 <+6>:     mov    %rsp,%rsi
   0x00000000004010aa <+9>:     callq  0x4018b7 <read_six_numbers>
   0x00000000004010af <+14>:    cmpl   $0x1,(%rsp)
   0x00000000004010b3 <+18>:    je     0x4010da <phase_2+57>
   0x00000000004010b5 <+20>:    callq  0x40181c <trip_alarm>
   0x00000000004010ba <+25>:    jmp    0x4010da <phase_2+57>
   0x00000000004010bc <+27>:    add    $0x1,%ebx
   0x00000000004010bf <+30>:    mov    %ebx,%eax
   0x00000000004010c1 <+32>:    imul   -0x4(%rbp),%eax
   0x00000000004010c5 <+36>:    cmp    %eax,0x0(%rbp)
   0x00000000004010c8 <+39>:    je     0x4010cf <phase_2+46>
   0x00000000004010ca <+41>:    callq  0x40181c <trip_alarm>
   0x00000000004010cf <+46>:    add    $0x4,%rbp
   0x00000000004010d3 <+50>:    cmp    $0x6,%ebx
   0x00000000004010d6 <+53>:    jne    0x4010bc <phase_2+27>
   0x00000000004010d8 <+55>:    jmp    0x4010e6 <phase_2+69>
   0x00000000004010da <+57>:    lea    0x4(%rsp),%rbp
   0x00000000004010df <+62>:    mov    $0x1,%ebx
   0x00000000004010e4 <+67>:    jmp    0x4010bc <phase_2+27>
   0x00000000004010e6 <+69>:    add    $0x28,%rsp
   0x00000000004010ea <+73>:    pop    %rbx
   0x00000000004010eb <+74>:    pop    %rbp
   0x00000000004010ec <+75>:    retq
End of assembler dump.

Based on that lines +14, +18, and +20, what do you think the first value in the string should be?

Example answer: It should be a 1, because otherwise line 18 will not jump and line 20 will call trip_alarm.

Where is that first value stored?

Example answer: In memory, where %rsp the stack pointer is pointing. This is the bottom of the stack (the stack grows downwards).

If the first value is correct, what line of code is jumped to/executed next?

Example answer: Line +57, highlighted in red above.

The lea instruction can be used to do a variety of math operations, although its original purpose is to do address calculations. To understand how it works, you need to understand the addressing syntax in x86 assembly. The syntax for memory address computations is:

offset ( base, index, stride)

with defaults of offset = 0, base = 0, index = 0, and stride = 1. The formula for the address is:

base + offset + (index × stride)

So for example, the assembly expression $2(%rdi, %rsi, $4), assuming %rdi holds 100 and %rsi holds 10, would compute the address 142, which is %rdi + 2 + %rsi * 4. The lea instruction just computes that value and stores it in the destination register.

Given this information, explain the instruction lea 0x4(%rsp),%rbp, and what value you expect %rbp to contain (this one is complicated):

Example answer: Here the offset is 4, and there is no index or stride specified. So we just have %rsp + 4 being stored in %rbp: it’s serving as an alternate add instruction.

What’s the significance of the number 4 in the above instruction, given that we are storing integers on the stack in this function?

Example answer: 4 is the size of one integer, so we’re setting %rbp to point to the next integer beyond the stack pointer.

Note that the next instruction initializes %ebx to a value. Record the initial value of %ebx:

Correct answer: 1

What address does the program jump to next?

Example answer: +27, the line highlighted in blue.

Hand-execute the next four instructions (+27, +30, +32, and +36), keeping track of the value of any registers that change. Explain what you think is happening during the execution of these instructions:

Example answer: First, we add 1 to %ebx, so the value is now 2. Next we copy %ebx into %eax. Then we multiply whatever is in memory at 4 bytes below %rbp with %eax. In this case, the result will be 2, since we’re multiplying %eax (currently 2) with the first input number (must have been 1). In future iterations, this will always be multiplying the previous input number by the current %ebx value. Finally, we’re comparing the value we just got from multiplication with the second/current input number.

Can you understand the relationship that needs to hold between each value and the one following it in the string for the password to be accepted?

Example answer: Since %ebx starts at 2 for the second input value (first input value must always be a 1), and %ebx increases by one each time, each input value must be a multiple of the previous value, but that multiple keeps increasing by 1. If we consider the indices of the inputs starting from zero as ‘i’, each number has to be (i + 1) times the previous one. So the correct sequence would be 1 2 6 24 120 720.

Hints for Reverse Engineering

  1. Read through each new function and highlight all of the address values, including jump targets within the function, calls to other functions, and any other addresses that might be used. Use a separate document for this, and/or print stuff out on paper and use highlighters.

  2. Focus on how to avoid executing calls to trip_alarm. Is there a jump that jumps around a trip_alarm call? What comparison or test decides whether it jumps? What data are inspected by the comparison? What code generates that data? Can you change that data to steer around the trip_alarm?

  3. Zoom out. You do not need to understand what every instruction does. In fact, you could spend a lot of time deciphering code which will not help you that much in solving the problem. Certainly, details of many instructions are key, but many are not. Stumped on one instruction? Ignore it for now and come back later only if you cannot figure things out without it.

  4. Function names are useful information. Sometimes it’s worth assuming a function does what it says it does and double-checking. You can often double-check by using gdb to probe the values of arguments to a function and then check the return value in %rax.

    Some functions are not well-named, but you can still extract some useful information. For example, if you stumble into a function with a wacky name like this:

    __isoc99_sscanf@plt:
    => 0x0804873c <+0>:    jmp    *0x804a100
       0x08048742 <+6>:    push   $0x28
       0x08048747 <+11>:    jmp    0x80486dc

    don’t try to figure out what it does by examining its instructions. There are two markers in the name of the function that suggest it is something to look up:

    __isoc99_: stands for ISO C99, the name of the International Standards Organization standard for the version of C we are using. This means it is a function in the standard C library.

    @plt: stands for Procedure Linkage Table, a table that the linker and loader use to connect code compiled from a C program with pre-compiled external code. Again, this suggests a function in the standard C library.

    The nice thing about standard library functions is that, instead of figuring out what they do instruction by instruction, you can just look them up.

    I find the name of this one by removing the __isoc99_ and the @plt, to see sscanf. Then I can:

    • Read the manual page about this function with the command man sscanf, focusing on the function headers, the beginning of the DESCRIPTION section and the RETURN VALUE section; or
    • Search to find the same info online.

    Now I can learn:

    • What arguments the function takes.
    • What it does with them.
    • What it returns.

When you are ready to start the x86 assignment, read the setup instructions carefully. You will need to repeat phase_1 for your own version of the assignment. The code for phase_1 will be the same as in the lab sample, but the solution will differ for your assignment, since each team has unique solutions for each phase of the adventure.

If you have time left in lab today, you can work on the pointers assignment, and/or look back unfinished stuff from last week’s lab which will help with that. If you are finished with that assignment, you should begin reading the x86 assignment and start work on it.