🔬 Lab
CS 240 Lab 7
Disassembly and Reverse Engineering
CS 240 Lab 7
In lab today, you will practice using disassembly tools to recreate the assembly language instructions for a C program for which you have been given only the object (executable) code. This will help you learn about the X86 instruction set architecture, teach you debugging skills, and also help you get started with the next assignment.
Open VSCode and a Terminal.
To begin, you will use the cmemory
repository from lab
6/7. If you don’t have a copy, you can go into your
cs240-repos
folder and do cs240 start cmemory
now. Either way, cd
into your cmemory
folder.
Disassembly Tools
We can examine the assembly version of a C file in several different ways:
- by compiling to produce assembly code,
- by using
objdump
to dump the assembly code from an executable file, or - by disassembling parts of an executable while debugging with
gdb
.
Compiling to produce assembly code
Up to now, you have compiled by using make
, which uses a
Makefile
that contains the actual command to compile your
code using gcc
(GCC stands for “GNU C Compiler”).
You can also compile directly by using gcc
at the
command line, and you can use various options to control the type of
output produced by the command.
Exercise 1:
Create an X86 assembly language file by using gcc
with
the -S
option on your practice.c
source
file.
Note that you can run gcc --help
to print out basic help
for gcc
(most programs accept a --help
and/or
-h
option to do this) and you can run man gcc
to open the manual for gcc
(most programs have a manual
page). Sadly, the complexity of these help pages is sometimes limiting.
Using the ‘/’ key to search within manual pages can help somewhat.
Use the following arguments:
-S
tellsgcc
that it should stop at the assembly code stage, rather than continuing to assemble the assembly code into machine code and then link it into an executable file.-Wall
tellsgcc
to enable “all” warnings (there are some that are still excluded from-Wall
but most of them are enabled by it).-std=c99
says to compile using the 1999 C standard, as opposed to more modern standard. All programs we compile in this class will use this standard, because many of the exploits we use are designed with it in mind.-m64
controls the sizes of some data types. Notably it setsint
to 32 bits and pointer types to 64 bits.-Og
turns on optimization-for-debugging. This greatly simplifies the assembly code in some ways, although it makes it more complicated in others.-g
instructsgcc
to produce extra debugging information thatgdb
can use. We will almost always use this flag.-o practice.s
tells the compiler to output its results into a file namedpractice.s
..s
is the standard extension for assembly code.practice.c
tells it what file to compile.
Run your command and make sure it executes without errors, then paste the full command here to double-check it:
Correct answer: gcc -S -Wall -std=c99 -m64 -Og -g -o practice.s practice.c Explanation: Note that the order of the arguments usually isn’t important, although some programs are exceptions to that rule.
Now open practice.s
to get an idea of the instructions
produced. See if you can find:
- A label, starting with a
.
and ending with a:
- The name of a function.
- An actual assembly instruction, like
movl
orret
.
Using objdump
and
strings
The Linux objdump
commands can be used as a disassembler
to view an executable in assembly form.
Use the objdump
tool to display the disassembled
executable file by running:
objdump -d practice.bin
Note: If you don’t have a
practice.bin
file, use gcc
to create it. The
same command as above, without the -S
, and using
-o practice.bin
instead of -o practice.s
to
specify the output file, should work.
You will see output that looks something like this for each
user-defined and library function used by the program. The following
code is part of the test_string_length
function:
00000000004011e3 <test_string_length>:
4011e3: 55 push %rbp
4011e4: 48 89 e5 mov %rsp,%rbp
4011e7: 48 83 ec 30 sub $0x30,%rsp
4011eb: 48 89 7d d8 mov %rdi,-0x28(%rbp)
4011ef: 48 89 75 d0 mov %rsi,-0x30(%rbp)
4011f3: 48 8b 45 d8 mov -0x28(%rbp),%rax
4011f7: 48 89 c6 mov %rax,%rsi
4011fa: bf a2 20 40 00 mov $0x4020a2,%edi
4011ff: b8 00 00 00 00 mov $0x0,%eax
401204: e8 47 fe ff ff callq 401050 <printf@plt>
401209: c7 45 fc 00 00 00 00 movl $0x0,-0x4(%rbp)
401210: c7 45 f8 00 00 00 00 movl $0x0,-0x8(%rbp)
401217: c7 45 f4 00 00 00 00 movl $0x0,-0xc(%rbp)
40121e: e9 9b 00 00 00 jmpq 4012be <test_string_length+0xdb>
401223: 83 45 f8 01 addl $0x1,-0x8(%rbp) 401227: 8b 45 f4 mov -0xc(%rbp),%eax
Note: Don’t worry if your objdump
output doesn’t match this exactly. This exercise doesn’t depend on the
exact instructions involved, and things like different optimization
flags can generate very different assembly code from the same C
code.
Exercise 2:
For a particular instruction, we see a line like this:
4011e7: 48 83 ec 30 sub $0x30,%rsp
Try to guess what each part means:
4011e7
is the:
Address of
the entire instruction.
Address
of the first byte of the instruction.
Assembly
instruction.
Machine
instruction.
Hex for the
ASCII encoding of the assembly instruction.
48 83 ec 30
is the:
Address of
the entire instruction.
Address of
the first byte of the instruction.
Assembly
instruction.
Machine
instruction.
Hex for the
ASCII encoding of the assembly instruction.
sub $0x30,%rsp
is the:
Address of
the entire instruction.
Address of
the first byte of the instruction.
Assembly
instruction.
Machine
instruction.
Hex for the
ASCII encoding of the assembly instruction.
Also, try the strings
command, which lists information
about all the labels and strings defined in your program. What command
can be used to access more information about how strings
works?
Correct answer: man strings
Given that information, how can you invoke strings
to
show you the strings for the practice
program?
Correct answer: strings practice.bin
That should produce output like the following:
compaction
actual
face
face the action
face the faction
factual
facet
facetious
face facet
face facts facetiously
effacing
efface
aaabb
aabb
Testing %s
PASS
FAIL
[%s] %s("%s") = %d; expected %d
Of %4d tests of %s
%4d PASS
%4d FAIL
[%s] %s("%s", '%c') = %d; expected %d
[%s] %s("%s", "%s") = %d; expected %d substring
You may recognize many of the strings that were used in the program for various purposes, such as test values, user output, or function names.
You may find strings
to be a helpful utility for
debugging or for the reverse engineering assignment.
Disassembly using gdb
The gdb
debugger can be used to examine the X86 assembly
language version of a C program (the disassembled version of the
program).
For the Pointer practice problems, you were asked to write a function which returned the length of a string:
int string_length_a(char[] str)
Here is a possible solution (which is probably quite similar to the one that you wrote):
int string_length_a(char str[]) {
// initialize count to 0 and to refer to first character in the string
int count = 0;
// step through the array of characters until the character is 0 (null)
while (str[count]) {
++;
count}
// count should contain the number of characters
return count;
}
Exercise 3:
Edit
practice.c
and replace your own code forstring_length_a
with the above program.Re-compile
practice.c
(make sure to include-Og
to “optimize for debugging):gcc -Wall -std=c99 -m64 -g -Og -o practice.bin practice.c
Start
gdb
, and disassemble using thedisas
command:gdb ./practice.bin (gdb) start (gdb) disas string_length_a
Examine the x86 code CAREFULLY to understand how it represents the C code.
Note: In this example as in most examples, actual addresses in memory on your machine may be different than shown, because of address space randomization.
Address of Offset from
Instruction beginning Instruction
----------- ----------- -------------------
0x0000000000401156 <+0>: mov $0x0,%eax
0x000000000040115b <+5>: jmp 0x401160 <string_length_a+10>
0x000000000040115d <+7>: add $0x1,%eax
0x0000000000401160 <+10>: movslq %eax,%rdx
0x0000000000401163 <+13>: cmpb $0x0,(%rdi,%rdx,1)
0x0000000000401167 <+17>: jne 0x40115d <string_length_a+7>
0x0000000000401169 <+19>: ret
Click here to see a version of the code with some explanatory comments
Address Offset Instruction Comment
--------- ------- ------------------- -----------
0x401156 <+0>: mov $0x0,%eax ; Set return value to 0
0x40115b <+5>: jmp 0x401160 <string_length_a+10> ; Jump into loop past update
0x40115d <+7>: add $0x1,%eax ; Increment count
0x401160 <+10>: movslq %eax,%rdx ; Copy count into %rdx
0x401163 <+13>: cmpb $0x0,(%rdi,%rdx,1) ; Compare one byte to 0
0x401167 <+17>: jne 0x40115d <string_length_a+7> ; Continue loop if not equal
0x401169 <+19>: ret ; Return if byte is NUL
Answer the following questions:
What is the starting address of the function?
Correct answer: 0x401156 Explanation: It’s also fine to include more leading zeroes.
What is the significance of the red and orange highlights above (on the addresses of the jump instructions at addresses +5 and +17, and on the address parts of the lines with addresses +7 and +10)?
Example answer: These highlights are showing where the jump instructions go to: The highlighted addresses match the value specified for each jump, since line +5 jumps to line +10, and line +17 jumps to line +7.
The blue, purple, and teal highlights above help track which register is used where. For each register, click to select whether this function reads from it and/or writes to it:
%rdi
Only reads
Only writes
Reads and writes%rdx
Only reads
Only writes
Reads and writes%eax
Only reads
Only writes
Reads and writes
Which register(s) are used as memory addresses and/or offsets in memory?
%rdx
%rdi
and%rdx
%rdi
%eax
and%rdx
How can you tell when a register is being used as a memory address?
Example answer: Parentheses in the assembly code denote memory access, so registers that appear within parentheses are used as an address (in the first spot) or an offset (in the second spot).
You have learned that by default, certain registers are used to pass parameters to a function. When you begin execution of the function, which register contains the parameter?
Correct answer: %rdi
By examining
practice.c
, how many parameters does this function have?Correct answer: 1 Explanation: We know this because parameters are always stored starting in
%rdi
, then in%rsi
, etc., and this function uses%rdi
but does NOT use%rsi
.Which register is used to represent the value of
count
?Correct answer: %eax Explanation: Note that the name for the full-length version of this register is
%rax
. Only the last 4 bytes of the register are used when we refer to it as%eax
, which is consistent with theint
type forcount
. Also, the value is moved into%rdx
as well which is then used to index the string.%rdx
is used to hold an 8-byte version of the same value (sign-extended) because the offset needs to be 8 bytes since the string address in%rdi
is 8 bytes. The ‘slq’ inmovslq
stands for ‘signed long to quad’ meaning we sign-extend 4 bytes into 8.You have learned that a particular register is used for returning the value from a fruitful function. When you return from the function, which register will hold the return value?
Correct answer: %eax Explanation: Again
%rax
is the name for the full register.Which four instructions constitute the while loop from the C program? List their relative addresses, separated by commas:
Correct answer: 7, 10, 13, 17
Check the commented version of the code above. Are there any instructions that don’t make sense to you?
Example answer: Ask the instructor to explain things that don’t make sense at this stage. You will be asked to read and understand a lot of assembly code on the upcoming assignments.
Exercise 4:
Now, execute the program in gdb
, using stepwise
execution, and use gdb
commands to understand the contents
of registers and memory, by following these instructions:
Start by setting a breakpoint at the beginning of
string_length_a
and then run to execute the program until it hits the breakpoint:(gdb) break string_length_a (gdb) run
At this point, the program is paused at the beginning of the function, and
%rdi
should contain the address of the string whose length is being calculated. Show the value of%rdi
using:(gdb) info reg rdi rdi 0x4009d5 4196821
Note that the value of the register is shown in both hexadecimal and decimal notation. Your value may be different, since it’s a memory address that gets randomized.
The first test string when running the program is
"act"
. Let’s examine memory to confirm that:To display the value pointed to by
%rdi
as a string:(gdb) x /s $rdi 0x4009d5: "act"
To display as hex values:
(gdb) x /4bx $rdi 0x4009d5: 0x61 0x63 0x74 0x00
You are seeing the 4 bytes in hexadecimal notation which are the ASCII values for the characters of the string, plus a null character to terminate the string.
Now, use the command
help x
ingdb
to display help for thex
command and figure out how we could display the string (including the NUL byte) as characters:Correct answer: x /4bc $rdi Explanation: Note that
x /4c $rdi
will also work, since ‘c’ for character implies that only a single byte is being read.Run that command and you should see:
0x4009d5: 97 'a' 99 'c' 116 't' 0 '\000'
You are seeing the characters (and the decimal notation for their ASCII values) in the string.
Now, single-step several times, executing one line of C code and a time and pausing after each:
Note: when stepping in
gdb
,step
will complete one line of C code whilestepi
will complete one assembly instruction.(gdb) step 24 int count = 0; (gdb) step 26 count++; (gdb) step 25 while (str[count]) (gdb) step 26 count++; (gdb) step 25 while (str[count]) (gdb)step 26 count++;
Examine the disassembled code again, and find the address of the instruction which will return from the function (on line +19 below).
(gdb) disas Dump of assembler code for function string_length_a: 0x0000000000401156 <+0>: mov $0x0,%eax 0x000000000040115b <+5>: jmp 0x401160 <string_length_a+10> 0x000000000040115d <+7>: add $0x1,%eax 0x0000000000401160 <+10>: movslq %eax,%rdx 0x0000000000401163 <+13>: cmpb $0x0,(%rdi,%rdx,1) 0x0000000000401167 <+17>: jne 0x40115d <string_length_a+7> 0x0000000000401169 <+19>: ret
Use
help break
ingdb
to find the command to set a breakpoint at the return instruction:Correct answer: break *string_length_a + 19 Explanation: Note: You could also use
*
followed by the specific memory address of that line, which varies between users. If your code doesn’t match this assembly code, use the address of theret
(orrep ret
) instruction in your code.Continue execution (your program will run until it hits the breakpoint).
(gdb) continue Breakpoint 2, 0x0000000000401169 in string_length_a (str=0x4009d5 "act") at practice.c:33
What
gdb
command will show us the contents of the%eax
register that is holding the return value?Correct answer: info reg eax Explanation: Note that
print $eax
would also work.Run that command and you should see the number 3 (in hex). This is because the string
"act"
has 3 characters.Exit from gdb using
quit
.
Reverse Engineering/Deciphering Adventure
In this part of the lab we will get some practice on the
x86
assignment, although we won’t actually start the real
assignment. To create a directory and get the starter files, use these
commands in the shell, from within your cs240-repos
directory:
mkdir x86practice
cd x86practice
wget https://cs.wellesley.edu/~cs240/lab/lab07/starter/main.c
wget https://cs.wellesley.edu/~cs240/lab/lab07/starter/sample.bin
chmod 755 sample.bin
cp sample.bin backup.bin
The wget
program can download files into the current
directory based on a URL. We’re fetching two files: main.c
,
and sample.bin
. The first file is part of the C
code that was used to compile an “adventure” program which requires the
user to input a series of obscure codes to solve puzzles. Unfortunately,
the main.c
file does not contain the C code for any of the
puzzles (you can see at the top it does #include phases.h
,
but we don’t have a copy of either phases.h
or
phases.c
). The sample.bin
file is the entire
compiled program, and since it’s in executable format, that means it
contains all of the machine instructions. Your job for the assignment
will be to reverse engineer those machine instructions, using
gdb
and/or objdump
to convert them to assembly
and then reading through the assembly code to figure out what the
program does, and thus what inputs it needs to receive. You can write
those input into a file named inputs.txt
and feed them into
the program so that you don’t have to re-type them each time. One caveat
though: the program is set up to self-destruct, and
will erase itself if you give it the wrong passwords!
Note: In the code above, we included commands to
make a backup copy of sample.bin
in case it self-destructs.
You can use cp backup.bin sample.bin
to restore it if it
erases itself. If you’re quick, you can also hit control-C as it’s
counting down to interrupt the self-destruct process, and in
gdb
, you can set a breakpoint to stop the self-destruct
process from triggering. In the worst case, you can always re-download
the file.
Exercise 5:
Look at the code in main.c
, and answer these
questions:
How can the program be asked to read input from a file?
Example answer: By giving it the file name as an argument. For example, instead of running
./practice.bin
you could run./practice.bin inputs.txt
.How many phases are there in the program? Correct answer: 6
Which functions does it call from code that you do not have access to?
Example answer:
initialize_obstacles
,read_line
,phase_1
throughphase_6
, andphase_disarmed
. Note thatprintf
andfopen
are standard library functions.What argument is passed to each
phase
function?Example answer: Each of the phases gets the result of a
read_line
call which presumably gets either user input or input from the specified file.Since arguments are passed to functions via registers, and registers hold at most 64 bits on our system, what value is actually put in a register when the assembly code jumps to one of the phase functions?
Example answer: A pointer: the address of a location where the user’s input was stored as a string.
Each phase uses a string input by the user to perform some task. The phases have to do with (1) comparison, (2) loops, (3) switch statements, (4) recursion, (5) pointers and arrays, (6) sorting linked lists.
If you input the incorrect string for a phase, you get an error message and the program will attempt to delete itself. You must complete a phase by entering the correct string in order to move on to solving the next phase.
How can we figure out what string is required if we don’t have access to the C code?
Example
answer: Disassemble the code using
objdump
and read through the assembly code to understand
what it does. Also, debug the program with gdb
and use
disas
to inspect the assembly code plus breakpoints,
stepping, and
print
/info reg
/examine
to see
what the program is actually doing step by
step.
Exercise 6:
Run sample.bin
with gdb
and display the
disassembled version of the main
function:
gdb ./sample.bin
(gdb) start //runs the program and pauses at the beginning of the main function
(gdb) disas // shows disassembly of current function
Dump of assembler code for function main:
=> 0x0000000000400e2d <+0>: push %rbx
0x0000000000400e2e <+1>: mov %rsi,%rbx
0x0000000000400e31 <+4>: cmp $0x1,%edi
0x0000000000400e34 <+7>: jne 0x400e46 <main+25>
0x0000000000400e36 <+9>: mov 0x202ccb(%rip),%rax # 0x603b08 <stdin@@GLIBC_2.2.5>
0x0000000000400e3d <+16>: mov %rax,0x202cdc(%rip) # 0x603b20 <infile>
0x0000000000400e44 <+23>: jmp 0x400e9c <main+111>
0x0000000000400e46 <+25>: cmp $0x2,%edi
0x0000000000400e49 <+28>: jne 0x400e80 <main+83>
0x0000000000400e4b <+30>: mov 0x8(%rsi),%rdi
0x0000000000400e4f <+34>: mov $0x4027ed,%esi 0x0000000000400e54 <+39>: callq 0x400cb0 <fopen@plt>
Hit -return- several times to display the complete disassembled code
for main. Read through the code, and notice that you can recognize the
calls to functions that you saw in the C source (such as calls to
printf
, phase_1
, etc.).
You have learned in that up to 6 arguments for a functions are stored in the following registers:
- arg1: %rdi
- arg2: %rsi
- arg3: %rdx
- arg4: %rcx
- arg5: %r8
- arg6: %r9
Therefore, for main (which has 2 arguments, argc
, the
number of command-line arguments, and argv
, the array of
strings representing the command-line arguments), you would expect
argc
to be stored in %rdi
, and
argv
to be stored in %rsi
.
Display those registers and their current values:
(gdb) info reg rdi
rdi 0x1 1
(gdb) info reg rsi rsi 0x7fffffffe168 140737488347496
Remember: your machine may have somewhat different
values than shown here when an address in memory is displayed, so
%rsi
may show a different value on your machine than the
one displayed here.
From the value of
%rdi
, how many command-line arguments are there for the current invocation ofmain
?Correct answer: 1
Which of the following
gdb
commands can be used to display the string pointed to byargv
(usehelp
ingdb
and test these out if you need to)?print argv
print *argv
print **argv
x /s argv
x /s *argv
x /s *$rsi
print *(char**)$rsi
Explanation: Note thatx /s *$rsi
does not work, because the system knows the type ofargv
but does not know the type of$rsi
, and assumes that it points to 4 bytes instead of 8.What is the meaning of the first command-line argument to the program?
Example answer: It’s the path to the executable file that we ran.
You can view the arguments for any function call by examining the registers which hold the parameters at the beginning of the function. This will be useful to you when deciphering your adventure.
Continue execution of the program at this point:
(gdb) continue
You have just ridden the elevator to the hidden 7th floor
of the Science Center. A fantastic adventure awaits!
First, you must pass a *stringy* spider's web beyond *compare*, guarding a trap door!
The program is waiting for you to enter a string at this point (which
is the solution for phase_1
).
You don’t know the answer yet, so just guess and enter a string. A message similar to the following will be displayed:
!!!! WOMP WOMP !!!!
This adventure will self-destruct in 3 seconds...
This adventure will self-destruct in 2 seconds...
This adventure will self-destruct in 1 seconds...
This adventure will self-destruct in 0 seconds...
**** POOF! ****
Refer to the note above on how to recover the file if you didn’t hit control-C in time.
We’d like to avoid the self-destruct mechanism without having to press control-C quickly. Let’s look at why that happened, and how we can avoid it. First, go back into gdb (possibly after restoring the file):
gdb ./sample.bin
The main function calls phase_1
. You need to understand
what phase_1
does in order to figure out the correct
string.
What is the gdb
command to set a breakpoint at the
phase_1
function?
Correct answer: break phase_1
Set a breakpoint at phase_1
and run the program. The
program will not immediately reach the phase_1
function:
first it needs user input for read_line
in
main
.
Type a single character, and hit -return-.
Note: You are entering an incorrect string again here, but don’t worry.
You should now stop at the breakpoint for phase_1
:
Breakpoint 2, 0x0000000000400f77 in phase_1 ()
Disassemble to display the assembly code for the function:
(gdb) disas
Dump of assembler code for function phase_1:
=> 0x0000000000401080 <+0>: sub $0x8,%rsp
0x0000000000401084 <+4>: mov $0x4021b3,%esi
0x0000000000401089 <+9>: callq 0x40143e <strings_not_equal>
0x000000000040108e <+14>: test %eax,%eax
0x0000000000401090 <+16>: je 0x401097 <phase_1+23>
0x0000000000401092 <+18>: callq 0x40181c <trip_alarm>
0x0000000000401097 <+23>: mov $0x1,%eax
0x000000000040109c <+28>: add $0x8,%rsp
0x00000000004010a0 <+32>: retq
End of assembler dump.
Notice that there is a call to a function called
trip_alarm
(and a jump that can skip that call).
trip_alarm
is responsible for deleting your sample when an
incorrect string is entered.
To avoid executing trip_alarm
, set a breakpoint at
trip_alarm
:
(gdb) break trip_alarm
Since you have entered an incorrect string, you will expect to hit the breakpoint when you continue execution of the program:
(gdb) continue Breakpoint 2, 0x0000000000401392 in trip_alarm ()
You are paused at the beginning of trip_alarm at this point. To avoid executing it, simple re-run your program again from the beginning:
(gdb) run
The program being debugged has been started already. Start it from the beginning? (y or n) y
Answer ‘y,’ and, you are again executing from the beginning of the program (and it has not been deleted).
Note: to avoid tripping the alarm, ALWAYS set a
breakpoint to trip_alarm as soon as you start gdb! There is a mechanism
to do this automatically; create a file called .gdbinit
and
write ‘break trip_alarm’ in it, then follow the instructions about
enabling auto-load that GDB prints when it starts.
Exercise 7:
Now, examine phase_1
more closely
Once again, enter an incorrect string when you are prompted, and you
will again hit the breakpoint for phase_1
.
Disassemble to display the assembly code for the function:
(gdb) disas
Dump of assembler code for function phase_1:
=> 0x0000000000401080 <+0>: sub $0x8,%rsp
0x0000000000401084 <+4>: mov $0x4021b3,%esi
0x0000000000401089 <+9>: callq 0x40143e <strings_not_equal>
0x000000000040108e <+14>: test %eax,%eax
0x0000000000401090 <+16>: je 0x401097 <phase_1+23>
0x0000000000401092 <+18>: callq 0x40181c <trip_alarm>
0x0000000000401097 <+23>: mov $0x1,%eax
0x000000000040109c <+28>: add $0x8,%rsp
0x00000000004010a0 <+32>: retq
End of assembler dump.
From earlier, do you remember how many parameters phase_1 has?
Correct answer: 1
What register holds the parameter?
Correct answer: %rdi
What command will show the value of the parameter at this point?
Correct answer: info reg rdi Explanation:
print $rdi
would also work, but only shows the decimal value unless you add/x
.Run that command and you should see something like:
rdi 0x603640 6305344
What command can be used to examine the string at that address?
Correct answer: x /s $rdi
When you run this command, you should see something like:
0x603640 <input_strings>: "a"
This assumes the incorrect string entered earlier is an “a.” You should see whatever you typed in as an input.
The main function puts a pointer to the string you typed into
%rdi
before callingphase_1
.In general, if a function is called, the parameters needed must be put in the proper registers before the call. Examining the contents of the registers holding arguments before the call is often a good way to understand what the function being called will do.
So, let’s look at what functions are called from
phase_1
:
Dump of assembler code for function phase_1:
=> 0x0000000000401080 <+0>: sub $0x8,%rsp
0x0000000000401084 <+4>: mov $0x4021b3,%esi
0x0000000000401089 <+9>: callq 0x40143e <strings_not_equal>
0x000000000040108e <+14>: test %eax,%eax
0x0000000000401090 <+16>: je 0x401097 <phase_1+23>
0x0000000000401092 <+18>: callq 0x40181c <trip_alarm>
0x0000000000401097 <+23>: mov $0x1,%eax
0x000000000040109c <+28>: add $0x8,%rsp
0x00000000004010a0 <+32>: retq
End of assembler dump.
Here we have highlighted each value listed in the assembly code which is a memory address.
Of the highlighted addresses, which are not the address of a function?
0x4021b3
0x40143e
0x401097
0x40181c
Explanation: The first address is put into%esi
, and since that’s the ‘e’ part of%rsi
that holds the second argument to a function, it’s being used as a function parameter. The second and fourth addresses are listed with angle brackets afterwards showing that they are jumps to labeled addresses, and those labels are the function names for their functions. The third address has angle brackets, but the offset +23 indicates it’s a jump within a function body, not a jump to call a function.What do you think the function called strings_not_equal does?
Example answer: Probably returns 1 (i.e., sets
%rax
to 1) if the two strings it gets as arguments are not the same, and returns 0 (i.e., sets%rax
to 0) if they are the same.How many parameters do you think this function would need?
Correct answer: 2
Before
strings_not_equal
is called, the registers which pass the parameters to it must be set up with the parameter values, which we assume must be pointers to strings. Which two registers would need to have pointers in them?Example answer:
%rdi
and%rsi
are the registers used for the first two arguments to a function, so these need to have strings in them.Does the code of
phase_1
modify either of these registers before callingstrings_not_equal
?Example answer: Yes, the instruction at +4 modifies
%esi
which is the ‘e’ part of%rsi
.Why doesn’t the code of
phase_1
need to modify the value of%rdi
as well as%rsi
?Example answer: When
phase_1
is called, the user’s input string is already placed into%rdi
. So to compare that string against another string, it just needs to set%rsi
to the second string and leave%rdi
alone.What
gdb
command can show us what value%esi
points to?Correct answer: x /s $esi Explanation: In this case we use
/s
because we know it should be a string. We might use something like/xg
if we expected a pointer,/dw
if we expected an integer, etc. (seehelp x
ingdb
)Run this command and note down what it shows.
The value in
%esi
may not have been what you expected. But wait. Where are we in the code right now? Usedisas
and look for the arrow to see which instruction is up next. If you’re still before the line that sets up%esi
, then you can either usex
to examine the raw address0x4021b3
, or usestepi
to step forward until themov
command executes and then check the value of%esi
. What is the string that your input will be compared against?Correct answer: Alexandria Botanic Garden
What do you think the significance of this string is?
Example answer: It’s the correct answer to the first phase!
Now restart the program by entering run
. Enter the
string you just discovered, and when your phase_1
breakpoint triggers, enter c
to continue. If all goes well,
you should be on to phase 2.
You can now add that string to the inputs.txt
file. Then
run your program with ./sample.bin inputs.txt
(or in
gdb
use run inputs.txt
) and it will
automatically enter that input for you each time you re-run it.
Note: Once you pass command-line arguments to
run
within gdb
once, it will remember them and
you and subsequently just type run
without arguments. It
will print “Starting program:” and then show you the full set of
arguments each time you run
.
Exercise 8:
You have seen how to find the answer for phase_1
using a
bit of guesswork based on the strings_not_equal
function
name, but we didn’t look into the details of the rest of the assembly
code.
Notice the instructions following the call to
strings_not_equal
:
Dump of assembler code for function phase_1:
=> 0x0000000000401080 <+0>: sub $0x8,%rsp
0x0000000000401084 <+4>: mov $0x4021b3,%esi
0x0000000000401089 <+9>: callq 0x40143e <strings_not_equal>
0x000000000040108e <+14>: test %eax,%eax
0x0000000000401090 <+16>: je 0x401097 <phase_1+23>
0x0000000000401092 <+18>: callq 0x40181c <trip_alarm>
0x0000000000401097 <+23>: mov $0x1,%eax
0x000000000040109c <+28>: add $0x8,%rsp
0x00000000004010a0 <+32>: retq
End of assembler dump.
Remember that %eax
is just the lower half of
%rax
, which is the register used for returning a value from
a function. So, the value of %eax
in this instruction must
be the value returned from strings_not_equal
.
What is the effect of
test %eax,%eax
(look up thex86
test
instruction)?Example answer: It performs bitwise AND and discards the result, but sets flags based on that result. In particular, in our code the “ZF” or “zero flag” will be important.
How does the result of the
test
instruction affect the following instruction (look up thex86
je
instruction)?Example answer: The
je
instruction jumps “if equal” but what this really means is “if the zero flag is set” because “equal” is discovered through subtraction just like in the HW ISA. So if the%eax
value is 0, thetest
instruction will set the zero flag and the jump will happen. This makes the jump happen when the strings are NOT “not equal” (since the function is calledstrings_not_equal
) or in other words, when they’re equal. This x86 jumps reference is pretty useful for all of the different jump instructions.What would the C code that generated this assembly code look like (just give a vague sketch and then check the answer)?
Note: The
sub $0x8, %rsp
and correspondingadd $0x8, %rsp
instructions are modifying%rsp
the “stack pointer register” which we will learn about later. They are not terribly meaningful within this code although they are necessary to maintain stack alignment.Example answer:int phase_1(char* input_string) { if (strings_not_equal(input_string,correct_string)) { (); trip_alarm} return 0; }
Stretch Exercises: Read Six Numbers
Next, let’s look at the function read_six_numbers
(a
function called from phase_2
).
Disassemble the function phase_2
:
(gdb) disas phase_2
Dump of assembler code for function phase_2:
0x00000000004010a1 <+0>: push %rbp
0x00000000004010a2 <+1>: push %rbx
0x00000000004010a3 <+2>: sub $0x28,%rsp
0x00000000004010a7 <+6>: mov %rsp,%rsi
0x00000000004010aa <+9>: callq 0x4018b7 <read_six_numbers>
Note: Like phase_1
,
phase_2
has one parameter, which is the pointer to the
string that is input by the user. Therefore, %rdi
will
contain the pointer to the input when the function begins
execution.
Notice the highlighted line above. The second parameter to
read_six_numbers is being set up by copying the current
%rsp
to %rsi
before the call to
read_six_numbers
.
So, the first parameter to read_six_numbers
is the
string input by the user, and the second parameter is a location in the
stack area of memory.
Exercise 9:
What do you think the purpose of read_six_numbers
is?
Example answer: Probably it will split up the input into 6 different numbers?
Where do you think the results of executing
read_six_numbers
will be stored?
In the code
segment of the executable.
On the
computer’s hard drive.
On the heap,
after allocating space with
malloc
.
On
the stack, within the stack frame for phase_2
so that
phase_2
can make use of them.
Explanation: The key clue here is that the stack pointer during
the phase_2
call was passed as an argument to
read_six_numbers
. It’s possible that that function calls
malloc
and returns a pointer to numbers allocated on the
heap, but giving it a pointer to the stack makes it likely that it will
store the numbers there.
Next, disassemble read_six_numbers
and examine the code
(here with address values highlighted):
(gdb) disas read_six_numbers
Dump of assembler code for function read_six_numbers:
0x00000000004018b7 <+0>: sub $0x18,%rsp
0x00000000004018bb <+4>: mov %rsi,%rdx
0x00000000004018be <+7>: lea 0x4(%rsi),%rcx
0x00000000004018c2 <+11>: lea 0x14(%rsi),%rax
0x00000000004018c6 <+15>: mov %rax,0x8(%rsp)
0x00000000004018cb <+20>: lea 0x10(%rsi),%rax
0x00000000004018cf <+24>: mov %rax,(%rsp)
0x00000000004018d3 <+28>: lea 0xc(%rsi),%r9
0x00000000004018d7 <+32>: lea 0x8(%rsi),%r8
0x00000000004018db <+36>: mov $0x40277d,%esi
0x00000000004018e0 <+41>: mov $0x0,%eax
0x00000000004018e5 <+46>: callq 0x400c80 <__isoc99_sscanf@plt>
0x00000000004018ea <+51>: cmp $0x5,%eax
0x00000000004018ed <+54>: jg 0x4018f4 <read_six_numbers+61>
0x00000000004018ef <+56>: callq 0x40181c <trip_alarm>
0x00000000004018f4 <+61>: add $0x18,%rsp
0x00000000004018f8 <+65>: retq
End of assembler dump.
Notice that this function call a C library function called
sscanf
(highlighted in blue above).
Search online for a definition of the function sscanf
and answer the following questions:
What does sscanf
do?
Example answer: It reads formatted values from a string, and stores them into variables pointed to by pointers it is given in addition to the string to scan.
How many parameters does sscanf
have?
Example answer: It has a variable number of parameters, but always at least 3: the first parameter is the string to scan, the second is the format to recognize, and the third and subsequent parameters are pointers to variables where the scan results should be placed. For each % escape in the format string, the result will be placed into the variable pointed to by the next argument in order.
What are the first and second parameter of sscanf
, and
how are they used?
Example
answer: The first parameter is the string to
scan, and the second is the format string. The format string determines
what kind(s) of values sscanf
will try to read from the
string being scanned, using % escapes like printf
.
What are the remaining parameters of sscanf
, and how are
they used?
Example answer: Each additional parameter is a pointer to a variable into which the next scan result should be placed.
What value is returned by sscanf?
Example answer: It returns the number of items successfully scanned.
Notice that a large immediate value is loaded into %esi
before the call to read_six_numbers
(highlighted in red
above). This is setting up the second parameter for the function.
Examine memory at that address to understand the meaning of the large
constant:
Note: Your own computer may load a different value
than 0x40277d
into %esi
; if so, in the command
below; use the value from your own computer. If you step far enough, you
could also use $esi
.
(gdb) x /s 0x40277d
0x40277d: "%d %d %d %d %d %d"
Explain what this string tells you about the format of the expected input string:
Example answer: It means that the input string will be read as 6 decimal numbers in base 10, separated by whitespace.
Now that you understand the format of the string expected, again
examine the code in phase_2
after the call to
read_six_numbers
(starting at the +14 line right after the
call to read_six_numbers
):
(gdb) disas phase_2
Dump of assembler code for function phase_2:
0x00000000004010a1 <+0>: push %rbp
0x00000000004010a2 <+1>: push %rbx
0x00000000004010a3 <+2>: sub $0x28,%rsp
0x00000000004010a7 <+6>: mov %rsp,%rsi
0x00000000004010aa <+9>: callq 0x4018b7 <read_six_numbers>
0x00000000004010af <+14>: cmpl $0x1,(%rsp)
0x00000000004010b3 <+18>: je 0x4010da <phase_2+57>
0x00000000004010b5 <+20>: callq 0x40181c <trip_alarm>
0x00000000004010ba <+25>: jmp 0x4010da <phase_2+57>
0x00000000004010bc <+27>: add $0x1,%ebx
0x00000000004010bf <+30>: mov %ebx,%eax
0x00000000004010c1 <+32>: imul -0x4(%rbp),%eax
0x00000000004010c5 <+36>: cmp %eax,0x0(%rbp)
0x00000000004010c8 <+39>: je 0x4010cf <phase_2+46>
0x00000000004010ca <+41>: callq 0x40181c <trip_alarm>
0x00000000004010cf <+46>: add $0x4,%rbp
0x00000000004010d3 <+50>: cmp $0x6,%ebx
0x00000000004010d6 <+53>: jne 0x4010bc <phase_2+27>
0x00000000004010d8 <+55>: jmp 0x4010e6 <phase_2+69>
0x00000000004010da <+57>: lea 0x4(%rsp),%rbp
0x00000000004010df <+62>: mov $0x1,%ebx
0x00000000004010e4 <+67>: jmp 0x4010bc <phase_2+27>
0x00000000004010e6 <+69>: add $0x28,%rsp
0x00000000004010ea <+73>: pop %rbx
0x00000000004010eb <+74>: pop %rbp
0x00000000004010ec <+75>: retq
End of assembler dump.
Based on that lines +14, +18, and +20, what do you think the first value in the string should be?
Example
answer: It should be a 1, because otherwise
line 18 will not jump and line 20 will call trip_alarm
.
Where is that first value stored?
Example
answer: In memory, where %rsp
the stack pointer is pointing. This is the bottom of the stack (the
stack grows downwards).
If the first value is correct, what line of code is jumped to/executed next?
Example answer: Line +57, highlighted in red above.
The lea
instruction can be used to do a variety of math
operations, although its original purpose is to do address calculations.
To understand how it works, you need to understand the addressing syntax
in x86 assembly. The syntax for memory address computations is:
offset ( base, index, stride)
with defaults of offset = 0, base = 0, index = 0, and stride = 1. The formula for the address is:
base + offset + (index × stride)
So for example, the assembly expression
$2(%rdi, %rsi, $4)
, assuming %rdi
holds 100
and %rsi
holds 10, would compute the address 142, which is
%rdi
+ 2 + %rsi
* 4. The lea
instruction just computes that value and stores it in the destination
register.
Given this information, explain the instruction
lea 0x4(%rsp),%rbp
, and what value you expect
%rbp
to contain (this one is complicated):
Example
answer: Here the offset is 4, and there is
no index or stride specified. So we just have %rsp + 4
being stored in %rbp
: it’s serving as an alternate
add
instruction.
What’s the significance of the number 4 in the above instruction, given that we are storing integers on the stack in this function?
Example
answer: 4 is the size of one integer, so
we’re setting %rbp
to point to the next integer beyond the
stack pointer.
Note that the next instruction initializes %ebx
to a
value. Record the initial value of %ebx
:
Correct answer: 1
What address does the program jump to next?
Example answer: +27, the line highlighted in blue.
Hand-execute the next four instructions (+27, +30, +32, and +36), keeping track of the value of any registers that change. Explain what you think is happening during the execution of these instructions:
Example answer: First, we add 1 to %ebx, so the value is now 2. Next we copy %ebx into %eax. Then we multiply whatever is in memory at 4 bytes below %rbp with %eax. In this case, the result will be 2, since we’re multiplying %eax (currently 2) with the first input number (must have been 1). In future iterations, this will always be multiplying the previous input number by the current %ebx value. Finally, we’re comparing the value we just got from multiplication with the second/current input number.
Can you understand the relationship that needs to hold between each value and the one following it in the string for the password to be accepted?
Example
answer: Since %ebx
starts at 2
for the second input value (first input value must always be a 1), and
%ebx
increases by one each time, each input value must be a
multiple of the previous value, but that multiple keeps increasing by 1.
If we consider the indices of the inputs starting from zero as ‘i’, each
number has to be (i + 1) times the previous one. So the correct sequence
would be 1 2 6 24 120 720
.
Hints for Reverse Engineering
Read through each new function and highlight all of the address values, including jump targets within the function, calls to other functions, and any other addresses that might be used. Use a separate document for this, and/or print stuff out on paper and use highlighters.
Focus on how to avoid executing calls to
trip_alarm
. Is there a jump that jumps around atrip_alarm
call? What comparison or test decides whether it jumps? What data are inspected by the comparison? What code generates that data? Can you change that data to steer around thetrip_alarm
?Zoom out. You do not need to understand what every instruction does. In fact, you could spend a lot of time deciphering code which will not help you that much in solving the problem. Certainly, details of many instructions are key, but many are not. Stumped on one instruction? Ignore it for now and come back later only if you cannot figure things out without it.
Function names are useful information. Sometimes it’s worth assuming a function does what it says it does and double-checking. You can often double-check by using
gdb
to probe the values of arguments to a function and then check the return value in%rax
.Some functions are not well-named, but you can still extract some useful information. For example, if you stumble into a function with a wacky name like this:
__isoc99_sscanf@plt: => 0x0804873c <+0>: jmp *0x804a100 0x08048742 <+6>: push $0x28 0x08048747 <+11>: jmp 0x80486dc
don’t try to figure out what it does by examining its instructions. There are two markers in the name of the function that suggest it is something to look up:
__isoc99_
: stands for ISO C99, the name of the International Standards Organization standard for the version of C we are using. This means it is a function in the standard C library.@plt
: stands for Procedure Linkage Table, a table that the linker and loader use to connect code compiled from a C program with pre-compiled external code. Again, this suggests a function in the standard C library.The nice thing about standard library functions is that, instead of figuring out what they do instruction by instruction, you can just look them up.
I find the name of this one by removing the
__isoc99_
and the@plt
, to seesscanf
. Then I can:- Read the manual page about this function with the command
man sscanf
, focusing on the function headers, the beginning of the DESCRIPTION section and the RETURN VALUE section; or - Search to find the same info online.
Now I can learn:
- What arguments the function takes.
- What it does with them.
- What it returns.
- Read the manual page about this function with the command
When you are ready to start the x86 assignment, read the setup
instructions carefully. You will need to repeat phase_1
for
your own version of the assignment. The code for phase_1
will be the same as in the lab sample, but the solution will differ for
your assignment, since each team has unique solutions for each phase of
the adventure.
If you have time left in lab today, you can work on the pointers assignment, and/or look back unfinished stuff from last week’s lab which will help with that. If you are finished with that assignment, you should begin reading the x86 assignment and start work on it.