🔬 Lab
Frequently Asked Questions
This page holds questions and answers for some frequently-asked questions from the lab. Use your browser’s find-in-page to search for words you’re interested in and then click on a question to expand the answer. This page only has answers from some topics, and will be updated throughout the class; feel free to email your lab instructor if you have a question that you’d like to see here.
Lab 5
How does a Register File work?
Check the lab 4 notebook for the details of the register file circuit. Basically, you can use a decoder on an address bus to send a signal to one of N registers, so that only the clock for that register turns on. Then you can send the same address bits into a multiplexer, so that you only get output from the register that you want. The clock is used when writing to a register is needed (and is ANDed with a “write-enable” line for the entire file, or equivalently, the control unit that sends the “write enable” signal can AND that with a clock to ensure that there’s a falling edge between two cycles where “write-enable” is on) and the multiplexer is used to get output. Since the register file has two read addresses/outputs and one write address, there are 3 address inputs, one hooked up to a decoder for deciding where to write, and the other two hooked up to two multiplexers for deciding which register(s) to read from. The same write data is routed simultaneously to all registers, with the clock signal making sure that only the appropriate register is written.How are clock signals used in the CPU?
Within register file and RAM chips, clock signals are only needed when we want to actually write into a register. So the global clock signal gets directed into these devices, and ANDed with a “write enable” line to determine whether or not to actually write data. This can happen either inside or outside the device; in our LogicWorks simulations, we manually turn write enable on and then off (or vice-versa if it’s active-low) to simulate a clock pulse. The ALU doesn’t need a clock signal at all, because it’s a combinational circuit: it produces outputs based on its current inputs all the time. The main place where the clock signal is used every cycle is the program counter: that register has an address update feedback mechanism that we saw in lab either adds 2 or performs BEQ or JMP logic. The clock signal then controls when this feedback mechanism actually updates the register to move to a new address. As soon as that happens, the outputs from instruction memory change, since its input from the PC has changed. When they change, their new value is the next instruction to execute, and so the control unit and register file now receive new data which changes their outputs, and after a short delay necessary to let signals propagate through gates, their outputs have settled and the instruction is complete. As long as the length of one clock cycle is longer than this gate propagation delay, the instruction effects are in place before the next cycle.How can I figure out what a series of HW ISA instructions will do?
To understand what the result will be from a series of HW ISA instructions, you need to mentally (or on paper) keep track of: What is the current program counter and what will it be for the next instruction, what is the content of each of the 16 registers, and what is the content of all 256 memory slots. Of course, you really only need to keep track of those registers and memory slots that the code actually reads from or writes to. Importantly, you really only have 14 registers to track, since R0 is always 0 and R1 is always 1. You can think of each instruction at two different levels of abstraction: as control signals from the control unit (which you should understand the truth table for if you want to deal with stuff at this level) OR as abstract logical/arithmetic operations, listed in the “meaning” column of the main HW ISA slide.
If you want to think about it at the bits/circuit level, trace through the full CPU architecture diagram for each instruction: from the opcode part, figure out the outputs that the control unit will send. From the Rs/Rt/Rd parts, figure out what inputs the register file, ALU, and/or BEQ mechanism will receive. From the control unit outputs, figure out which signals get dropped or forwarded by the multiplexers, which influences these inputs. Then, figure out what the ALU output is, and possibly, what address input the data memory will receive and what its output will be. Finally, figure out what value will be sent back to the register file (or written into the data memory, at what address, or what the BEQ/JMP address result will be). I think that this process is worth doing a few times so you really understand how the circuitry is connected and operating together. However, it’s a slow process. If we think more abstractly, we can reason through assembly code much faster…
So if you want to think at the “meaning” level of abstraction, just read off the “meaning” line for the opcode of the instruction being processed, fill in the s/t/d/offset values in that meaning from the other 12 bits of the instruction, and note down the update to either the PC, a register, or a memory address (again, using paper or a whiteboard here is probably good to start with; once you’re more comfortable you can start to just hold this info in your head). At this level of abstraction, you can also look at the Rd (or Rt for LS/SW) values to understand which register is being updated (except for BEQ/JMP where it’s the PC). This means that if you lose your place in terms of what a register value is, you can look back through the code to the last time where that register is listed as a Rd value (or as Rt for a LW instruction), since that’s the instruction that would have set the current value for the register in question.What is the purpose of the Control Unit?
The purpose of the Control unit is to send all the little extra signals that the other parts of the CPU need to perform the correct function for each different opcode. For example, the ALU needs to be told which operation to perform, and memory and the register file need to be told whether to accept incoming data and update a value or to ignore it (the “write enable” lines). This is why the opcode bits of each instruction go straight into the control unit.How can we figure out what the control unit should do for a particular opcode?
To come up with the Boolean algebra expressions for the control unit, it’s easiest to start from a truth table: you already know how to use sum-of-products and/or simplification to get form there to an expression. To derive the truth table for the control unit, you need to understand what the other parts of the CPU circuit are doing, and then be able to deduce what the individual control wires need to be set to to make that happen. All of that is in terms of the “meaning” column of the ISA slide. So for example, if the “meaning” for a particular opcode says “R[d] <- R[s] + R[t]” you know that the steps for making that happen are: 1. read data from R[s], 2. read data from R[t], 3. Add those two numbers together, and 3. Store the result in R[d]. If that’s the work the circuit needs to do, then you know that the address inputs to the register file must be s and t, and the read data outputs from the register file must make it into the ALU. That implies that the “Mem” control line will be off for this operation, since otherwise, read data 2 is based on d, not t, and the t data as an offset would get put into the ALU instead of the read data 2 output from the register file. Additionally, it means that the ALU opcode must be the opcode for addition. Further, you can also deduce that Mem must be 0 because the ALU result needs to make it back to the register file data input, and Mem also controls the multiplexer that can swap that for a data memory result. Then you can deduce that the “write enable” bit for the register file should be on, since the result needs to be written into a register. Similarly, the result from the ALU and data value passed to the data memory won’t be sensible, so we need to make sure to set the “mem write” line to 0. Also, we are NOT doing any kind of jump or branch with this meaning, so the relevant control lines for those things should be off. That’s just a simple example, but the process is the same for the other opcodes.
If you don’t really understand what the purpose of a particular control line is, look at what it’s connected to and what part that plays in the circuit. If it still seems mysterious, read more about the relevant operation and make sure you understand what has to happen for that operation to take effect.How is a decoder chip used in a RAM circuit?
(Note: lab 4 has an example RAM circuit where you can see these details.)
The decoder in the RAM circuit gets input from the address input of the whole chip. Remember that a decoder converts things into a “1-hot” encoding, meaning that 1 of N wires is on and the rest are off (or vice versa if it’s active-low). So we take the address input (say, 8 bits) and convert it to a 1-hot encoding (which for 8 bit of address would be 256 wires, one of them active). These wires connect to the individual registers within the RAM: for 8 bits of address we have 256 registers. This activates that particular register, both activating its tri-state buffers to dump its data onto the output bus, and if the write-enable signal is also on (using an AND gate) activating its clock inputs so that data coming in on the data-in lines actually updates that register (the data-in line connects to all registers simultaneously, but only one will receive clock input).How are tri-state buffers used in a RAM circuit?
(Note: lab 4 has an example RAM circuit where you can see these details.)
The RAM circuit needs to produce output from one of its internal registers at a time: whichever one was indicated by the address input. Each register has tri-state buffers for each of its bits, and the address logic takes care of activating only the tri-state buffers for the specific register that the address bits indicate. The tri-state buffer acts as a wire that passes a signal through when activated, but when deactivated, acts as a gap in the circuit. So effectively, one of the registers gets connected to the data outputs at a time, while the rest are disconnected. You can think of a tri-state buffer as a 2x1 mux, where the input signal is on the ‘1’ input to the mux (i.e., gets forwarded through when the select signal is a 1) and the second input to the mux is a disconnected wire (so that when ‘0’ is selected, the output is neither high nor low, nor in-between: there is no output connection). That’s not exactly how it works, since a traditional MUX wouldn’t work that way; check out this circuit (a 2x1 mux implemented using 2 tri-state buffers) if you want to see the details of how a tri-state buffer is implemented.What is RAM actually used for in a computer?
RAM gets used to store most data for a running application. For example, right now, the text that you’re reading is stored in the RAM in an area reserved for your web browser, along with all of the data from all of your open tabs. Now, some of that may in fact be stored on the hard drive if RAM is too small to hold it, and if it is, when you go back to that tab, it will need to be loaded back into RAM, which will take a bit longer than switching to a more recent tab that’s still loaded in RAM. There’s also a special part of RAM (or in some cases RAM chip) that stores the machine instructions for the programs that you’re running: the CPU needs to access that data as well to know what to do. In our HW Arch, this is the “Instruction Memory” that’s separate from the “Data Memory.”What’s the difference between RAM and the hard drive, and what are they used for?
RAM is general-purpose memory, there’s a lot of it, and it’s much faster to access than the hard drive, so it’s used for all of the data that programs and your operating system are actively using at any given moment. Unlike data stored on the hard drive, data stored in RAM will vanish if you turn off the computer though. So when you save something in a file, that gets written to the hard drive.
Some systems that need to remember things, like an oven that remembers settings, just stay on all the time, because they’re plugged into a wall outlet. If you unplug them, they’ll forget their settings. Other devices have a battery that just keeps the RAM powered when the device is off; SRAM as opposed to DRAM needs very little power to stay active over time. Still other devices might use flash memory or another kind of persistent storage that can retain data even when power is off, although these are typically more expensive than RAM, so there’s a tradeoff there.What instructions does the computer actually load when it boots up?
When you power on the computer, there’s a specialized chip that contains a fixed set of instructions for the computer to start with. This is the firmware and/or BIOS. This in turn invokes a “boot loader” which loads up the operating system. The operating system itself has a mechanism for loading instructions from a file into instruction memory, but it also does some high-level management of multiple programs to swap them in and out.How is a bit shift implemented in a circuit?
In hardware, a bit shift is super simple: you just plug in the wires to their next destination shifted left or right by one. The bits that get “shifted off” are just wires that you leave unconnected, while if you need extra zeroes, you make extra ground connections at the destination. If you need to do an arithmetic shift, you can connect a bunch of inputs at the destination up to the shifted top bit of the source, so that whether that single wire is 0 or 1 controls all of the inputs together. None of these things require any transistors, which is great.What would happen if you accessed instruction memory at an odd address?
If you have byte-addressable memory but the word size is more than 1 byte, you can ask for the computer to load data starting in the middle of an instruction. It will happily grab part of that instruction and then part of the neighboring instruction, and load those two halves together as one word. If you’re lucky (which is quite likely) then that instruction will just be something nonsensical and the program will have a slight glitch or crash (especially in cases where reading one unaligned instruction means you will end up reading more and more as the PC advances by 2 automatically to another unaligned address). If you’re unlucky, it will do something worse, like go into an infinite loop or overwrite some critical data in memory that it shouldn’t have. After all, there’s a chance that the two parts of an instruction together make up an instruction like LW, BEQ, or JMP. These risks are part of why it’s actually not possible to do this in our ISA: the BEQ and JMP instructions have built-in multiply-by-two steps so that the actual address they reference will always be a multiple of 2 if the PC starts at a multiple of 2, which it always does.How does the BEQ instruction work?
The BEQ instruction stands for “branch if equal.” It’s a very important instruction because without it, there’s no way for the computer to change its behavior based on the data it is processing. You can create quite complex programs with just JMP, but you can’t achieve anything nearly as rich as you can with BEQ available as well. BEQ will serve as our “if” statement, as well as “for” and “while” loops when combined with JMP. How does it work? First, “branch” in this case means “go somewhere else in the code, instead of continuing down as normal with PC <- PC + 2.” It could also have been called “JEQ” for “Jump if equal” because it is a kind of jump, but the word “branch” implies a splitting-off which is relevant to BEQ because there are two possible destinations: BEQ examines two registers (or possibly the same register twice) and if they’re equal, it performs the jump indicated by the offset, relative to where the PC would have ended up normally. On the other hand, if the two registers’ values are not equal, it skips the jump and just does PC <- PC + 2 as usual, continuing to the next instruction after it.
As an example of how it is used, consider a while loop: we want to “continue the loop if the continuation condition is true.” If we assume true = 1 and false = 0, you could use the following assembly code at the bottom of your loop:
BEQ Rn, R0, 1 JMP <top of loop>
What inputs does the ALU have?
Ainv, Bneg, op1, and op0 are the inputs to an ALU, which we saw in detail in lab 4. Together, they can be used to get the ALU to perform not only AND, OR, and +, but also other operations like subtraction, NAND, NOR, and even NOT. The Control Unit is responsible for creating these inputs based on the opcode, so there’s a strict relationship between opcodes in the ISA, and the ALU inputs.What is the role of the Program Counter (or PC) within the CPU?
The Program Counter register stores the “current instruction address,” and while running a program it’s directly connected to the Instruction Memory (one of the two RAM chips in the HW Arch CPU, the other being Data Memory). By connecting to the instruction memory address bus, the PC determines the data output from the instruction memory in each clock cycle, and as it keeps increasing by 2 each cycle it reads out subsequent instructions from memory one by one (or in some cases its value changes to follow the jump dictated by a BEQ or JMP instruction).What’s the difference between the “Mem” and “Mem Store” lines?
The “Mem” line from the control unit controls three multiplexers, which change the read address 2 input of the register file, the second ALU input, and whether the ALU output or Data Memory output is routed back around to the register file data input. For both LW and SW instructions, this line is set to 1, and the first two multiplexers allow it to perform the R[s] + d (where d is an offset) calculation that both LW and SW require to compute their memory address. However, when doing a LW operation, we do NOT want to write data into the RAM (the data available at the RAM data input for that instruction is the contents of the register Rt that we are trying to write to, not read from). So in this case, we set Mem Store to 0. On the other hand, for an SW operation, we want to do exactly that (store Rs data into memory) and we set Mem Store to 1. In this case, the RAM still outputs a value which will change to match Rs as the instruction completes. We ignore this value by setting the write enable line for the register file to 0.What would happen if you loaded program data as text?
Most likely, you’d see a garbled mess of random characters, possibly with some runs of similar characters where similar instructions happen in a row, and very likely with many unprintable characters where the necessary opcodes or register specifications of machine code, when translated through a symbol table, map to symbols that aren’t normally used. You can try this yourself: right-click on an executable file and choose “open with” and then select a text editor, like Notepad on Windows or TextEdit on a Mac. Alternatively, in the command line if you have a compiled program (like the program you wrote for the bits assignment) type “cat” and then the filename. Note that on the command line this may cause issues with the terminal if it happens to end up printing special output control characters; in that case using the terminal “reset” command (or just opening a new terminal) might be helpful.
Of course, with an expressive enough ISA and machine code, it might be possible to write a program that also looks something like intelligible text. Indeed, any strings used in the program will appear somewhere among the machine code gibberish as well. But it’s very likely that you’d just see weird and nonsensical text.What would happen if you loaded text as a program?
If you load non-program data, like a text file, as if it were a program, the CPU will interpret that data as instructions and try to run them. Most likely, your “program” will just crash, but if you are supremely unlucky, it may do something else like enter an infinite loop, or even start erasing data from your hard drive. It just depends on how the text was encoded, and what those bit patterns mean when interpreted as machine instructions. Quite likely this will be a bunch of nonsense operations doing arithmetic and hopefully eventually either a lucky HALT instruction, or some combination of instructions when run in an operating system that makes the OS stop the program for bad behavior. But if the bits happen to line up as a bunch of SW instructions, for example, you could end up overwriting important values in RAM.What’s the difference between a 32-bit and a 64-bit computer?
These terms basically mean that the CPU, including the ALU and the register file, have that many bits in each slot. Of course, if you want to add two 64-bit numbers on a 32-bit machine, it’s not impossible: You issue one add instruction for the first 32 bits of each, then issue an instruction to recover the carry-out and store it somewhere, then issue another instruction to add that carry-out to one of the numbers (again capturing carry-out) and then add that result to the second half of the other number, finally add the carry-out from that to your second carry-out value (if you even care about final carry-out) and finally, issue instructions to write both half-results back into memory next to each other so that they’re still a 64-bit number. Obviously, this is much slower and more painful than just doing a normal + operation, so we tend to end up using 32-bit integers on a 32-bit machine, and 64-bit integers on a 64-bit machine. Code that uses 64-bit integers can handle much much much higher values before overflow is an issue, and so it has a lot of advantages over code using 32-bit integers. Other parts of the computer (like RAM addresses) may also follow the bit width of the CPU, and there can be advantages there (with a 32-bit address space, you can address about 4 billion bytes, so about 4 gigabytes. But many modern computers have 8, 16, or even larger RAMs, which may require a 64-bit address space, or workarounds that slow down 32-bit computers even more).What are the different connections a ALU could have with a RAM?
The question in the lab was mostly to provoke you to think a bit more about what kinds of connections could be made and how pieces of a CPU might fit together, in preface to you actually looking at the (mostly) full CPU circuit diagram later in the lab. In that diagram, the ALU takes inputs from the register file (or possibly Instruction Memory directly for the offset part of a memory address) and sends output both to a register file and to the address input of Data Memory (another RAM). However, note that the data memory output goes into the register file, and the data memory input comes from a register file, Since the ALU can take input from register files and write output to them, it’s indirectly hooked up to the data input and output of the data memory, since values can go into a register for a cycle and in the next cycle go in to the ALU, or vice versa.
On why the ALU is used as an address input in the CPU: The HW ISA states that for LW and SW, the memory address is going to be “R[s] + offset.” For the CPU to implement this, it needs an adder circuit. Instead of adding a dedicated adder for memory address computations, it simply re-uses the existing ALU, which would otherwise not be involved in the LW and SW instructions at all. It routes the last 4 instruction bits which are serving as the offset into the ALU, along with the R[s] result from the register file, and sets the operation bits for the ALU to addition. Then the ALU result is hooked into the data memory address input.How are the “sign extend” and “shift left” for the BEQ computation implemented?
The BEQ “meaning” column states: “PC <- PC + 2 + offset * 2” if the registers are equal. So the circuit needs to take offset, multiply it by 2, and then add it to the normal PC result. The left-shift by 1 bit is a multiply-by-2, and the sign extend ensures that the 4-bit offset value (which is signed 2’s-complement) can be turned into a 16-bit value to be added to the 16-bit PC value. Where are these components in the LogicWorks circuit? They actually aren’t circuit components, but are instead achieved by clever wiring of the offset wires where they connect to the adder input in the BEQ mechanism: the wires are shifted up by one spot, with an extra ground connection below them for the 0, and then the 4th wire is connected to ALL of the inputs above the 5th input as well as the 5th input itself, so that you get a “sign extend” effect. It’s very efficient that we can perform these two operations using just clever wiring and without any transistors involved, but they still need to show up as boxes in the circuit diagram so that you can understand what operations are being performed.How do we know how to connect wires in LogicWorks based on a circuit diagram?
When translating a schematic diagram (like the CPU diagrams we show in various places) into LogicWorks or the Falstad Simulator, you need to first identify which component in the simulator corresponds to which part of the diagram (or what to search for and add to your diagram to get the right component). This is complicated by the fact that some components in the schematic diagram, like a sign extend or a shift, are created just using alternate wiring connections without there being an actual component.
Next, try to make the same connections as are present in the diagram, using buses and breakouts to route multiple wires when the diagram has a slash with a number above it indicating a multi-wire line. Use “show values” in LogicWorks to help double-check that things are connected and that their behavior is correct: you can try to verify each chip one at a time for a simple example where you know what each value should be. (If you can’t put together a simple example and deduce the values for each part of the circuit diagram, it’s best to back up and work that out first, before attempting to simulate the circuit.)What’s the difference between the HW Instruction Set Architecture (ISA) and a “real” assembly language?
The HW ISA is intentionally kept very simple, so that we can teach it to you in 5 weeks and you can understand how the hardware side of things implements the software specification. In the next part of the course, we’ll be reading and writing programs in Intel x86 assembly; other “popular” (at least, among hardware hacking people I follow on social media) assembly languages include ARM (a much simpler assembly language with a different design philosophy, found in some Macs and many embedded devices like TVs) and 6502 (an older CPU popular among retro tech enthusiasts and used in systems like the Super Nintendo). Each will implement a lot of the same basic instructions as the HW ISA, just with more different flavors. They might also use more than 16 bits, and some have instructions that are different sizes.
The basic operations of the HW ISA are the same operations we’ll use for about 70-80% of our assembly programming, just with slightly different syntax. The most common other instructions we’ll use will be specialized jump instructions for function calls and returns, and some specialized memory access instructions for dealing with the stack (which we’ll be learning about soon).
We will NOT ask you to learn anything about the hardware implementations for these other ISAs: we’re going to draw a firm abstraction boundary there since otherwise we could spend the entire class getting into the details of modern hardware design.What’s the difference between an ISA (Instruction Set Architecture) and a Microarchitecture?
An Instruction Set Architecture is a specification that says “here’s the meaning of each of these assembly instructions, and here’s how it will be encoded in binary.” It’s possible to build multiple different CPUs that both implement the same ISA, and one might even be faster or slower than another (or consume less power, or generate less heat, etc.). The Microarchitecture is a specific hardware design (and/or actual physical chip) which implements a particular ISA. So the specific layout of the CPU circuit diagrams we’ve shown you is a microarchitecture, while the “HW ISA” slide is pretty much the full specification for the associated ISA.What does the “R” in RAM mean?
The “R” in “RAM” stands for “Random” which may feel a bit weird, given that its operation is entirely deterministic. But it’s a historical artifact: prior to RAM there was “sequential access memory” where values were always read out one after another, with no address input. You’d either set up your code to take advantage of this, or if you couldn’t, you’d have to wait and let values cycle around again to the spot you needed, which was inefficient. RAM is much more expensive to build in terms of transistors, but the efficiency is typically well worth the investment.What does the LW instruction do?
The LW instruction stands for “Load Word” and its purpose is to take data stored in memory and load it into a register. This is necessary because the HW ISA does not contain any instructions for preforming arithmetic on values stored in memory: only values stored in registers can be added, ANDed, etc. So a typical program will need to load values from RAM into registers, perform whatever computation it needs to do with them, and then store them back into RAM so that it can use the registers for the next batch of operations.How does the Program Counter (PC) get updated?
The Program Counter has three update mechanisms: the normalPC <- PC + 2
advance, the BEQ mechanism, and the JMP
mechanism. A series of multiplexers controlled by signals from the
control unit and the ALU’s zero output select whichever route is
appropriate for the current instruction. By the end of the clock cycle,
as the falling edge approaches, the correct next PC value will have
propagated through these mechanisms, and be waiting on the input side of
the PC register. During that clock cycle, right from the beginning, the
current PC value will have been used as the Instruction Memory address
input, meaning that the corresponding instruction will be sent out from
the instruction memory data output, which feeds an opcode into the
control unit and which feeds addresses and/or offsets into the register
file and/or ALU. When the falling edge of the clock arrives, the PC’s D
flip-flops will each accept their current input as their new value, and
a new instruction will begin to execute based on the “next PC” result
from the previous instruction. Once again, at some point during that
cycle, the PC update mechanisms will place yet another new PC value at
the PC input side, ready for the next swap.
Why does the HW ISA use a 16-bit word size in RAM?
The choice of 16 bits as the word size in RAM is based on the fact that the HW ISA specifies a 16-bit word size for each instruction. If RAM used a different number of bits per word, you’d have to load either multiple words per instruction or only part of a word per instruction, vastly complicating the design either way. But you can easily build RAMs with any word size you want: you just need that many data input and output wires, and then you need to group the internal D flip-flops together in groups of that size.
This choice of 16 bits affects the rest of the CPU as well: the register file has 16-bit registers so that we can load or store one word at a time. This also needs to match up to the ALU, so that it can operate on two register file values and send back a result that fits into a register file. In each case, there are 16 wires that connect from one chip to another. There are other systems which organize data on a wire as pulses over time which might need some kind of synchronization mechanism to tell the next chip “I’m done now” but using parallel wires like our designs do, the number of bits is built into the physical design of the chip.How is the number of address bits related to the total size of a RAM chip?
The number of address bits for a RAM chip places an absolute limit on the number of data items it can contain, since with N bits, there are different possible bit patterns, so you can’t distinguish any more than that number of data items. This is also the formula for deducing either the number of address bits from the RAM size, or the RAM size from the number of address bits. Almost without exception, a RAM will always choose to have exactly that many data items, although in principle it could have fewer. In the simplest setup, each “data item” is a “word,” (16 bits for the HW ISA) and so the total size of the RAM is the word size times the number of distinct addresses.
In the HW Arch, we actually have a more complicated case, where the RAM is “byte addressable” meaning that each set of 8 bits has its own address, even though we read and write 16 bits at a time. There are both advantages and disadvantages to this approach that I won’t go into detail on, except to mention that when representing English text, 8 bits is a common amount of data used to represent a single letter. The effect that this has on RAM size is that the full size of the RAM is now the addressable size times the number of addresses, rather than being the word size times the number of addresses. In our RAM, with 8-bit addressing and 16-bit words, this means that the size of RAM is only half as large as it would be if it were strictly word-addressable, but we have more flexibility in being able to read starting in the middle of a word.How can I practice drawing circuit diagrams?
There are multiple things you might be asked to draw a diagram of: you might be asked to draw a diagram based on a truth table, a boolean algebra expression, and/or an English description of how some components fit together. For the first two, you can convert between truth tables and boolean algebra expressions. I think if you’re given a truth table, converting to an expression is best because the expression can be directly translated into AND, OR, and NOT gates: Just use order-of-operations to figure out which operations happen in what order, and draw a gate for each one.
For an English description, just try to follow the description as best you can.
To practice drawing from a description, you can go to any of the diagrams in the lab pages and right-click to access the alt text (your browser should have some means of doing this). The alt text will be a description of how that circuit is connected, so you can try drawing from that description without looking at the image, and then compare your result with the image when you’re done (do let me know if you find an inconsistency!).
To practice with drawing circuits from boolean algebra expressions, you can do the following: Draw out a 4-row truth table with two inputs A and B and an output Q. Fill in all four input possibilities, and then randomly put ones and zeros in the output column. Then convert to a boolean algebra expression and try to draw the circuit. You can build it in the Falstad simulator according to your drawing, and then use the simulator to test against your truth table to see if you got it right. You can also do this with 3 inputs (and so 8 rows), which will be a bit better practice.What does the JMP instruction do?
The JMP instruction lets us move from one part of the code to another. The BEQ instruction can also do this, but the BEQ instruction has a limited range, because the offset is a 4-bit signed value, so it can jump at most 8 instructions backwards (or 6, if you don’t count BEQ and the instruction after it) or 7 instructions forwards (8 if you count the instruction after BEQ). So JMP allows us to jump to a very different part of the code, since it can accept a 12-bit offset, and that offset is absolute, rather than relative. Remember that we always multiply the offset by 2, to ensure that the resulting PC address is even, because each instruction word (16 bits) takes up two address slots (8 bits each), so the address of the start of each instruction is a multiple of 2.
We’ll see some other related instructions that are fancier in the next part of the course, but key uses of the JMP and related instructions are to do function calls and returns: If you think of each individual function being written as assembly code somewhere in the instruction memory, a JMP can take us to the start of one of those blocks to call the function, and (with a bit of an extra twist) a JMP at the end of that function can serve as a return to go back to where the function was called from. The twist is that the same function needs to return to multiple different locations, based on where it was called from, so we can’t just use a normal JMP instruction (and the HW ISA can’t actually be used to implement full function calls like you’re familiar with from Python or Java).How can we make sure that the program will halt instead of going into an infinite loop?
With BEQ and JMP instructions, it’s possible to create infinite loops. Of course, sometimes that’s what you want, like a program that’s always waiting to respond to input and only stops when you shut off the computer.
But to ensure that it does halt, you need to ensure that the sequence of BEQ and JMP instructions will always eventually lead to a HALT instruction. As it turns out, in the abstract this is a tricky theoretical problem, and if you are given a program supplied by an untrusted source, it can be prohibitively resource-intensive to prove whether or not it will eventually halt.
CS 235 “Theory of Computation” goes much deeper into this question than we will in this class.Why is the register file faster than memory?
I don’t know all the specifics of the answer to this question, but some reasons include:
We ignore this detail in our circuits in this class, but a 1 or 0 indicated on a single wire may be insufficient to actually trigger all of the places that wire connects to if the wire connects to many different places. If you think of each output end of a wire as needing to “do some work” to assert either a 0 or a 1, and the input end as “doing some work” to assert 0 or 1, if there’s one input and thousands of outputs, the work of the input might not be sufficient to trigger each output as it is supposed to. This can be fixed with extra repeater gates, but…
We mostly ignore this detail in this class, but each gate in a circuit (really, each transistor) takes some time to operate. And the next gate in the circuit won’t be receiving valid inputs until the previous gate settles, so in a circuit with hundreds or thousands of gates, this propagation delay makes the whole circuit take more time to reach a valid state. This is called “gate delay.” A register file is small, and thus has a limited gate delay cost. A much larger RAM doesn’t necessarily seem to have a longer longest-single-path (which is what matters for gate delay) but in practice because of the repeaters from point 1, it may have a much higher delay.
A third consideration (which again we’ll ignore in this class) is that while the register file is built directly into the CPU, the RAM chip is actually a separate circuit on another board (this is why you can add more RAM to your computer by plugging in a new chip). There is some delay related to the communication between the two chips, which happens over a PCI bus that also has other things using it.
There are more issues, like the difference between SRAM and cheaper-but-slower DRAM, which I don’t fully know the details of. You should now be equipped to understand a lot of them though if you want to dig deeper on Wikipedia and/or in our textbooks.
In practice, modern CPUs use multiple layers of “caches” which are separate RAM chips that are closer to and further away from the CPU, usually with smaller sizes and more expensive/faster construction closer to the CPU. These act as temporary storage for data from the main RAM, in (common) cases where you need to access the same memory over and over again, so that you don’t have to go all the way to main RAM each time. How these work is pretty complex, and we won’t cover it in too much depth in this course, but it is often relevant to the question of “why does my program execute so slowly compared to that other program.”What is the big picture in terms of how things are connected in the CPU?
The full CPU schematic diagram from the lecture slides is pretty much the big picture in terms of how things are connected: The PC connects to the instruction memory, which feeds instructions to the control unit and the register file, the register file outputs go into the ALU which may feed an address to data memory, and the ALU result and/or RAM data output feeds back into the register file. On a separate path, the BEQ and/or JMP logic modifies the PC for those instructions. The control unit feeds control signals to each of these systems to get them to perform their part of the current operation.How are numbers represented using voltage patterns?
The registers in RAM and in the register file store voltage patterns: they have a series of 16 D flip-flops, and those D flip-flops are each remembering a value in the sense that their output remains at +5V or ground, depending on their previous inputs. These can be updated by pulsing their clock lines a necessary. How do those voltage patterns (a series of wires either charged to +5V or discharged to 0V) represent numbers? We treat each wire as a single bit, with high voltage (above +2.5V) meaning 1, and low voltage (below ~2V) meaning 0. Then we read those 16 1’s and 0’s in sequence to produce a binary number, and use the conventions of binary notation to translate that into a familiar decimal number. So for example, if all of the latches have low-voltage output except the 3rd, that’s the number0000 0000 0000 0100
,
which in decimal is a 4. The circuitry is hard-wired to interpret some
of those numbers (so 4 might correspond to the 4th register in the
register file because it contains a decoder that makes that happen). In
other cases, we use symbol tables to translate those binary values into
shapes on a monitor with the corresponding meaning. Ultimately though,
the form that the data takes is these electrical charges on wires. So
for example, if we disrupt those by introducing a stronger electric
charge or in some cases a magnetic field, we will corrupt the data.
Does a computer usually have just one CPU?
Most modern CPUs have what are called “multiple cores” which effectively means several mini-CPUs bundled together, and these cores execute things in parallel to each other (most often, they’re simple working on different programs, but sometimes a single program can be using more than one core). In addition, a chip from the Intel 386 line loads more than one instruction at a time, and executes these instructions in parallel using multiple ALUs. This is extremely tricky, since this means that the instructions you wrote are NOT being executed in order, and the CPU is responsible for ensuring that everything will appear to have happened in order at the end of the day. Look up “pipeline stall” if you want to dive deep down that rabbit hole. For this class, we’ll just pretend we’ve got a single CPU executing one instruction at a time, and that it’s only working on the program we wrote.Lab 6/7
How does modifying a value differ from modifying the dereferenced version of a pointer pointing to that value?
In code, we are comparing situations like this:
int x = 0;
+= 1; x
vs. modifying by dereferencing a pointer, like:
int x = 0;
int *xAddr = &x;
*xAddr += 1;
As far as I know, there are no differences whatsoever between these
two cases. The compiler should produce almost identical assembly code
for each. One caveat is that the second code may end up forcing
x
to be stored on the stack, since we need its address. But
that shouldn’t change the effects of the code at the C level.
Note that if the pointer type of the pointer doesn’t match the type
of the variable, there may be differences. The following code uses a
char*
and so when we modify the dereferenced value, we
modify only 1 byte (notice that a cast is necessary to create the
differently-typed pointer):
int x = 0xffff;
char *xAddr = (char*) &x;
*xAddr = 0;
x
will be
0xff00
rather than just 0.
What are malloc
and free
used for? Why do we
have to cast the result of malloc
? What arguments do
malloc
and free
take and what do they return?
malloc
is short for “memory allocate” and it does just
that: allocates (reserves for our use) some memory. To build a metaphor
here, imagine you’re in a review session with a bunch of other students,
and you each have a different problem to solve. You could just solve it
in your head, but there are whiteboards in the room, and if solving it
in your head means using registers only, there are times when there’s
too much information and you need to write stuff down (i.e., use memory
instead of a register). But with so many students (functions) in the
same room, you need a way to share the whiteboard. While
malloc
uses the “heap”, it’s worth considering for a moment
the other memory pool: the stack. Each function gets its own stack
frame, so in our metaphor each student gets some reserved whiteboard
space. However, after a student is done solving their problem, we erase
that space so that other students can use it. Anything stored on the
stack is theoretically invalid after the function returns. That’s where
the heap comes in: In our metaphor, it’s an extra part of the board
where students can write info from their solution that’s useful for
other students. If you’re all solving related problems, sometimes part
of the solution to one problem is useful in solving another. For this
part of the board, not every student needs to use it, so we don’t
allocate space for each student. Instead, when a student needs to use
it, they can just walk up and draw a box to reserve some space, then put
the information they want to share in that box. Later, if someone who
needed that information decides it’s no longer necessary, they can erase
the box, so that others can use that space. So this extra whiteboard
space that anyone can use is called the heap, and it’s where we store
values that need to be shared between functions.
malloc
in our metaphor is walking up to the whiteboard
and drawing a box. When we do it, we need to determine the amount of
space we need, which is why we need to specify that when we call
malloc
. malloc
returns a pointer to the
allocated space; we usually want to cast that pointer to a specific
type, based on what we plan to put in that space.
free
in our metaphor is erasing the box so others can
use it. If we forget to call free
, the heap will fill up
with junk values, and someone who needs to use space will end up unable
to run because there’s no space left. This is called a “memory leak,”
and valgrind
is the tool that helps us detect memory leaks.
The complicating factor is that since we use the heap to share
information between functions, the function that calls free
is often a different function than the one that calls
malloc
. This makes it extra easy to forget to call
free
when necessary, or to do something like accidentally
calling free too early, before someone else comes along and tries to
read the value we just erased. While in the whiteboard metaphor freeing
actually erases the data, in the computer, freeing just marks the memory
as fair game for re-use, and it may or may not be erased.
In general, since using malloc
is complicated, if you
can avoid it you should. But any time you need a value that’s bigger
than a single return can carry to pass from one function to another,
you’ll need to use it, or use other techniques that are almost as
complex (like passing a pointer into the second function which it uses
to write its answer into pre-allocated memory).
The specifics usage of malloc
is that you give it a
number of bytes, and it returns a pointer to allocated space where that
many bytes are reserved. Because it doesn’t know what you wanted the
bytes for, it always returns a void*
generic pointer. This
is why you almost always cast the result to indicate what type it is,
and use sizeof
in the argument to compute the correct
number of bytes for a certain number of values, like:
int *data = (int*) malloc(sizeof(int)*10);
For free
, you just give it the pointer that
malloc
returned and it marks that memory as
no-longer-in-use. In this example you’d do:
(data); free
How does C remember types? Does C allow us to modify the type of a declaration?
The compiler keeps tables that it fills in as it reads your code with
the memory address (or register used) for each variable, as well as the
types that they have. When producing assembly code, it uses these tables
to pick which registers/addresses to use. It also uses the type
information to pick which variants of assembly instructions to use, and
whether to use full-size or part-size registers (e.g., using
%eax
for an int but %rax
for a pointer). We
won’t dig into that process in this class, but a compilers class
would.
Why is the ptrdiff_t
type useful?
When computing the result of subtracting two pointers, we get a value
of type ptrdiff_t
rather than the same type as the
pointers. Why not just have char*
- char*
→
char*
?
One reason is that pointers are unsigned, since there’s no such thing as negative addresses. But when we subtract, we care about direction, not just difference, so the value is signed.
But why not just useint
if we want a signed integer
difference? The problem is that the int
type is often 4
bytes, while a pointer might be 8 (this is the case for code we’ll write
in this class). We’d like a signed integer type that is at least a big
as a pointer itself, and that’s what ptrdiff_t
is.
How can we address a whole block of memory with a single number?
The short answer is: “we can’t.” Each address specifies a particular byte in memory. For anything bigger than a byte, we can’t really “point to” the whole thing, without using two numbers: one for the start, and another for the length (or one for the start, and another for the end). In practice, pointers themselves always point to the starting byte of whatever they’re trying to indicate. Then the compiler keeps meta-information about the size of that value, so that it knows how many bytes to access at once (e.g., 4 bytes together for an int).
In the case of arrays, the compiler isn’t helpful, so we need to keep around extra variables holding their lengths. Alternatively, as with strings, we can have a “sentinel” value that we place at the end, and if we want to know the length we just iterate until we find that value.
Why does writing past the end of an array (e.g., replacing
ptr[20]
with 'B'
after declaring
char[3] ptr;
) cause a segmentation fault? fault?
Segmentation faults can happen for several different reasons:
- Code tries to write data to a memory address that’s read-only.
- Code tries to read data from an unreadable address (like NULL).
- Code tries to execute instructions from an address that’s not marked as executable (e.g., trying to jump to a location that’s not part of the program’s code).
'B'
relative to
where the array was stored on the stack coincided with that return
address. By corrupting the return address, we cause the code to jump to
the wrong place when it returns, and if the place it jumps to isn’t even
part of the code any more, we get a segmentation fault.
When do we apply scaling to pointer operations?
The best way to think about this in my opinion is to consider all addition and subtraction operations with a pointer value to take place in units of “slots,” and then based on the size of the thing pointed to, that determines the size of a slot. So if we have:
int *array = {10, 11, 12, 13, 14};
int *ptr = array + 2;
In this case, the pointer will point to the third slot in the array,
or to the number 12. In raw address terms, that’s 8 bytes beyond the
address of the base array, and that’s where pointer scaling comes in:
the actual value of 2 was multiplied by the stride of 4 to get +8 for
the raw byte address. However, when doing subtraction, we effectively
apply scaling in reverse. If we look at numerical values, ptr is 8 more
than array. But if we ask C to print out ptr - array
the
subtraction happens in terms of “array slots” and the result will be 2,
not 8. The reason for this is that we’d like the following equation to
be true: array + (ptr - array) == ptr)
. In C, that
expression does the (ptr - array)
calculation in terms of
“int-sized slots” to get 2 as the value, so that when it adds 2 to array
which it also does in terms of “int-sized slots” it gets the correct
value for ptr
.
&array[2] == ptr
is true,
because that’s the same as &*(array + 2) == ptr
and the
&
and the *
cancel out.
What happens if I assign a pointer to a local variable on the stack and then dereference that pointer after the function has returned?
When a function returns, we move the stack pointer back up above its stack frame, but as soon as another function gets called, the stack pointer moves back down, across the same memory region. The new function then will write its own data into that region, erasing the data written by the old function. At that point, anyone who uses a pointer to the old data will see the new data instead. For example:
int* readThree() {
int results[3] = {0, 0, 0}; // needs
("%d %d %d", &result[0], &result[1], &result[2]);
scanfint *r = results; // otherwise the compiler substitues NULL for return
return r;
}
int main(int argc, char** argv) {
int* first3 = readThree(); // creates frame; then returns
// first3 is technically invalid here, but the data is still there
// ...until another function is called
int* next3 = readThree(); // same frame position as first call
// (printf itself doesn't use enough of the stack to pose a problem)
("First 3 address: %p\n", first3);
printf("Next 3 address: %p\n", next3); // prints same address
printf("First 3: %d, %d, %d\n", first3[0], first3[1], first3[2]);
printf("Next 3: %d, %d, %d\n", next3[0], next3[1], next3[2]);
printf// same numbers from second call are printed twice
return 0;
}
"1 2 3" the first time and "4 5 6" the
If we were to run this and enter , the prints would print out "4 5 6" twice, and it would show
second timefor the two return values. Without the extra pointer
the same address , compiling it will also produce a complier warning about
variable. When it prints that
returning a pointer to stuff allocated on the stack, our currrent compiler also sets the return value to NULL,
warning"undefined
becuase since returning a pointer to a stack variable is " the compiler can do whatever it wants in that case. So if we
behavior, we'll just get a segmentation
run this program without the `r` variable. fault
How big are the stack and the heap?
The stack usually has a fixed maximum size in the megabytes range. Given that most function call frames take a few dozen bytes, this is usually sufficient for thousands to hundreds of thousands of layered function calls, and since each function gives back space when it returns, the only case where you might be worried about the stack size limit is when using recursion and layering lots of stack frames. The heap on the other hand is usually only limited by the physical memory (and on 32-bit systems by the address space size). A 32-bit system is usually limited to ~3GB of heap, while a 64-bit system can use up to all of the physical RAM available (until we develop systems with exabytes of RAM…).
There’s actually a virtualization layer that allocates real physical memory between different programs on your computer. So if you have 8GB of RAM and Firefox is using 4GB, another program you run may encounter an out-of-memory error if it tries to allocated 6GB.
See this StackOverflow answer for more details.Why do we have to allocate memory? Why don’t we do this in other languages like Java or Python?
At a most basic level, if we want to use memory to store things, we
need to allocate it, so that two different functions or programs don’t
try to use the same memory for different purposes. The stack conventions
provide an easy way to allocate memory for a particular function call:
Just declare a local variable. However, that’s not good enough if we
need to create values that will survive beyond the return from a
function, like strings, arrays, or other objects that need to be passed
around between functions. In those cases, we use malloc
to
ask for a permanent allocation, and then when we’re done with that
memory, we use free
to release it so that someone else
could use that memory later.
In languages like Python, Java, or Javascript, we have a garbage collector and an automatic memory allocation system. Allocation is handled for us automatically, but to free it, we need to spend some time on “garbage collection” now and then. Those languages are in many cases 2-100 times slower than equivalent C code (with garbage collection being one part of that). Now, most of the time, we simply don’t care, and it’s better to use the garbage-collected language, because there are all sorts of errors that we can make when allocating memory, and for many programs, they spend most of their time waiting for user input or network data, not running their own code (so speedup doesn’t matter).
For some real-time applications, like videogames, performance actually does matter, and these are more typically written in C++ or another low-level language. Always try to judge whether a performance improvement actually impacts the user experience before reaching for the “fastest” language.
That said, understanding how memory is allocated by having to do it manually is super useful in understanding various things about the performance of automatic allocation systems. For example, knowing how strings are allocated can help you understand why certain Python programs would be faster than others. For example, these two functions:
def f(n):
return 'a'*n # string multiply can allocate once
def g(n):
= ''
result for i in range(n):
+= 'a' # needs to copy + reallocate each time
result return result
n
, even if one is faster
or slower. If we understand how strings are represented in memory, we
can see that he second function, each time it adds an ‘a’, is going to
have to copy the entire result string to a new memory location that has
room for an extra ‘a’. Since it does that in each iteration of the loop
the actual performance of g
is proportional to
n
squared!
Is using pointers more efficient than using array indexing?
No. They will become exactly the same assembly code when compiled, so there’s absolutely no difference in terms of what the CPU will do.
We’re having you use pointers instead of indexing in order to give you more practice using pointers, so that you can actually understand them.Can we implement more complicated data structures like heaps or sets in C using just pointers?
Yes. No matter what language you’re working with, ultimately memory is used to hold data, and any pattern of memory access can be written up using pointers.
Sets are somewhat complex, but there’s actually a very nice and simple heap and/or binary tree data structure that uses a linear array of memory and which can be implemented using pointers quite easily.What is a segmentation fault?
The computer’s memory is divided up into different “segments.” One of these holds the actual machine instructions to be executed, another holds the heap, another holds the stack, and other segments provide space for things like system calls. Each segment has certain permissions: can we read from it, can we write to it, can we execute code from it? A segmentation fault happens when we attempt to do something that’s forbidden, like write to a read-only segment. Most often, we will see:
Attempts to read from or write to the base segment, which is a small area of memory at the very low end of the address space that’s neither readable nor writeable. This segment exists to protect us from accidentally using a number value as if it were a pointer. If we say something like:
int x = 3; int y = *x;
This would attempt to read memory at address 3. You’ll get a segmentation fault becuase of that protected base segment. Most commonly, this will actually happen because you are using NULL as a default value for a pointer and accidentally tried to use the pointer before it received its correct value.
Attempts to execute code from the stack or heap. The stack is marked as non-executable, so trying to interpret data stored there as machine instructions will be a segfault. This usually happens when return addresses stored on the stack get corrupted, which shouldn’t happen much in practice, but which will happen in this class because we’ll be teaching you about how to corrupt the stack.
Why do pointers have to know what kind of value they point to? Aren’t they all just memory addresses?
Yes, as values, they’re just unsigned integers (64-bit on a 64-bit machine, or 32-bit on a 32-bit machine). However, the compiler treats them differently when adding or subtracting, according to the kind of value they point to. This is a good thing as it makes array formulas work smoothly, but it does require us to distinguish the different pointer types. These types also affect how much memory is read when a pointer is dereferenced, even if we’re not using arrays. If I have a “pointer to an integer” and I want to read “what it points to” I need to read 4 bytes. However, if I had a “pointer to a character” with the same address value, I’d only read 1 byte when reading “what it points to.”
How can you tell the difference between the *
type modifier
and the *
operator?
Whether it’s part of a type declaration for a new variable. Casts (which
also use the type definition syntax) are always in parentheses,
and the unary *
operator can never be the last thing before
a closing parenthesis. It’s extra confusing because the *
“operator” can appear on the left-hand side of an equals sign, which is
not usually a place that an operator can appear, and type declarations
can also appear there. But a *
for a type declaration will
be directly after a type word, or directly before a variable name.
Unless you’re declaring multiple variables on one line, just look to see
if there’s a type name before the *
. With multiple variable
declarations (or even a single declaration), you cannot also include a
*
operator on the left hand side of the equals sign.
What are “little-endian” and “big-endian?”
When storing an integer, which is 4 bytes, we have to decide how to
store those bytes in memory. Let’s say we want to store the value
0x12ab34cd
at address 0x40000
. Do we store
0x12
at address 0x40000
, 0xab
at
address 0x40001
, 0x34
at address
0x40002
, and 0xcd
at address
0x40003
? Or do we do the opposite, and store
0x12
at address 0x40003, 0xab
at
0x40002
, 0x34
at 0x40001
, and
0xcd
at 0x40000
? This decision doesn’t really
make much difference in terms of performance: either way we’ll load all
4 bytes at once; either way we’ll do things like carry between them in
an adder, just in a different order. However, if two different systems
make different choices, they won’t be able to directly transfer data to
each other in binary: They’ll have to convert that data to swap around
the bytes first. So even though within a particular design it doesn’t
have big performance implications, differences between designs do.
The word “endian” comes from Gulliver’s Travels, in which there is a society deeply divided about whether eggs should be cracked from the little end or the big end. In other words: it’s a reference to making a big deal out of an ultimately unimportant distinction. There aren’t really good arguments for why one way is “better” than another, but people are still sharply divided into two camps about it.
Remembering which one is which is hard, but a mnemonic that you can
try is that “little endian” means that the “little” part (i.e., the
least significant byte, which is 0xcd
in our example above)
is stored at the “end” (i.e., the lowest address). Equivalently, you can
remember that the end of the number (still 0xcd
is stored
at the little address. Big endian is the opposite: the end of the number
0xcd
is stored in the byte with the biggest address, or
equivalently, the biggest (most significant) part of the number
(0x12
in our example) is stored at the end (at the smallest
address).
Why does C use a sentinel value for strings, but require a separate variable to store the lengths of arrays? Why not just use one mechanism or the other for both?
I don’t actually know the history behind why different choices were made in these cases, but there are certainly tradeoffs between them. Note that if you want, you can create a character array that isn’t NUL-terminated, and store its length separately, although this may cause headaches if other don’t realize what’s going on.
For sentinel values, not having a separate length variable simplifies the code that deals with them. However, you pay a cost: you need to allocate one extra byte per string, and strings are no longer allowed to contain an all-0 byte as part of their data. That second cost is not really acceptable for arbitrary arrays, since not being able to store the number 0 in an integer array would be pretty silly. Of course, you could pick another data value as the sentinel, but things get complicated if each different kind of array needed to use a different kind of sentinel value. Also, because we often do arithmetic on numbers (but rarely on characters) the chance of accidentally producing the sentinel value would be greater for arrays of integers.
In contrast, while arrays do require more complicated programming to pass around extra variables, they don’t come with any space overhead or value restrictions. Especially if you have lots of very small arrays, avoiding the space overhead can be meaningful (imagine 1,000,000 2-item arrays representing x, y coordinates).
How do gdb
and valgrind
help in debugging C
programs?
gdb
is a debugger, so it lets us set breakpoints and get
lots of information about what’s going on. If your program crashes,
running it with gdb
can let you figure out exactly what was
going on when it crashed, and also inspect the variable values at that
moment that might have led to the crash.
valgrind
double-checks all of your memory usage, in
particular malloc
and free
. It will report
errors when you make mistakes, and it tells you a good amount of detail
about what happened and which parts of the code are involved. Since
memory mistakes are incredibly easy to make, but often only cause
problems a small percentage of the time, valgrind
is pretty
important!
Lab 8
What’s the difference between info reg
, print
,
and examine
(or x
) in gdb
?
info reg
tells you the value of a register. You only need
to give it the register name, like info reg rdx
.
print
can print many kinds of values, from C variables to
register values. You need to prefix a register name with $
for print
because it’s not designed to work with only
registers (you might have a C variable that shares the name of a
register, for example). In contrast to info reg
and
print
, examine
interprets its argument as a
memory address, and shows you what’s in memory there. It accepts a wide
range of expressions just like print
does, including C
variable names, so to use it with a register, use $
followed by the register name, as in x $rdx
. For
x
you usually want to have some idea of what kind of values
will be in memory so that you can tell it how to format them. For
example, x /s $rdx
will interpret the data as a string,
translating bytes into characters via the ASCII table and printing those
characters until it hits a NUL byte. On the other hand,
x /2xg $rdx
will print two “giant” (8-byte) values as
hexadecimal numbers, starting at the target address. Use
help x
within gdb
to get details on the
formatting options.
How do you see what’s on the stack in gdb
?
Lab 9 has several examples of this, but the basic idea is: the stack
is in memory, so we can use examine
(or just
x
) to view that memory. Of course, we have to know where in
memory we want to look, but since %rsp
stores the stack
pointer, we can use $rsp
as our target address to see
values starting from the bottom of the stack.
x /20gx $rsp
will show you 20 8-byte slots on the stack,
and many things stored on the stack are 8 bytes (any pointer, including
return addresses, will be). But other things like integers or characters
might be less than 8 bytes, so they’ll be jumbled together as one big
hex number. You can practice reading out those details from looking at
hex numbers (although just from the number you can never be 100% sure),
or you can consult the assembly code to figure out what values it stores
on the stack in what order, and thus know what to expect. For example,
if you know that 4 integers are stored on the stack starting 8 bytes
above the stack pointer, you could do x /4wd $rsp+8
to
print them out in decimal format. Use help x
in
gdb
to see the details of the formatting commands.
How can we protect vulnerable information like a password from being stolen by someone who has access to our assembly code? Can hackers take over any program they want by disassembling code?
There are two main avenues of approach here (although a computer security class can go into a lot more detail). The first one is instead of giving people code, run the code on your own computer and sell access to a web interface where they can send messages to the code and get messages back. Most modern apps are trending this way, and almost anything with a login these days involves at least one server besides the computer or phone actually using the app. This is complicated, but you get a lot more control over things. Sadly, this includes the ability to do a lot of nefarious stuff, like sell off the user’s data to make some extra profit, or force them to enable extra “features” they weren’t expecting and didn’t want.
The second approach is to use encryption to protect key information, even though you’re giving out a program people can download. For something like a password, there are encryption routines that can compute a value called a “hash” that’s based on the password, using a function that takes only a few seconds to run, but where running the reverse function to compute the password based on the hash would take millions of years. The math behind this is pretty cool (and out-of-scope for this class) but as one example, consider factoring a 100-digit number that was created by multiplying together 3 prime numbers, one of which was 50 digits long. If I tell you what the numbers are, it’s easy to multiply them and verify that they do indeed multiply to the correct value. But if you use a disassembler to look at the code and see that the correct password involves entering three numbers that are each prime and which multiply together to create the stored target value, you’re out of luck: actually finding what those numbers are requires searching through so many different possibilities it will take effectively forever to do it.
Now, there are still some attacks against this method, typically involving making smart guesses about what the factors could be based on knowledge of what kinds of passwords people typically use. And of course, if the password is trying to prevent users from executing some code and the code itself isn’t properly encrypted, you could edit the assembly code to jump directly to the code you want to run and skip the password check, so getting it right is pretty tricky.How do you understand which lines of assembly code and/or registers are used as part of a loop?
One of the first things to do when looking at a function in assembly code is to map out the jumps by identifying which ones go where, especially those that are internal to the function rather than jumps to call other code (although those are important too). From that, you’ll be able to identify loops because they involve one or more jump instructions where given the right condition results, we can jump in a loop from one instruction back up and eventually down again to the same instruction. In the simplest case, there’s just one conditional jump that goes upwards in the code, and the code will then flow back down to that jump until eventually its condition becomes false.
Once you’ve identified any loops in the code, look at the comparisons (typicallycmp
or test
instructions) right
before the jumps to figure out which registers and/or memory addresses
are used to determine when they stop or continue. Remember that
especially when test
is used, the literal name of the jump
instruction often doesn’t totally apply. Focus on which flags the
comparison sets, and which flags the particular jump instruction tests
(for example, je
tests the zero flag, so when following a
test
, it’s testing whether the result was zero, NOT whether
two things were equal).
If we have access to the full source code, what are the benefits of looking at the assembly code?
Typically, looking at the source code is much better in this case, and I wouldn’t recommend looking at the assembly unless it’s absolutely necessary. If there’s a bug in your compiler, it may be necessary since the assembly might not faithfully implement the source code in that case. Hope that you never run into such a bug: they’re very nasty since you typically try to rule out every possibility involving errors in your own code first, which is exhausting. Also, unless you’re very unlucky, the compiler bug will end up being fixed in a newer version of the compiler that’s already available, so simply upgrading your version will solve things. Discovering a new compiler bug is even more fun :)
Besides situations where the compiler’s buggy, the only other advantage of looking at assembly over source code I can think of is if you’re programming a performance-critical application. In some cases where one particular function runs many times inside a loop, hand-optimizing the assembly code could be enough to improve runtimes by 25-50%. Just be aware that often times working at the source code level to improve things like cache coherence can be just as if not more important, so hand-optimizing assembly code should only very rarely be the tool you reach for to improve performance.
That said, there are a lot of situations where you don’t have the source code: most programs you run are produced by private companies that keep their source code private as one means of protecting their secrets. A few examples of practical places you might want to use a disassembler:
- If you’re working to understand and counteract a new computer virus, reading the assembly code of the virus is necessary since the hacker producing it isn’t likely to give you the source code.
- If you’ve just found a new method of exploiting or taking over programs and you want to check whether common programs are vulnerable to it, you may need to disassemble those programs to check.
- If you’re modding a videogame, you won’t have access to source code, but you need to figure out where key data and functions are so that you can change what they do. Some games might come with a programming API or even be open-source, but that’s not the norm.
Is there a list of commonly used assembly instructions?
Yes. This memory diagram sheet also includes a list of common instructions. You can find the link on our tools page, along with a few other references. The Compiler Explorer also lets you right-click on any instruction to look up the reference for it.
What does cmp
(or cmpb
, etc.) do?
The cmp
instruction is short for “compare” but it
doesn’t actually do that. It effectively does that in combination with
one of the jump instructions, however. What it does do is subtract the
two operands, throw away the result, but set flags based on that result.
The flags are things like Zero, Sign, Overflow, etc. which we saw in the
HW ISA. Since cmp
sets the flags and instructions like
je
or jg
check them, they work together.
However, instructions like add
that store results in
registers also set flags, so the cmp
instruction typically
needs to be right before the jump instruction.
Since cmp
subtracts and je
checks the zero
flag (jumping if it’s set), together cmp
followed by
je
will jump if the things that were compared are equal.
But je
following a different instruction may have a
different meaning. The test
instruction works like
cmp
, except the operation it performs is bitwise AND (still
setting flags based on the result of that operation). This means that
jump instructions following test
will have different
meanings than usual.
cmp
can be used to compare different sizes of
data. For example, cmpb
would compare a single byte. You
can also see this reflected in which register names are used, since
partial register names will be used when dealing with smaller data.
How can we identify how many arguments a function uses by looking at its assembly code? Which registers are used to store function arguments?
We know that certain registers are used for specific arguments: the
mnemonic “Diane’s silk dress costs $89.” helps name %rdi
,
%rsi
, %rdx
, %rcx
,
%r8
, and %r9
as the registers for the first 6
arguments (with further arguments getting pushed on the stack, which
complicated things).
%rdi
and
%rsi
get used in the assembly code, but not
%rdx
. This is complicated by the fact that if a function
simply passes arguments on to another function without using them for
any other purpose, and it passes them on in the same order it received
them at the same positions, it can directly make the calls in the
assembly code without ever mentioning the registers. We’re unlikely to
ask you to deduce the number of arguments from assembly code in such a
tricky case.
What tools have we covered for debugging C programs, and which ones should we use for what purposes?
We’ve talked about gdb
and valgrind
,
including both more and less advanced ways of using gdb
.
We’ve also supplied testing files and had you write some tests so that
if there are errors in your program you’re likely to catch them. You
also know how to use printf
to print out values after using
#include <stdio.h>
. My typical debugging flow would
probably be:
- Run tests, and look for errors either from
valgrind
or from the tests themselves. Note relevant line numbers in my code. If things are crashing badly, run withgdb
and usebacktrace
to get line numbers of where the crash happens. - Do a quick spot-check of the relevant line to see if there’s something simple I can fix, like a missing semicolon. Note that the error may be on the line above or below in some cases, especially with missing semicolons or curly braces.
- In all likelihood, I won’t see the error immediately, so I’ll throw
a
printf
in there right before the line where the error is. If I don’t have any line number because a test is failing but the code isn’t crashing, I’ll try to figure out based on which test it is what part of the code might be wrong, and addprintf
s there. If I really have no idea, I’ll try to addprintf
s to the first loop in my program to see if I can validate its behavior, and then after doing that, I’ll move on to the next loop, etc., trying to isolate the source of the issue. - From the printed output, hopefully I can figure out what’s wrong and how to fix it. This would be a good time to ask on Zulip and/or during office hours if there’s something that you just don’t understand, although checking relevant references beforehand is also a good idea (e.g., is there a function you’re not 100% sure about that you might look up?).
- If things are really tricky, I might consider using
gdb
to set breakpoints and step through the code.gdb
lets me pause at any step and also print out any values that might be relevant. You can also usedispaly
to set up automatic prints that will be shown every time a breakpoint triggers. Combining this with a breakpoint on the first line inside a loop can be very powerful, because unlike print statements in the code, it’s easier to change them around when you figure out the next thing you want to look at.
Why do we use lea
for arithmetic sometimes? How is it
different from mov
? How does lea
relate to the
&
operator?
Basically, the CPU needs to do a lot of memory accesses, so in
addition to general-purpose math circuits like an ALU, it has a
specialized circuit to do pointer math. Since pointer math involves a
base pointer, an index within that array, a scaling factor, and
potentially an additional offset, this special-purpose circuit performs
two additions and one multiplication in a single step: offset + base +
index * scale. Since it’s designed to do exactly that sequence of
operations, it’s faster to use that circuit than to perform three
separate steps with a general-purpose ALU (especially since the
multiplication is restricted to scales of 1, 2, 4, or 8 so it can be
implemented using bit shifting). That means that if we have a sequence
of operations that turns out to fit into the from A + B + C * D (where D
is 1, 2, 4, or 8), it’s better to use lea
to do them all at
once.
One more advantage: lea
does not set flags based on its
result. This means that if for some reason you want to do some
arithmetic between a compare and the associated jump, you can use
lea
for that (see the answer on the cmp
for
more details).
In terms of differences from mov
, both lea
and mov
update their second argument based on their first
argument. mov
directly copies between them, and can use a
memory address as one of the two: either the source or the
destination. When mov
receives a memory address, it uses
the specialized address computation unit to compute the address based on
the values it was given, and then accesses that value in memory, either
reading from it into a register, or writing from a register into that
memory location. lea
always has a memory address as the
first argument and a register as the second, and while it uses the
memory address unit to compute the final address, it then loads that
address value directly into the destination register, rather than
actually reading from memory at that address. Since variables are often
stored at particular memory addresses, lea
computes a
pointer: it stores the address of the data into a register, instead of
storing the data itself.
In the HW ISA, mov
corresponds to both SW and LW, plus
it can be used to copy data directly between two registers. It just
depends on what kind of arguments it gets. lea
doesn’t have
an equivalent in the HW ISA, although it wouldn’t be hard to add one. We
already have the ALU set up to perform addition between a register value
and a constant from the instruction bits for the LW and SW instructions.
If we direct that value back into a register instead of using it to read
from memory, that would be the lea
equivalent. Of course,
it doesn’t support an index or scaling like lea
does, but
the HW ISA can’t do that for normal memory reads either.
lea
is usually used when we use the &
operator in C. If we do this:
int x = 7
int y = x;
int* z = &x;
The assignment to x
might be implemented using
mov $7, 8(%rsp)
, which moves a constant value (also called
an “immediate” value) into a memory location (since it’s based on an
offset from %rsp
, we know it’s on the stack).
The assignment to y
might then be done using one
instruction: mov 8(%rsp), %rbx
. That’s assuming we’re just
using a register (%rbx
in this case) to store
y
. Unlike x
, y
’s address isn’t
used in this code, so it’s possible we can just store it in a register.
If we wanted to store it in memory like x
, we’d need two
instructions, since we can’t move directly from memory to memory. We
could do: mov 8(%rsp), %rbx
followed by
mov %rbx, 12(%rsp)
.
z
here might use lea
, since
it wants to compute the address of x
, but store that
address instead of reading the value there. Assuming we want to store
z
in register r10
, we could do:
lea 8(%rsp), %r10
. In a mov
command,
8(%rsp)
would refer to the value at that address, which is
7. In lea
however, we skip the memory lookup, computing the
address by adding 8 to %rsp
but then storing that address
value directly into %r10
. Note that we can’t do this with
an add
instruction, because add
stores its
result into the second operand, which must be a register. We want to add
to %rsp
, but instead of changing its value, we want to
store that result in a different register, which
lea
can accomplish.
Why do we need a stack pointer? What is the purpose of
%rsp
?
Each function call needs a scratch pad to store local variables that are active during the call. Whenever possible it’s nice if it can just use registers for this purpose, but sometimes there’s too much data to fit in registers, or it needs to pass a pointer into a function it calls referring to one of those variables, so that variable needs a memory address. We use the stack to provide this. Since the list of active functions at any point in time can be represented as a stack, where we push when a function is called and pop when it returns, that’s the data structure we use.
%rsp
stores the pointer to the end of the stack (we use the
lowest address and grow the stack downwards). The compiler knows how
much space each function call frame uses, so it makes sure only to use
memory from within that region (or if it needs more memory, to insert
assembly instructions that move the stack pointer down). With that
restriction, no coordination is needed between functions about which
part of the stack to use: each function always uses whatever is at the
bottom of the stack. If it calls another function, that function will
use stack space even further down, but will clean all that up and
restore the stack pointer by the time it returns, so the calling
function won’t need to respect space below its own frame. Similarly,
that function will clean up the stack space it allocated before it
returns, so the function that called it won’t have to respect the memory
it uses: it will be done with that memory before the calling function
continues. This is why we only need a single register to keep track of
the entire stack, but it’s also why values stored on the stack are
invalidated after the function returns.
When do we use $
vs. %
vs no prefix for a
register in gdb
?
This is confusing because gdb
is a little inconsistent
about this! When using print
or examine
you
need the $
, when using info reg
you don’t need
any prefix, and in assembly code you’ll see %
as a prefix,
even though you never need to type that yourself in
gdb
.
help
for the relevant command to
figure it out.
How do we know whether a function reads from or writes to a register?
Each different assembly instruction has a different meaning, and
reads and/or writes registers in different positions. For example
mov
reads from its first argument and writes to its second
(although either could be a memory location, in which case we’re reading
from the register used to compute that memory location and writing to
memory).
We’ll expect you to be familiar with the instructions listed on this sheet which includes a list of common instructions.
Note that most of the math instructions likeadd
or
imul
are actually more like +=
and
*=
than +
or *
: they both read
from and then write into their second argument.
Do companies actually have engineers who program in assembly language?
Sure. Just look up “x86 assembly jobs” online and you’ll see a few. It’s not super common, but still very necessary here and there. In many cases, it’s probably not someone’s full job just to program in assembly, but it’s part of a larger job where they use other language as well. Here are some examples of people who would use at least some assembly in their day-to-day job:
- People maintaining and updating older systems where either they are directly programmed in assembly code or where programs in higher-level languages like FORTRAN or C might be checked for accuracy at the assembly level. For example, NASA recently sent a software update to the Voyager 2 spacecraft. You can read about its computer, which was originally programmed in FORTRAN, here. Systems that are old but still critical, like nuclear missile control & management systems, are probably still maintained by people who need to write assembly code.
- Many computer security workers who deal with viruses need to be able to read and understand assembly code, since that’s all they have to work with. They may use a decompiler to try to come up with approximate C code corresponding to the assembly they see, but since hackers could write some routines in custom assembly code, they can’t just rely on those tools for everything.
- The people who write compilers and develop operating systems need to understand assembly code, even if they might not write much by hand.
- People who develop new hardware, like new CPU designs, need to be intimately familiar with assembly so that they can design new assembly languages and/or maintain backwards compatibility with existing ones.
Which gdb
commands should we memorize?
The gdb
commands that we assume you will understand
include print
, info reg
, examine
,
backtrace
, disas
,
start
/run
, break
,
step
, stepi
, and continue
.
list
is another helpful command sometimes but not as
critical, especially for the x86
assignment where we don’t
give you source code. Thankfully, the help
command in
gdb
is pretty good, so you can get reminders easily as
you’re learning these. By the end of the x86
assignment,
you’ll know them, one way or another.
x
(although knowing
/gx
and /s
is probably good). But you should
be able to explain what they do, and make inferences if shown the output
from one or more of them.
What happens if we write a function that has more than 6 arguments/parameters, since there are only 6 registers available for holding arguments?
Additional arguments get written onto the stack in the calling function’s stack frame, and the assembly code of the function that has lots of parameters will include code to read those values based on its starting stack pointer.What general strategies should we use for reverse engineering?
There are several different things you can do to get your bearings in unfamiliar assembly code, so I’ll list them here:
- Highlight all of the memory addresses used for jumps, using a
different color for each unique address. This will help you trace how
control flows through the code, even before you figure out the
conditions of each jump. Also note function calls during this process
(especially calls to
trip_alarm
in thex86
assignment) andret
instructions. - Highlight uses of the
rdi
,rsi
,rdx
,rcx
,r8
, and/orr9
registers. These will usually tell you how many arguments a function accepts and/or where it passes arguments to other functions it calls. - Look at places where
rax
(or a part of it like%eax
) is used, tracing backwards fromret
instructions and respecting the various possible jump paths. This will help you understand what goes into computing the function’s return value. - Look for where parentheses are used, since they indicate memory access. Is memory accessed based on the stack pointer, or another register? If it’s another register, where does that register get its value? How many different memory locations are accessed, and are we dealing with arrays and/or strings based on the use of an index register during memory access? Remember to account for return values from any functions that this function calls since those can be the source of memory addresses.
I’d pursue some or all of these strategies a bit before really diving into all the details of a function, since having context can help a lot in understanding details. I’d use the strategies above to answer questions like: does this function use conditionals? Loops? Recursion? What are the main data structures it uses? Just individual variables, or any arrays or strings? What other functions does it call, what arguments does it give them, and what does it do with their return values?
For the x86
assignment in particular, I’m always looking
to answer the question: what path can be taken through this function to
get to a return instruction without hitting any trip_alarm
calls? The conditions used for the jumps necessary to take that path are
the next points of inquiry for figuring out how to solve the phase.
Once I feel like I have some orientation in terms of the big ideas
for the function, the next step is to make some assumptions about
argument values and then run through the code line by line, executing
instructions in my head and/or writing down results on paper. This step
is tedious, but skipping it can waste a lot of time. Here you can use
stepi
in gdb
to run the actual code
instruction-by-instruction and check your results, although I wouldn’t
necessarily do that at every step. Running through the code step by step
doesn’t have to mean writing down every specific value. Instead I often
keep a sense of which variable each register or stack location is
storing, with an eye for which ones will be used by key jumps in the
function. Based on this rough trace-through, I’ll come up with
hypotheses, like “the first input number must be a 5 to gt through the
first check.” It’s important to try to test these hypotheses
piece-by-piece. You don’t want to develop 17 hypotheses only to find out
that your first one was wrong and so are all the others. Use
gdb
to test out your hypotheses, and once you figure out
how to navigate one jump, start developing hypotheses about the next
one.
gdb
to really figure out what’s going on. If it comes to that, use the
display
command to set up auto-printing for the important
registers and/or memory values so that you don’t get lost trying to
remember what you need to print at every step.
Lab 9
How do the push
and pop
instructions work?
The push
instruction is equivalent to the sequence:
sub $8, %rsp
mov <target>, (%rsp)
where <target>
is the argument to
push
. So it modifies the %rsp
register by
subtracting 8, and then uses that register as a memory address to save
the target value there. Note that if the size of the target value is not
8 bytes, the stack pointer may be moved by the size of the target value
instead, so push %eax
would subtract 4 bytes instead of 8,
since %eax
is only half of the %rax
register.
What this accomplishes is saving the target value onto the stack, and
growing the stack (downwards) to accommodate the new value. After a
push
, an instruction referring to (%rsp)
is
referring to the saved value (although if there are multiple pushes,
older values will need offsets since %rsp
will have moved
further).
pop
is the inverse of push
. It translates
to:
mov (%rsp), <target>
add $8, %rsp
push
, the amount the stack actually moves varies with the
size of the target. If we’re popping an integer, we could use an
e
register as the destination and the stack pointer would
only move by 4. If a push
is immediately followed by a
pop
and both have the same target, they will cancel each
other out and nothing will have changed. On the other hand, assuming
that any code in between the two instructions leaves the stack pointer
in the same place by the time the pop
occurs as it was when
the push
was finished, the pop
will restore
the target value back to what it was when the push
occurred
(and will also restore the stack pointer to what it was before the
push
). Because of this, push
and
pop
are most often used to store & reset values from
registers which need to be saved while another function call uses them.
How do the call
and ret
instructions work?
call
and ret
work somewhat like
push
and pop
, except the thing they push and
pop is the “instruction pointer” which is the x86 equivalent of the
program counter, and there are a few other wrinkles:
call
doesn’t push the current instruction pointer (which
would be pointing at the call
instruction that’s being
executed) but instead it pushes the address of the next instruction
after that call
. This means that when ret
happens, we can continue the current function after the function call we
just made.
ret
pops the value written by call
(assuming there are no errors in your stack pointer maintenance) and
sets the instruction pointer to that value, rather than setting it to
point to the instruction after the ret
. This is how we
“jump back” from a function, and how we can jump back from the same
function into many different places depending on who called it.
call
and ret
move the stack pointer
%rsp
by 8, since that’s the size of an address (see above
for details on push
and pop
.
Lab 12
What aspects of the memory allocator are the most inefficient and how do people go about weighing whether they should sacrifice space or time?
For the basic allocator, the main inefficiency lies in fragmentation and in the time it takes to find free space starting from the beginning of the heap when many things have been allocated and/or freed. I’m not knowledgable about it, but I’m sure that coming up with better allocation algorithms is still an active area of research. Different applications often call for different space/time tradeoffs, so in some cases you might use a custom memory allocator for special applications. For example, if you’ve got a scientific program to crunch a bunch of data and it’s running out of memory, you could buy more RAM, but you might also want to try out a different memory allocator, even if it is a bit less time efficient. At the same time, if you’re running this calculation for weeks, finding a more efficient allocator might speed things up in a meaningful way, so you might be willing to trade off some space (especially if you can afford to buy more RAM). Of course, you’d want to first ensure that memory allocation really is a bottleneck for the program, rather than some other issue.
Long story short: usually you should test your code and ensure that memory fragmentation and/or allocation overhead is actually a problem before thinking about changing how you allocate memory.
What memory does malloc
consider? All memory in the
computer, or just specific areas?
malloc
just manages memory on the heap, which is a specific
region of memory. It calls sbrk
to extend that region if it
needs to, but eventually, sbrk
may indicate that it can’t
allocate any more memory because the heap has grown too large and either
hit an arbitrary OS limit, or is in danger of running into another
memory region, like the stack.
Are there memory allocation functions in other languages that are faster than C?
I’m not knowledgable about all the specifics, but I now that Rust has more memory management stuff built-in to its type system, which probably lets it allocate things more efficiently than C because more information is known at compile-time, although I don’t know that this is the case. There are also languages where the programmer does more of the work (like specifying which memory to allocate) which could be faster in one way, but more cumbersome in another.
What tips do you have for making a better malloc
implementation that’s more space/time efficient?
This is intentionally a pretty hard objective, because there are so many facets to it, and in particular, different kinds of allocation & freeing patterns might behave differently for different allocators, so in many cases there’s not an obvious “best” thing to do. We’re asking you to consider this problem a bit because it’s more like real un-curated problems you’ll face in the real world.
That said, the assignment itself has a lot of good hints here. The section on performance mentions next-fit allocation and an explicit free list as two things you can do to improve performance. Doing better than an explicit free list for throughput is probably pretty tough.
That part also links to the section on performance optimization and engineering, which points out that the use of a profiler is the correct first step and it has some instructions on usinggprof
for this. Figuring out how to
use a profiler and actually using it is an incredibly valuable learning
experience :P
How does memory mapping work?
Memory mapping has two meanings. There’s an ‘mmap’ C function which allows you to directly map files (or regions of files) into memory, so that any alteration to that memory alters the file, and reading that memory reads from the file. There’s also the concept of virtual memory: the actual physical RAM of the computer is mapped into a virtual address space for each process, so that the different processes each think they have access to all of memory, but in reality, the parts they each use are mapped to separate regions of the real memory, so that they don’t have access to each others’ data.
The way that virtual memory works is quite complicated, involving some specialized hardware that’s part of the memory load/store process for looking up where a certain area of memory is actually stored (the data this hardware uses is called “page tables,” and this is where the term “page size” comes from).
In lieu of a longer and more complex explanation here, the relevant Wikipedia page on virtual memory is an interesting read.What are the dangers of premature optimization in terms of both development time/effort and code correctness?
If you start trying to write “more efficient” code without understanding whether your code works yet, and/or whether it’s actually too slow or using too much memory, you might:
- Make your code needlessly complex, in a way that doesn’t actually speed it up.
- Relatedly, introduce bugs into your code because of that complexity, which you could have avoided with simpler code.
- Spend a lot more time building that complex code, when a simpler version that seemed less efficient might have been faster to build, giving you more time to actually measure its performance and fix any real issues that came up.
When we call malloc(n)
does n
include the
header size for the allocated block?
No, it does not. The person using malloc
shouldn’t need to
know how the allocator is implemented, they’re just requesting a certain
number of bytes. So since the header is an internal implementation
detail, they don’t have to know its size. It’s the responsibility of the
allocator to create a payload area with enough room for the requested
bytes, and then store any additional metadata it needs to track that.
Why shouldn’t we use a struct
to represent a heap allocated
block?
A struct
is a fixed-size data grouping, where we know at
compile-time how big it will be. Because the heap blocks are all
different sizes, a struct
doesn’t make sense. We could
imagine using a struct
to store just the header
information, but struct
doesn’t have a facility to store
multiple pieces of information overlaid on top of each other like we’re
doing. Instead we use a few custom helper functions like
status_size
to let us access the information as individual
values.
What’s the purpose of the heap footer/epilogue?
The footer/epilogue of the heap is marked as a “size-0” “allocated”
block, and its previously-allocated bit is updated based on the
allocated status of the block before it. The size being 0 is a
clearly-invalid value that makes it easy to see if you end up getting a
pointer to that epilogue. The other information is useful because other
operations (particularly coalesce
) can treat it as if it’s
a normal block, and they don’t need any extra special cases to handle
the end of the heap.
coalesce
will see that it’s an “allocated”
block, and so it won’t try to merge with it.
What might happen if we don’t intentionally call free
on
the pointers that malloc
returns? Where in the code do we
deliberately call free
?
Calling free
is the responsibility of the person who
called malloc
, so in our mm.c
, we don’t need
to do that anywhere. In our testing code, we have a “trace” system set
up where we can read in a trace and call free
when the
trace says to do that. If you’re writing other code (like you did for
Pointers) then it’s your responsibility to call free
once
you’re done using some memory. Of course, there may be some pieces of
memory that you need to use for the entire time the program is running,
so they might only be freed at the end (and in any case, when the
process ends everything will get cleaned up automatically).
free
when you’re done with something,
that’s called a “memory leak.” That memory stays allocated, and if your
program runs for a very long time (like a web browser does) even a small
memory leak can lead to a huge amount of wasted, potentially leaving
less memory for other programs to use and/or causing programs to fail
because there’s not enough memory available. That’s why we use
valgrind
to check for memory errors and ensure we don’t
have any leaks.
Why is code with Java primitive types faster than code with Object types?
In Java,Object
-type objects are all represented using
pointers, and their actual data is allocated on the heap. In contrast,
primitive types can be stored directly in Java’s equivalent of
registers, which are sometimes actual registers. So by using primitive
types, there are some situations where you can avoid heap allocation,
which we now know is potentially time-consuming as it involves a search
for a free block and a bunch of memory writes to update header
information.
What’s the relationship between “RAM” and the memory managed by
malloc
?
Malloc manages a segment of RAM called the “heap.” There are some
constants that define where this segment actually lives in the address
space, but it gets a starting size and we can call sbrk
to
grow it. The stack and text segments are also in RAM, and since both the
stack and RAM can grow, if you allocate too much in the heap you’ll
limit the amount of stack stuff you can allocate and vice versa (mostly
these limits are way beyond what ordinary programs need).
Does the malloc
project use an implicit or explicit free
list?
The starter code provided has the skeleton of an implementation that
uses an implicit free list. If you want to score more points
for better performance, you can implement an explicit free list (easier
is to just least implement a next-fit instead of first-fit search
policy, which will score higher than first-fit but not as high as an
explicit free list).
What is the “payload.”
In malloc
, we’re writing a system that others are going
to use to allocate memory. They may use that memory for whatever purpose
they want (like storing a command array :). From the perspective of the
malloc
code, we don’t care what the memory gets used for,
we just need to ensure that nobody else will write to that memory
(ourselves or other people who called malloc
and obey the
size limit they requested).
Why is the explicit free list (doubly-linked list) better than an implicit free list for tracking free blocks?
It’s faster at finding free blocks when there are many allocated blocks. Imagine you have one small free block followed by 1000 allocated blocks of various sizes, and then a big free block. If you’re asked to find space that’s larger than the small block, with an implicit free list you’ll have to iterate through all of the allocated blocks in order to find the free block at the end, because each block only has enough information to find the start of the next block; there’s no way to “skip ahead” multiple blocks. In contrast, if each free block knows the location of the next free block, you can skip from one to the next ignoring the allocated blocks.
How can we distinguish the four coalesce
cases in our
malloc
code?
For a code design question like this, I often start with idealized
variables and then work backwards to see how I can get those variables.
This helps break the problem into smaller pieces (one for each variable)
and it also makes it easier to test and verify each piece independently
when debugging. For the coalesce
cases, we need to know two
“bits” of information, which will distinguish our four cases:
- Is the previous block free?
- Is the next block free?
The combinations of no/no no/yes yes/no yes/yes will then be the four coalesce cases (although with some designs it’s possible to share code between the cases too). So I might start by defining:
int prev_free = 0;
int next_free = 0;
Where do 0x11 and 0x13 come from for an allocated block of size 16 with a free/allocated block before it?
I think it’s clearest if we translate into decimal and binary:
0x11 = 16 + 1 = 17 0x13 = 16 + 3 = 19
0x11 = 0b10001 0x13 = 0b10011
In both cases, if we mask out the two lower-order bits which are actually used as free/allocated indicators, the numeric value is 16 to represent the block size. Then we either have a 01 (in binary; 1 in decimal) to represent “previous block is free and this block is allocated) or a 0b11 (3 in decimal) to represent”previous block and this block are both allocated.Are there cases where the other two unused bits in a block header get used for something? If so, what? How about the least significant bits of the footer?
I’m not familiar with all the different malloc
variants
out there, but I’m sure someone has used those extra bits for something
in some implementation, I just don’t have an example that comes to
mind.
If I try to imagine what they could be used for, I can try to think of further optimizations that might be possible, but this is hard. For example, there might be some gains in marking which free block is the largest, because if you get a request for a larger allocation and you happened to find that block while searching, you could immediately skip to expanding the heap rather than continuing to check further free blocks. But that’s probably a terrible idea, because the time spent trying to maintain that “largest” bit correctly in each block would be a lot (storing a pointer to the largest block in the heap header would be cheaper). Also, how frequent is it really that we get an allocation request that’s larger than our largest free block (maybe not frequent) and what’s the cost of checking an extra bit in each free block (probably not worth it even ignoring the other issues).
So you can see it’s not easy to come up with other things that it’s useful to store, because anything you might store incurs costs to maintain that information and to check it.
One thing suggested in the assignment as an extra challenge would be to do some kind ofvalgrind
-like checking for certain kinds of
memory errors. These extra bits might be useful for that purpose, and if
you really cared about efficiency, you could put those checks behind
if (DEBUG)
or the equivalent so that they only happen when
debugging. That said I haven’t actually designed such an algorithm
myself, so this is just a guess, not a hint, and it might end up too
hard to go that route.
Lab 13
When looking at source code, what are some clues that a program is a fork bomb?
You’d want to look at wherefork
is called, and how many
times that code will execute. It’s not always a guarantee, but a program
that calls fork
within a loop should raise some suspicions:
Why does it do that, how many times will the loop execute, and what
happens to each parent/child process before the next call to
fork
. If the child or parent processes immediately break
out of the loop, for example, then it may be fine. But if both of them
will reach the next iteration and call fork
again, that
will lead to an exponential explosion of processes.
How can multiple processes run at the same time on the same CPU?
They don’t actually run at the same time: each process gets to run for a little bit, and then a “hardware interrupt” will stop it, giving control of the CPU back to the kernel (which is the core program that implements the operating system). The kernel may need to do some of its own business, but will pretty quickly hand control of the CPU back to another process that wants to run, and so processes share the CPU by taking turns, with turn-taking enforced by the kernel.
When is it safe/useful to use fork
? Why would you use
fork
?
Whenever you need to start a new background process, it’s easiest to
use fork
. Typically, if you use an if
statement to have only the parent process continue with the code for the
original program while using exec
in the child process to
replace that code with some other program, there’s much less risk of
forking too many times. Usually you also don’t need to place
fork
within a loop, which can increase the risk of forking
too many times.
Why would you want a “background process?” One common use case is when you need to pay attention to user input, but also process some data in the background. You could use one process that waits for user input and gives immediate feedback, while sending data inputs over to a second process that does something slow/expensive with those inputs. Even something as simple as showing a spinning loading icon while data is loading requires two things to happen at once: the spinning icon needs to be constantly updated, and the loading needs to happen. This kind of thing is more often handled by threads than processes, because threads can more easily share data with each other, but the same logic applies.
Another example of a background process is when you want to use an existing program that doesn’t have a C API, maybe one that’s written in another language. If you can be sure that that program is installed on your computer, you can usefork
and exec
to
run it from within a C program. Many languages have higher-level
libraries for doing this (like the subprocess
module in
Python, which also lets you easily manage input and output streams for
the child process). Most of the time, it will be more convenient to
write a multi-process program in a higher-level language. But knowing
how fork
works at the C level helps you understand why
those languages might have certain quirks (like how file descriptors are
shared between Python subprocesses in some situations).
How can information be shared between different forks of a program?
There are a few possible options: for example, you could use files that the programs read & write together, or an internet socket that’s just used to forward data from one fork to another. There’s also aC
function equivalent to the kill
terminal
command that can send signals between processes if they know each
others’ PIDs, which they can easily from either the fork
return value (in the parent) or a function for getting the parent
process ID. So you can have one process signal the other when it’s done
writing a file, for example. Besides the ‘interrupt’ and ‘kill’ signals,
there are a number of different signals available to send.
Why do we care what order things are printed when using
fork
? What causes things to print out-of-order? Is there a
way to control the order that things happen in after forking?
We care about the order because if we want to coordinate work between multiple processes, there may be things that need to happen in a certain order, such as one process writing a result into a file before another process tries to read it.
The order becomes unpredictable because as soon as there are multiple processes, the operating system can choose to schedule them in any order. Depending on how many other processes there are and other uncontrollable factors, the OS may schedule different processes to run for different amounts of time, and in different orders.
There are ways to force the OS to schedule things in a certain way: You can have either the parent or the child wait usingwaitpid
for the other one to finish. You could also use signals as described in
the previous answer to coordinate which process will do things when, by
having one process wait for a certain signal before doing something.
What are the jobs
, fg
, and bg
commands?
jobs
lists the running jobs in the current terminal. A
“job” consists of one process that was launched directly from the
terminal, plus all child processes of that parent. The fg
command brings a paused job back to the foreground to resume it. You can
use control-Z to pause a job, after which fg
can resume it.
The bg
command also resumes a job, but lets it run in the
background, so you can still use the terminal for other stuff while it’s
running.
bg
to let it run
in the background while you do other stuff. Of course, background
processes can still print stuff, which can be distracting if it happens
to print out stuff on top of some other output from the other programs
you’ve started to run. There’s another command ‘disown’ which can cut
off a process entirely from a particular terminal, and can be used once
the process is sent to the background with bg
.
Why do we need to differentiate the parent and child processes based on
the return value from fork
?
Typically, it’s not useful if the two forked processes do
exactly the same thing: we’re forking because we want a
secondary process to do some work in the background, while the main
process continues with the work it’s been doing. So we need to know
whether we’re the parent or the child to know whether to continue the
work we’ve been doing, or to switch to something else (often, the child
process will call exec
to start a new program, while the
parent process continues on).
How does fork
work in terms of coordinating variables in
the parent and child processes?
While threads (a related technique) can share some variables, processes do not share variables. Of course, they are created as clones of each other, so both the parent and child process have access to all the same variables when they start up. But they “don’t share” variables because when one of them changes a variable, the other one won’t see that change. Effectively, when we create the child process, we copy the entire state of memory over to the child, and any modifications in either the parent or the child are done on separate copies so they don’t affect each other.
As mentioned in a previous answer, you can explicitly use mechanisms like files or sockets to communicate between processes.
What does sleep
do?
As mentioned above, the kernel/OS manages how processes share the CPU,
letting them run for a while, then interrupting them, and then letting
them continue later after other processes have had a turn.
sleep
tells the operating system: hey, I’m done with work
for now, and I actually don’t need to take any more turns until this
many seconds later. So sleep(1)
would tell the OS “wake me
back up in 1 second.” Typically, from the process’s perspective, it just
pauses for that long and then resumes. Of course, if many other
processes are demanding to run at the same time that we’re scheduled to
wake up, there might be a slight delay beyond 1 second before we run
again, so we can’t depend on the sleep lasting exactly 1 second
(typically the delay is on the order of microseconds, so not very
noticeable by humans).
What does exit
do and what’s the significance of its
argument?
exit
stops the current process. Each process has an “exit
code” that can be observed by other processes after it finishes, and is
used to indicate success/failure, with certain exit codes indicating
specific kinds of failure or other possibilities. Each process can
decide what its exit codes will mean, but universally, an exit code of 0
means “success with no errors” (while any other exit code usually means
some kind of error). So exit(0)
means “stop this process,
and notify anyone watching it that it succeeded.” In particular, the
shell can be set up to print an alternate prompt when a process
terminates with a non-0 exit code, and there are shell
built-ins/operators that can change behavior depending on the exit code
of a process like if
, ||
, and
&&
. Using exit(1)
is a common way to
indicate a general error, although many programs have sections in their
manual detailing many different possible error codes.
When should you use control-C vs. kill
to stop a process.
Control-C sends a ‘SIGINT’ signal (interrupt) to the current terminal
process and all of its children. kill
sends that signal (by
default; it can send other signals instead) to one specific process
named by its PID. So in general control-C is much more convenient. But
if you need to kill a background process, you’ll need to use
kill
, since control-C only applies to the current
foreground process. You may also need kill
if you want to
send another signal, like SIGTERM
, although control-Z (to
pause a process with SIGSTOP
), control-D (to “hang up” on a
process with SIGHUP
) and control-backslash (to send
SIGTERM
) are also available as keyboard shortcuts.
Are there things besides fork
that can cause there to be
multiple processes?
fork
is the main C library function for creating
processes, although there are probably custom higher-level
process-management libraries that you could use instead. However,
there’s another simpler answer: the user might intentionally launch
multiple processes. If you remember the Pointers Assignment, we made a
big deal of ‘&’. That’s because in the shell, you can add ‘&’
after a command to start it running in the background, not the
foreground. This lets you launch other programs immediately, so you can
launch multiple at once. Same would apply if you open multiple terminals
(each of which is its own process) and then launch multiple programs at
the same time.
How can we predict the output from and behavior of programs that use
fork
, including when the next prompt will be printed by the
shell?
You “just” need to trace through the code and keep track of any forks
that happen, tracing them separately. Then account for all the possible
ways that those traces might be interleaved after the
fork
(s) happen(s). This is… hard. Drawing diagrams helps,
but fundamentally parallel programming is an incredibly challenging
thing, and often keeping forking patterns as simple as possible is a
good idea if you’re the one writing the code.
What are “threads” and when would you use them?
Threads are like processes, but instead of creating a clone of the process they share a lot of the same process state, meaning that they can each write to the same set of global variables, for example. This makes it easier to pass information between threads, but also much easier to screw things up. Threads are commonly used in situations where user input is happening alongside CPU-intensive processing, like in a videogame, and the processing needs to immediately access and respond to the user input.
Why does fork
return 0 in the child process?
This gives the code calling fork
a simple way to figure out
if they’re the parent or child, since 0 is not a valid PID. There’s a
getppid
function in the standard library header
unistd.h
that can get your parent process’ PID, so the
child still has access to that if it needs it. Unlike the child, since a
parent process might have multiple children, there’s no obvious way for
it to get the PID of the new child except for the return value from
fork
, which is why the 0 is returned to the child, not the
parent.
Does parallel processing use one RAM or multiple RAMs?
All processes/threads share a single RAM. Sometimes this is actually multiple RAM chips if you’ve added extra RAM to your computer, but the operating system manages these as one unit to display a consistent pool of memory to running processes.
Of course, if that’s true, why don’t different processes overwrite each other’s data all the time? Well, there’s something called memory virtualization which means that where a process thinks things are stored is not where they’re actually stored. Every time memory is read or written, the address used by the process goes through a translation step that maps it to a different part of the actual hardware RAM. The OS makes sure to map memory used by different processes to different parts of the physical RAM, even while ensuring that they each think they have access to all available RAM. It also leaves unused sections of memory un-mapped until they’re requested, so that it can fit the active memory regions from many many processes together into the single actual RAM.
So in effect, each process gets its own memory to use, separate from all other processes, even though physically, there’s only one RAM chip.
What does atoi
from forkex7
do?
To find out the behavior of an unfamiliar function like
atoi
, a quick search for a reference site usually suffices,
although the reference sites are becoming harder to find and less
reliable these days… The man
program (short for manual)
also has references for most C library functions, so running
man atoi
in a terminal would work.
atoi
converts a string (for some reason abbreviated as ‘a’,
perhaps after “alphabetic?”) to an integer (so it reads “‘a’ to ‘i’”).
If the string you give it isn’t convert-able, it returns 0.
How does the operating system manage processes to give the illusion of simultaneous execution? What controls which process runs when?
The OS has a “scheduler” which decides which process to run next
whenever a process either voluntarily gives up control or is explicitly
interrupted. Any time a system call is made (like when calling
printf
) this gives the OS a chance to smoothly interrupt
the process, since the process is asking the system to do some work for
it anyways. Once a process is selected, after it runs for a bit it will
be interrupted and another process will be swapped in. The scheduler is
responsible for making sure that each process gets a turn, while also
ideally letting processes that are more important or which have more
work to do get a bit more time than others. But since context swaps
happen hundreds to tens of thousands of times per second, from a human
perspective it looks like multiple processes are happening
simultaneously.