▲ ☰ ▼

Frequently Asked Questions

This page holds questions and answers for some frequently-asked questions from the lab. Use your browser’s find-in-page to search for words you’re interested in and then click on a question to expand the answer. This page only has answers from some topics, and will be updated throughout the class; feel free to email your lab instructor if you have a question that you’d like to see here.

Filter:

Lab 5

How does a Register File work?

Check the lab 4 notebook for the details of the register file circuit. Basically, you can use a decoder on an address bus to send a signal to one of N registers, so that only the clock for that register turns on. Then you can send the same address bits into a multiplexer, so that you only get output from the register that you want. The clock is used when writing to a register is needed (and is ANDed with a “write-enable” line for the entire file, or equivalently, the control unit that sends the “write enable” signal can AND that with a clock to ensure that there’s a falling edge between two cycles where “write-enable” is on) and the multiplexer is used to get output. Since the register file has two read addresses/outputs and one write address, there are 3 address inputs, one hooked up to a decoder for deciding where to write, and the other two hooked up to two multiplexers for deciding which register(s) to read from. The same write data is routed simultaneously to all registers, with the clock signal making sure that only the appropriate register is written.

How are clock signals used in the CPU?

Within register file and RAM chips, clock signals are only needed when we want to actually write into a register. So the global clock signal gets directed into these devices, and ANDed with a “write enable” line to determine whether or not to actually write data. This can happen either inside or outside the device; in our LogicWorks simulations, we manually turn write enable on and then off (or vice-versa if it’s active-low) to simulate a clock pulse. The ALU doesn’t need a clock signal at all, because it’s a combinational circuit: it produces outputs based on its current inputs all the time. The main place where the clock signal is used every cycle is the program counter: that register has an address update feedback mechanism that we saw in lab either adds 2 or performs BEQ or JMP logic. The clock signal then controls when this feedback mechanism actually updates the register to move to a new address. As soon as that happens, the outputs from instruction memory change, since its input from the PC has changed. When they change, their new value is the next instruction to execute, and so the control unit and register file now receive new data which changes their outputs, and after a short delay necessary to let signals propagate through gates, their outputs have settled and the instruction is complete. As long as the length of one clock cycle is longer than this gate propagation delay, the instruction effects are in place before the next cycle.

How can I figure out what a series of HW ISA instructions will do?

To understand what the result will be from a series of HW ISA instructions, you need to mentally (or on paper) keep track of: What is the current program counter and what will it be for the next instruction, what is the content of each of the 16 registers, and what is the content of all 256 memory slots. Of course, you really only need to keep track of those registers and memory slots that the code actually reads from or writes to. Importantly, you really only have 14 registers to track, since R0 is always 0 and R1 is always 1. You can think of each instruction at two different levels of abstraction: as control signals from the control unit (which you should understand the truth table for if you want to deal with stuff at this level) OR as abstract logical/arithmetic operations, listed in the “meaning” column of the main HW ISA slide.

If you want to think about it at the bits/circuit level, trace through the full CPU architecture diagram for each instruction: from the opcode part, figure out the outputs that the control unit will send. From the Rs/Rt/Rd parts, figure out what inputs the register file, ALU, and/or BEQ mechanism will receive. From the control unit outputs, figure out which signals get dropped or forwarded by the multiplexers, which influences these inputs. Then, figure out what the ALU output is, and possibly, what address input the data memory will receive and what its output will be. Finally, figure out what value will be sent back to the register file (or written into the data memory, at what address, or what the BEQ/JMP address result will be). I think that this process is worth doing a few times so you really understand how the circuitry is connected and operating together. However, it’s a slow process. If we think more abstractly, we can reason through assembly code much faster…

So if you want to think at the “meaning” level of abstraction, just read off the “meaning” line for the opcode of the instruction being processed, fill in the s/t/d/offset values in that meaning from the other 12 bits of the instruction, and note down the update to either the PC, a register, or a memory address (again, using paper or a whiteboard here is probably good to start with; once you’re more comfortable you can start to just hold this info in your head). At this level of abstraction, you can also look at the Rd (or Rt for LS/SW) values to understand which register is being updated (except for BEQ/JMP where it’s the PC). This means that if you lose your place in terms of what a register value is, you can look back through the code to the last time where that register is listed as a Rd value (or as Rt for a LW instruction), since that’s the instruction that would have set the current value for the register in question.

What is the purpose of the Control Unit?

The purpose of the Control unit is to send all the little extra signals that the other parts of the CPU need to perform the correct function for each different opcode. For example, the ALU needs to be told which operation to perform, and memory and the register file need to be told whether to accept incoming data and update a value or to ignore it (the “write enable” lines). This is why the opcode bits of each instruction go straight into the control unit.

How can we figure out what the control unit should do for a particular opcode?

To come up with the Boolean algebra expressions for the control unit, it’s easiest to start from a truth table: you already know how to use sum-of-products and/or simplification to get form there to an expression. To derive the truth table for the control unit, you need to understand what the other parts of the CPU circuit are doing, and then be able to deduce what the individual control wires need to be set to to make that happen. All of that is in terms of the “meaning” column of the ISA slide. So for example, if the “meaning” for a particular opcode says “R[d] <- R[s] + R[t]” you know that the steps for making that happen are: 1. read data from R[s], 2. read data from R[t], 3. Add those two numbers together, and 3. Store the result in R[d]. If that’s the work the circuit needs to do, then you know that the address inputs to the register file must be s and t, and the read data outputs from the register file must make it into the ALU. That implies that the “Mem” control line will be off for this operation, since otherwise, read data 2 is based on d, not t, and the t data as an offset would get put into the ALU instead of the read data 2 output from the register file. Additionally, it means that the ALU opcode must be the opcode for addition. Further, you can also deduce that Mem must be 0 because the ALU result needs to make it back to the register file data input, and Mem also controls the multiplexer that can swap that for a data memory result. Then you can deduce that the “write enable” bit for the register file should be on, since the result needs to be written into a register. Similarly, the result from the ALU and data value passed to the data memory won’t be sensible, so we need to make sure to set the “mem write” line to 0. Also, we are NOT doing any kind of jump or branch with this meaning, so the relevant control lines for those things should be off. That’s just a simple example, but the process is the same for the other opcodes.

If you don’t really understand what the purpose of a particular control line is, look at what it’s connected to and what part that plays in the circuit. If it still seems mysterious, read more about the relevant operation and make sure you understand what has to happen for that operation to take effect.

How is a decoder chip used in a RAM circuit?

(Note: lab 4 has an example RAM circuit where you can see these details.)

The decoder in the RAM circuit gets input from the address input of the whole chip. Remember that a decoder converts things into a “1-hot” encoding, meaning that 1 of N wires is on and the rest are off (or vice versa if it’s active-low). So we take the address input (say, 8 bits) and convert it to a 1-hot encoding (which for 8 bit of address would be 256 wires, one of them active). These wires connect to the individual registers within the RAM: for 8 bits of address we have 256 registers. This activates that particular register, both activating its tri-state buffers to dump its data onto the output bus, and if the write-enable signal is also on (using an AND gate) activating its clock inputs so that data coming in on the data-in lines actually updates that register (the data-in line connects to all registers simultaneously, but only one will receive clock input).

How are tri-state buffers used in a RAM circuit?

(Note: lab 4 has an example RAM circuit where you can see these details.)

The RAM circuit needs to produce output from one of its internal registers at a time: whichever one was indicated by the address input. Each register has tri-state buffers for each of its bits, and the address logic takes care of activating only the tri-state buffers for the specific register that the address bits indicate. The tri-state buffer acts as a wire that passes a signal through when activated, but when deactivated, acts as a gap in the circuit. So effectively, one of the registers gets connected to the data outputs at a time, while the rest are disconnected. You can think of a tri-state buffer as a 2x1 mux, where the input signal is on the ‘1’ input to the mux (i.e., gets forwarded through when the select signal is a 1) and the second input to the mux is a disconnected wire (so that when ‘0’ is selected, the output is neither high nor low, nor in-between: there is no output connection). That’s not exactly how it works, since a traditional MUX wouldn’t work that way; check out this circuit (a 2x1 mux implemented using 2 tri-state buffers) if you want to see the details of how a tri-state buffer is implemented.

What is RAM actually used for in a computer?

RAM gets used to store most data for a running application. For example, right now, the text that you’re reading is stored in the RAM in an area reserved for your web browser, along with all of the data from all of your open tabs. Now, some of that may in fact be stored on the hard drive if RAM is too small to hold it, and if it is, when you go back to that tab, it will need to be loaded back into RAM, which will take a bit longer than switching to a more recent tab that’s still loaded in RAM. There’s also a special part of RAM (or in some cases RAM chip) that stores the machine instructions for the programs that you’re running: the CPU needs to access that data as well to know what to do. In our HW Arch, this is the “Instruction Memory” that’s separate from the “Data Memory.”

What’s the difference between RAM and the hard drive, and what are they used for?

RAM is general-purpose memory, there’s a lot of it, and it’s much faster to access than the hard drive, so it’s used for all of the data that programs and your operating system are actively using at any given moment. Unlike data stored on the hard drive, data stored in RAM will vanish if you turn off the computer though. So when you save something in a file, that gets written to the hard drive.

Some systems that need to remember things, like an oven that remembers settings, just stay on all the time, because they’re plugged into a wall outlet. If you unplug them, they’ll forget their settings. Other devices have a battery that just keeps the RAM powered when the device is off; SRAM as opposed to DRAM needs very little power to stay active over time. Still other devices might use flash memory or another kind of persistent storage that can retain data even when power is off, although these are typically more expensive than RAM, so there’s a tradeoff there.

What instructions does the computer actually load when it boots up?

When you power on the computer, there’s a specialized chip that contains a fixed set of instructions for the computer to start with. This is the firmware and/or BIOS. This in turn invokes a “boot loader” which loads up the operating system. The operating system itself has a mechanism for loading instructions from a file into instruction memory, but it also does some high-level management of multiple programs to swap them in and out.

How is a bit shift implemented in a circuit?

In hardware, a bit shift is super simple: you just plug in the wires to their next destination shifted left or right by one. The bits that get “shifted off” are just wires that you leave unconnected, while if you need extra zeroes, you make extra ground connections at the destination. If you need to do an arithmetic shift, you can connect a bunch of inputs at the destination up to the shifted top bit of the source, so that whether that single wire is 0 or 1 controls all of the inputs together. None of these things require any transistors, which is great.

What would happen if you accessed instruction memory at an odd address?

If you have byte-addressable memory but the word size is more than 1 byte, you can ask for the computer to load data starting in the middle of an instruction. It will happily grab part of that instruction and then part of the neighboring instruction, and load those two halves together as one word. If you’re lucky (which is quite likely) then that instruction will just be something nonsensical and the program will have a slight glitch or crash (especially in cases where reading one unaligned instruction means you will end up reading more and more as the PC advances by 2 automatically to another unaligned address). If you’re unlucky, it will do something worse, like go into an infinite loop or overwrite some critical data in memory that it shouldn’t have. After all, there’s a chance that the two parts of an instruction together make up an instruction like LW, BEQ, or JMP. These risks are part of why it’s actually not possible to do this in our ISA: the BEQ and JMP instructions have built-in multiply-by-two steps so that the actual address they reference will always be a multiple of 2 if the PC starts at a multiple of 2, which it always does.

How does the BEQ instruction work?

The BEQ instruction stands for “branch if equal.” It’s a very important instruction because without it, there’s no way for the computer to change its behavior based on the data it is processing. You can create quite complex programs with just JMP, but you can’t achieve anything nearly as rich as you can with BEQ available as well. BEQ will serve as our “if” statement, as well as “for” and “while” loops when combined with JMP. How does it work? First, “branch” in this case means “go somewhere else in the code, instead of continuing down as normal with PC <- PC + 2.” It could also have been called “JEQ” for “Jump if equal” because it is a kind of jump, but the word “branch” implies a splitting-off which is relevant to BEQ because there are two possible destinations: BEQ examines two registers (or possibly the same register twice) and if they’re equal, it performs the jump indicated by the offset, relative to where the PC would have ended up normally. On the other hand, if the two registers’ values are not equal, it skips the jump and just does PC <- PC + 2 as usual, continuing to the next instruction after it.

As an example of how it is used, consider a while loop: we want to “continue the loop if the continuation condition is true.” If we assume true = 1 and false = 0, you could use the following assembly code at the bottom of your loop:

BEQ Rn, R0, 1
JMP <top of loop>

With this setup, assuming that Rn holds the result of evaluating your continuation condition, we compare it to R0, which always holds the value 0. If they’re equal, meaning that our condition is false, we jump forward 1, which skips the JMP instruction below it and moves on to the code after the loop. On the other hand, if the value is NOT 0, then we assume the continuation condition is true, we don’t perform the BEQ jump, and we continue to the next instruction, which is a JMP that takes us up to the top of the loop to repeat it.

What inputs does the ALU have?

Ainv, Bneg, op1, and op0 are the inputs to an ALU, which we saw in detail in lab 4. Together, they can be used to get the ALU to perform not only AND, OR, and +, but also other operations like subtraction, NAND, NOR, and even NOT. The Control Unit is responsible for creating these inputs based on the opcode, so there’s a strict relationship between opcodes in the ISA, and the ALU inputs.

What is the role of the Program Counter (or PC) within the CPU?

The Program Counter register stores the “current instruction address,” and while running a program it’s directly connected to the Instruction Memory (one of the two RAM chips in the HW Arch CPU, the other being Data Memory). By connecting to the instruction memory address bus, the PC determines the data output from the instruction memory in each clock cycle, and as it keeps increasing by 2 each cycle it reads out subsequent instructions from memory one by one (or in some cases its value changes to follow the jump dictated by a BEQ or JMP instruction).

What’s the difference between the “Mem” and “Mem Store” lines?

The “Mem” line from the control unit controls three multiplexers, which change the read address 2 input of the register file, the second ALU input, and whether the ALU output or Data Memory output is routed back around to the register file data input. For both LW and SW instructions, this line is set to 1, and the first two multiplexers allow it to perform the R[s] + d (where d is an offset) calculation that both LW and SW require to compute their memory address. However, when doing a LW operation, we do NOT want to write data into the RAM (the data available at the RAM data input for that instruction is the contents of the register Rt that we are trying to write to, not read from). So in this case, we set Mem Store to 0. On the other hand, for an SW operation, we want to do exactly that (store Rs data into memory) and we set Mem Store to 1. In this case, the RAM still outputs a value which will change to match Rs as the instruction completes. We ignore this value by setting the write enable line for the register file to 0.

What would happen if you loaded program data as text?

Most likely, you’d see a garbled mess of random characters, possibly with some runs of similar characters where similar instructions happen in a row, and very likely with many unprintable characters where the necessary opcodes or register specifications of machine code, when translated through a symbol table, map to symbols that aren’t normally used. You can try this yourself: right-click on an executable file and choose “open with” and then select a text editor, like Notepad on Windows or TextEdit on a Mac. Alternatively, in the command line if you have a compiled program (like the program you wrote for the bits assignment) type “cat” and then the filename. Note that on the command line this may cause issues with the terminal if it happens to end up printing special output control characters; in that case using the terminal “reset” command (or just opening a new terminal) might be helpful.

Of course, with an expressive enough ISA and machine code, it might be possible to write a program that also looks something like intelligible text. Indeed, any strings used in the program will appear somewhere among the machine code gibberish as well. But it’s very likely that you’d just see weird and nonsensical text.

What would happen if you loaded text as a program?

If you load non-program data, like a text file, as if it were a program, the CPU will interpret that data as instructions and try to run them. Most likely, your “program” will just crash, but if you are supremely unlucky, it may do something else like enter an infinite loop, or even start erasing data from your hard drive. It just depends on how the text was encoded, and what those bit patterns mean when interpreted as machine instructions. Quite likely this will be a bunch of nonsense operations doing arithmetic and hopefully eventually either a lucky HALT instruction, or some combination of instructions when run in an operating system that makes the OS stop the program for bad behavior. But if the bits happen to line up as a bunch of SW instructions, for example, you could end up overwriting important values in RAM.

What’s the difference between a 32-bit and a 64-bit computer?

These terms basically mean that the CPU, including the ALU and the register file, have that many bits in each slot. Of course, if you want to add two 64-bit numbers on a 32-bit machine, it’s not impossible: You issue one add instruction for the first 32 bits of each, then issue an instruction to recover the carry-out and store it somewhere, then issue another instruction to add that carry-out to one of the numbers (again capturing carry-out) and then add that result to the second half of the other number, finally add the carry-out from that to your second carry-out value (if you even care about final carry-out) and finally, issue instructions to write both half-results back into memory next to each other so that they’re still a 64-bit number. Obviously, this is much slower and more painful than just doing a normal + operation, so we tend to end up using 32-bit integers on a 32-bit machine, and 64-bit integers on a 64-bit machine. Code that uses 64-bit integers can handle much much much higher values before overflow is an issue, and so it has a lot of advantages over code using 32-bit integers. Other parts of the computer (like RAM addresses) may also follow the bit width of the CPU, and there can be advantages there (with a 32-bit address space, you can address about 4 billion bytes, so about 4 gigabytes. But many modern computers have 8, 16, or even larger RAMs, which may require a 64-bit address space, or workarounds that slow down 32-bit computers even more).

What are the different connections a ALU could have with a RAM?

The question in the lab was mostly to provoke you to think a bit more about what kinds of connections could be made and how pieces of a CPU might fit together, in preface to you actually looking at the (mostly) full CPU circuit diagram later in the lab. In that diagram, the ALU takes inputs from the register file (or possibly Instruction Memory directly for the offset part of a memory address) and sends output both to a register file and to the address input of Data Memory (another RAM). However, note that the data memory output goes into the register file, and the data memory input comes from a register file, Since the ALU can take input from register files and write output to them, it’s indirectly hooked up to the data input and output of the data memory, since values can go into a register for a cycle and in the next cycle go in to the ALU, or vice versa.

On why the ALU is used as an address input in the CPU: The HW ISA states that for LW and SW, the memory address is going to be “R[s] + offset.” For the CPU to implement this, it needs an adder circuit. Instead of adding a dedicated adder for memory address computations, it simply re-uses the existing ALU, which would otherwise not be involved in the LW and SW instructions at all. It routes the last 4 instruction bits which are serving as the offset into the ALU, along with the R[s] result from the register file, and sets the operation bits for the ALU to addition. Then the ALU result is hooked into the data memory address input.

How are the “sign extend” and “shift left” for the BEQ computation implemented?

The BEQ “meaning” column states: “PC <- PC + 2 + offset * 2” if the registers are equal. So the circuit needs to take offset, multiply it by 2, and then add it to the normal PC result. The left-shift by 1 bit is a multiply-by-2, and the sign extend ensures that the 4-bit offset value (which is signed 2’s-complement) can be turned into a 16-bit value to be added to the 16-bit PC value. Where are these components in the LogicWorks circuit? They actually aren’t circuit components, but are instead achieved by clever wiring of the offset wires where they connect to the adder input in the BEQ mechanism: the wires are shifted up by one spot, with an extra ground connection below them for the 0, and then the 4th wire is connected to ALL of the inputs above the 5th input as well as the 5th input itself, so that you get a “sign extend” effect. It’s very efficient that we can perform these two operations using just clever wiring and without any transistors involved, but they still need to show up as boxes in the circuit diagram so that you can understand what operations are being performed.

How do we know how to connect wires in LogicWorks based on a circuit diagram?

When translating a schematic diagram (like the CPU diagrams we show in various places) into LogicWorks or the Falstad Simulator, you need to first identify which component in the simulator corresponds to which part of the diagram (or what to search for and add to your diagram to get the right component). This is complicated by the fact that some components in the schematic diagram, like a sign extend or a shift, are created just using alternate wiring connections without there being an actual component.

Next, try to make the same connections as are present in the diagram, using buses and breakouts to route multiple wires when the diagram has a slash with a number above it indicating a multi-wire line. Use “show values” in LogicWorks to help double-check that things are connected and that their behavior is correct: you can try to verify each chip one at a time for a simple example where you know what each value should be. (If you can’t put together a simple example and deduce the values for each part of the circuit diagram, it’s best to back up and work that out first, before attempting to simulate the circuit.)

What’s the difference between the HW Instruction Set Architecture (ISA) and a “real” assembly language?

The HW ISA is intentionally kept very simple, so that we can teach it to you in 5 weeks and you can understand how the hardware side of things implements the software specification. In the next part of the course, we’ll be reading and writing programs in Intel x86 assembly; other “popular” (at least, among hardware hacking people I follow on social media) assembly languages include ARM (a much simpler assembly language with a different design philosophy, found in some Macs and many embedded devices like TVs) and 6502 (an older CPU popular among retro tech enthusiasts and used in systems like the Super Nintendo). Each will implement a lot of the same basic instructions as the HW ISA, just with more different flavors. They might also use more than 16 bits, and some have instructions that are different sizes.

The basic operations of the HW ISA are the same operations we’ll use for about 70-80% of our assembly programming, just with slightly different syntax. The most common other instructions we’ll use will be specialized jump instructions for function calls and returns, and some specialized memory access instructions for dealing with the stack (which we’ll be learning about soon).

We will NOT ask you to learn anything about the hardware implementations for these other ISAs: we’re going to draw a firm abstraction boundary there since otherwise we could spend the entire class getting into the details of modern hardware design.

What’s the difference between an ISA (Instruction Set Architecture) and a Microarchitecture?

An Instruction Set Architecture is a specification that says “here’s the meaning of each of these assembly instructions, and here’s how it will be encoded in binary.” It’s possible to build multiple different CPUs that both implement the same ISA, and one might even be faster or slower than another (or consume less power, or generate less heat, etc.). The Microarchitecture is a specific hardware design (and/or actual physical chip) which implements a particular ISA. So the specific layout of the CPU circuit diagrams we’ve shown you is a microarchitecture, while the “HW ISA” slide is pretty much the full specification for the associated ISA.

What does the “R” in RAM mean?

The “R” in “RAM” stands for “Random” which may feel a bit weird, given that its operation is entirely deterministic. But it’s a historical artifact: prior to RAM there was “sequential access memory” where values were always read out one after another, with no address input. You’d either set up your code to take advantage of this, or if you couldn’t, you’d have to wait and let values cycle around again to the spot you needed, which was inefficient. RAM is much more expensive to build in terms of transistors, but the efficiency is typically well worth the investment.

What does the LW instruction do?

The LW instruction stands for “Load Word” and its purpose is to take data stored in memory and load it into a register. This is necessary because the HW ISA does not contain any instructions for preforming arithmetic on values stored in memory: only values stored in registers can be added, ANDed, etc. So a typical program will need to load values from RAM into registers, perform whatever computation it needs to do with them, and then store them back into RAM so that it can use the registers for the next batch of operations.

How does the Program Counter (PC) get updated?

The Program Counter has three update mechanisms: the normal PC <- PC + 2 advance, the BEQ mechanism, and the JMP mechanism. A series of multiplexers controlled by signals from the control unit and the ALU’s zero output select whichever route is appropriate for the current instruction. By the end of the clock cycle, as the falling edge approaches, the correct next PC value will have propagated through these mechanisms, and be waiting on the input side of the PC register. During that clock cycle, right from the beginning, the current PC value will have been used as the Instruction Memory address input, meaning that the corresponding instruction will be sent out from the instruction memory data output, which feeds an opcode into the control unit and which feeds addresses and/or offsets into the register file and/or ALU. When the falling edge of the clock arrives, the PC’s D flip-flops will each accept their current input as their new value, and a new instruction will begin to execute based on the “next PC” result from the previous instruction. Once again, at some point during that cycle, the PC update mechanisms will place yet another new PC value at the PC input side, ready for the next swap.

Why does the HW ISA use a 16-bit word size in RAM?

The choice of 16 bits as the word size in RAM is based on the fact that the HW ISA specifies a 16-bit word size for each instruction. If RAM used a different number of bits per word, you’d have to load either multiple words per instruction or only part of a word per instruction, vastly complicating the design either way. But you can easily build RAMs with any word size you want: you just need that many data input and output wires, and then you need to group the internal D flip-flops together in groups of that size.

This choice of 16 bits affects the rest of the CPU as well: the register file has 16-bit registers so that we can load or store one word at a time. This also needs to match up to the ALU, so that it can operate on two register file values and send back a result that fits into a register file. In each case, there are 16 wires that connect from one chip to another. There are other systems which organize data on a wire as pulses over time which might need some kind of synchronization mechanism to tell the next chip “I’m done now” but using parallel wires like our designs do, the number of bits is built into the physical design of the chip.

How is the number of address bits related to the total size of a RAM chip?

The number of address bits for a RAM chip places an absolute limit on the number of data items it can contain, since with N bits, there are $2^N$ different possible bit patterns, so you can’t distinguish any more than that number of data items. This is also the formula for deducing either the number of address bits from the RAM size, or the RAM size from the number of address bits. Almost without exception, a RAM will always choose to have exactly that many data items, although in principle it could have fewer. In the simplest setup, each “data item” is a “word,” (16 bits for the HW ISA) and so the total size of the RAM is the word size times the number of distinct addresses.

In the HW Arch, we actually have a more complicated case, where the RAM is “byte addressable” meaning that each set of 8 bits has its own address, even though we read and write 16 bits at a time. There are both advantages and disadvantages to this approach that I won’t go into detail on, except to mention that when representing English text, 8 bits is a common amount of data used to represent a single letter. The effect that this has on RAM size is that the full size of the RAM is now the addressable size times the number of addresses, rather than being the word size times the number of addresses. In our RAM, with 8-bit addressing and 16-bit words, this means that the size of RAM is only half as large as it would be if it were strictly word-addressable, but we have more flexibility in being able to read starting in the middle of a word.

How can I practice drawing circuit diagrams?

There are multiple things you might be asked to draw a diagram of: you might be asked to draw a diagram based on a truth table, a boolean algebra expression, and/or an English description of how some components fit together. For the first two, you can convert between truth tables and boolean algebra expressions. I think if you’re given a truth table, converting to an expression is best because the expression can be directly translated into AND, OR, and NOT gates: Just use order-of-operations to figure out which operations happen in what order, and draw a gate for each one.

For an English description, just try to follow the description as best you can.

To practice drawing from a description, you can go to any of the diagrams in the lab pages and right-click to access the alt text (your browser should have some means of doing this). The alt text will be a description of how that circuit is connected, so you can try drawing from that description without looking at the image, and then compare your result with the image when you’re done (do let me know if you find an inconsistency!).

To practice with drawing circuits from boolean algebra expressions, you can do the following: Draw out a 4-row truth table with two inputs A and B and an output Q. Fill in all four input possibilities, and then randomly put ones and zeros in the output column. Then convert to a boolean algebra expression and try to draw the circuit. You can build it in the Falstad simulator according to your drawing, and then use the simulator to test against your truth table to see if you got it right. You can also do this with 3 inputs (and so 8 rows), which will be a bit better practice.

What does the JMP instruction do?

The JMP instruction lets us move from one part of the code to another. The BEQ instruction can also do this, but the BEQ instruction has a limited range, because the offset is a 4-bit signed value, so it can jump at most 8 instructions backwards (or 6, if you don’t count BEQ and the instruction after it) or 7 instructions forwards (8 if you count the instruction after BEQ). So JMP allows us to jump to a very different part of the code, since it can accept a 12-bit offset, and that offset is absolute, rather than relative. Remember that we always multiply the offset by 2, to ensure that the resulting PC address is even, because each instruction word (16 bits) takes up two address slots (8 bits each), so the address of the start of each instruction is a multiple of 2.

We’ll see some other related instructions that are fancier in the next part of the course, but key uses of the JMP and related instructions are to do function calls and returns: If you think of each individual function being written as assembly code somewhere in the instruction memory, a JMP can take us to the start of one of those blocks to call the function, and (with a bit of an extra twist) a JMP at the end of that function can serve as a return to go back to where the function was called from. The twist is that the same function needs to return to multiple different locations, based on where it was called from, so we can’t just use a normal JMP instruction (and the HW ISA can’t actually be used to implement full function calls like you’re familiar with from Python or Java).

How can we make sure that the program will halt instead of going into an infinite loop?

With BEQ and JMP instructions, it’s possible to create infinite loops. Of course, sometimes that’s what you want, like a program that’s always waiting to respond to input and only stops when you shut off the computer.

But to ensure that it does halt, you need to ensure that the sequence of BEQ and JMP instructions will always eventually lead to a HALT instruction. As it turns out, in the abstract this is a tricky theoretical problem, and if you are given a program supplied by an untrusted source, it can be prohibitively resource-intensive to prove whether or not it will eventually halt.

CS 235 “Theory of Computation” goes much deeper into this question than we will in this class.

Why is the register file faster than memory?

I don’t know all the specifics of the answer to this question, but some reasons include:

We ignore this detail in our circuits in this class, but a 1 or 0 indicated on a single wire may be insufficient to actually trigger all of the places that wire connects to if the wire connects to many different places. If you think of each output end of a wire as needing to “do some work” to assert either a 0 or a 1, and the input end as “doing some work” to assert 0 or 1, if there’s one input and thousands of outputs, the work of the input might not be sufficient to trigger each output as it is supposed to. This can be fixed with extra repeater gates, but…
We mostly ignore this detail in this class, but each gate in a circuit (really, each transistor) takes some time to operate. And the next gate in the circuit won’t be receiving valid inputs until the previous gate settles, so in a circuit with hundreds or thousands of gates, this propagation delay makes the whole circuit take more time to reach a valid state. This is called “gate delay.” A register file is small, and thus has a limited gate delay cost. A much larger RAM doesn’t necessarily seem to have a longer longest-single-path (which is what matters for gate delay) but in practice because of the repeaters from point 1, it may have a much higher delay.
A third consideration (which again we’ll ignore in this class) is that while the register file is built directly into the CPU, the RAM chip is actually a separate circuit on another board (this is why you can add more RAM to your computer by plugging in a new chip). There is some delay related to the communication between the two chips, which happens over a PCI bus that also has other things using it.

There are more issues, like the difference between SRAM and cheaper-but-slower DRAM, which I don’t fully know the details of. You should now be equipped to understand a lot of them though if you want to dig deeper on Wikipedia and/or in our textbooks.

In practice, modern CPUs use multiple layers of “caches” which are separate RAM chips that are closer to and further away from the CPU, usually with smaller sizes and more expensive/faster construction closer to the CPU. These act as temporary storage for data from the main RAM, in (common) cases where you need to access the same memory over and over again, so that you don’t have to go all the way to main RAM each time. How these work is pretty complex, and we won’t cover it in too much depth in this course, but it is often relevant to the question of “why does my program execute so slowly compared to that other program.”

What is the big picture in terms of how things are connected in the CPU?

The full CPU schematic diagram from the lecture slides is pretty much the big picture in terms of how things are connected: The PC connects to the instruction memory, which feeds instructions to the control unit and the register file, the register file outputs go into the ALU which may feed an address to data memory, and the ALU result and/or RAM data output feeds back into the register file. On a separate path, the BEQ and/or JMP logic modifies the PC for those instructions. The control unit feeds control signals to each of these systems to get them to perform their part of the current operation.

How are numbers represented using voltage patterns?

The registers in RAM and in the register file store voltage patterns: they have a series of 16 D flip-flops, and those D flip-flops are each remembering a value in the sense that their output remains at +5V or ground, depending on their previous inputs. These can be updated by pulsing their clock lines a necessary. How do those voltage patterns (a series of wires either charged to +5V or discharged to 0V) represent numbers? We treat each wire as a single bit, with high voltage (above +2.5V) meaning 1, and low voltage (below ~2V) meaning 0. Then we read those 16 1’s and 0’s in sequence to produce a binary number, and use the conventions of binary notation to translate that into a familiar decimal number. So for example, if all of the latches have low-voltage output except the 3rd, that’s the number 0000 0000 0000 0100, which in decimal is a 4. The circuitry is hard-wired to interpret some of those numbers (so 4 might correspond to the 4th register in the register file because it contains a decoder that makes that happen). In other cases, we use symbol tables to translate those binary values into shapes on a monitor with the corresponding meaning. Ultimately though, the form that the data takes is these electrical charges on wires. So for example, if we disrupt those by introducing a stronger electric charge or in some cases a magnetic field, we will corrupt the data.

Does a computer usually have just one CPU?

Most modern CPUs have what are called “multiple cores” which effectively means several mini-CPUs bundled together, and these cores execute things in parallel to each other (most often, they’re simple working on different programs, but sometimes a single program can be using more than one core). In addition, a chip from the Intel 386 line loads more than one instruction at a time, and executes these instructions in parallel using multiple ALUs. This is extremely tricky, since this means that the instructions you wrote are NOT being executed in order, and the CPU is responsible for ensuring that everything will appear to have happened in order at the end of the day. Look up “pipeline stall” if you want to dive deep down that rabbit hole. For this class, we’ll just pretend we’ve got a single CPU executing one instruction at a time, and that it’s only working on the program we wrote.

Lab 6/7

How does modifying a value differ from modifying the dereferenced version of a pointer pointing to that value?

In code, we are comparing situations like this:

int x = 0;
x += 1;

vs. modifying by dereferencing a pointer, like:

int x = 0;
int *xAddr = &x;
*xAddr += 1;

As far as I know, there are no differences whatsoever between these two cases. The compiler should produce almost identical assembly code for each. One caveat is that the second code may end up forcing x to be stored on the stack, since we need its address. But that shouldn’t change the effects of the code at the C level.

Note that if the pointer type of the pointer doesn’t match the type of the variable, there may be differences. The following code uses a char* and so when we modify the dereferenced value, we modify only 1 byte (notice that a cast is necessary to create the differently-typed pointer):

int x = 0xffff;
char *xAddr = (char*) &x;
*xAddr = 0;

In this last example, the final value of x will be 0xff00 rather than just 0.

What are malloc and free used for? Why do we have to cast the result of malloc? What arguments do malloc and free take and what do they return?

malloc is short for “memory allocate” and it does just that: allocates (reserves for our use) some memory. To build a metaphor here, imagine you’re in a review session with a bunch of other students, and you each have a different problem to solve. You could just solve it in your head, but there are whiteboards in the room, and if solving it in your head means using registers only, there are times when there’s too much information and you need to write stuff down (i.e., use memory instead of a register). But with so many students (functions) in the same room, you need a way to share the whiteboard. While malloc uses the “heap”, it’s worth considering for a moment the other memory pool: the stack. Each function gets its own stack frame, so in our metaphor each student gets some reserved whiteboard space. However, after a student is done solving their problem, we erase that space so that other students can use it. Anything stored on the stack is theoretically invalid after the function returns. That’s where the heap comes in: In our metaphor, it’s an extra part of the board where students can write info from their solution that’s useful for other students. If you’re all solving related problems, sometimes part of the solution to one problem is useful in solving another. For this part of the board, not every student needs to use it, so we don’t allocate space for each student. Instead, when a student needs to use it, they can just walk up and draw a box to reserve some space, then put the information they want to share in that box. Later, if someone who needed that information decides it’s no longer necessary, they can erase the box, so that others can use that space. So this extra whiteboard space that anyone can use is called the heap, and it’s where we store values that need to be shared between functions.

malloc in our metaphor is walking up to the whiteboard and drawing a box. When we do it, we need to determine the amount of space we need, which is why we need to specify that when we call malloc. malloc returns a pointer to the allocated space; we usually want to cast that pointer to a specific type, based on what we plan to put in that space.

free in our metaphor is erasing the box so others can use it. If we forget to call free, the heap will fill up with junk values, and someone who needs to use space will end up unable to run because there’s no space left. This is called a “memory leak,” and valgrind is the tool that helps us detect memory leaks. The complicating factor is that since we use the heap to share information between functions, the function that calls free is often a different function than the one that calls malloc. This makes it extra easy to forget to call free when necessary, or to do something like accidentally calling free too early, before someone else comes along and tries to read the value we just erased. While in the whiteboard metaphor freeing actually erases the data, in the computer, freeing just marks the memory as fair game for re-use, and it may or may not be erased.

In general, since using malloc is complicated, if you can avoid it you should. But any time you need a value that’s bigger than a single return can carry to pass from one function to another, you’ll need to use it, or use other techniques that are almost as complex (like passing a pointer into the second function which it uses to write its answer into pre-allocated memory).

The specifics usage of malloc is that you give it a number of bytes, and it returns a pointer to allocated space where that many bytes are reserved. Because it doesn’t know what you wanted the bytes for, it always returns a void* generic pointer. This is why you almost always cast the result to indicate what type it is, and use sizeof in the argument to compute the correct number of bytes for a certain number of values, like:

int *data = (int*) malloc(sizeof(int)*10);

For free, you just give it the pointer that malloc returned and it marks that memory as no-longer-in-use. In this example you’d do:

free(data);

How does C remember types? Does C allow us to modify the type of a declaration?

The compiler keeps tables that it fills in as it reads your code with the memory address (or register used) for each variable, as well as the types that they have. When producing assembly code, it uses these tables to pick which registers/addresses to use. It also uses the type information to pick which variants of assembly instructions to use, and whether to use full-size or part-size registers (e.g., using %eax for an int but %rax for a pointer). We won’t dig into that process in this class, but a compilers class would.

Although C does not allow you to change the type of a variable, you can always use a cast to create another variable that stores the same value. And creating a copy of a pointer with a different pointer type allows you to operate on the same data using two different variables, each of which treats it differently.

Why is the ptrdiff_t type useful?

When computing the result of subtracting two pointers, we get a value of type ptrdiff_t rather than the same type as the pointers. Why not just have char* - char* → char*?

One reason is that pointers are unsigned, since there’s no such thing as negative addresses. But when we subtract, we care about direction, not just difference, so the value is signed.

But why not just use int if we want a signed integer difference? The problem is that the int type is often 4 bytes, while a pointer might be 8 (this is the case for code we’ll write in this class). We’d like a signed integer type that is at least a big as a pointer itself, and that’s what ptrdiff_t is.

How can we address a whole block of memory with a single number?

The short answer is: “we can’t.” Each address specifies a particular byte in memory. For anything bigger than a byte, we can’t really “point to” the whole thing, without using two numbers: one for the start, and another for the length (or one for the start, and another for the end). In practice, pointers themselves always point to the starting byte of whatever they’re trying to indicate. Then the compiler keeps meta-information about the size of that value, so that it knows how many bytes to access at once (e.g., 4 bytes together for an int).

In the case of arrays, the compiler isn’t helpful, so we need to keep around extra variables holding their lengths. Alternatively, as with strings, we can have a “sentinel” value that we place at the end, and if we want to know the length we just iterate until we find that value.

Why does writing past the end of an array (e.g., replacing ptr[20] with 'B' after declaring char[3] ptr;) cause a segmentation fault? fault?

Segmentation faults can happen for several different reasons:

Code tries to write data to a memory address that’s read-only.
Code tries to read data from an unreadable address (like NULL).
Code tries to execute instructions from an address that’s not marked as executable (e.g., trying to jump to a location that’s not part of the program’s code).

When we learn about the stack, we’ll see that return addresses are stored on the stack. In the particular example from lab, we set things up so that the address used to store the 'B' relative to where the array was stored on the stack coincided with that return address. By corrupting the return address, we cause the code to jump to the wrong place when it returns, and if the place it jumps to isn’t even part of the code any more, we get a segmentation fault.

When do we apply scaling to pointer operations?

The best way to think about this in my opinion is to consider all addition and subtraction operations with a pointer value to take place in units of “slots,” and then based on the size of the thing pointed to, that determines the size of a slot. So if we have:

int *array = {10, 11, 12, 13, 14};
int *ptr = array + 2;

In this case, the pointer will point to the third slot in the array, or to the number 12. In raw address terms, that’s 8 bytes beyond the address of the base array, and that’s where pointer scaling comes in: the actual value of 2 was multiplied by the stride of 4 to get +8 for the raw byte address. However, when doing subtraction, we effectively apply scaling in reverse. If we look at numerical values, ptr is 8 more than array. But if we ask C to print out ptr - array the subtraction happens in terms of “array slots” and the result will be 2, not 8. The reason for this is that we’d like the following equation to be true: array + (ptr - array) == ptr). In C, that expression does the (ptr - array) calculation in terms of “int-sized slots” to get 2 as the value, so that when it adds 2 to array which it also does in terms of “int-sized slots” it gets the correct value for ptr.

The other tricky thing to remember: square brackets for indexing are also a form of addition. So &array[2] == ptr is true, because that’s the same as &*(array + 2) == ptr and the & and the * cancel out.

What happens if I assign a pointer to a local variable on the stack and then dereference that pointer after the function has returned?

When a function returns, we move the stack pointer back up above its stack frame, but as soon as another function gets called, the stack pointer moves back down, across the same memory region. The new function then will write its own data into that region, erasing the data written by the old function. At that point, anyone who uses a pointer to the old data will see the new data instead. For example:

int* readThree() {
    int results[3] = {0, 0, 0}; // needs 
    scanf("%d %d %d", &result[0], &result[1], &result[2]);
    int *r = results; // otherwise the compiler substitues NULL for return
    return r;
}

int main(int argc, char** argv) {
    int* first3 = readThree(); // creates frame; then returns
    // first3 is technically invalid here, but the data is still there
    // ...until another function is called
    int* next3 = readThree(); // same frame position as first call
    // (printf itself doesn't use enough of the stack to pose a problem)
    printf("First 3 address: %p\n", first3);
    printf("Next 3 address: %p\n", next3);  // prints same address
    printf("First 3: %d, %d, %d\n", first3[0], first3[1], first3[2]);
    printf("Next 3: %d, %d, %d\n", next3[0], next3[1], next3[2]);
    // same numbers from second call are printed twice
    return 0;
}

If we were to run this and enter "1 2 3" the first time and "4 5 6" the
second time, the prints would print out "4 5 6" twice, and it would show
the same address for the two return values. Without the extra pointer
variable, compiling it will also produce a complier warning about
returning a pointer to stuff allocated on the stack. When it prints that
warning, our currrent compiler also sets the return value to NULL,
becuase since returning a pointer to a stack variable is "undefined
behavior" the compiler can do whatever it wants in that case. So if we
run this program without the `r` variable, we'll just get a segmentation
fault.

How big are the stack and the heap?

The stack usually has a fixed maximum size in the megabytes range. Given that most function call frames take a few dozen bytes, this is usually sufficient for thousands to hundreds of thousands of layered function calls, and since each function gives back space when it returns, the only case where you might be worried about the stack size limit is when using recursion and layering lots of stack frames. The heap on the other hand is usually only limited by the physical memory (and on 32-bit systems by the address space size). A 32-bit system is usually limited to ~3GB of heap, while a 64-bit system can use up to all of the physical RAM available (until we develop systems with exabytes of RAM…).

There’s actually a virtualization layer that allocates real physical memory between different programs on your computer. So if you have 8GB of RAM and Firefox is using 4GB, another program you run may encounter an out-of-memory error if it tries to allocated 6GB.

See this StackOverflow answer for more details.

Why do we have to allocate memory? Why don’t we do this in other languages like Java or Python?

At a most basic level, if we want to use memory to store things, we need to allocate it, so that two different functions or programs don’t try to use the same memory for different purposes. The stack conventions provide an easy way to allocate memory for a particular function call: Just declare a local variable. However, that’s not good enough if we need to create values that will survive beyond the return from a function, like strings, arrays, or other objects that need to be passed around between functions. In those cases, we use malloc to ask for a permanent allocation, and then when we’re done with that memory, we use free to release it so that someone else could use that memory later.

In languages like Python, Java, or Javascript, we have a garbage collector and an automatic memory allocation system. Allocation is handled for us automatically, but to free it, we need to spend some time on “garbage collection” now and then. Those languages are in many cases 2-100 times slower than equivalent C code (with garbage collection being one part of that). Now, most of the time, we simply don’t care, and it’s better to use the garbage-collected language, because there are all sorts of errors that we can make when allocating memory, and for many programs, they spend most of their time waiting for user input or network data, not running their own code (so speedup doesn’t matter).

For some real-time applications, like videogames, performance actually does matter, and these are more typically written in C++ or another low-level language. Always try to judge whether a performance improvement actually impacts the user experience before reaching for the “fastest” language.

That said, understanding how memory is allocated by having to do it manually is super useful in understanding various things about the performance of automatic allocation systems. For example, knowing how strings are allocated can help you understand why certain Python programs would be faster than others. For example, these two functions:

def f(n):
    return 'a'*n  # string multiply can allocate once

def g(n):
    result = ''
    for i in range(n):
        result += 'a'  # needs to copy + reallocate each time
    return result

Without understanding how memory is allocated, it might seem that each should take time proportional to n, even if one is faster or slower. If we understand how strings are represented in memory, we can see that he second function, each time it adds an ‘a’, is going to have to copy the entire result string to a new memory location that has room for an extra ‘a’. Since it does that in each iteration of the loop the actual performance of g is proportional to n squared!

Is using pointers more efficient than using array indexing?

No. They will become exactly the same assembly code when compiled, so there’s absolutely no difference in terms of what the CPU will do.

We’re having you use pointers instead of indexing in order to give you more practice using pointers, so that you can actually understand them.

Can we implement more complicated data structures like heaps or sets in C using just pointers?

Yes. No matter what language you’re working with, ultimately memory is used to hold data, and any pattern of memory access can be written up using pointers.

Sets are somewhat complex, but there’s actually a very nice and simple heap and/or binary tree data structure that uses a linear array of memory and which can be implemented using pointers quite easily.

What is a segmentation fault?

The computer’s memory is divided up into different “segments.” One of these holds the actual machine instructions to be executed, another holds the heap, another holds the stack, and other segments provide space for things like system calls. Each segment has certain permissions: can we read from it, can we write to it, can we execute code from it? A segmentation fault happens when we attempt to do something that’s forbidden, like write to a read-only segment. Most often, we will see:

Attempts to read from or write to the base segment, which is a small area of memory at the very low end of the address space that’s neither readable nor writeable. This segment exists to protect us from accidentally using a number value as if it were a pointer. If we say something like:
```
int x = 3;
int y = *x;
```
This would attempt to read memory at address 3. You’ll get a segmentation fault becuase of that protected base segment. Most commonly, this will actually happen because you are using NULL as a default value for a pointer and accidentally tried to use the pointer before it received its correct value.
Attempts to execute code from the stack or heap. The stack is marked as non-executable, so trying to interpret data stored there as machine instructions will be a segfault. This usually happens when return addresses stored on the stack get corrupted, which shouldn’t happen much in practice, but which will happen in this class because we’ll be teaching you about how to corrupt the stack.

Why do pointers have to know what kind of value they point to? Aren’t they all just memory addresses?

Yes, as values, they’re just unsigned integers (64-bit on a 64-bit machine, or 32-bit on a 32-bit machine). However, the compiler treats them differently when adding or subtracting, according to the kind of value they point to. This is a good thing as it makes array formulas work smoothly, but it does require us to distinguish the different pointer types. These types also affect how much memory is read when a pointer is dereferenced, even if we’re not using arrays. If I have a “pointer to an integer” and I want to read “what it points to” I need to read 4 bytes. However, if I had a “pointer to a character” with the same address value, I’d only read 1 byte when reading “what it points to.”

How can you tell the difference between the * type modifier and the * operator?

Whether it’s part of a type declaration for a new variable. Casts (which also use the type definition syntax) are always in parentheses, and the unary * operator can never be the last thing before a closing parenthesis. It’s extra confusing because the * “operator” can appear on the left-hand side of an equals sign, which is not usually a place that an operator can appear, and type declarations can also appear there. But a * for a type declaration will be directly after a type word, or directly before a variable name. Unless you’re declaring multiple variables on one line, just look to see if there’s a type name before the *. With multiple variable declarations (or even a single declaration), you cannot also include a * operator on the left hand side of the equals sign.

What are “little-endian” and “big-endian?”

When storing an integer, which is 4 bytes, we have to decide how to store those bytes in memory. Let’s say we want to store the value 0x12ab34cd at address 0x40000. Do we store 0x12 at address 0x40000, 0xab at address 0x40001, 0x34 at address 0x40002, and 0xcd at address 0x40003? Or do we do the opposite, and store 0x12 at address 0x40003, 0xab at 0x40002, 0x34 at 0x40001, and 0xcd at 0x40000? This decision doesn’t really make much difference in terms of performance: either way we’ll load all 4 bytes at once; either way we’ll do things like carry between them in an adder, just in a different order. However, if two different systems make different choices, they won’t be able to directly transfer data to each other in binary: They’ll have to convert that data to swap around the bytes first. So even though within a particular design it doesn’t have big performance implications, differences between designs do.

The word “endian” comes from Gulliver’s Travels, in which there is a society deeply divided about whether eggs should be cracked from the little end or the big end. In other words: it’s a reference to making a big deal out of an ultimately unimportant distinction. There aren’t really good arguments for why one way is “better” than another, but people are still sharply divided into two camps about it.

Remembering which one is which is hard, but a mnemonic that you can try is that “little endian” means that the “little” part (i.e., the least significant byte, which is 0xcd in our example above) is stored at the “end” (i.e., the lowest address). Equivalently, you can remember that the end of the number (still 0xcd is stored at the little address. Big endian is the opposite: the end of the number 0xcd is stored in the byte with the biggest address, or equivalently, the biggest (most significant) part of the number (0x12 in our example) is stored at the end (at the smallest address).

The reason that we draw our memory diagrams with high addresses at the top, and within each slot, we label bytes from right to left with increasing addresses is that these things help us make sense of a little-endian system, which is what we’re using.

Why does C use a sentinel value for strings, but require a separate variable to store the lengths of arrays? Why not just use one mechanism or the other for both?

I don’t actually know the history behind why different choices were made in these cases, but there are certainly tradeoffs between them. Note that if you want, you can create a character array that isn’t NUL-terminated, and store its length separately, although this may cause headaches if other don’t realize what’s going on.

For sentinel values, not having a separate length variable simplifies the code that deals with them. However, you pay a cost: you need to allocate one extra byte per string, and strings are no longer allowed to contain an all-0 byte as part of their data. That second cost is not really acceptable for arbitrary arrays, since not being able to store the number 0 in an integer array would be pretty silly. Of course, you could pick another data value as the sentinel, but things get complicated if each different kind of array needed to use a different kind of sentinel value. Also, because we often do arithmetic on numbers (but rarely on characters) the chance of accidentally producing the sentinel value would be greater for arrays of integers.

In contrast, while arrays do require more complicated programming to pass around extra variables, they don’t come with any space overhead or value restrictions. Especially if you have lots of very small arrays, avoiding the space overhead can be meaningful (imagine 1,000,000 2-item arrays representing x, y coordinates).

How do gdb and valgrind help in debugging C programs?

gdb is a debugger, so it lets us set breakpoints and get lots of information about what’s going on. If your program crashes, running it with gdb can let you figure out exactly what was going on when it crashed, and also inspect the variable values at that moment that might have led to the crash.

valgrind double-checks all of your memory usage, in particular malloc and free. It will report errors when you make mistakes, and it tells you a good amount of detail about what happened and which parts of the code are involved. Since memory mistakes are incredibly easy to make, but often only cause problems a small percentage of the time, valgrind is pretty important!

Lab 8

What’s the difference between info reg, print, and examine (or x) in gdb?

info reg tells you the value of a register. You only need to give it the register name, like info reg rdx. print can print many kinds of values, from C variables to register values. You need to prefix a register name with $ for print because it’s not designed to work with only registers (you might have a C variable that shares the name of a register, for example). In contrast to info reg and print, examine interprets its argument as a memory address, and shows you what’s in memory there. It accepts a wide range of expressions just like print does, including C variable names, so to use it with a register, use $ followed by the register name, as in x $rdx. For x you usually want to have some idea of what kind of values will be in memory so that you can tell it how to format them. For example, x /s $rdx will interpret the data as a string, translating bytes into characters via the ASCII table and printing those characters until it hits a NUL byte. On the other hand, x /2xg $rdx will print two “giant” (8-byte) values as hexadecimal numbers, starting at the target address. Use help x within gdb to get details on the formatting options.

How do you see what’s on the stack in gdb?

Lab 9 has several examples of this, but the basic idea is: the stack is in memory, so we can use examine (or just x) to view that memory. Of course, we have to know where in memory we want to look, but since %rsp stores the stack pointer, we can use $rsp as our target address to see values starting from the bottom of the stack.

The other issue is interpreting what you’re seeing. Using x /20gx $rsp will show you 20 8-byte slots on the stack, and many things stored on the stack are 8 bytes (any pointer, including return addresses, will be). But other things like integers or characters might be less than 8 bytes, so they’ll be jumbled together as one big hex number. You can practice reading out those details from looking at hex numbers (although just from the number you can never be 100% sure), or you can consult the assembly code to figure out what values it stores on the stack in what order, and thus know what to expect. For example, if you know that 4 integers are stored on the stack starting 8 bytes above the stack pointer, you could do x /4wd $rsp+8 to print them out in decimal format. Use help x in gdb to see the details of the formatting commands.

How can we protect vulnerable information like a password from being stolen by someone who has access to our assembly code? Can hackers take over any program they want by disassembling code?

There are two main avenues of approach here (although a computer security class can go into a lot more detail). The first one is instead of giving people code, run the code on your own computer and sell access to a web interface where they can send messages to the code and get messages back. Most modern apps are trending this way, and almost anything with a login these days involves at least one server besides the computer or phone actually using the app. This is complicated, but you get a lot more control over things. Sadly, this includes the ability to do a lot of nefarious stuff, like sell off the user’s data to make some extra profit, or force them to enable extra “features” they weren’t expecting and didn’t want.

The second approach is to use encryption to protect key information, even though you’re giving out a program people can download. For something like a password, there are encryption routines that can compute a value called a “hash” that’s based on the password, using a function that takes only a few seconds to run, but where running the reverse function to compute the password based on the hash would take millions of years. The math behind this is pretty cool (and out-of-scope for this class) but as one example, consider factoring a 100-digit number that was created by multiplying together 3 prime numbers, one of which was 50 digits long. If I tell you what the numbers are, it’s easy to multiply them and verify that they do indeed multiply to the correct value. But if you use a disassembler to look at the code and see that the correct password involves entering three numbers that are each prime and which multiply together to create the stored target value, you’re out of luck: actually finding what those numbers are requires searching through so many different possibilities it will take effectively forever to do it.

Now, there are still some attacks against this method, typically involving making smart guesses about what the factors could be based on knowledge of what kinds of passwords people typically use. And of course, if the password is trying to prevent users from executing some code and the code itself isn’t properly encrypted, you could edit the assembly code to jump directly to the code you want to run and skip the password check, so getting it right is pretty tricky.

How do you understand which lines of assembly code and/or registers are used as part of a loop?

One of the first things to do when looking at a function in assembly code is to map out the jumps by identifying which ones go where, especially those that are internal to the function rather than jumps to call other code (although those are important too). From that, you’ll be able to identify loops because they involve one or more jump instructions where given the right condition results, we can jump in a loop from one instruction back up and eventually down again to the same instruction. In the simplest case, there’s just one conditional jump that goes upwards in the code, and the code will then flow back down to that jump until eventually its condition becomes false.

Once you’ve identified any loops in the code, look at the comparisons (typically cmp or test instructions) right before the jumps to figure out which registers and/or memory addresses are used to determine when they stop or continue. Remember that especially when test is used, the literal name of the jump instruction often doesn’t totally apply. Focus on which flags the comparison sets, and which flags the particular jump instruction tests (for example, je tests the zero flag, so when following a test, it’s testing whether the result was zero, NOT whether two things were equal).

If we have access to the full source code, what are the benefits of looking at the assembly code?

Typically, looking at the source code is much better in this case, and I wouldn’t recommend looking at the assembly unless it’s absolutely necessary. If there’s a bug in your compiler, it may be necessary since the assembly might not faithfully implement the source code in that case. Hope that you never run into such a bug: they’re very nasty since you typically try to rule out every possibility involving errors in your own code first, which is exhausting. Also, unless you’re very unlucky, the compiler bug will end up being fixed in a newer version of the compiler that’s already available, so simply upgrading your version will solve things. Discovering a new compiler bug is even more fun :)

Besides situations where the compiler’s buggy, the only other advantage of looking at assembly over source code I can think of is if you’re programming a performance-critical application. In some cases where one particular function runs many times inside a loop, hand-optimizing the assembly code could be enough to improve runtimes by 25-50%. Just be aware that often times working at the source code level to improve things like cache coherence can be just as if not more important, so hand-optimizing assembly code should only very rarely be the tool you reach for to improve performance.

That said, there are a lot of situations where you don’t have the source code: most programs you run are produced by private companies that keep their source code private as one means of protecting their secrets. A few examples of practical places you might want to use a disassembler:

If you’re working to understand and counteract a new computer virus, reading the assembly code of the virus is necessary since the hacker producing it isn’t likely to give you the source code.
If you’ve just found a new method of exploiting or taking over programs and you want to check whether common programs are vulnerable to it, you may need to disassemble those programs to check.
If you’re modding a videogame, you won’t have access to source code, but you need to figure out where key data and functions are so that you can change what they do. Some games might come with a programming API or even be open-source, but that’s not the norm.

Is there a list of commonly used assembly instructions?

Yes. This memory diagram sheet also includes a list of common instructions. You can find the link on our tools page, along with a few other references. The Compiler Explorer also lets you right-click on any instruction to look up the reference for it.

What does cmp (or cmpb, etc.) do?

The cmp instruction is short for “compare” but it doesn’t actually do that. It effectively does that in combination with one of the jump instructions, however. What it does do is subtract the two operands, throw away the result, but set flags based on that result. The flags are things like Zero, Sign, Overflow, etc. which we saw in the HW ISA. Since cmp sets the flags and instructions like je or jg check them, they work together. However, instructions like add that store results in registers also set flags, so the cmp instruction typically needs to be right before the jump instruction.

Since cmp subtracts and je checks the zero flag (jumping if it’s set), together cmp followed by je will jump if the things that were compared are equal. But je following a different instruction may have a different meaning. The test instruction works like cmp, except the operation it performs is bitwise AND (still setting flags based on the result of that operation). This means that jump instructions following test will have different meanings than usual.

Variants of cmp can be used to compare different sizes of data. For example, cmpb would compare a single byte. You can also see this reflected in which register names are used, since partial register names will be used when dealing with smaller data.

How can we identify how many arguments a function uses by looking at its assembly code? Which registers are used to store function arguments?

We know that certain registers are used for specific arguments: the mnemonic “Diane’s silk dress costs $89.” helps name %rdi, %rsi, %rdx, %rcx, %r8, and %r9 as the registers for the first 6 arguments (with further arguments getting pushed on the stack, which complicated things).

So if a function has 2 arguments, you’ll see %rdi and %rsi get used in the assembly code, but not %rdx. This is complicated by the fact that if a function simply passes arguments on to another function without using them for any other purpose, and it passes them on in the same order it received them at the same positions, it can directly make the calls in the assembly code without ever mentioning the registers. We’re unlikely to ask you to deduce the number of arguments from assembly code in such a tricky case.

What tools have we covered for debugging C programs, and which ones should we use for what purposes?

We’ve talked about gdb and valgrind, including both more and less advanced ways of using gdb. We’ve also supplied testing files and had you write some tests so that if there are errors in your program you’re likely to catch them. You also know how to use printf to print out values after using #include <stdio.h>. My typical debugging flow would probably be:

Run tests, and look for errors either from valgrind or from the tests themselves. Note relevant line numbers in my code. If things are crashing badly, run with gdb and use backtrace to get line numbers of where the crash happens.
Do a quick spot-check of the relevant line to see if there’s something simple I can fix, like a missing semicolon. Note that the error may be on the line above or below in some cases, especially with missing semicolons or curly braces.
In all likelihood, I won’t see the error immediately, so I’ll throw a printf in there right before the line where the error is. If I don’t have any line number because a test is failing but the code isn’t crashing, I’ll try to figure out based on which test it is what part of the code might be wrong, and add printfs there. If I really have no idea, I’ll try to add printfs to the first loop in my program to see if I can validate its behavior, and then after doing that, I’ll move on to the next loop, etc., trying to isolate the source of the issue.
From the printed output, hopefully I can figure out what’s wrong and how to fix it. This would be a good time to ask on Zulip and/or during office hours if there’s something that you just don’t understand, although checking relevant references beforehand is also a good idea (e.g., is there a function you’re not 100% sure about that you might look up?).
If things are really tricky, I might consider using gdb to set breakpoints and step through the code. gdb lets me pause at any step and also print out any values that might be relevant. You can also use dispaly to set up automatic prints that will be shown every time a breakpoint triggers. Combining this with a breakpoint on the first line inside a loop can be very powerful, because unlike print statements in the code, it’s easier to change them around when you figure out the next thing you want to look at.

As you’re going through this process, make sure to pick out the places where your code is using concepts you’re not totally sure about, and refer back to reference material to firm up your understanding, since those are likely to be places where things go wrong.

Why do we use lea for arithmetic sometimes? How is it different from mov? How does lea relate to the & operator?

Basically, the CPU needs to do a lot of memory accesses, so in addition to general-purpose math circuits like an ALU, it has a specialized circuit to do pointer math. Since pointer math involves a base pointer, an index within that array, a scaling factor, and potentially an additional offset, this special-purpose circuit performs two additions and one multiplication in a single step: offset + base + index * scale. Since it’s designed to do exactly that sequence of operations, it’s faster to use that circuit than to perform three separate steps with a general-purpose ALU (especially since the multiplication is restricted to scales of 1, 2, 4, or 8 so it can be implemented using bit shifting). That means that if we have a sequence of operations that turns out to fit into the from A + B + C * D (where D is 1, 2, 4, or 8), it’s better to use lea to do them all at once.

One more advantage: lea does not set flags based on its result. This means that if for some reason you want to do some arithmetic between a compare and the associated jump, you can use lea for that (see the answer on the cmp for more details).

In terms of differences from mov, both lea and mov update their second argument based on their first argument. mov directly copies between them, and can use a memory address as one of the two: either the source or the destination. When mov receives a memory address, it uses the specialized address computation unit to compute the address based on the values it was given, and then accesses that value in memory, either reading from it into a register, or writing from a register into that memory location. lea always has a memory address as the first argument and a register as the second, and while it uses the memory address unit to compute the final address, it then loads that address value directly into the destination register, rather than actually reading from memory at that address. Since variables are often stored at particular memory addresses, lea computes a pointer: it stores the address of the data into a register, instead of storing the data itself.

In the HW ISA, mov corresponds to both SW and LW, plus it can be used to copy data directly between two registers. It just depends on what kind of arguments it gets. lea doesn’t have an equivalent in the HW ISA, although it wouldn’t be hard to add one. We already have the ALU set up to perform addition between a register value and a constant from the instruction bits for the LW and SW instructions. If we direct that value back into a register instead of using it to read from memory, that would be the lea equivalent. Of course, it doesn’t support an index or scaling like lea does, but the HW ISA can’t do that for normal memory reads either.

lea is usually used when we use the & operator in C. If we do this:

int x = 7
int y = x;
int* z = &x;

The assignment to x might be implemented using mov $7, 8(%rsp), which moves a constant value (also called an “immediate” value) into a memory location (since it’s based on an offset from %rsp, we know it’s on the stack).

The assignment to y might then be done using one instruction: mov 8(%rsp), %rbx. That’s assuming we’re just using a register (%rbx in this case) to store y. Unlike x, y’s address isn’t used in this code, so it’s possible we can just store it in a register. If we wanted to store it in memory like x, we’d need two instructions, since we can’t move directly from memory to memory. We could do: mov 8(%rsp), %rbx followed by mov %rbx, 12(%rsp).

The assignment to z here might use lea, since it wants to compute the address of x, but store that address instead of reading the value there. Assuming we want to store z in register r10, we could do: lea 8(%rsp), %r10. In a mov command, 8(%rsp) would refer to the value at that address, which is 7. In lea however, we skip the memory lookup, computing the address by adding 8 to %rsp but then storing that address value directly into %r10. Note that we can’t do this with an add instruction, because add stores its result into the second operand, which must be a register. We want to add to %rsp, but instead of changing its value, we want to store that result in a different register, which lea can accomplish.

Why do we need a stack pointer? What is the purpose of %rsp?

Each function call needs a scratch pad to store local variables that are active during the call. Whenever possible it’s nice if it can just use registers for this purpose, but sometimes there’s too much data to fit in registers, or it needs to pass a pointer into a function it calls referring to one of those variables, so that variable needs a memory address. We use the stack to provide this. Since the list of active functions at any point in time can be represented as a stack, where we push when a function is called and pop when it returns, that’s the data structure we use.

%rsp stores the pointer to the end of the stack (we use the lowest address and grow the stack downwards). The compiler knows how much space each function call frame uses, so it makes sure only to use memory from within that region (or if it needs more memory, to insert assembly instructions that move the stack pointer down). With that restriction, no coordination is needed between functions about which part of the stack to use: each function always uses whatever is at the bottom of the stack. If it calls another function, that function will use stack space even further down, but will clean all that up and restore the stack pointer by the time it returns, so the calling function won’t need to respect space below its own frame. Similarly, that function will clean up the stack space it allocated before it returns, so the function that called it won’t have to respect the memory it uses: it will be done with that memory before the calling function continues. This is why we only need a single register to keep track of the entire stack, but it’s also why values stored on the stack are invalidated after the function returns.

When do we use $ vs. % vs no prefix for a register in gdb?

This is confusing because gdb is a little inconsistent about this! When using print or examine you need the $, when using info reg you don’t need any prefix, and in assembly code you’ll see % as a prefix, even though you never need to type that yourself in gdb.

If you’re in doubt, use help for the relevant command to figure it out.

How do we know whether a function reads from or writes to a register?

Each different assembly instruction has a different meaning, and reads and/or writes registers in different positions. For example mov reads from its first argument and writes to its second (although either could be a memory location, in which case we’re reading from the register used to compute that memory location and writing to memory).

We’ll expect you to be familiar with the instructions listed on this sheet which includes a list of common instructions.

Note that most of the math instructions like add or imul are actually more like += and *= than + or *: they both read from and then write into their second argument.

Do companies actually have engineers who program in assembly language?

Sure. Just look up “x86 assembly jobs” online and you’ll see a few. It’s not super common, but still very necessary here and there. In many cases, it’s probably not someone’s full job just to program in assembly, but it’s part of a larger job where they use other language as well. Here are some examples of people who would use at least some assembly in their day-to-day job:

People maintaining and updating older systems where either they are directly programmed in assembly code or where programs in higher-level languages like FORTRAN or C might be checked for accuracy at the assembly level. For example, NASA recently sent a software update to the Voyager 2 spacecraft. You can read about its computer, which was originally programmed in FORTRAN, here. Systems that are old but still critical, like nuclear missile control & management systems, are probably still maintained by people who need to write assembly code.
Many computer security workers who deal with viruses need to be able to read and understand assembly code, since that’s all they have to work with. They may use a decompiler to try to come up with approximate C code corresponding to the assembly they see, but since hackers could write some routines in custom assembly code, they can’t just rely on those tools for everything.
The people who write compilers and develop operating systems need to understand assembly code, even if they might not write much by hand.
People who develop new hardware, like new CPU designs, need to be intimately familiar with assembly so that they can design new assembly languages and/or maintain backwards compatibility with existing ones.

Which gdb commands should we memorize?

The gdb commands that we assume you will understand include print, info reg, examine, backtrace, disas, start/run, break, step, stepi, and continue. list is another helpful command sometimes but not as critical, especially for the x86 assignment where we don’t give you source code. Thankfully, the help command in gdb is pretty good, so you can get reminders easily as you’re learning these. By the end of the x86 assignment, you’ll know them, one way or another.

We don’t expect you to remember the exact formatting for these commands in terms of their output, nor do we necessarily expect you to memorize all the format options for x (although knowing /gx and /s is probably good). But you should be able to explain what they do, and make inferences if shown the output from one or more of them.

What happens if we write a function that has more than 6 arguments/parameters, since there are only 6 registers available for holding arguments?

Additional arguments get written onto the stack in the calling function’s stack frame, and the assembly code of the function that has lots of parameters will include code to read those values based on its starting stack pointer.

What general strategies should we use for reverse engineering?

There are several different things you can do to get your bearings in unfamiliar assembly code, so I’ll list them here:

Highlight all of the memory addresses used for jumps, using a different color for each unique address. This will help you trace how control flows through the code, even before you figure out the conditions of each jump. Also note function calls during this process (especially calls to trip_alarm in the x86 assignment) and ret instructions.
Highlight uses of the rdi, rsi, rdx, rcx, r8, and/or r9 registers. These will usually tell you how many arguments a function accepts and/or where it passes arguments to other functions it calls.
Look at places where rax (or a part of it like %eax) is used, tracing backwards from ret instructions and respecting the various possible jump paths. This will help you understand what goes into computing the function’s return value.
Look for where parentheses are used, since they indicate memory access. Is memory accessed based on the stack pointer, or another register? If it’s another register, where does that register get its value? How many different memory locations are accessed, and are we dealing with arrays and/or strings based on the use of an index register during memory access? Remember to account for return values from any functions that this function calls since those can be the source of memory addresses.

I’d pursue some or all of these strategies a bit before really diving into all the details of a function, since having context can help a lot in understanding details. I’d use the strategies above to answer questions like: does this function use conditionals? Loops? Recursion? What are the main data structures it uses? Just individual variables, or any arrays or strings? What other functions does it call, what arguments does it give them, and what does it do with their return values?

For the x86 assignment in particular, I’m always looking to answer the question: what path can be taken through this function to get to a return instruction without hitting any trip_alarm calls? The conditions used for the jumps necessary to take that path are the next points of inquiry for figuring out how to solve the phase.

Once I feel like I have some orientation in terms of the big ideas for the function, the next step is to make some assumptions about argument values and then run through the code line by line, executing instructions in my head and/or writing down results on paper. This step is tedious, but skipping it can waste a lot of time. Here you can use stepi in gdb to run the actual code instruction-by-instruction and check your results, although I wouldn’t necessarily do that at every step. Running through the code step by step doesn’t have to mean writing down every specific value. Instead I often keep a sense of which variable each register or stack location is storing, with an eye for which ones will be used by key jumps in the function. Based on this rough trace-through, I’ll come up with hypotheses, like “the first input number must be a 5 to gt through the first check.” It’s important to try to test these hypotheses piece-by-piece. You don’t want to develop 17 hypotheses only to find out that your first one was wrong and so are all the others. Use gdb to test out your hypotheses, and once you figure out how to navigate one jump, start developing hypotheses about the next one.

Hopefully, you can eventually figure out the function piece by piece using these strategies, although sometimes you end up needing to step through the whole function one instruction at a time in gdb to really figure out what’s going on. If it comes to that, use the display command to set up auto-printing for the important registers and/or memory values so that you don’t get lost trying to remember what you need to print at every step.

Lab 9

How do the push and pop instructions work?

The push instruction is equivalent to the sequence:

sub $8, %rsp
mov <target>, (%rsp)

where <target> is the argument to push. So it modifies the %rsp register by subtracting 8, and then uses that register as a memory address to save the target value there. Note that if the size of the target value is not 8 bytes, the stack pointer may be moved by the size of the target value instead, so push %eax would subtract 4 bytes instead of 8, since %eax is only half of the %rax register.

What this accomplishes is saving the target value onto the stack, and growing the stack (downwards) to accommodate the new value. After a push, an instruction referring to (%rsp) is referring to the saved value (although if there are multiple pushes, older values will need offsets since %rsp will have moved further).

pop is the inverse of push. It translates to:

mov (%rsp), <target>
add $8, %rsp

Here we read a value from the memory the stack points to into the target, and then add 8 to the stack pointer to move it up above the value we just read, implicitly relinquishing that stack space to be used by another function call if one were to occur. Just like with push, the amount the stack actually moves varies with the size of the target. If we’re popping an integer, we could use an e register as the destination and the stack pointer would only move by 4. If a push is immediately followed by a pop and both have the same target, they will cancel each other out and nothing will have changed. On the other hand, assuming that any code in between the two instructions leaves the stack pointer in the same place by the time the pop occurs as it was when the push was finished, the pop will restore the target value back to what it was when the push occurred (and will also restore the stack pointer to what it was before the push). Because of this, push and pop are most often used to store & reset values from registers which need to be saved while another function call uses them.

How do the call and ret instructions work?

call and ret work somewhat like push and pop, except the thing they push and pop is the “instruction pointer” which is the x86 equivalent of the program counter, and there are a few other wrinkles:

call doesn’t push the current instruction pointer (which would be pointing at the call instruction that’s being executed) but instead it pushes the address of the next instruction after that call. This means that when ret happens, we can continue the current function after the function call we just made.

ret pops the value written by call (assuming there are no errors in your stack pointer maintenance) and sets the instruction pointer to that value, rather than setting it to point to the instruction after the ret. This is how we “jump back” from a function, and how we can jump back from the same function into many different places depending on who called it.

Both call and ret move the stack pointer %rsp by 8, since that’s the size of an address (see above for details on push and pop.

Lab 12

What aspects of the memory allocator are the most inefficient and how do people go about weighing whether they should sacrifice space or time?

For the basic allocator, the main inefficiency lies in fragmentation and in the time it takes to find free space starting from the beginning of the heap when many things have been allocated and/or freed. I’m not knowledgable about it, but I’m sure that coming up with better allocation algorithms is still an active area of research. Different applications often call for different space/time tradeoffs, so in some cases you might use a custom memory allocator for special applications. For example, if you’ve got a scientific program to crunch a bunch of data and it’s running out of memory, you could buy more RAM, but you might also want to try out a different memory allocator, even if it is a bit less time efficient. At the same time, if you’re running this calculation for weeks, finding a more efficient allocator might speed things up in a meaningful way, so you might be willing to trade off some space (especially if you can afford to buy more RAM). Of course, you’d want to first ensure that memory allocation really is a bottleneck for the program, rather than some other issue.

Long story short: usually you should test your code and ensure that memory fragmentation and/or allocation overhead is actually a problem before thinking about changing how you allocate memory.

What memory does malloc consider? All memory in the computer, or just specific areas?

malloc just manages memory on the heap, which is a specific region of memory. It calls sbrk to extend that region if it needs to, but eventually, sbrk may indicate that it can’t allocate any more memory because the heap has grown too large and either hit an arbitrary OS limit, or is in danger of running into another memory region, like the stack.

Are there memory allocation functions in other languages that are faster than C?

I’m not knowledgable about all the specifics, but I now that Rust has more memory management stuff built-in to its type system, which probably lets it allocate things more efficiently than C because more information is known at compile-time, although I don’t know that this is the case. There are also languages where the programmer does more of the work (like specifying which memory to allocate) which could be faster in one way, but more cumbersome in another.

What tips do you have for making a better malloc implementation that’s more space/time efficient?

This is intentionally a pretty hard objective, because there are so many facets to it, and in particular, different kinds of allocation & freeing patterns might behave differently for different allocators, so in many cases there’s not an obvious “best” thing to do. We’re asking you to consider this problem a bit because it’s more like real un-curated problems you’ll face in the real world.

That said, the assignment itself has a lot of good hints here. The section on performance mentions next-fit allocation and an explicit free list as two things you can do to improve performance. Doing better than an explicit free list for throughput is probably pretty tough.

That part also links to the section on performance optimization and engineering, which points out that the use of a profiler is the correct first step and it has some instructions on using gprof for this. Figuring out how to use a profiler and actually using it is an incredibly valuable learning experience :P

How does memory mapping work?

Memory mapping has two meanings. There’s an ‘mmap’ C function which allows you to directly map files (or regions of files) into memory, so that any alteration to that memory alters the file, and reading that memory reads from the file. There’s also the concept of virtual memory: the actual physical RAM of the computer is mapped into a virtual address space for each process, so that the different processes each think they have access to all of memory, but in reality, the parts they each use are mapped to separate regions of the real memory, so that they don’t have access to each others’ data.

The way that virtual memory works is quite complicated, involving some specialized hardware that’s part of the memory load/store process for looking up where a certain area of memory is actually stored (the data this hardware uses is called “page tables,” and this is where the term “page size” comes from).

In lieu of a longer and more complex explanation here, the relevant Wikipedia page on virtual memory is an interesting read.

What are the dangers of premature optimization in terms of both development time/effort and code correctness?

If you start trying to write “more efficient” code without understanding whether your code works yet, and/or whether it’s actually too slow or using too much memory, you might:

Make your code needlessly complex, in a way that doesn’t actually speed it up.
Relatedly, introduce bugs into your code because of that complexity, which you could have avoided with simpler code.
Spend a lot more time building that complex code, when a simpler version that seemed less efficient might have been faster to build, giving you more time to actually measure its performance and fix any real issues that came up.

In general, it’s much better to start by writing code in the simplest and most straightforward way, even if you’re worried it might not be efficient. Then measure it to see if it’s actually too slow or using too much memory. If you find that it is, measure in more detail to figure out why (often the thing that makes it slow is not what you would have expected) and then spend effort trying to make that specific thing faster.

When we call malloc(n) does n include the header size for the allocated block?

No, it does not. The person using malloc shouldn’t need to know how the allocator is implemented, they’re just requesting a certain number of bytes. So since the header is an internal implementation detail, they don’t have to know its size. It’s the responsibility of the allocator to create a payload area with enough room for the requested bytes, and then store any additional metadata it needs to track that.

Why shouldn’t we use a struct to represent a heap allocated block?

A struct is a fixed-size data grouping, where we know at compile-time how big it will be. Because the heap blocks are all different sizes, a struct doesn’t make sense. We could imagine using a struct to store just the header information, but struct doesn’t have a facility to store multiple pieces of information overlaid on top of each other like we’re doing. Instead we use a few custom helper functions like status_size to let us access the information as individual values.

What’s the purpose of the heap footer/epilogue?

The footer/epilogue of the heap is marked as a “size-0” “allocated” block, and its previously-allocated bit is updated based on the allocated status of the block before it. The size being 0 is a clearly-invalid value that makes it easy to see if you end up getting a pointer to that epilogue. The other information is useful because other operations (particularly coalesce) can treat it as if it’s a normal block, and they don’t need any extra special cases to handle the end of the heap.

In particular, coalesce will see that it’s an “allocated” block, and so it won’t try to merge with it.

What might happen if we don’t intentionally call free on the pointers that malloc returns? Where in the code do we deliberately call free?

Calling free is the responsibility of the person who called malloc, so in our mm.c, we don’t need to do that anywhere. In our testing code, we have a “trace” system set up where we can read in a trace and call free when the trace says to do that. If you’re writing other code (like you did for Pointers) then it’s your responsibility to call free once you’re done using some memory. Of course, there may be some pieces of memory that you need to use for the entire time the program is running, so they might only be freed at the end (and in any case, when the process ends everything will get cleaned up automatically).

If you forget to call free when you’re done with something, that’s called a “memory leak.” That memory stays allocated, and if your program runs for a very long time (like a web browser does) even a small memory leak can lead to a huge amount of wasted, potentially leaving less memory for other programs to use and/or causing programs to fail because there’s not enough memory available. That’s why we use valgrind to check for memory errors and ensure we don’t have any leaks.

Why is code with Java primitive types faster than code with Object types?

In Java, Object-type objects are all represented using pointers, and their actual data is allocated on the heap. In contrast, primitive types can be stored directly in Java’s equivalent of registers, which are sometimes actual registers. So by using primitive types, there are some situations where you can avoid heap allocation, which we now know is potentially time-consuming as it involves a search for a free block and a bunch of memory writes to update header information.

What’s the relationship between “RAM” and the memory managed by malloc?

Malloc manages a segment of RAM called the “heap.” There are some constants that define where this segment actually lives in the address space, but it gets a starting size and we can call sbrk to grow it. The stack and text segments are also in RAM, and since both the stack and RAM can grow, if you allocate too much in the heap you’ll limit the amount of stack stuff you can allocate and vice versa (mostly these limits are way beyond what ordinary programs need).

Does the malloc project use an implicit or explicit free list?

The starter code provided has the skeleton of an implementation that uses an implicit free list. If you want to score more points for better performance, you can implement an explicit free list (easier is to just least implement a next-fit instead of first-fit search policy, which will score higher than first-fit but not as high as an explicit free list).

What is the “payload.”

In malloc, we’re writing a system that others are going to use to allocate memory. They may use that memory for whatever purpose they want (like storing a command array :). From the perspective of the malloc code, we don’t care what the memory gets used for, we just need to ensure that nobody else will write to that memory (ourselves or other people who called malloc and obey the size limit they requested).

One restriction on the payloads: they need to start at a multiple of 16 bytes, so that the people we give them to can use instructions which require that without having to double-check first every time.

Why is the explicit free list (doubly-linked list) better than an implicit free list for tracking free blocks?

It’s faster at finding free blocks when there are many allocated blocks. Imagine you have one small free block followed by 1000 allocated blocks of various sizes, and then a big free block. If you’re asked to find space that’s larger than the small block, with an implicit free list you’ll have to iterate through all of the allocated blocks in order to find the free block at the end, because each block only has enough information to find the start of the next block; there’s no way to “skip ahead” multiple blocks. In contrast, if each free block knows the location of the next free block, you can skip from one to the next ignoring the allocated blocks.

How can we distinguish the four coalesce cases in our malloc code?

For a code design question like this, I often start with idealized variables and then work backwards to see how I can get those variables. This helps break the problem into smaller pieces (one for each variable) and it also makes it easier to test and verify each piece independently when debugging. For the coalesce cases, we need to know two “bits” of information, which will distinguish our four cases:

Is the previous block free?
Is the next block free?

The combinations of no/no no/yes yes/no yes/yes will then be the four coalesce cases (although with some designs it’s possible to share code between the cases too). So I might start by defining:

int prev_free = 0;
int next_free = 0;

Now the question becomes; how can I check whether the previous block is free, and how can I check whether the following block is free. If you look at the supplied helper functions and the preparatory exercises, you’ll find a lot of tools for solving those problems :)

Where do 0x11 and 0x13 come from for an allocated block of size 16 with a free/allocated block before it?

I think it’s clearest if we translate into decimal and binary:

0x11 = 16 + 1 = 17 0x13 = 16 + 3 = 19

0x11 = 0b10001 0x13 = 0b10011

In both cases, if we mask out the two lower-order bits which are actually used as free/allocated indicators, the numeric value is 16 to represent the block size. Then we either have a 01 (in binary; 1 in decimal) to represent “previous block is free and this block is allocated) or a 0b11 (3 in decimal) to represent”previous block and this block are both allocated.

Are there cases where the other two unused bits in a block header get used for something? If so, what? How about the least significant bits of the footer?

I’m not familiar with all the different malloc variants out there, but I’m sure someone has used those extra bits for something in some implementation, I just don’t have an example that comes to mind.

If I try to imagine what they could be used for, I can try to think of further optimizations that might be possible, but this is hard. For example, there might be some gains in marking which free block is the largest, because if you get a request for a larger allocation and you happened to find that block while searching, you could immediately skip to expanding the heap rather than continuing to check further free blocks. But that’s probably a terrible idea, because the time spent trying to maintain that “largest” bit correctly in each block would be a lot (storing a pointer to the largest block in the heap header would be cheaper). Also, how frequent is it really that we get an allocation request that’s larger than our largest free block (maybe not frequent) and what’s the cost of checking an extra bit in each free block (probably not worth it even ignoring the other issues).

So you can see it’s not easy to come up with other things that it’s useful to store, because anything you might store incurs costs to maintain that information and to check it.

One thing suggested in the assignment as an extra challenge would be to do some kind of valgrind-like checking for certain kinds of memory errors. These extra bits might be useful for that purpose, and if you really cared about efficiency, you could put those checks behind if (DEBUG) or the equivalent so that they only happen when debugging. That said I haven’t actually designed such an algorithm myself, so this is just a guess, not a hint, and it might end up too hard to go that route.

Lab 13

When looking at source code, what are some clues that a program is a fork bomb?

You’d want to look at where fork is called, and how many times that code will execute. It’s not always a guarantee, but a program that calls fork within a loop should raise some suspicions: Why does it do that, how many times will the loop execute, and what happens to each parent/child process before the next call to fork. If the child or parent processes immediately break out of the loop, for example, then it may be fine. But if both of them will reach the next iteration and call fork again, that will lead to an exponential explosion of processes.

How can multiple processes run at the same time on the same CPU?

They don’t actually run at the same time: each process gets to run for a little bit, and then a “hardware interrupt” will stop it, giving control of the CPU back to the kernel (which is the core program that implements the operating system). The kernel may need to do some of its own business, but will pretty quickly hand control of the CPU back to another process that wants to run, and so processes share the CPU by taking turns, with turn-taking enforced by the kernel.

When is it safe/useful to use fork? Why would you use fork?

Whenever you need to start a new background process, it’s easiest to use fork. Typically, if you use an if statement to have only the parent process continue with the code for the original program while using exec in the child process to replace that code with some other program, there’s much less risk of forking too many times. Usually you also don’t need to place fork within a loop, which can increase the risk of forking too many times.

Why would you want a “background process?” One common use case is when you need to pay attention to user input, but also process some data in the background. You could use one process that waits for user input and gives immediate feedback, while sending data inputs over to a second process that does something slow/expensive with those inputs. Even something as simple as showing a spinning loading icon while data is loading requires two things to happen at once: the spinning icon needs to be constantly updated, and the loading needs to happen. This kind of thing is more often handled by threads than processes, because threads can more easily share data with each other, but the same logic applies.

Another example of a background process is when you want to use an existing program that doesn’t have a C API, maybe one that’s written in another language. If you can be sure that that program is installed on your computer, you can use fork and exec to run it from within a C program. Many languages have higher-level libraries for doing this (like the subprocess module in Python, which also lets you easily manage input and output streams for the child process). Most of the time, it will be more convenient to write a multi-process program in a higher-level language. But knowing how fork works at the C level helps you understand why those languages might have certain quirks (like how file descriptors are shared between Python subprocesses in some situations).

How can information be shared between different forks of a program?

There are a few possible options: for example, you could use files that the programs read & write together, or an internet socket that’s just used to forward data from one fork to another. There’s also a C function equivalent to the kill terminal command that can send signals between processes if they know each others’ PIDs, which they can easily from either the fork return value (in the parent) or a function for getting the parent process ID. So you can have one process signal the other when it’s done writing a file, for example. Besides the ‘interrupt’ and ‘kill’ signals, there are a number of different signals available to send.

Why do we care what order things are printed when using fork? What causes things to print out-of-order? Is there a way to control the order that things happen in after forking?

We care about the order because if we want to coordinate work between multiple processes, there may be things that need to happen in a certain order, such as one process writing a result into a file before another process tries to read it.

The order becomes unpredictable because as soon as there are multiple processes, the operating system can choose to schedule them in any order. Depending on how many other processes there are and other uncontrollable factors, the OS may schedule different processes to run for different amounts of time, and in different orders.

There are ways to force the OS to schedule things in a certain way: You can have either the parent or the child wait using waitpid for the other one to finish. You could also use signals as described in the previous answer to coordinate which process will do things when, by having one process wait for a certain signal before doing something.

What are the jobs, fg, and bg commands?

jobs lists the running jobs in the current terminal. A “job” consists of one process that was launched directly from the terminal, plus all child processes of that parent. The fg command brings a paused job back to the foreground to resume it. You can use control-Z to pause a job, after which fg can resume it. The bg command also resumes a job, but lets it run in the background, so you can still use the terminal for other stuff while it’s running.

If you start a long-running process (like the malloc tests on all traces) you can use control-Z followed by bg to let it run in the background while you do other stuff. Of course, background processes can still print stuff, which can be distracting if it happens to print out stuff on top of some other output from the other programs you’ve started to run. There’s another command ‘disown’ which can cut off a process entirely from a particular terminal, and can be used once the process is sent to the background with bg.

Why do we need to differentiate the parent and child processes based on the return value from fork?

Typically, it’s not useful if the two forked processes do exactly the same thing: we’re forking because we want a secondary process to do some work in the background, while the main process continues with the work it’s been doing. So we need to know whether we’re the parent or the child to know whether to continue the work we’ve been doing, or to switch to something else (often, the child process will call exec to start a new program, while the parent process continues on).

How does fork work in terms of coordinating variables in the parent and child processes?

While threads (a related technique) can share some variables, processes do not share variables. Of course, they are created as clones of each other, so both the parent and child process have access to all the same variables when they start up. But they “don’t share” variables because when one of them changes a variable, the other one won’t see that change. Effectively, when we create the child process, we copy the entire state of memory over to the child, and any modifications in either the parent or the child are done on separate copies so they don’t affect each other.

As mentioned in a previous answer, you can explicitly use mechanisms like files or sockets to communicate between processes.

What does sleep do?

As mentioned above, the kernel/OS manages how processes share the CPU, letting them run for a while, then interrupting them, and then letting them continue later after other processes have had a turn. sleep tells the operating system: hey, I’m done with work for now, and I actually don’t need to take any more turns until this many seconds later. So sleep(1) would tell the OS “wake me back up in 1 second.” Typically, from the process’s perspective, it just pauses for that long and then resumes. Of course, if many other processes are demanding to run at the same time that we’re scheduled to wake up, there might be a slight delay beyond 1 second before we run again, so we can’t depend on the sleep lasting exactly 1 second (typically the delay is on the order of microseconds, so not very noticeable by humans).

What does exit do and what’s the significance of its argument?

exit stops the current process. Each process has an “exit code” that can be observed by other processes after it finishes, and is used to indicate success/failure, with certain exit codes indicating specific kinds of failure or other possibilities. Each process can decide what its exit codes will mean, but universally, an exit code of 0 means “success with no errors” (while any other exit code usually means some kind of error). So exit(0) means “stop this process, and notify anyone watching it that it succeeded.” In particular, the shell can be set up to print an alternate prompt when a process terminates with a non-0 exit code, and there are shell built-ins/operators that can change behavior depending on the exit code of a process like if, ||, and &&. Using exit(1) is a common way to indicate a general error, although many programs have sections in their manual detailing many different possible error codes.

When should you use control-C vs. kill to stop a process.

Control-C sends a ‘SIGINT’ signal (interrupt) to the current terminal process and all of its children. kill sends that signal (by default; it can send other signals instead) to one specific process named by its PID. So in general control-C is much more convenient. But if you need to kill a background process, you’ll need to use kill, since control-C only applies to the current foreground process. You may also need kill if you want to send another signal, like SIGTERM, although control-Z (to pause a process with SIGSTOP), control-D (to “hang up” on a process with SIGHUP) and control-backslash (to send SIGTERM) are also available as keyboard shortcuts.

Are there things besides fork that can cause there to be multiple processes?

fork is the main C library function for creating processes, although there are probably custom higher-level process-management libraries that you could use instead. However, there’s another simpler answer: the user might intentionally launch multiple processes. If you remember the Pointers Assignment, we made a big deal of ‘&’. That’s because in the shell, you can add ‘&’ after a command to start it running in the background, not the foreground. This lets you launch other programs immediately, so you can launch multiple at once. Same would apply if you open multiple terminals (each of which is its own process) and then launch multiple programs at the same time.

We actually saw a sort-of example of this with a few people in lab: they tried to launch Firefox, but got a popup saying that their profile was already in use! This wasn’t a true example of multiple processes, but it was relate: Firefox was trying to prevent multiple main-management processes from launching with the same profile, because it doesn’t have proper synchronization code for reading & writing profile information simultaneously from multiple processes. In this case, it wasn’t that Firefox was actually already running on that computer: instead, because the lab computers use a network file system to share files between computers, they had opened Firefox on another computer, which created a “lock” file to keep track of the fact that the profile was in use. Since the lock file was shared between computers, the other computer knew not to touch that profile. The solution was to find the other computer and shut down Firefox there, to properly remove the lock file & also eliminate the possibility of activity on that machine corrupting the profile.

How can we predict the output from and behavior of programs that use fork, including when the next prompt will be printed by the shell?

You “just” need to trace through the code and keep track of any forks that happen, tracing them separately. Then account for all the possible ways that those traces might be interleaved after the fork(s) happen(s). This is… hard. Drawing diagrams helps, but fundamentally parallel programming is an incredibly challenging thing, and often keeping forking patterns as simple as possible is a good idea if you’re the one writing the code.

To know when the shell will print the next prompt, look for where the initial parent process exits. That’s the process that the shell launched originally, so when it completes (even if it leaves children around in the background) the shell will print the next prompt and be ready to accept new input.

What are “threads” and when would you use them?

Threads are like processes, but instead of creating a clone of the process they share a lot of the same process state, meaning that they can each write to the same set of global variables, for example. This makes it easier to pass information between threads, but also much easier to screw things up. Threads are commonly used in situations where user input is happening alongside CPU-intensive processing, like in a videogame, and the processing needs to immediately access and respond to the user input.

Why does fork return 0 in the child process?

This gives the code calling fork a simple way to figure out if they’re the parent or child, since 0 is not a valid PID. There’s a getppid function in the standard library header unistd.h that can get your parent process’ PID, so the child still has access to that if it needs it. Unlike the child, since a parent process might have multiple children, there’s no obvious way for it to get the PID of the new child except for the return value from fork, which is why the 0 is returned to the child, not the parent.

Does parallel processing use one RAM or multiple RAMs?

All processes/threads share a single RAM. Sometimes this is actually multiple RAM chips if you’ve added extra RAM to your computer, but the operating system manages these as one unit to display a consistent pool of memory to running processes.

Of course, if that’s true, why don’t different processes overwrite each other’s data all the time? Well, there’s something called memory virtualization which means that where a process thinks things are stored is not where they’re actually stored. Every time memory is read or written, the address used by the process goes through a translation step that maps it to a different part of the actual hardware RAM. The OS makes sure to map memory used by different processes to different parts of the physical RAM, even while ensuring that they each think they have access to all available RAM. It also leaves unused sections of memory un-mapped until they’re requested, so that it can fit the active memory regions from many many processes together into the single actual RAM.

So in effect, each process gets its own memory to use, separate from all other processes, even though physically, there’s only one RAM chip.

What does atoi from forkex7 do?

To find out the behavior of an unfamiliar function like atoi, a quick search for a reference site usually suffices, although the reference sites are becoming harder to find and less reliable these days… The man program (short for manual) also has references for most C library functions, so running man atoi in a terminal would work.

atoi converts a string (for some reason abbreviated as ‘a’, perhaps after “alphabetic?”) to an integer (so it reads “‘a’ to ‘i’”). If the string you give it isn’t convert-able, it returns 0.

How does the operating system manage processes to give the illusion of simultaneous execution? What controls which process runs when?

The OS has a “scheduler” which decides which process to run next whenever a process either voluntarily gives up control or is explicitly interrupted. Any time a system call is made (like when calling printf) this gives the OS a chance to smoothly interrupt the process, since the process is asking the system to do some work for it anyways. Once a process is selected, after it runs for a bit it will be interrupted and another process will be swapped in. The scheduler is responsible for making sure that each process gets a turn, while also ideally letting processes that are more important or which have more work to do get a bit more time than others. But since context swaps happen hundreds to tens of thousands of times per second, from a human perspective it looks like multiple processes are happening simultaneously.

The OS has a notion of process priority that can be used to give some processes more priority during scheduling, but the overall scheduling algorithm is quite complex and has to take many different things into account (see this article for an overview and history of the Linux scheduler).

TODO