# Memory Hierarchy: Cache Memory hierarchy Cache basics Locality Cache organization Cache-aware programming ### Cache ### **English:** n. a hidden storage space for provisions, weapons, or treasuresv. to store away in hiding for future use ### **Computer Science:** n. a computer memory with short access time used to store frequently or recently used instructions or data v. to store [data/instructions] temporarily for later quick retrieval Also used more broadly in CS: software caches, file caches, etc. ## **Locality Example #2** row-major M x N 2D array in C ``` int sum_array_cols(int a [M] [N] { int i, j, sum = 0; for (j = 0; j < N; j++) { for (i = 0; i < M; i++) { sum += a[i][j]; } } return sum; }</pre> ``` ## **Locality Example #3** ``` int sum_array_3d(int a[M][N][N]) { int i, j, k, sum = 0; for (i = 0; i < N; i++) { for (j = 0; j < N; j++) { for (k = 0; k < M; k++) { sum += a[k][i][j]; } } return sum; }</pre> ``` What is "wrong" with this code? How can it be fixed? 14 ### **Cost of Cache Misses** Huge difference between a hit and a miss Could be 100x, if just L1 and main memory 99% hits could be twice as good as 97%. How? Cache hit time of 1 cycle, miss penalty of 100 cycles Mean access time 97% hits: 1 cycle + 0.03 \* 100 cycles = 4 cycles 99% hits: 1 cycle + 0.01 \* 100 cycles = 2 cycles hit/miss rate This is why "miss rate" is used instead of "hit rate" ### **Cache Performance Metrics** ### Miss Rate Fraction of memory accesses to data not in cache (misses / accesses) Typically: 3% - 10% for L1; maybe < 1% for L2, depending on size, etc. ### Hit Time Time to find and deliver a block in the cache to the processor. Typically: 1 - 2 clock cycles for L1; 5 - 20 dock cycles for L2 ### Miss Penalty Additional time required on cache miss = main memory access time Typically **50 - 200 cycles** for L2 (trend: increasing!) 16 # Cache Organization: Key Points Block Fixed-size unit of data in memory/cache Placement Policy Where should a given block be stored in the cache? • direct-mapped, set associative Replacement Policy What if there is no room in the cache for requested data? • least recently used, most recently used Write Policy When should writes update lower levels of memory hierarchy? write back, write through, write allocate, no write allocate | Example: Tag, Index, Offset? | | | |-------------------------------------------|-------------------------------------------------|--| | 4-bit Address Tag | Index Offset | | | Direct-mapped<br>4 slots<br>2-byte blocks | tag bits<br>set index bits<br>block offset bits | | | index(1101) = | | | | Example: Tag, Index, Offset? | | | |-------------------------------------------|-------------------------------------------|--| | 4-bit Address Tag | Index Offset | | | Direct-mapped<br>4 slots<br>2-byte blocks | tag bits set index bits block offset bits | | | index(1101) = | | | | Example: Tag, Index, Offset? | | | | |---------------------------------------------------------|---------------------------------------------------------|---------------------------------------------------------|--| | E-way set-associative S slots 16-byte blocks | 16-bit Address Tag | g Index Offset | | | E = 1-way<br>S = 8 sets | E = 2-way<br>S = 4 sets<br>Set<br>0 | E = 4-way<br>S = 2 sets | | | tag bits set index bits block offset bits index(0x1833) | tag bits set index bits block offset bits index(0x1833) | tag bits set index bits block offset bits index(0x1833) | | | Cache starts <i>empty</i> , uses LRU. Access (address, hit/miss) stream | | |-------------------------------------------------------------------------|--| | (10, miss); (12, miss); (10, miss) | | | associativity of cache? | | # Example (E = 2) ``` float dotprod(float x[8], float y[8]) { float sum = 0; int i; for (i = 0; i < 8; i++) sum += x[i]*y[i]; return sum; }</pre> ``` If x and y aligned, e.g. &x[0] = 0, &y[0] = 128, can still fit both because each set has space for two blocks/lines # **Types of Cache Misses** ### Cold (compulsory) miss first access to a block ### **Conflict miss** cache has space for all needed blocks, but multiple blocks map to same slot e.g., referencing blocks 0, 8, 0, 8, ... would miss every time increasing associativity can reduce conflict misses ### Capacity miss working set of active cache blocks is larger than the cache 42 ### What about writes? ### Multiple copies of data exist: L1, L2, possibly L3, main memory ### Write-hit policy Write-through: write immediately to memory, all caches in between. Write-back: defer write to memory until line is evicted (replaced) Need a dirty bit to indicate if line is different from memory or not ### Write-miss policy Write-allocate: load into cache, update line in cache. Good if more nearby writes or reads follow No-write-allocate: just write immediately to memory. ### Typical caches: Write-back + Write-allocate, usually Write-through + No-write-allocate, occasionally # # Aside: software caches Examples File system buffer caches, web browser caches, database caches, network CDN caches, etc. Some design differences Almost always fully-associative so, no placement restrictions index structures like hash tables are common (for placement) Often use complex replacement policies misses are very expensive when disk or network involved worth thousands of cycles to avoid them Not necessarily constrained to single "block" transfers may fetch or write-back in larger units, opportunistically # **Cache-Friendly Code** Locality, locality, locality. ← locality Programmer can optimize for cache performance Data structure organization Data access patterns Nested loops Blocking (see CSAPP 6.5) ### All systems favor "cache-friendly code" Performance is hardware-specific Cache size, line size, associativity, etc. Generic rules still capture most of advantages Keep working set small (temporal locality) Use small strides (spatial locality) Focus on inner loop code 57