# **Memory Hierarchy: Cache**

Memory hierarchy

Cache basics

Locality

Cache organization

Cache-aware programming

## How does execution time grow with SIZE?

```
int[] array = new int[SIZE];
fillArrayRandomly(array);
int s = 0;
for (int i = 0; i < 200000; i++) {
  for (int j = 0; j < SIZE; j++) {
    s += array[j];
```

# reality beyond O(...)



## **Processor-Memory Bottleneck**



Solution: caches

## Cache

#### **English:**

- n. a hidden storage space for provisions, weapons, or treasures
- v. to store away in hiding for future use

#### **Computer Science:**

- n. a computer memory with short access time used to store frequently or recently used instructions or data
- v. to store [data/instructions] temporarily for later quick retrieval

Also used more broadly in CS: software caches, file caches, etc.

## **General Cache Mechanics**



## **Cache Hit**



- 1. Request data in block b.
- **2. Cache hit:**Block b is in cache.

# **Cache Miss CPU** Request: 12 Cache 3 14 Request: 12 12 Memory 6 11 10

13

14

15

- 1. Request data in block b.
- 2. Cache miss:

  block is not in cache
- 3. Cache eviction:

  Evict a block to make room,
  maybe store to memory.
- **4. Cache fill:**Fetch block from memory, store in cache.

**Placement Policy:** 

where to put block in cache

12

Replacement Policy:

which block to evict

## Locality: why caches work

Programs tend to use data and instructions at addresses near or equal to those they have used recently.

#### **Temporal locality:**

Recently referenced items are *likely* to be referenced again in the near future.



#### **Spatial locality:**

Items with nearby addresses are *likely* to be referenced close together in time.



How do caches exploit temporal and spatial locality?

```
sum = 0;
for (i = 0; i < n; i++) {
   sum += a[i];
}
return sum;</pre>
```

What is stored in memory?

#### Data:

#### **Instructions:**

#### row-major M x N 2D array in C

```
int sum array rows(int a[M][N])
    int sum = 0;
                                                a[0][0]
                                                       a[0][1]
                                                              a[0][2]
                                                                     a[0][3]
    for (int i = 0; i < M; i++) {
                                                a[1][0]
                                                       a[1][1]
                                                             a[1][2]
                                                                     a[1][3]
         for (int j = 0; j < N; j++)
                                                a[2][0]
                                                      a[2][1]
                                                             a[2][2]
                                                                     a[2][3]
              sum += a[i][j];
    return sum;
```

#### row-major M x N 2D array in C

```
int sum array cols(int a[M][N])
    int sum = 0;
                                                a[0][0]
                                                       a[0][1]
                                                                     a[0][3]
                                                              a[0][2]
    for (int j = 0; j < N; j++) {
                                                a[1][0]
                                                       a[1][1]
                                                             a[1][2]
                                                                     a[1][3] ···
         for (int i = 0; i < M; i++) {
                                                a[2][0]
                                                       a[2][1]
                                                             a[2][2]
                                                                     a[2][3]
              sum += a[i][j];
    return sum;
```

```
int sum_array_3d(int a[M][N][N]) {
   int sum = 0;

   for (int i = 0; i < N; i++) {
      for (int j = 0; j < N; j++) {
         for (int k = 0; k < M; k++) {
            sum += a[k][i][j];
         }
    }
   return sum;
}</pre>
```

What is "wrong" with this code? How can it be fixed?

## **Cost of Cache Misses**

#### Huge difference between a hit and a miss

Could be 100x, if just L1 and main memory

#### 99% hits could be twice as good as 97%. How?

Assume cache hit time of 1 cycle, miss penalty of 100 cycles

## **Cache Performance Metrics**

#### **Miss Rate**

Fraction of memory accesses to data not in cache (misses / accesses)

Typically: 3% - 10% for L1; maybe < 1% for L2, depending on size, etc.

#### **Hit Time**

Time to find and deliver a block in the cache to the processor.

Typically: 1 - 2 clock cycles for L1; 5 - 20 clock cycles for L2

#### **Miss Penalty**

Additional time required on cache miss = main memory access time Typically **50 - 200 cycles** for L2 (*trend: increasing!*)



## **Cache Organization: Key Points**

#### **Block**

Fixed-size unit of data in memory/cache

#### **Placement Policy**

Where should a given block be stored in the cache?

direct-mapped, set associative

#### **Replacement Policy**

What if there is no room in the cache for requested data?

least recently used, most recently used

#### **Write Policy**

When should writes update lower levels of memory hierarchy?

write back, write through, write allocate, no write allocate



# **Placement Policy**



Large, fixed number of block slots.

## Placement: Direct-Mapped



## Placement: mapping ambiguity



## Placement: Tags resolve ambiguity



## Address = Tag, Index, Offset





# Placement: Direct-Mapped



# Why not this mapping? index(Block ID) = Block ID / S

(still easy for power-of-2 block sizes...)

|       | Cache |
|-------|-------|
| Index |       |
| 00    |       |
| 01    |       |
| 10    |       |
| 11    |       |

# A puzzle.

Cache starts empty.

Access (address, hit/miss) stream:

(10, miss), (11, hit), (12, miss)



What could the block size be?

## Placement: direct mapping conflicts



What happens when accessing in repeated pattern:

0010, 0110, 0010, 0110, 0010...?

### cache conflict

Every access suffers a miss, evicts cache line needed by next access.

## Placement: Set Associative

sets S = # slets in cache

Index per *set* of block slots.
Store block in *any* slot within set.

#### Mapping:

index(Block ID) = Block ID mod S



**Replacement policy:** if set is full, what block should be replaced?

Common: least recently used (LRU)

but hardware usually implements "not most recently used"

## **Example: Tag, Index, Offset?**

4-bit Address Tag Index Offset

Direct-mapped 4 slots 2-byte blocks tag bits
set index bits
block offset bits\_\_\_\_\_

index(1101) = \_\_\_\_

## **Example: Tag, Index, Offset?**

E-way set-associativeS slots16-byte blocks

**16**-bit Address

| Tag   | Index   | Offset |
|-------|---------|--------|
| . 6.0 | 1110107 | 011000 |

| E | = | 1- | -way |
|---|---|----|------|
| S | = | 8  | sets |



$$E = 2$$
-way  $S = 4$  sets







| tag bits          |   |
|-------------------|---|
| set index bits    |   |
| block offset bits | - |
| index(0x1833)     |   |
|                   |   |

## Replacement Policy

If set is full, what block should be replaced?

Common: least recently used (LRU)

(but hardware usually implements "not most recently used"

Another puzzle: Cache starts empty, uses LRU.

Access (address, hit/miss) stream

(10, miss); (12, miss); (10, miss)

associativity of cache?

## General Cache Organization (S, E, B)



## **Cache Read**



## **Cache Read: Direct-Mapped** (E = 1)

#### This cache:

Block size: 8 bytes

Associativity: 1 block per set (direct mapped)



## **Cache Read: Direct-Mapped** (E = 1)

#### This cache:

Block size: 8 bytes

Associativity: 1 block per set (direct mapped)



If no match: old line is evicted and replaced

## **Direct-Mapped Cache Practice**

12-bit address

0x354

16 lines, 4-byte block size

0xA20

#### **Direct mapped**

Offset bits? Index bits? Tag bits?

| 11 | 10 | 9 | 8 | 7 | 6 | 5 | 4 | 3 | 2 | 1 | 0 |
|----|----|---|---|---|---|---|---|---|---|---|---|
|    |    |   |   |   |   |   |   |   |   |   |   |
|    |    |   |   |   |   |   |   |   |   |   |   |

| Index | Tag | Valid | В0 | B1 | B2 | В3 |
|-------|-----|-------|----|----|----|----|
| 0     | 19  | 1     | 99 | 11 | 23 | 11 |
| 1     | 15  | 0     | ı  | ı  | ı  | -  |
| 2     | 1B  | 1     | 00 | 02 | 04 | 08 |
| 3     | 36  | 0     | _  | _  | _  | _  |
| 4     | 32  | 1     | 43 | 6D | 8F | 09 |
| 5     | 0D  | 1     | 36 | 72 | F0 | 1D |
| 6     | 31  | 0     | _  | _  | _  | _  |
| 7     | 16  | 1     | 11 | C2 | DF | 03 |

| Index | Tag | Valid | B0 | B1 | B2 | В3 |
|-------|-----|-------|----|----|----|----|
| 8     | 24  | 1     | 3A | 00 | 51 | 89 |
| 9     | 2D  | 0     | ı  | -  | _  | _  |
| Α     | 2D  | 1     | 93 | 15 | DA | 3B |
| В     | OB  | 0     | _  | _  | _  | _  |
| С     | 12  | 0     | _  | _  | _  | _  |
| D     | 16  | 1     | 04 | 96 | 34 | 15 |
| Е     | 13  | 1     | 83 | 77 | 1B | D3 |
| F     | 14  | 0     | _  | _  | _  | _  |

## Example (E = 1)

Locals in registers.

Assume **a** is aligned such that

&a[r][c] is aa...a rrrr cccc 000

```
int sum_array_rows (double a[16][16]) {
    double sum = 0;

    for (int r = 0; r < 16; r++) {
        for (int c = 0; c < 16; c++) {
            sum += a[r][c];
        }
    }
    return sum;
}</pre>
```

```
int sum_array_cols(double a[16][16]) {
    double sum = 0;

    for (int c = 0; c < 16; c++) {
        for (int r = 0; r < 16; r++) {
            sum += a[r][c];
        }
    }
    return sum;
}</pre>
```

Assume: cold (empty) cache **3-bit set index, 5-bit offset** 

aa...arrr rcc cc000

**0,9**:aa...a000 <u>000</u> <u>00000</u>



32 bytes = 4 doubles

4 misses per row of array 4\*16 = 64 misses 32 bytes = 4 doubles every access a miss 16\*16 = 256 misses



# Example (E = 1)

```
int dotprod(int x[8], int y[8]) {
   int sum = 0;

for (int i = 0; i < 8; i++) {
     sum += x[i]*y[i];
   }
  return sum;
}</pre>
```

block = 16 bytes; 8 sets in cache How many block offset bits? How many set index bits?

Address bits:

B =

S =

Addresses as bits

0x00000000:

0x00000080:

0x000000**A0**:

16 bytes = 4 ints



if x and y are mutually unaligned, e.g., 0x00, 0xA0

| x[0] | x[1] | x[2] | x[3] |
|------|------|------|------|
| x[4] | x[5] | x[6] | x[7] |
| y[0] | y[1] | y[2] | y[3] |
| y[4] | y[5] | y[6] | y[7] |
|      |      |      |      |
|      |      |      |      |
|      |      |      |      |
|      |      |      |      |

if x and y are mutually aligned, e.g., 0x00, 0x80

### Cache Read: Set-Associative (Example: E = 2)

#### This cache:

Block size: 8 bytes



### Cache Read: Set-Associative (Example: E = 2)

#### This cache:

Block size: 8 bytes



If no match: Evict and replace one line in set.

# Example (E = 2)

```
float dotprod(float x[8], float y[8]) {
    float sum = 0;

    for (int i = 0; i < 8; i++) {
        sum += x[i]*y[i];
    }
    return sum;
}</pre>
```

If x and y aligned, e.g. &x[0] = 0, &y[0] = 128, can still fit both because each set has space for two blocks/lines

#### 2 blocks/lines per set

| x[0] | x[1] | x[2] | x[3] | y[0] | y[1] | y[2] | y[3] |
|------|------|------|------|------|------|------|------|
| x[4] | x[5] | x[6] | x[7] | y[4] | y[5] | y[6] | y[7] |
|      |      |      |      |      |      |      |      |
|      |      |      |      |      |      |      |      |

4 sets

### **Types of Cache Misses**

**Cold (compulsory) miss** 

**Conflict miss** 

**Capacity miss** 

Which ones can we mitigate/eliminate? How?

### Writing to cache

Multiple copies of data exist, must be kept in sync.

#### Write-hit policy

Write-through:

Write-back: needs a dirty bit

#### Write-miss policy

Write-allocate:

No-write-allocate:

#### **Typical caches:**

Write-back + Write-allocate, usually

Write-through + No-write-allocate, occasionally

## Write-back, write-allocate example

Cache/memory not involved



- 1. mov \$T, %ecx 🔺
- 2. mov \$U, %edx<sup>▶</sup>
- 3. mov \$0xFEED, (%ecx)
  - a. Miss on T.



Memory

T

OxFACE

U

OxCAFE

### Write-back, write-allocate example





- 1. mov \$T, %ecx
- 2. mov \$U, %edx
- 3. mov \$0xFEED, (%ecx)
  - a. Miss on T.
  - b. Evict U (clean: discard).
  - c. Fill T (write-allocate).
  - d. Write T in cache (dirty).
- 4. mov (%edx), %eax
  - a. Miss on U.



### Write-back, write-allocate example





Memory T OxFEED

U OxCAFE

- 1. mov \$T, %ecx
- 2. mov \$U, %edx
- 3. mov \$0xFEED, (%ecx)
  - a. Miss on T.
  - b. Evict U (clean: discard).
  - c. Fill T (write-allocate).
  - d. Write T in cache (dirty).
- 4. mov (%edx), %eax
  - a. Miss on U.
  - b. Evict T (dirty: write back).
  - c. Fill U.
  - d. Set %eax.
- 5. DONE.

### **Example Memory Hierarchy**

#### Typical laptop/desktop processor

(always changing)

Processor package



L1 i-cache and d-cache: 32 KB, 8-way,

Access: 4 cycles

L2 unified cache:

256 KB, 8-way,

Access: 11 cycles

L3 unified cache:

8 MB, 16-way,

Access: 30-40 cycles

Block size: 64 bytes for

all caches.

slower, but more likely to hit

### **Aside: software caches**

#### **Examples**

File system buffer caches, web browser caches, database caches, network CDN caches, etc.

#### Some design differences

Almost always fully-associative

Often use complex replacement policies

Not necessarily constrained to single "block" transfers

## Cache-Friendly Code

Locality, locality, locality.

#### Programmer can optimize for cache performance

Data structure layout

Data access patterns

Nested loops

Blocking (see CSAPP 6.5)

#### All systems favor "cache-friendly code"

Performance is hardware-specific

Generic rules capture most advantages

Keep working set small (temporal locality)

Use small strides (spatial locality)

Focus on inner loop code