Memory and Cache Flashcards

(20 cards)

1
Q

Why is memory access a performance bottleneck in modern computers?

A

Main memory access is much slower than CPU operations.

Can take ~100 clock cycles

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What is the purpose of the memory hierarchy?

A

To balance:
- Capacity
- Cost
- Access time

It also ensures frequently accessed data is available in faster memory levels.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What is cache?

A

A small amount of fast memory which holds data fetched from, and written to, memory

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

How is cache structured in modern CPUs?

A

CPUs have multiple levels of cache (L1, L2, L3), each progressively larger but slower.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What are the two main types of locality that caches exploit?

A
  1. Spatial Locality – Data near recently accessed memory is likely to be used.
  2. Temporal Locality – Recently accessed data is likely to be reused soon.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

How does spatial locality work?

A
  1. Data is fetched in fixed size blocks (cache lines)
  2. When we access adjacent memory, it is likely to be in cache (cache hit)
  3. If it’s not in the cache we have a cache miss, and the correct data cache line is fetched from main memory
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Why is loop order important in C for efficient memory access?

A
  • Accessing memory sequentially improves cache efficiency.
  • Looping over 2D array rows first (before columns) results in more cache hits.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

What is cache blocking? What are the benefits?

A
  • Computations are split into blocks that fit in cache
    Benefits:
  • Maximises data reuse
  • Much faster data access
  • Exploits temporal locality
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

How can reducing problem size affect performance testing?

A

If a test case is too small, memory behavior (e.g., cache misses) may not be representative of real workloads.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

What can compilers do to optimise memory access?

A

Techniques like loop interchange and cache blocking can improve cache efficiency.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What is arithmetic intensity?

A

The ratio of floating-point operations to data movement, measured in FLOPs/byte.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

What does the roofline model describe?

A

The maximum floating-point performance of an application based on:
1. Peak performance
2. Memory bandwidth
3. Arithmetic intensity

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

When is an application compute-bound vs. memory-bound?

A
  • Compute-bound: High arithmetic intensity, limited by CPU performance.
  • Memory-bound: Low arithmetic intensity, limited by memory bandwidth.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

What does NUMA stand for? What is it?

A
  • Non-Uniform Memory Access
  • NUMA is the phenomenon that memory at various points in the address space of a processer have different performance characteristics.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Why does NUMA exist?

A

Multi-socket CPUs have separate memory controllers, making remote memory accesses slower than local ones.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

How does memory allocation work in a NUMA system?

A

First Touch Policy - Memory is allocated on the first memory controller that accesses it.

17
Q

How does NUMA affect OpenMP programs?

A

If one thread initialises memory and other threads access it later, remote memory accesses slow performance.

18
Q

What is hybrid parallelism?

A

Using both MPI (for distributed memory) and OpenMP (for shared memory) together.

19
Q

How can hybrid parallelism help with NUMA effects?

A
  • Assign one MPI process per socket
  • Each MPI process starts OpenMP threads within a single socket
20
Q

How do you compile and run an MPI+OpenMP program (if it was titled lissajous.c)?

A

mpicc -std=c99 -o lissajous -fopenmp lissajous.c -lm
export OMP_NUM_THREADS=2
mpirun -np 2 ./lissajous