Memory and Cache Flashcards
(20 cards)
Why is memory access a performance bottleneck in modern computers?
Main memory access is much slower than CPU operations.
Can take ~100 clock cycles
What is the purpose of the memory hierarchy?
To balance:
- Capacity
- Cost
- Access time
It also ensures frequently accessed data is available in faster memory levels.
What is cache?
A small amount of fast memory which holds data fetched from, and written to, memory
How is cache structured in modern CPUs?
CPUs have multiple levels of cache (L1, L2, L3), each progressively larger but slower.
What are the two main types of locality that caches exploit?
- Spatial Locality – Data near recently accessed memory is likely to be used.
- Temporal Locality – Recently accessed data is likely to be reused soon.
How does spatial locality work?
- Data is fetched in fixed size blocks (cache lines)
- When we access adjacent memory, it is likely to be in cache (cache hit)
- If it’s not in the cache we have a cache miss, and the correct data cache line is fetched from main memory
Why is loop order important in C for efficient memory access?
- Accessing memory sequentially improves cache efficiency.
- Looping over 2D array rows first (before columns) results in more cache hits.
What is cache blocking? What are the benefits?
- Computations are split into blocks that fit in cache
Benefits: - Maximises data reuse
- Much faster data access
- Exploits temporal locality
How can reducing problem size affect performance testing?
If a test case is too small, memory behavior (e.g., cache misses) may not be representative of real workloads.
What can compilers do to optimise memory access?
Techniques like loop interchange and cache blocking can improve cache efficiency.
What is arithmetic intensity?
The ratio of floating-point operations to data movement, measured in FLOPs/byte.
What does the roofline model describe?
The maximum floating-point performance of an application based on:
1. Peak performance
2. Memory bandwidth
3. Arithmetic intensity
When is an application compute-bound vs. memory-bound?
- Compute-bound: High arithmetic intensity, limited by CPU performance.
- Memory-bound: Low arithmetic intensity, limited by memory bandwidth.
What does NUMA stand for? What is it?
- Non-Uniform Memory Access
- NUMA is the phenomenon that memory at various points in the address space of a processer have different performance characteristics.
Why does NUMA exist?
Multi-socket CPUs have separate memory controllers, making remote memory accesses slower than local ones.
How does memory allocation work in a NUMA system?
First Touch Policy - Memory is allocated on the first memory controller that accesses it.
How does NUMA affect OpenMP programs?
If one thread initialises memory and other threads access it later, remote memory accesses slow performance.
What is hybrid parallelism?
Using both MPI (for distributed memory) and OpenMP (for shared memory) together.
How can hybrid parallelism help with NUMA effects?
- Assign one MPI process per socket
- Each MPI process starts OpenMP threads within a single socket
How do you compile and run an MPI+OpenMP program (if it was titled lissajous.c
)?
mpicc -std=c99 -o lissajous -fopenmp lissajous.c -lm
export OMP_NUM_THREADS=2
mpirun -np 2 ./lissajous