General Purpose GPUs Flashcards

Week 2.10 (24 cards)

1
Q

what 3 things are CPUs limited by

A
  1. small number of threads
  2. high overhead per thread
  3. power and heat constraints
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

describe GPU

A
  • orginally designed for fast 3D graphics rendering
  • now used for Ai, simulation, medical imaging & finance
  • massive parallelism - thousands of lightweight threads
  • ideal for compute-heavy, data-parallel tasks
  • enabled by programmer-friendly APIs
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

what is CUDA

A
  • Compute Unified Device Architecture is NVIDIA’s platform for GPGPU
  • provides parallel programming model and Apis to harness GPU’s massive core count
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

what is the program structure in CUDA

A
  • programs run partly on the CPU (host) and partly on the GPU (device)
  • C/C++ includes:
    • host code
    • device code (kernel)
      • data transfer code
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

describe kernel

A
  • function written by the programmer to run on the GPU
  • when launched, the kernel runs in parallel across many threads
  • each thread executed the same kernel code but on different data
  • threads must mostly be independent - branching or divergence leads to slower serial execution
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

how are threads launched

A

blocks and grids
- threads -> blocks
- blocks -> grids
- threads in blocks can share fast memory & sync
- threads execute in groups of 32 = warps
- threads, blocks & grids = software abstraction

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

how do we chose grid & block dimension

A
  • each block runs of a streaming multiprocessor (SM)
  • each SM has a max number of threads per block
  • too many blocks = idle SMs = wasted performance
  • choose based on:
    • problem size
    • hardware constraints
      • maximising GPU usage
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

what is CPU optimised for

A

sequential code
- chip = cache & control logic
- handles complex logic

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

what is a GPU optimised for

A

data-parallel tasks
- chip = processing logic - SIMD
- minimal control logic & cache
- hides memory latency by oversubscribing threads to cores
- designed to maximise FLOPs - graphics & scientific applications

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q
A
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

how does memory & thread scheduling work with NVIDIA fermi

A
  • global DRAM interfaces surround the chip & connects SMs to external VRAM
  • VRAM is GPU’s dedicated memory
  • host interface: connects GPU to CPU through PCIe
  • GigaThread scheduler assigns thread blocks to SMs
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

why should we maximise warp throughput in GPUs

A
  • GPUs are most effective when many warps are active, maximising CUDA core utilisation
  • fermi architecture: 2 warps every 2 clock cycles
  • structural hazards - can block warps
  • memory latency - can be hidden by having other ready warps available
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

what is the warp execution flow

A
  • each warp executes one instruction at a time
  • threads use different execution units:
    • ALU ops
    • Loads/store
      • Special ops
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

what are the 4 items in CUDA memory hierachy

A
  1. registers
  2. L1 cache
  3. L2 cache
  4. global memory (VRAM)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

what are registers used for in CUDA

A
  • private to each thread, extremely fast
  • used for temporary variables during execution
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

what is L1 cache used for in CUDA memory hierarchy

A
  • shared among threads in SM
  • caches global memory to hide latency
17
Q

what is L2 cache used for in CUDA memory hierarchy

A
  • shared across all SMs
  • backup if data misses in L1
18
Q

what is VRAM used for in CUDA

A
  • largest but slowest
  • shared across the whole CPU
19
Q

how does CUDA use shared memory for thread communication

A
  • registers are private - inaccessible by others
  • shared memory is per block - threads in the same block can share data via shared memory
  • avoid write hazards: assign specific threads to write to specific shared memory locations
  • synchronise threads: use barriers to ensure all writes finish before reads begin
20
Q

describe Intel’s Gen8 GPU execution unit

A
  • basic compute unit
  • SMT with 7 hardware threads
  • contains two SIMD floating point units
  • includes branch and send units for control flow & memory access
21
Q

describe instruction flow in Intel’s Gen8 GPU

A
  • cycle begins: 1 instruction fetched per active thread
  • each thread has its own superscalar pipeline
22
Q

describe subslices in Intel’s Gen8 GPU

A
  • includes up to 8 EU and shared resources
  • thread dispatcher assigns threads to EUs for execution
  • instruction cache is shared - stores program code for active threads
23
Q

describe slice-level organisation in Intel’s Gen8 GPU

A
  • slices group multiple subslices - 24 EUs
  • enables sharing of temporary variables across EUs for efficient processing
24
Q

when does a program benefit from GPU acceleration

A
  • GPUs are SIMD processors with hundreds/thousands of lightweight cores
  • best suited for tasks with large-scale parallelism
  • ideal for processing large data sets concurrently using many lightweight threads