General Purpose GPUs Flashcards
Week 2.10 (24 cards)
what 3 things are CPUs limited by
- small number of threads
- high overhead per thread
- power and heat constraints
describe GPU
- orginally designed for fast 3D graphics rendering
- now used for Ai, simulation, medical imaging & finance
- massive parallelism - thousands of lightweight threads
- ideal for compute-heavy, data-parallel tasks
- enabled by programmer-friendly APIs
what is CUDA
- Compute Unified Device Architecture is NVIDIA’s platform for GPGPU
- provides parallel programming model and Apis to harness GPU’s massive core count
what is the program structure in CUDA
- programs run partly on the CPU (host) and partly on the GPU (device)
- C/C++ includes:
- host code
- device code (kernel)
- data transfer code
describe kernel
- function written by the programmer to run on the GPU
- when launched, the kernel runs in parallel across many threads
- each thread executed the same kernel code but on different data
- threads must mostly be independent - branching or divergence leads to slower serial execution
how are threads launched
blocks and grids
- threads -> blocks
- blocks -> grids
- threads in blocks can share fast memory & sync
- threads execute in groups of 32 = warps
- threads, blocks & grids = software abstraction
how do we chose grid & block dimension
- each block runs of a streaming multiprocessor (SM)
- each SM has a max number of threads per block
- too many blocks = idle SMs = wasted performance
- choose based on:
- problem size
- hardware constraints
- maximising GPU usage
what is CPU optimised for
sequential code
- chip = cache & control logic
- handles complex logic
what is a GPU optimised for
data-parallel tasks
- chip = processing logic - SIMD
- minimal control logic & cache
- hides memory latency by oversubscribing threads to cores
- designed to maximise FLOPs - graphics & scientific applications
how does memory & thread scheduling work with NVIDIA fermi
- global DRAM interfaces surround the chip & connects SMs to external VRAM
- VRAM is GPU’s dedicated memory
- host interface: connects GPU to CPU through PCIe
- GigaThread scheduler assigns thread blocks to SMs
why should we maximise warp throughput in GPUs
- GPUs are most effective when many warps are active, maximising CUDA core utilisation
- fermi architecture: 2 warps every 2 clock cycles
- structural hazards - can block warps
- memory latency - can be hidden by having other ready warps available
what is the warp execution flow
- each warp executes one instruction at a time
- threads use different execution units:
- ALU ops
- Loads/store
- Special ops
what are the 4 items in CUDA memory hierachy
- registers
- L1 cache
- L2 cache
- global memory (VRAM)
what are registers used for in CUDA
- private to each thread, extremely fast
- used for temporary variables during execution
what is L1 cache used for in CUDA memory hierarchy
- shared among threads in SM
- caches global memory to hide latency
what is L2 cache used for in CUDA memory hierarchy
- shared across all SMs
- backup if data misses in L1
what is VRAM used for in CUDA
- largest but slowest
- shared across the whole CPU
how does CUDA use shared memory for thread communication
- registers are private - inaccessible by others
- shared memory is per block - threads in the same block can share data via shared memory
- avoid write hazards: assign specific threads to write to specific shared memory locations
- synchronise threads: use barriers to ensure all writes finish before reads begin
describe Intel’s Gen8 GPU execution unit
- basic compute unit
- SMT with 7 hardware threads
- contains two SIMD floating point units
- includes branch and send units for control flow & memory access
describe instruction flow in Intel’s Gen8 GPU
- cycle begins: 1 instruction fetched per active thread
- each thread has its own superscalar pipeline
describe subslices in Intel’s Gen8 GPU
- includes up to 8 EU and shared resources
- thread dispatcher assigns threads to EUs for execution
- instruction cache is shared - stores program code for active threads
describe slice-level organisation in Intel’s Gen8 GPU
- slices group multiple subslices - 24 EUs
- enables sharing of temporary variables across EUs for efficient processing
when does a program benefit from GPU acceleration
- GPUs are SIMD processors with hundreds/thousands of lightweight cores
- best suited for tasks with large-scale parallelism
- ideal for processing large data sets concurrently using many lightweight threads