Lecture 11 - GPUs Flashcards
(11 cards)
Programming GPUs
OpenCL is a C-like language used for programming heterogeneous hardware, including GPUs. It aims for a “write once, run anywhere” approach. It uses a C API with C++ bindings.
OpenCL Specifications
The OpenCL specification is divided into four main parts: Platform model, Execution model, Memory model, and Programming model.
Platform Model
- Specifies the host/device relationship, where the host is the processor coordinating execution and the device is the processor executing OpenCL code. OpenCL functions are called kernels, and these kernels execute on devices, which don’t necessarily have to be GPUs.
- A device is divided into one or more compute units, which are further divided into processing elements, each with its own program counter.
Execution Model
- Describes how OpenCL is configured on the host and how kernels are executed, setting up the OpenCL context and allowing host-device interaction. It also defines the concurrency model for kernel execution.
- The execution model includes work items and work groups, with kernels executed on devices
Memory Model
- Defines an abstract memory hierarchy, independent of the underlying memory architecture but closely resembling modern GPU architecture. This model can be adapted for other architectures like FPGAs.
- The programmer allocates memory to spaces within this hierarchy, and the runtime system maps these spaces to the physical memory hierarchy. The memory hierarchy includes private memory (per work-item), local memory (per work-group), and global memory.
Programming Model
Describes how the concurrency model is mapped to physical hardware, with contexts executing kernels mapped to actual device hardware units.
Typical Setup
A typical setup involves an x86 host CPU and an OpenCL GPU device. The host CPU sets up the kernel for the device and instantiates it with a specified degree of parallelism. The GPU device executes the kernel, with input/output handled via the host.
Concurrency Model
OpenCL uses a hierarchical concurrency model consisting of work groups and work items, promoting scalability. Each kernel is specified with an n-dimensional range (NDRange), which is a 1-, 2-, or 3-dimensional index of work items, generally corresponding to the dimensionality of the input or output space
Vector Add Example
Involves adding corresponding elements from two arrays, with each thread performing one addition. OpenCL allows for defining kernels to perform such operations, with work items and work groups managing the parallel execution.
Thread Structures
Massively parallel programs are often written such that each thread computes one part of a problem. The thread structure is usually arranged in the same shape as the data. OpenCL’s thread structure is designed to be scalable, using work-items organized into work-groups. Work-items can identify themselves uniquely based on a global ID or a work-group ID and local ID.
Memory Model Details
The OpenCL memory model defines various types of memories, closely related to GPU memory hierarchy. These include global memory, constant memory, local memory, and private memory, each with different scopes and accessibility. Memory management is explicit, requiring data movement between host and device memory.