Chapter 6 - CUDA Flashcards
What are GPGPUs?
General purpose GPUs
What is CUDA?
API for general purpose programming on GPUs.
What is the architecture of GPUs?
SIMD (single control unit, multiple datapaths). Instruction is fetched from memory and broadcasted to the multiple datapaths. Each datapath executes the instruction on its data, or remain idle.
GPUs consist of one or more SIMD processors.
A Streaming Multiprocessor (SMs) consists of one or more SIMD processors.
On GPUs, the datapaths are called cores or Streaming processor (SPs). There can bemultiple SPs within a SM
What does a SM consist of?
Can have several control units, and many datapaths
Operates asynchronously.
No penalty if one SM execute one part of conditional (if), and another SM executes the other (else)
Small block of memory shared amongst all SPs
What is the Host?
CPU and its associated memory
What is the device?
GPU and its associated memory
Draw a diagram of a CPU and a GPU
What is heterogeneous computing?
Writing programs that will run on devices with different architectures.
What is a kernel?
A function started by host, but run by device.
__global__
return void
How can threads running in cuda be synchroniced?
cudaDeviceSynchronize();
Main waits until device threads have finished executing the kermel.
How are kernels run?
Kernel_name «<grid_dims, block_dims»>(args);
n_threads: How many threads to start on the GPU
What variables are defined by CUDA when a kernel is started?
4 structs, all with a x-, y- and z-field
threadIdx: defines rank of thread within block
blockDim: dimensions of the blocks
blockIdx: rank of current block in grid
gridDim: Dimensions of grid
What does it mean that a call to a kernel is asynchronous?
The call returns immediately.
To avoid the host terminating before kernels has completed execution, cudaDeviceSynchronize can be used.
How are threads organized in a kernel?
Each individual thread will execute on one SP.
«<grid_dims, block_dims»>
grid_dims: how many SMs to utilize, how many blocks in each dimenstion
block_dims: How many SPs to utilize, how many threads to run on a single SM
if grid_dims or block_dims has the type integer, the x-value of the grid- and block-dimension will be that integer. The y- and z-dimensions will be set to 1.
To specify the y- and z- dimension in addition to x, the grid_dims variable must be of type dim3
How is a dimension variable specified using dim3?
dim3 grid_dims;
grid_dims.x = 2;
grid_dims.y = 3;
grid_dims.z = 4;
What are some conditions of thread blocks?
All blocks have the same dimensions
All blocks can execute independantly of each other
All blocks can execute in any order
Thread blocks must be able to complete its execution, regardless of the state of the other blocks.
What is the compute capacity?
There are limits on number of threads in each block, and number of blocks.
This number (limit) has the for a.b
a: 1, 2, 3, 5, 6, 7, 8
b: 0-7 (depends on the value of a)
The compute capacity specifies how many threads each block can have, and how big dimensions in blocks or grids can be.
How can you calculate the global index of a thread?
x = blockDim.x * blockIdx.x + threadIdx.x
How is memory allocated to devices?
cudaMalloc(void** ptr, size_t size)
Memory allocated on device must be explicitly freed
What does the qualifier __device__ do?
Put in front of functions.
Indicates that a function can only be called from the device.
How is device memory freed?
cudaFree(&data);
How is data copied to and from devices?
cudaMemcpy(
void* dest,
void* src,
size s,
codaMemcpyKind kind
)
kind: where are src and dest located
The function is synchronous, meaning the host will wait for the kernels to finish running - don’t need to explicitly synch the program.
cudaMemcpy(device_x, host_x, size, cudaMemcpyHostToDevice);
cudaMemcpy(host_x, device_x, size, cudaMemcpyDeviceToHost);
host_x and device_x are pointers
What is the return type of a kernel
void
What does kernel-declaration look like?
__global__ void kernelName();