Chapter 6 - CUDA Flashcards

1
Q

What are GPGPUs?

A

General purpose GPUs

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What is CUDA?

A

API for general purpose programming on GPUs.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What is the architecture of GPUs?

A

SIMD (single control unit, multiple datapaths). Instruction is fetched from memory and broadcasted to the multiple datapaths. Each datapath executes the instruction on its data, or remain idle.

GPUs consist of one or more SIMD processors.
A Streaming Multiprocessor (SMs) consists of one or more SIMD processors.

On GPUs, the datapaths are called cores or Streaming processor (SPs). There can bemultiple SPs within a SM

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What does a SM consist of?

A

Can have several control units, and many datapaths

Operates asynchronously.
No penalty if one SM execute one part of conditional (if), and another SM executes the other (else)

Small block of memory shared amongst all SPs

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What is the Host?

A

CPU and its associated memory

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What is the device?

A

GPU and its associated memory

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Draw a diagram of a CPU and a GPU

A
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

What is heterogeneous computing?

A

Writing programs that will run on devices with different architectures.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What is a kernel?

A

A function started by host, but run by device.

__global__

return void

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

How can threads running in cuda be synchroniced?

A

cudaDeviceSynchronize();

Main waits until device threads have finished executing the kermel.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

How are kernels run?

A

Kernel_name «<grid_dims, block_dims»>(args);

n_threads: How many threads to start on the GPU

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

What variables are defined by CUDA when a kernel is started?

A

4 structs, all with a x-, y- and z-field

threadIdx: defines rank of thread within block

blockDim: dimensions of the blocks

blockIdx: rank of current block in grid

gridDim: Dimensions of grid

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

What does it mean that a call to a kernel is asynchronous?

A

The call returns immediately.

To avoid the host terminating before kernels has completed execution, cudaDeviceSynchronize can be used.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

How are threads organized in a kernel?

A

Each individual thread will execute on one SP.

«<grid_dims, block_dims»>

grid_dims: how many SMs to utilize, how many blocks in each dimenstion

block_dims: How many SPs to utilize, how many threads to run on a single SM

if grid_dims or block_dims has the type integer, the x-value of the grid- and block-dimension will be that integer. The y- and z-dimensions will be set to 1.

To specify the y- and z- dimension in addition to x, the grid_dims variable must be of type dim3

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

How is a dimension variable specified using dim3?

A

dim3 grid_dims;

grid_dims.x = 2;
grid_dims.y = 3;
grid_dims.z = 4;

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

What are some conditions of thread blocks?

A

All blocks have the same dimensions

All blocks can execute independantly of each other

All blocks can execute in any order

Thread blocks must be able to complete its execution, regardless of the state of the other blocks.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

What is the compute capacity?

A

There are limits on number of threads in each block, and number of blocks.

This number (limit) has the for a.b
a: 1, 2, 3, 5, 6, 7, 8
b: 0-7 (depends on the value of a)

The compute capacity specifies how many threads each block can have, and how big dimensions in blocks or grids can be.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

How can you calculate the global index of a thread?

A

x = blockDim.x * blockIdx.x + threadIdx.x

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

How is memory allocated to devices?

A

cudaMalloc(void** ptr, size_t size)

Memory allocated on device must be explicitly freed

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

What does the qualifier __device__ do?

A

Put in front of functions.
Indicates that a function can only be called from the device.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

How is device memory freed?

A

cudaFree(&data);

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q

How is data copied to and from devices?

A

cudaMemcpy(
void* dest,
void* src,
size s,
codaMemcpyKind kind
)

kind: where are src and dest located

The function is synchronous, meaning the host will wait for the kernels to finish running - don’t need to explicitly synch the program.

cudaMemcpy(device_x, host_x, size, cudaMemcpyHostToDevice);

cudaMemcpy(host_x, device_x, size, cudaMemcpyDeviceToHost);

host_x and device_x are pointers

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
23
Q

What is the return type of a kernel

A

void

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
24
Q

What does kernel-declaration look like?

A

__global__ void kernelName();

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
25
What is a similarity between cuda, and pthreads and openMP?
Cuda threads are also allocated stacks and local variables
26
What are warp shuffles?
Functions that allow collections of threads within a warp to read variables stored by other threads in the warp. Function allow threads to read from registers used by other threads in the warp.
27
How is memory layed out in CUDA?
SMs has access to "shared" memory. This "shared" memory is accessible to all of the SPs within this SM. All of the SPs and all of the threads have access to "global" memory. This memory is Number of shared memory locations are small but fast. Global memory is many but slow. The fastest memory used is registers.
28
How are local variables stored in cuda?
Depends on total available memory, and programs memory usage. If there is enough storage - store in registers. If not - the local variables are stored in a region of global memory that's thread-private.
29
What is a warp?
A set of threads with consecutive ranks belonging to a block. Thenumber of threads is 32 - this can change in future CUDA implementations. The system-initialized variable warpSize stores the size.
30
How does threads in a warp behave?
They execute in a SIMD fashion Threads in different warps can execute different statements. Threads within the same warp must execute the same.
31
What does it mean when threads are said to be diversed?
Threads within a warp attempts to execute different statements. E.g. take different branches in an if-else statement When diverged threads have finished executing the different statements, they begin executing the same. When this happens the threads have converged.
32
What is a thread's lane? How can it be calculated?
Its rank within a warp. lane = threadIdx.x % warpSize
33
How does the warp shuffle function__shfl_down_sync work?
__shfl_down_sync( unsigned mask, float var, unsigned diff, int width=warpsize ) mask: indicates what threads are participating in the function call. A bit representing a thread's lane must be set for each participating thread to ensure all threads in the call have converged (arrived at the call) before the threads begin executing this function. var: when lane x calls function, the value stored in var on the thread with lane (x + diff)is returned for the current thread x diff: because unsigned -> bigger than 0. value returned is therefor from a higher rank (hence name shuffl_down)
34
What are some possible issues with warp shuffles?
If not all threads in a warp calls the function, or the warps does not have warpSize amount of threads or a multiple of it, the function may return undefined results.
35
What happens if x calls a warp shuffle and its corresponding thread lane does not?
X gets returned undefined
36
How can you create shared variables?
__shared__ float shared_vals[32] This definition needs to be in kernel
37
What is a SM?
Streaming Multiprocessor (SM) Closest thing the GPU has to a CPU core 128 32-bit floating point units (128 'cores')
38
What are similarities between CPUs and GPUs?
Both are general purpose processor cores Both use superscalar architectures
39
How are threads executed in GPUs?
Executed in groups of 32 (warps) All threads in a warp execute same instructions (GPUs are designed to do things over and over)
40
What is a main reason for using warps?
Only one decode needed for 32 threads
41
What is thread divergence?
When a branch occur, all threads in warp must participate, even if the branch is not in their execution path. if: If one thread chooses one of the paths, the other threads must participate in executing that path loop: all threads must participate until the final thread has finished iterating Thread 1: if = true a = 3 else a = N/A Thread 2: if = true a = N/A else a = 10 Both execute along both paths, though might not do useful work (N/A)
42
When does thread divergence become a problem?
If threads are following different execution paths. Long if/else where threads are likely to choose different paths Loops with highly variable number of iterations Fix: -rewrite branch using math - rewrite multiple nested loops to one single one
43
What class of parallelism does warp-based execution use in Flynn's taxonomy?
SIMT
44
What is the difference between SIMD and SIMT?
SIMD: Single instruction executes multiple data operations in a single thread SIMT: A single instruction executes a single data operation in multiple threads
45
How does the CPU do a context switch?
Store PC Store all register to stack Load registers from new thread from stack Jump to PC of new thread Continue execution
46
How is context switching done on GPUs?
Context switch every single clock cycle Register file stores all registers of all threads -> no need to swap registers because of this Each cycle, 4 warps are chosen that are able to execute an instruction
47
What is the memory system of GPUs?
Shared L2 cache SM: its own L1 cache + separate instruction cache. Part of the L1 cache can be used as temporary storage known as shared memory
48
What is the measure of GPU utilisation?
Occupancy =Cycles SM busy / Total cycles
49
What is a GPU grid?
Contains all threads we want to run. dim3 gridDim = {x, y, z}
50
What is a CPU block?
Collection of threads, all running within a single SM dim3 blockDim = {x, y, z}
51
What happens when a kernel is launched?
A queue of blocks is created Blocks run until all its threads have completed Blocks run in any order Once a block has completed, the next block is allocated to that SM Blocks should be dimensioned such that they contain a multiple of 32 threads
52
What is the desired dimensions of a block?
A multiple of 32 threads
53
What does __synchthreads() do?
Interraction between threads within a block. Barriers for all threads in block
54
What mechanism does GPUs have to avoid race conditions?
Atomic operations int atomicAdd(int* address, int val) int atomicSub(int* address, int val) int atomicMax(int* address, int val)
55
What is shared memory?
Part of the SMs L1 cache used for temporary storage. Useful if the same data will be used many times. Can be usedto communicate bewteen warps in a block allocated on a per-block basis Stores: - intermediate results - data to be reused - exchange values between warps in a block
56
What limits does SMs have?
Number of simultaneously executing threads Misconfigurations can decrease performance In Ada Lovelace architecture: - Max thread per block: 1024 - block per SM: 24 - Warps per SM: 48 (1536 thread) - registers per thread (255) - shared mem: 100Kb register requirements limit the number of warps that can be executed simultaneously in an SM blocks have constant number of warps, and cannot be partially allocated to an SM, even if the SMs register file has space for some of the warps in the block Shared memory required by each block limits the number of warps
57
How are kernel launched?
dim3 gridDims = {1, 2, 3} dim3 gblockDims = {2, 2, 3} kernelName<<>>(param1, param2); Dimension cannot be 0
58
How are registers used?
Each thread has a fixed number of registers A warp uses 32x that number of registers register values are kept in register files within the SM register requirements limit the number of warps that can be executed simultaneously in an SM
59
What are some block size trade offs?
Large: - More ability to cooperate between threads - better reuse of shared memory Small: - Less waiting for all warps to complete - May improve occupancy by allowing more warps to execute simultaneously
60
What are cooperative groups?
Threads in blocks and warps partitioned into groups that work together. Collaboration stems from communication between threads within the same warp being cheap
61
How do you implement cooperative groups?
#include namespace cg = cooperative_groups
62
What are some types of cooperative groups?
Coalesced group Block group Grid group Cluster group
63
What is a Coalesced group?
Threads in the current warp, but only the ones that are executing at that point in time Example: __global__ void kernel(){ if(threadIdx.x < 12) return; cg::coalesced_group warp = cg::coalesced_threads(): } 20 threads would be active in the warp
64
What is a block group?
A group with all threads in the current block cg::thread_block block = cg::this_thread_block();;
65
What is a grid group?
Group of all threads in the entire grid cg::grid_group grid = cg::this_grid(); Cannot use <<<>>>, must use: cudaLaunchCooperativeKernel(kernel, dim, dim, args);
66
What is a cluster group?
A union of multiple thread blocks
67
What is group partitioning?
Creating smaller subgroups from larger ones
68
How is barriers used on groups?
group.sync(); Works on any group or partition
69
How are instructions executed in terms of GPU structure?
All threads are executed grouped as a warp
70
What are the 8 GPU limits?
1. Max 1024 thread per block 2. Max 24 blocks per SM 3. Max 48 warps per SM 4. Max registers per thread: 255 5. Shared memory: 100Kb 6. Register requirements limit the number of warps that can be executed simultaneously in an SM 7. Blocks have constant number of warps and cannot be partially allocated to an SM 8. Shared memory required by each block limits the number of warps
71
How does register requirements limit the number of warps that can be executed simultaneously in an SM?
72
How does shared memory required by each block limit the number of warps?
73
What are two common issues with memory in GPUs?
Non-coalesced memory reads Atomic write contention
74
What are coalesced memory reads?
GPU cache lines: often 128 bytes (32 warps * 4) Memory allocated with cudaMalloc is always aligned, meaning byte 0 is the start of a cache line. Even if only one byte is read, the whole cache line must be loaded into core. Coalesced reads are when 100% of the bandwidth are utilised by reading one cache line per warp, and each thread reads 4 bytes. 4 bytes * 32 threads = 128 bytes (the whole cache line)
75
Code an example where a therad only uses one byte in a cache line
__global__ void kernel(int* array, int n){ int value = array[32 * threadIdx.x] } As each int is 4 bytes - every thread will skip the next 32 * 4 bytes. meaning it reads from the next cache line. One cache line must be read in per thread.
76
Give a coding example of a coalesced read
__global__ void kernel(int* array, int n){ int value = array[threadIdx.x] }
77
What is atomic write contention?
When multiple threads tries to do an atomic operation, their effects are serialised because threads needs to wait. __global__ void kernel(int* array. int* oddCount){ int value = array[threadIdx.x]; if(value % 2 == 1){ atomicAdd(oddCount, 1); } }
78
What is a solution to atomic write contention?
Do a reduction within the warp, then have a single thread do the atomic operation
79
What is a common problem when using structs of arrays, and how can this be solved?
Misaligned reads struct int4 { int x, int y, int y } int4 var; var.x = varArray[threadIndex].x Memory: x y z x y z x y z x y z index: 0 1 2 3 say every xyz-block is each own cache block, each thread will read only read one value and from their own cache line Solve by using struct of array, this stores all x values together, and all y- and z. struct arrayOfInt4 { int* x, int* y, int* z } int x_value = arrayOfInt4.x[index] Memory: x x x x y y y y z z z z
79
What is vectorized loads?
If data type is exactly 4, 8 or 16 bytes, the mem. system guarantees your data is loaded in one operation. Does not need to use struct of arrays in this case.
80
What is register spilling?
Temporarily write some register values to memory upside: can run more threads simultaneously Downside: more memory transactions Compiler spills registers when it think it will improve performance
81
How can register spilling be avoided?
Identify parts of kernel where many variables must be kept simultaneously - nested function calls - loops Consider recalculating values if it is not too expensive cut fields from structs that are unrelated to kernel
82
What is a shuffle instruction?
Exchange values within a warp Each thread provides a value Ech thread reads a valueprovided by another thread __shfl__sync(unsigned mask, T var, int srcLane) Mask: defines what lanes are included in the function, often __activemask() - all threads in warp Thread 3 executes this - sends value 7: int out = __shfl__sync(__activatemask(), 7, 8) Thread 10 executes this - sends value 11 and receives value from lane 3, this being 7: int out = __shfl__sync(__activatemask(), 11, 3)
83
What threads can communicate using shuffle instructions?
Threads must be in the same block Threads must be in the sme warp or cooperative group up to 32 threads in size
84
What happens if a thread using shuffle instructions are reading from a lane that is not participating?
Undefined result
85
What are the four shuffle instructions
Read from any thread: __shfl_sync(mask, var, srcLane) Read from thread with laneid = (current lane - delta) __shfl_up_sync(mask, var, delta) Read from thread with laneid = (current lane + delta) __shfl_down_sync(mask, var, delta) Read from thread with laneId XORed with laneMask __shfl_xor_sync(mask, var, laneMask)
86
How can shuffle instructions be used to create sum?
int threadValue = 12 // for this thread sum += __shfl_xor_sync(__activemask(), threadValue, 16) sum += __shfl_xor_sync(__activemask(), sum, 8) sum += __shfl_xor_sync(__activemask(), sum, 4) sum += __shfl_xor_sync(__activemask(), sum, 2) sum += __shfl_xor_sync(__activemask(), sum, 1) return sum This will create a butterfly reduction (cross pattern) where 16 values are exchanged in the same direction which creates a cross, then 8 values - creating 2 crosses, then 4 values, and so on
87
How can shuffle instructions be used to broadcast values?
reserve a block of 32 entries in a buffer __shfl_sync(__activemask(), value, 0) All threads read from lane 0. Lane 0 calculates value, then all threads can use value in further computation
88
What is warp voting?
Each thread in the warp sets one bit in a 32 bit integer Bit index corresponds to the lane index only active threads vote
89
What are the warp voting instructions?
Create a 32-bit integer where each lane sets one bit: unsigned int __ballot_sync(mask, bool predicate) Returns true if all threads votes true bool __all_sync(mask, bool predicate) Returns true if any thread votes true bool __any_sync(mask, bool predicate) Reverses a 32-bit integer unsigned int __brev(unsigned int mask)
90
What are some usecases of warp voting?
Identify elements that should be removed
91
What are warp reductions?
Function used to reduce thread values into one result __func_name(unsigned mask, unsigned/int value)