Chapter 6 - CUDA Flashcards

1
Q

What are GPGPUs?

A

General purpose GPUs

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What is CUDA?

A

API for general purpose programming on GPUs.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What is the architecture of GPUs?

A

SIMD (single control unit, multiple datapaths). Instruction is fetched from memory and broadcasted to the multiple datapaths. Each datapath executes the instruction on its data, or remain idle.

GPUs consist of one or more SIMD processors.
A Streaming Multiprocessor (SMs) consists of one or more SIMD processors.

On GPUs, the datapaths are called cores or Streaming processor (SPs). There can bemultiple SPs within a SM

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What does a SM consist of?

A

Can have several control units, and many datapaths

Operates asynchronously.
No penalty if one SM execute one part of conditional (if), and another SM executes the other (else)

Small block of memory shared amongst all SPs

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What is the Host?

A

CPU and its associated memory

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What is the device?

A

GPU and its associated memory

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Draw a diagram of a CPU and a GPU

A
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

What is heterogeneous computing?

A

Writing programs that will run on devices with different architectures.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What is a kernel?

A

A function started by host, but run by device.

__global__

return void

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

How can threads running in cuda be synchroniced?

A

cudaDeviceSynchronize();

Main waits until device threads have finished executing the kermel.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

How are kernels run?

A

Kernel_name «<grid_dims, block_dims»>(args);

n_threads: How many threads to start on the GPU

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

What variables are defined by CUDA when a kernel is started?

A

4 structs, all with a x-, y- and z-field

threadIdx: defines rank of thread within block

blockDim: dimensions of the blocks

blockIdx: rank of current block in grid

gridDim: Dimensions of grid

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

What does it mean that a call to a kernel is asynchronous?

A

The call returns immediately.

To avoid the host terminating before kernels has completed execution, cudaDeviceSynchronize can be used.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

How are threads organized in a kernel?

A

Each individual thread will execute on one SP.

«<grid_dims, block_dims»>

grid_dims: how many SMs to utilize, how many blocks in each dimenstion

block_dims: How many SPs to utilize, how many threads to run on a single SM

if grid_dims or block_dims has the type integer, the x-value of the grid- and block-dimension will be that integer. The y- and z-dimensions will be set to 1.

To specify the y- and z- dimension in addition to x, the grid_dims variable must be of type dim3

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

How is a dimension variable specified using dim3?

A

dim3 grid_dims;

grid_dims.x = 2;
grid_dims.y = 3;
grid_dims.z = 4;

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

What are some conditions of thread blocks?

A

All blocks have the same dimensions

All blocks can execute independantly of each other

All blocks can execute in any order

Thread blocks must be able to complete its execution, regardless of the state of the other blocks.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

What is the compute capacity?

A

There are limits on number of threads in each block, and number of blocks.

This number (limit) has the for a.b
a: 1, 2, 3, 5, 6, 7, 8
b: 0-7 (depends on the value of a)

The compute capacity specifies how many threads each block can have, and how big dimensions in blocks or grids can be.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

How can you calculate the global index of a thread?

A

x = blockDim.x * blockIdx.x + threadIdx.x

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

How is memory allocated to devices?

A

cudaMalloc(void** ptr, size_t size)

Memory allocated on device must be explicitly freed

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

What does the qualifier __device__ do?

A

Put in front of functions.
Indicates that a function can only be called from the device.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

How is device memory freed?

A

cudaFree(&data);

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q

How is data copied to and from devices?

A

cudaMemcpy(
void* dest,
void* src,
size s,
codaMemcpyKind kind
)

kind: where are src and dest located

The function is synchronous, meaning the host will wait for the kernels to finish running - don’t need to explicitly synch the program.

cudaMemcpy(device_x, host_x, size, cudaMemcpyHostToDevice);

cudaMemcpy(host_x, device_x, size, cudaMemcpyDeviceToHost);

host_x and device_x are pointers

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
23
Q

What is the return type of a kernel

A

void

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
24
Q

What does kernel-declaration look like?

A

__global__ void kernelName();

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
25
Q

What is a similarity between cuda, and pthreads and openMP?

A

Cuda threads are also allocated stacks and local variables

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
26
Q

What are warp shuffles?

A

Functions that allow collections of threads within a warp to read variables stored by other threads in the warp.

Function allow threads to read from registers used by other threads in the warp.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
27
Q

How is memory layed out in CUDA?

A

SMs has access to “shared” memory. This “shared” memory is accessible to all of the SPs within this SM.

All of the SPs and all of the threads have access to “global” memory. This memory is

Number of shared memory locations are small but fast.
Global memory is many but slow.

The fastest memory used is registers.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
28
Q

How are local variables stored in cuda?

A

Depends on total available memory, and programs memory usage.

If there is enough storage - store in registers.
If not - the local variables are stored in a region of global memory that’s thread-private.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
29
Q

What is a warp?

A

A set of threads with consecutive ranks belonging to a block.

Thenumber of threads is 32 - this can change in future CUDA implementations.

The system-initialized variable warpSize stores the size.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
30
Q

How does threads in a warp behave?

A

They execute in a SIMD fashion

Threads in different warps can execute different statements. Threads within the same warp must execute the same.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
31
Q

What does it mean when threads are said to be diversed?

A

Threads within a warp attempts to execute different statements.

E.g. take different branches in an if-else statement

When diverged threads have finished executing the different statements, they begin executing the same. When this happens the threads have converged.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
32
Q

What is a thread’s lane?
How can it be calculated?

A

Its rank within a warp.
lane = threadIdx.x % warpSize

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
33
Q

How does the warp shuffle function__shfl_down_sync work?

A

__shfl_down_sync(
unsigned mask,
float var,
unsigned diff,
int width=warpsize
)

mask: indicates what threads are participating in the function call. A bit representing a thread’s lane must be set for each participating thread to ensure all threads in the call have converged (arrived at the call) before the threads begin executing this function.

var: when lane x calls function, the value stored in var on the thread with lane (x + diff)is returned for the current thread x

diff: because unsigned -> bigger than 0. value returned is therefor from a higher rank (hence name shuffl_down)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
34
Q

What are some possible issues with warp shuffles?

A

If not all threads in a warp calls the function, or the warps does not have warpSize amount of threads or a multiple of it, the function may return undefined results.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
35
Q

What happens if x calls a warp shuffle and its corresponding thread lane does not?

A

X gets returned undefined

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
36
Q

How can you create shared variables?

A

__shared__ float shared_vals[32]

This definition needs to be in kernel

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
37
Q

What is a SM?

A

Streaming Multiprocessor (SM)
Closest thing the GPU has to a CPU core

128 32-bit floating point units (128 ‘cores’)

38
Q

What are similarities between CPUs and GPUs?

A

Both are general purpose processor cores

Both use superscalar architectures

39
Q

How are threads executed in GPUs?

A

Executed in groups of 32 (warps)

All threads in a warp execute same instructions (GPUs are designed to do things over and over)

40
Q

What is a main reason for using warps?

A

Only one decode needed for 32 threads

41
Q

What is thread divergence?

A

When a branch occur, all threads in warp must participate, even if the branch is not in their execution path.

if: If one thread chooses one of the paths, the other threads must participate in executing that path

loop: all threads must participate until the final thread has finished iterating

Thread 1:
if = true
a = 3
else
a = N/A

Thread 2:
if = true
a = N/A
else
a = 10

Both execute along both paths, though might not do useful work (N/A)

42
Q

When does thread divergence become a problem?

A

If threads are following different execution paths.

Long if/else where threads are likely to choose different paths

Loops with highly variable number of iterations

Fix:
-rewrite branch using math
- rewrite multiple nested loops to one single one

43
Q

What class of parallelism does warp-based execution use in Flynn’s taxonomy?

A

SIMT

44
Q

What is the difference between SIMD and SIMT?

A

SIMD: Single instruction executes multiple data operations in a single thread

SIMT: A single instruction executes a single data operation in multiple threads

45
Q

How does the CPU do a context switch?

A

Store PC

Store all register to stack

Load registers from new thread from stack

Jump to PC of new thread

Continue execution

46
Q

How is context switching done on GPUs?

A

Context switch every single clock cycle

Register file stores all registers of all threads -> no need to swap registers because of this

Each cycle, 4 warps are chosen that are able to execute an instruction

47
Q

What is the memory system of GPUs?

A

Shared L2 cache

SM: its own L1 cache + separate instruction cache. Part of the L1 cache can be used as temporary storage known as shared memory

48
Q

What is the measure of GPU utilisation?

A

Occupancy =Cycles SM busy / Total cycles

49
Q

What is a GPU grid?

A

Contains all threads we want to run.

dim3 gridDim = {x, y, z}

50
Q

What is a CPU block?

A

Collection of threads, all running within a single SM

dim3 blockDim = {x, y, z}

51
Q

What happens when a kernel is launched?

A

A queue of blocks is created

Blocks run until all its threads have completed

Blocks run in any order

Once a block has completed, the next block is allocated to that SM

Blocks should be dimensioned such that they contain a multiple of 32 threads

52
Q

What is the desired dimensions of a block?

A

A multiple of 32 threads

53
Q

What does __synchthreads() do?

A

Interraction between threads within a block.

Barriers for all threads in block

54
Q

What mechanism does GPUs have to avoid race conditions?

A

Atomic operations

int atomicAdd(int* address, int val)
int atomicSub(int* address, int val)
int atomicMax(int* address, int val)

55
Q

What is shared memory?

A

Part of the SMs L1 cache used for temporary storage.

Useful if the same data will be used many times.

Can be usedto communicate bewteen warps in a block

allocated on a per-block basis

Stores:
- intermediate results
- data to be reused
- exchange values between warps in a block

56
Q

What limits does SMs have?

A

Number of simultaneously executing threads

Misconfigurations can decrease performance

In Ada Lovelace architecture:
- Max thread per block: 1024
- block per SM: 24
- Warps per SM: 48 (1536 thread)
- registers per thread (255)
- shared mem: 100Kb

register requirements limit the number of warps that can be executed simultaneously in an SM

blocks have constant number of warps, and cannot be partially allocated to an SM, even if the SMs register file has space for some of the warps in the block

Shared memory required by each block limits the number of warps

57
Q

How are kernel launched?

A

dim3 gridDims = {1, 2, 3}
dim3 gblockDims = {2, 2, 3}

kernelName«<gridDims, blockDims»>(param1, param2);

Dimension cannot be 0

58
Q

How are registers used?

A

Each thread has a fixed number of registers

A warp uses 32x that number of registers

register values are kept in register files within the SM

register requirements limit the number of warps that can be executed simultaneously in an SM

59
Q

What are some block size trade offs?

A

Large:
- More ability to cooperate between threads
- better reuse of shared memory

Small:
- Less waiting for all warps to complete
- May improve occupancy by allowing more warps to execute simultaneously

60
Q

What are cooperative groups?

A

Threads in blocks and warps partitioned into groups that work together.

Collaboration stems from communication between threads within the same warp being cheap

61
Q

How do you implement cooperative groups?

A

include <cooperative_groups.h></cooperative_groups.h>

namespace cg = cooperative_groups

62
Q

What are some types of cooperative groups?

A

Coalesced group
Block group
Grid group
Cluster group

63
Q

What is a Coalesced group?

A

Threads in the current warp, but only the ones that are executing at that point in time

Example:

__global__ void kernel(){
if(threadIdx.x < 12) return;
cg::coalesced_group warp = cg::coalesced_threads():
}

20 threads would be active in the warp

64
Q

What is a block group?

A

A group with all threads in the current block

cg::thread_block block = cg::this_thread_block();;

65
Q

What is a grid group?

A

Group of all threads in the entire grid

cg::grid_group grid = cg::this_grid();

Cannot use «<»>, must use:
cudaLaunchCooperativeKernel(kernel, dim, dim, args);

66
Q

What is a cluster group?

A

A union of multiple thread blocks

67
Q

What is group partitioning?

A

Creating smaller subgroups from larger ones

68
Q

How is barriers used on groups?

A

group.sync();

Works on any group or partition

69
Q

How are instructions executed in terms of GPU structure?

A

All threads are executed grouped as a warp

70
Q

What are the 8 GPU limits?

A

1.
Max 1024 thread per block

2.
Max 24 blocks per SM

3.
Max 48 warps per SM

4.
Max registers per thread: 255

5.
Shared memory: 100Kb

6.
Register requirements limit the number of warps that can be executed simultaneously in an SM

7.
Blocks have constant number of warps and cannot be partially allocated to an SM

8.
Shared memory required by each block limits the number of warps

71
Q

How does register requirements limit the number of warps that can be executed simultaneously in an SM?

A
72
Q

How does shared memory required by each block limit the number of warps?

A
73
Q

What are two common issues with memory in GPUs?

A

Non-coalesced memory reads
Atomic write contention

74
Q

What are coalesced memory reads?

A

GPU cache lines: often 128 bytes (32 warps * 4)

Memory allocated with cudaMalloc is always aligned, meaning byte 0 is the start of a cache line.

Even if only one byte is read, the whole cache line must be loaded into core.

Coalesced reads are when 100% of the bandwidth are utilised by reading one cache line per warp, and each thread reads 4 bytes.

4 bytes * 32 threads = 128 bytes (the whole cache line)

75
Q

Code an example where a therad only uses one byte in a cache line

A

__global__ void kernel(int* array, int n){
int value = array[32 * threadIdx.x]
}

As each int is 4 bytes - every thread will skip the next 32 * 4 bytes. meaning it reads from the next cache line.

One cache line must be read in per thread.

76
Q

Give a coding example of a coalesced read

A

__global__ void kernel(int* array, int n){
int value = array[threadIdx.x]
}

77
Q

What is atomic write contention?

A

When multiple threads tries to do an atomic operation, their effects are serialised because threads needs to wait.

__global__ void kernel(int* array. int* oddCount){
int value = array[threadIdx.x];
if(value % 2 == 1){
atomicAdd(oddCount, 1);
}
}

78
Q

What is a solution to atomic write contention?

A

Do a reduction within the warp, then have a single thread do the atomic operation

79
Q

What is a common problem when using structs of arrays, and how can this be solved?

A

Misaligned reads

struct int4 {
int x,
int y,
int y
}

int4 var;
var.x = varArray[threadIndex].x

Memory: x y z x y z x y z x y z
index: 0 1 2 3
say every xyz-block is each own cache block, each thread will read only read one value and from their own cache line

Solve by using struct of array, this stores all x values together, and all y- and z.

struct arrayOfInt4 {
int* x,
int* y,
int* z
}

int x_value = arrayOfInt4.x[index]

Memory:
x x x x
y y y y
z z z z

79
Q

What is vectorized loads?

A

If data type is exactly 4, 8 or 16 bytes, the mem. system guarantees your data is loaded in one operation.
Does not need to use struct of arrays in this case.

80
Q

What is register spilling?

A

Temporarily write some register values to memory

upside: can run more threads simultaneously

Downside: more memory transactions

Compiler spills registers when it think it will improve performance

81
Q

How can register spilling be avoided?

A

Identify parts of kernel where many variables must be kept simultaneously
- nested function calls
- loops

Consider recalculating values if it is not too expensive

cut fields from structs that are unrelated to kernel

82
Q

What is a shuffle instruction?

A

Exchange values within a warp

Each thread provides a value

Ech thread reads a valueprovided by another thread

__shfl__sync(unsigned mask, T var, int srcLane)

Mask: defines what lanes are included in the function, often __activemask() - all threads in warp

Thread 3 executes this - sends value 7:
int out = __shfl__sync(__activatemask(), 7, 8)

Thread 10 executes this - sends value 11 and receives value from lane 3, this being 7:
int out = __shfl__sync(__activatemask(), 11, 3)

83
Q

What threads can communicate using shuffle instructions?

A

Threads must be in the same block

Threads must be in the sme warp or cooperative group up to 32 threads in size

84
Q

What happens if a thread using shuffle instructions are reading from a lane that is not participating?

A

Undefined result

85
Q

What are the four shuffle instructions

A

Read from any thread:
__shfl_sync(mask, var, srcLane)

Read from thread with laneid = (current lane - delta)
__shfl_up_sync(mask, var, delta)

Read from thread with laneid = (current lane + delta)
__shfl_down_sync(mask, var, delta)

Read from thread with laneId XORed with laneMask
__shfl_xor_sync(mask, var, laneMask)

86
Q

How can shuffle instructions be used to create sum?

A

int threadValue = 12 // for this thread

sum += __shfl_xor_sync(__activemask(), threadValue, 16)

sum += __shfl_xor_sync(__activemask(), sum, 8)

sum += __shfl_xor_sync(__activemask(), sum, 4)

sum += __shfl_xor_sync(__activemask(), sum, 2)

sum += __shfl_xor_sync(__activemask(), sum, 1)

return sum

This will create a butterfly reduction (cross pattern) where 16 values are exchanged in the same direction which creates a cross, then 8 values - creating 2 crosses, then 4 values, and so on

87
Q

How can shuffle instructions be used to broadcast values?

A

reserve a block of 32 entries in a buffer

__shfl_sync(__activemask(), value, 0)

All threads read from lane 0.
Lane 0 calculates value, then all threads can use value in further computation

88
Q

What is warp voting?

A

Each thread in the warp sets one bit in a 32 bit integer

Bit index corresponds to the lane index

only active threads vote

89
Q

What are the warp voting instructions?

A

Create a 32-bit integer where each lane sets one bit:
unsigned int __ballot_sync(mask, bool predicate)

Returns true if all threads votes true
bool __all_sync(mask, bool predicate)

Returns true if any thread votes true
bool __any_sync(mask, bool predicate)

Reverses a 32-bit integer
unsigned int __brev(unsigned int mask)

90
Q

What are some usecases of warp voting?

A

Identify elements that should be removed

91
Q

What are warp reductions?

A

Function used to reduce thread values into one result

__func_name(unsigned mask, unsigned/int value)