Theory Flashcards

1
Q

What is easier to improve, bandwidth or latency?

A

Bandwitch

Because of this, when accessing memory, it is better to fetch as much memory as we can to make up for the memory latency

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

In caches, what are the size restirctions?

A

Cache block length: power of 2

Ways of associativity: Number of sets is a power of 2

n_blocks in a set: any integer

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What are Bernstein’s conditions for out-of-order execution?

A

Statement S1 and S2 will produce the same result when run in any order if:

  • S1 does not read what S2 writes
  • S1 does not write what S2 reads
  • S1 and S2 does not write in the same place
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What is a superscalar processor?

A

Launch multiple instructions at one (done in control path)

Execute multiple instructions at once by replicate the ALU, so that there are multiple

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What are vector registers?

A

Extra-wide registers that can store several consecutive alues at once, e.g.:

Vector load: a[0], a[1], a[2], a[3]

Vector registers allows for vector operations, where operations are done on all vector elements in parallel

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What is Flynn’s taxonomy?

A

Ways to sort parallel programs

SISD: Von neumann

SIMD: Vector machine

MISD: Theoretical

MIMD: Processes and threads. Fully independent processors, but program make them work towards the same goal, not synch,

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What are the ways to sort parallel programs?

A

Flynn’s taxonomy and shared-distributed memory

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

What is shared- and distributed memory?

A

Shared:
Partitions the memory image of a process

Distributed:
Works with several processes, all with their own memory

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What components are allocated to threads?

A

Stack,
Instruction pointer

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

What is invisible communication between threads?

A

When one thread access memory outside their own stack, the memory will be instantly available to the rest of the threads

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Pros and cons of shared memory?

A

Pro:
- only 1 copy of shared data
- everyone can modify this (must be coordinated)

Con:
- Cannot be on separate computers (thread lives within a process’ address space - so their memory must be managed by one single OS)
- Only get as many threads as fit into one machine

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Pros and cons of distributed memory?

A

Pros:
- Their memory cannot be involuntarily corrupted
- only receive what they ask for, and store this in a designated place
- Can run on same or different computers - if we run out of processors, can just add a machine

Cons:
- copies of data, because of message passing, uses twice the memory

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

What is multi-computer parallelism?

A

When processes in a distributed memory program run on different computers.

Communicate over network

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

What are some types of interconnects used in shared memory systems?

A

Crossbar: Looks like a grid

(Fat) trees: Gives non-uniform memory access (NUMA) effects - remote memory is slower. Associate processors with memory that are their responsibility

Mesh: Constant number of links per unit, messages routed throughout the network in several loops. Communication latency grows linearly with distance

Torus: Mesh that wraps around the edges

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

What are interconnect fabrics?

A

The combination of interconnect types (graph shapes) found in a computers

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

What are inherently sequential computations?

A

Computations that cannot be parallelized, because instructions are dependent on each other and can’t run in any order.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

What is the formula for parallel execution time?

A

Tp = f*T + ((1 - f) * T) / p)

T: Sum of the time costs of all operations in the program

p: number of processors, where the parallel operations are divided

(1-f): fraction of operations that can be parallelized

f: fraction of sequential operations the program requires

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

What is the formula for speedup?

A

Speedup is how much faster a fast solution runs compared to a slow one

S = Tslow / Tfast

If S = 1/4, the faster solution is 4 times quicker

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

What is the formula for speedup as a function of number of processors?

A

S(p)= Ts/Tp =(Ts = fT + (1-f)T) / (f*T + ((1 - f) * T) / p))

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

What is the formula for sequential time?

A

Ts = fT + (1-f)T

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

What is linear speedup?

A

When every operation in a program can be parallelized

S(p) = p

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q

What is Amdahl’s law?

A

When a program contains a sequential part, the fastest theoretical execution time it could run at, given infinite processors, would be the time spent executing these operations in sequence.

Assumes constant serial time. Everything is calculated in proportion to an amount of work taking T time sequentially.

If we have an infinite number of available processors, the speedup would reach the limit

lim p->infinity S(p) = 1/f

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
23
Q

What is scaled speedup?

A

If we assume constant parallel time:

Tp = fT + (1-f)*T

Meaning, every time a processor is added, (1-f)*T units of work is added to occupy the additional processor

Bigger problems takes longer to run sequentially:

Ts = fT + p(1-f)*T

Formula for scaled speedup:
Ss(p) = Ts/Tp = f + p(1-f), using new formula for Tp and Ts

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
24
Q

What is Gustafson’s law?

A

Observes thaat scaled speedup does not approach any limit as p grows, as long as the problem is sized up in proportion to the machine:

Ss(p) = f + p(1-f)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
25
What is the difference between scaled speedup and speedup?
Scaled: x times the work can be done in the same amount of time Speedup: The same amount of work can be done in 1/x time
26
What are scaling curves?
The graph of S(p) given p
27
What are strong scaling?
The scaling curves of Amdahl's law. Best scaling is linear scaling: S(p) = p Towards infinity, they approach infinity when the fraction (f) of sequential operations in the program increases Studies of how speedup changes with processor count, are said to be carried out in the strong mode
28
What is weak scaling?
The scaling curves of Gustafson's- law. They are all linear, but their gradiant is not quite 1 Ss(p) = ax + b ; OBS: Scaled speedup a: the gradient, not quite 1 Studies of how scaled speedup changes with processor count are carried out in weak mode
29
What is efficiency?
E(p) = S(p) / p Amount of speedup divided equally amongst the processors Efficiency gives a sense of how much one processor contributed to the speedup
30
What is horizontal scaling?
Upgrade system with more of the same components as available before Improves the parallel part of execution
31
What is vertical scaling?
Upgrade system with new, more powerful components Improves sequential part of execution
32
What is an assumption that gustafson makes?
That the parallel workload can grow without needing to increase the sequential
33
What is the hockney model?
When we know the message size (n), we can estimate transmission time as sum of latency and n*inverse bandwidth Latency (a): Time it takes to get between communication points, decided by the distance between these not message size Inverse bandwidth (B⁻1): seconds/byte T_comm(n) = a + n*B⁻1
34
What is the ping-pong test of communication speed?
Start clock Send message from A to B Send message from B to A Repeate this a lot of times Stop clock Divide the time by 2 (for both directions) and n (number of messages)
35
How can we measure latency?
Do ping-pong test with empty- or 1-byte messages The time taken will in this case be dominated by latency
36
How can we measure inverse bandwidth?
Ping-pong test with a smaller number of huge messages In this case, bandwidth requirements will dominate time taken
37
How can bandwidth be expanded?
Adding extra lanes to interconnect fabric
38
How can latency be improves?
Very difficult, essentially dependent on the speed of light. Can however be masked, e.g. using MPI_Isend
39
What is Symmetric MultiProcessor (SMP, r dancehall architectures)?
Earlier, 1 processor core per chip, clock rates for cpu and memory were comparable, no cache The structure was memory banks in a grid in the middle, e.g. dimensions 3x3, and 3 CPUs on each side of memory bank. All memory shared by every CPU Any CPU read/write to any memory bank at the same speed (Uniform Memory Acces, UMA) Because of this, any CPU can contact any other at the same cost This design introduced race conditions
40
How were race conditions handled in SMPs (Symmetric MultiProcessors)?
Interconnect fabric + cpu supported atomic operations test-and-set: check if value is 0, set to 1 if isn't, return (great for spin-locking) fetch-and-increment: increase value in mem, return what is was before (great for obtaining ticket numbers in a queue) fetch-and-add: fetch and incr with arbitrary value Compare-and-swap: check if value is equal to supplied value, if it is - exchange it for a number from the CPU, return whether it succeeded
41
What are the fetch-and-phi operations?
The atomic operations that where implemented in SMPs to handle race conditions
42
What caused memory access to no longer be uniform in SMPs?
Access to closer memory banks grew faster than access to remote banks NUMA - non-uniform memory access
43
What is cache coherent NUMA (ccNUMA)?
Caches are introduced to reduce memory latency gap between close and further away memory. Works for 1 CPU, make it worse for rest
44
What does SMP now stand for?
Shared memory processor. No longer symmetric memory access
45
What is Load Linked/Store Conditional?
Attempt on atomic solutions LL fetches value into a register, and temporarily tags the memory bank it came from While value is in register, the CPUs entire instruction set can manipulate it SC tries to write result back to tagged memory bank. Returns whether it succeeded. If fails, value not stored, because someone altered in meantime. Program gets to know about failure
46
What is atomic reservoirmemory
Seperate memory banks are wired directly to processor, bypasses all caching mechanisms. Slow, but all read/write are atomic
47
How can x86_64 ops be made atomic?
Instructions that include a whole read-mod-write cycle in one instruction can be made atomic by placing the keyword 'lock' in front of the instruction in the assembly code Gives exclusive access to either cache line with the value, if no other core has this value, or the entire memory bus.
48
What is cache coherence?
Problem of keeping multiple CPU caches updated with each others modified values Solution: Snooping, or directory
49
What is snooping?
Solution to cache coherence. Allowing CPUs to listen in on each others memory traffic via shared branches of the interconnect Can be combined with write-through and write-back
50
What is directory?
Solution to cache coherence Maintaining a centralized registry of cache lines and the various states they're in
51
Describe snooping with write-through
Done on machines with a single, shared memory bus as interconnect. One CPU write through the change on the interconnect, when this happen the other CPU listens and copies the change to its own cache. The change is written back to memory
52
Describe snooping with write-back
CPU alerts the mem.bus that the cache line is dirty. The other CPU does not use this cache line until it is clean. Only 1 bit is broadcasted to the other CPUs (dirty bit) When changes as written back by the first CPU, the memory is updated and the second CPU fetches the fresh copy
53
What is an issue with snooping?
The cost of broadcasting changes Only fast for numbers efficient for broadcasts
54
Describe the Directory solution to cache coherence
Have a table that records what CPU have copies of what memory Needs quite a bit of memory as the table can grow quite large Entry: - Entry for each memory block we want to track - Bits to record state (exclusive-or-shared, modified, uncached) - bit vector, one bit for each CPU
55
Give an example of Directory entries with ful bit vector format?
Memory: 1, 2, 3, 4, 5, 6, 7, 8 CPU 0: Cache 3, 4 CPU 1: Cache 3, 4 CPU 2: Cache 5, 6 CPU 3: no cache Directory (Block - cpus - state): [1, 2] - 0000 - uncached [3, 4] - 1100 - shared [5, 6] - 0010 - exclusive [7, 8] - 0000 - uncached
56
What is a directory with coarse bit vector format?
Bit vector indicates if one-or-more cpus has a modified copy The CPUs that share this memory sorts out coherence between themselves
57
How is memory allocation done for directory systems?
Some systems allocate fixed part of general system RAM. This is fine for smaller systems, but bigger systems will notice that some of their RAM cannot be used for other purposes. Some systems install additional memory banks for the directory.
58
What is tiling?
A way to improve cache performance.
59
What can improve cache performance?
Vector registers Tiling
60
What are intrinsics?
Constructs that behave like function calls, but compiler recognize as shorthand notation for assembly instructions. Useful when programming vector registers by hand, in situations where the compiler does not recognize a structure as a vector structure. __mm128d var: Stands for 128-bit SIMD register, can hold 2 doubles This is blob of bytes which can be put into a vector register in an instant and fit there.
61
What is _mm_malloc(size, alignment)
Alignment arg. gives a number that evenly divides the starting address of the allocation. Vector registers load faster when the addresses they load are clean multiples of the register size for 128 bit register value, alignment can be 16
62
How do you transfer data from memory to a SIMD register?
__m128d my_two_doubles = _mm_load_pd(&two_doubles) address of two_doubles must be 16 byte aligned Load two copies of one double: _m128d two_copies = _mm_load_pd1(&a)
63
How are SIMD data moved from registers to memory
_mm_store_pd(two_doubles, &two_doubles)
64
How to you do pairwise addition/multiplication of two SIMD registers?
__m128d sum = _mm_add_pd(ab, cd) gives a+b and c+d _m128d ac_and_bd = _mm_mul_pd(ab, cd)
65
What models are used in processing?
Problem model: What we want to calculate, simplified representation of a real thing Programming model: Operations used to express calculation. Simplified representation of machine instructions Processing model: Expectations about what the machine will do, simplified representation of hardware Actual computer: Actual hardware
66
What useful tihng does Amdahl's and Gustafson's law model?
Both are performance models Gives realistic estimates of whether or not program performance will improve if we invest in more hardware to run on it
67
What useful thing does the Hockney model model?
Whether program performance is constrained by the size of its messages of the number of messages
68
What is MIPS and FLOPS?
Millions of Instruction Per Second Floating-point operations per second
69
What is the issue with benchmarking?
Asit is so specific, it does not give a good overview/statistics for actual performance
70
What is memory bandwidth?
Maximum amount of bytes the interconnect can transport between CPU and memory each second bytes/second
71
What is the operational intensity/arithmetic intensity?
Intructions uses operands. Divide number of operations by the number of bytes they are applied to. operations / byte [FLOPS / byte]
72
What is the rate at which a program can run because of the rate it can read and write at?
Operational intensity * memory bandwidth If data transport is not fast enough to supply the processor with data for all the instructions in the program, the program has to wait. byte/s * Flop/byte = Flop / s
73
What is roofline models?
x-axis: Arithmetic intensity (FLOP/byte) y-axis: Flop / s The peak performance: straight line at a given y-value. this is the roofline of the graph. The graph will be a linear line up until this point, with the gradient a: a = bytes/s * FLOP/byte
74
What is the memory bound region in a rooftop graph?
Programs with an intensity that makes them running at the speed capped by memory bandwidth When Flop/b is lower than the peak performance
75
What is the ridge point in rooftop analysis?
When the FLOP/byte intensity goes from memory-bounding a program to compute bounding it
76
What is the "comput bound" region in a rooftop analysis?
When program intensity is so that the program run at the speed capped by the processor
77
Where do we find the rooftop of machines?
Theoretical numbers from hardware Run benchmarks that stress computing capability or memory
78
What is some properties ofdense linear algebra?
Long arrays of consecutive values tightly packed together in memory Operations of complicated calculations that require many instructions per element
79
What is sparse linear algebra?
Similar to dense algebra, but many/most of elements are zero this makes it meaningless to read them Data structures we use are rather lists of indices with non-zero values, and the values themselves values[], row[], col[]
80
What are unstructured grids?
Similar to grids, but without assumption that neighbours are evenly spaced
81
What are n-body problems?
List of coordinates for things that push each other around (starts, biljard balls) bottleneck: finding neighbours Does everybody affect each other, does only neighbours affect each other
82
What are Monte Carlo methods?
Calculations that approach their solution by accumulating random numbers additional numbers should contribute to the solution nomatter what their values are performance challenges: - pseudo random numbers often have sequential dependences between numbers
83
What are the 7 berkley kernels?
Dense linear algebra Sparse linear algebra Structured grids Unstructured grids N-body problems Monte carlo methods Spectral methods
84
What is multiple issue?
Replicate ALU, extend decoder to dispatch several independent instructions simultaneously
85
What is Simultaneously MultiThreading (SMT)
Useful to utilize the different parts of the ALU with instructions that require different calculation logic. Instead of leaving the ALU parts idle, run multiple at the same time. Replicate instruction pointer and decoding unit Receive 2 instruction streams. merge these together when their needs doesn't conflict This allows a 4 core processor to be able to run 8 threads, for example
86
When will SMT improve performance?
It doesn't happen often that two instruction streams only depend on different ALU units. Threads often need the same units, making the second one wait. SMTs improve performance in the rare cases where the threads does not need the same units. Statistically, this only happens every so often.
87
What is superlinear speedup?
Times when we can measure S(p) > p, which goes against Amdahl's law. This can happen if a problem is split up into smaller parts, that suddenly fits in a faster part of memory (e.g. a higher level cache)
88
What is load balancing?
Ways of mitigate unbalanced workload. Parallel programs often have synchronization points. If the work are unevenly distributed between processes, one process will lag behind. And the execution time is held back by the slowest process.
89
What are the 3 kinds of load balancing?
Static: Embed the partitioning of the program directly into the source code Semi-static: Examine workload on program start, divide it then, and run with this inital partitioning Dynamic: Shift work around between participants while program is running
90
What is the Master/Worker pattern?
A dynamic workload balancing technique. One rank is the master and maintains a queue of similar-sized tasks. Rest of ranks are workers. Assigned tasks by master, or take tasks from queue and informs master limited scalability due to centralized control. If the amount of workers is to big, they can overwhelm the master.
91