Thread and data level parallelism Flashcards

1
Q

What is the difference between instruction-level-, thread-level- and data-level parallelism?

A

TLP increases overall throughput

ILP and DLP focus on increasing the IPC

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What is ILP?

A

Exploits parallelism within a program, e.g. OoO-to be able to execute multiple instructions simultaneously. (loops, etc.)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What is TLP?

A

Exploits parallelism between independent threads. Running more applications in parallel, increases overall system performance.

However, does not necessarily make the individual applications perform better/faster.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What is DLP?

A

Exploits parallelism by operating on multiple data points simultaneously. E.g. if a loop updates all elements of an array, can instead tell the program that a certain operation is to be done to all elements in the loop. Then this operation would be run at the same time for all array elements.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What is throughput?

A

Total amount of completed instructions across all threads.

Higher throughput allows for more programs running simultaneously

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What is IPC?

A

Number of average instructions completed per cycle for a given thread.

Higher IPC gives more responsive programs

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What is a thread?

A

The execution of a program within a process

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

What is multithreading?

A

Scheduling multiple threads on a single core.

Duplicate independent state for each thread (register file, PC, page table)

Memory sharing is done by using virtual memory

HW must support thread switching. Latency of this must be much lower than context switching

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What are two types of switch strategies?

A

Coarse and fine grained

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

What are coarse-grained switch strategies?

A

Switch thread on long stall (L2 miss, TLB miss)

Advantage:
- Low HW cost due to slow thread switch
- Fast forward progress: thread is only switched when it would be delayed anyways

Disadvantages:
- CPU only issues from one thread. On stall, pipeline must be flushed before new thread can issue
- New thread must then refill pipeline from start - restart penalty
- Need added HW to detect costly stall and to start switch

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What are fine-grained switch strategies?

A

Switches between threads every cycle, interleaving the different threads.

Usually round-robin - skipping stalled threads

Advantages:
- Pipeline must not be refilled on stall
- Hides both short and long stalls

Cons:
- Slows down execution of individual threads
- Extra HW support

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

What is simultaneous Multithreading (SMT)?

A

Thread switching happens within cycles, on open slots. An advantage is that we are much more likely to be able to fill up all available slots because of this.

The motivation for SMT is that dynamically scheduled processors already have HW mechanisms to support multithreading.
- Lots of physical registers because of register renaming
- Dynamic execution scheduling

Required hardware:
- Per-thread renaming table
- Separate PC
- Separate ROBs with commit capabilities

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

What are some design challenges with SMTs?

A

Need a large register file:
- Need a lot of physical registers to be able to map the architectural ones.

Must also avoid worsening the critical path:
- Don’t introduce additional bottlenecks
- From the issue to execution stage, need to make sure each thread is able to make as much progress, as if they were running on their own processor unit.

Make sure that threads don’t worsen the cache behaviour of other threads. This can happen if a working set of one thread only barely fit within the cache. The next thread will then need to evict to get its own data.
- Threads should be “good neighbours”
- Avoid evicting each others working set
- Possibly be able to share resources or have fairness between resources

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

How does renaming work with SMTs?

A

Each thread have their own mapping table. So if threads are using the same instructions, these can be mapped to different physical instructions.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Why don’t we need to add more physical registers to allow SMTs?

A

Because a thread only uses the whole physical register file when it runs at peak performance. In SMTs this won’t be the case, as the threads won’t be running at peak performance.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

What does an OoO pipeline look like with SMTs?

A

Separate PCs for each thread, used to fetch instructions from instruction cache. Fetch stage needs to support providing instructions to multiple threads at the same time.

Separate renaming units for each thread, with no crossing wires.

Separate ROOBs for each thread. (in-order commit, precise exceptions)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

How does Dynamic processors find parallelism?

A

Through OoO-execution of independent instructions

18
Q

What is the basic elements of DLP?

A

Designing systems around data elements directly, instead of exposing and exploiting parallelism.

Do this through vectorization

19
Q

What is vectorization?

A

Larger sets of associative data, where one operation is to be applied to all data points.

20
Q

What are SIMD?

A

Extend existing hardware features to enable a simplified form of vector execution

Splits registers into multiple data-elements

Execution units work on parallel lanes for each data element.

The programmer must convert their vectors to have formats (e.g. length) that are supported in hardware.

21
Q

What are vector architectures?

A

Explicitly introduce vector registers and -processing.

Vectors up to a max size is supported. This size is generally larger than the SIMD systems.

HW can be designed to speed ut vector fetch and addressing.

22
Q

What types of applications benefit from SIMD?

A

Ones using narrow data types (8-32 bit data). can fit multiple of these datatypes in a single register.

Can reuse existing register structures, and make minor modification to execution units to be able to fit multiple data type elements into single registers, and do operations on these.

23
Q

How is SIMD designed?

A

Use the full width of registers and split operations

Uses specific instructions to identify how to split up registers. (1x64, 2x32, etc.)

24
Q

Give an example on how registers can be modified in a SIMD design

A

Have a 64-bit register

Modify this into 4 16 bit registers

25
Q

What are some benefits of SIMD?

A

Performance:
- Multiplicative factor of processing speed
- Latency stays the same

HW implementation has low cost
- logical operations already work
- ALU only need minor adjustments

26
Q

What are some limitations of SIMD?

A

Instruction operand encoding sometimes gives bad alignment, causing less performance benefits.

Programmer need to make sure data is aligned, as SIMD does not provide sophisticated addressing modes.

No mask registers

Challenging to program and compile:
- The program is required to explicitly tile in cases where alignment is not exact.
- This gives more responsibility to the programmer.

27
Q

How does vector addition benefit from vactor architectures

A

for(i<-0 to 100) a[i] = b[i] + c[i]

In vector, all of these instructions can be done in a single instruction:
A = B + C

28
Q

What are some benefits of vector architectures

A

Easier to program

Have a very explicit mode for parallelism, which makes for more efficient hardware.

Vectors are inherently independent, this gives a lot less control overhead

A lot of data processing at any given time, due to vector elements being parallelized

Defining vector properties gives more freedom to how it will be implemented in HW. Can scale to different designs.

Vector processing can map well SW wise, by using vector_len properties.

29
Q

How are vector architectures designed?

A

Use vector registers, these are very large. These stores (Vector_len/Data_len) number of elements

FU execution time depends on type of operation and number of elements in a vector.

FUs support accesses from vector elements

30
Q

How are vector data loaded form memory?

A

Base address is calculated.

Vector data is then brought into vector registers.

31
Q

What are some limitation of vectos?

A

Reduced performance for control flow dominated workloads

Data must be vectorizable

Mem system must be capable of servicing large amount of data (with low latency). Program must be very responsive, even with large units of data being processed.

32
Q

How have SIMD and Vector converged over time?

A

SIMD extensions slowely morphed to be more vector like

SIMD is a constrained vector model

Vector is in general a powerful tool when doing data processing.

However, SIMD do not have the flexible vector addressing mode as vector architectures have.

33
Q

What is start-up time?

A

Time until first word is read/is back into system, after it has been requested

34
Q

After the initial start-up time, how much data should be provided per cycle afterwards from the memory cycle?

A

LANES * Data_len / cycles
In this case, as much data as possible is provided each cycle

Lanes: Number of parallel executions we can do per cycle

Data_len(DLEN): Amount of bits each data element has

This can be achieved using memory banks (interleaving)

35
Q

What must Vector architectures support to deal with load-stores with strides?

A

For example, matrix multiplication. Row from first matrix is multiplied with column from the other matrix. This mean that the Vector needs to be able to deal with adjacent and strided accesses.

This can for example be done by specifying in the instruction the stride:
ld v1, number_elements, [base_address], stride

The same load instruction can be used to implement all different strides.

36
Q

What are sparse matrices, and some properties with these?

A

Mostly zero elements. The zero elements needs no processing, but can’t know if an element is zero or not, until we have already loaded it in.

Can use gather-scatter to take care of this issue

37
Q

What is gather-scatter in regards to sparse matrices?

A

Uses meta-data to find all 1-elements (gather)
Process these,
Write back to those elements in matrix (scatter)

LVI: Load Vector Indexed (gather)
SVI: Store Vector Indexed

38
Q

What is DLP acceleration?

A

Techniques to accelerate the performance of DLP

39
Q

What are 3 ways of accelerating DLP?

A

Speeding up dependency handling

Reducing impact of divergence

Parallelizing data element execution

40
Q

What is chaining of instructions?

A

Elements within a vector are independent. But there can still be dependencies between vectors.

In this case, each element of vector 2 has a true data dependence with the element in the same position, of vector 1.

Because some data elements from source vector will complete before the whole vector has completed. In this case, we can begin executing the next vector before the previous finishes.

By doing this we deconstruct vectors into registers. And issue dependent operands as soon as their source op are computed in the source vector

41
Q

What are multiple lanes?

A

Split vector execution across multiple execution units, instead of having whole vectors executing on one unit. And store vector element results in their correct element in the result vector

42
Q

What are predicates (vector-mask control)?

A

Take outcome of if-else evaluation and create a mask vector with the elements that gives true as 1 and other as 0.

Apply operation to vector, only if mask entry is enabled. Rest of the elements become zero.

This ensures that an entire vector can still move as one.