Thread and data level parallelism Flashcards

Question

What are some benefits of SIMD? (2 types)

Answer 1

Performance: - Multiplicative factor of processing speed - Latency stays the same HW implementation has low cost - logical operations already work - ALU only need minor adjustments

Answer 2

Instruction operand encoding sometimes gives bad alignment, causing less performance benefits. Challenging to program. Programmer need to make sure data is aligned, as SIMD does not provide sophisticated addressing modes. The program is required to explicitly tile in cases where alignment is not exact. No mask registers Challenging to compile

Answer 3

for(i<-0 to 100) a[i] = b[i] + c[i] In vector, all of these instructions can be done in a single instruction: A = B + C

Answer 4

Easier to program Have a very explicit mode for parallelism, which makes for more efficient hardware. Vectors are inherently independent, this gives a lot less control overhead Gives a lot of data processing at any given time Defining vector properties gives more freedom to how it will be implemented in HW. Can scale to different designs. Vector processing can map well SW wise, by using vector_len properties.

Answer 5

Use vector registers, these are very large. These stores (Vector_len/Data_len) number of elements FU execution time depends on type of operation and number of elements in a vector. FUs support accesses from vector elements

Answer 6

Base address is calculated. Vector data is then brought into vector registers.

Answer 7

Reduced performance for control flow dominated workloads Data must be vectorizable Mem system must be capable of servicing large amount of data (with low latency). Program must be very responsive, even with large units of data being processed.

Answer 8

SIMD extensions slowely morphed to be more vector like SIMD is a constrained vector model Vector is in general a powerful tool when doing data processing. However, SIMD do not have the flexible vector addressing mode as vector architectures have.

Answer 9

Time until first word is read/is back into system, after it has been requested

Answer 10

LANES * Data_len / cycles In this case, as much data as possible is provided each cycle Lanes: Number of parallel executions we can do per cycle Data_len(DLEN): Amount of bits each data element has This can be achieved using memory banks (interleaving)

Answer 11

For example, matrix multiplication. Row from first matrix is multiplied with column from the other matrix. This mean that the Vector needs to be able to deal with adjacent and strided accesses. Specify stride in instruction: ld v1, number_elements, [base_address], stride The same load instruction can be used to implement all different strides.

Answer 12

Mostly zero elements. The zero elements needs no processing, but can't know if an element is zero or not, until we have already loaded it in. Can use gather-scatter to take care of this issue

Answer 13

Uses meta-data to find all 1-elements (gather) Process these, Write back to those elements in matrix (scatter) LVI: Load Vector Indexed (gather) SVI: Store Vector Indexed

Answer 14

Techniques to accelerate the performance of DLP

Answer 15

Speeding up dependency handling Reducing impact of divergence Parallelizing data element execution

Answer 16

Elements within a vector are independent. But there can still be dependencies between vectors. In this case, each element of vector 2 has a true data dependence with the element in the same position, of vector 1. Because some data elements from source vector will complete before the whole vector has completed. In this case, we can begin executing the next vector before the previous finishes. By doing this we deconstruct vectors into registers. And issue dependent operands as soon as their source op are computed in the source vector

Answer 17

Split vector execution across multiple execution units, instead of having whole vectors executing on one unit. Afterwards, we store the vector element results in their correct element in the result vector

Answer 18

Take outcome of if-else evaluation and create a mask vector with the elements that gives true as 1 and other as 0. Apply operation to vector, only if mask entry is enabled. Rest of the elements become zero. This ensures that an entire vector can still move as one.

Thread and data level parallelism Flashcards

(42 cards)