Lecture 10 - Data Level Parallelism Flashcards

Question 1

Q

Introduction

Answer

A

Covers the transitions from Instruction-Level Parallelism (ILP) to Data-Level Parallelism (DLP), focusing on performing operations on data in parallel.
SIMD (Single Instruction Multiple Data) is a key concept in DLP.

Question 2

Q

Uses of SIMD Architectures

Answer

A

SIMD architectures are effective for matrix-oriented scientific computing and media-oriented image and sound processors.
SIMD is more energy-efficient than MIMD (Multiple Instruction Multiple Data) because it fetches one instruction per data operation, making it suitable for mobile devices.
SIMD allows programmers to maintain a sequential programming mindset, although compilers may generate the actual SIMD instructions

Question 3

Q

SIMD Architectures

Answer

A

SIMD architectures include vector architectures (e.g., DSPs), SIMD extensions (e.g., MMX/SSE/AVX), and Graphics Processor Units (GPUs).
SIMD extensions on x86 processors have shown rapid growth in SIMD width (128 to 512 bits), indicating a trend towards increased DLP, though AVX512 has been deprecated.
SIMD offers greater potential speedup compared to MIMD.

Question 4

Q

Scalar vs. Vector Operation

Question 5

Q

Vector Architectures

Answer

A

The basic idea of vector architectures involves reading sets of data elements into vector registers, operating on those registers, and dispersing the results back into memory.
Vector registers are managed by the compiler to hide memory latency and efficiently access regularly arranged data, thus leveraging memory bandwidth.

Question 6

Q

VIMPs Instructions

Answer

A

VMIPS is presented as an example architecture loosely based on the Cray-1.
VMIPS features vector registers (64 elements, 64 bits/element), vector functional units (fully pipelined, with hazard detection), a vector load-store unit (fully pipelined, one word per clock cycle after initial latency), and scalar registers.
Key VMIPS instructions include ADDVV.D (add two vectors), ADDVS.D (add scalar to a vector), and LV/SV (vector load and store).
The document uses the DAXPY (Y := A * X + Y) operation as an example to demonstrate the efficiency of VMIPS, requiring far fewer instructions than MIPS.

Question 7

Q

Vector Execution Time

Answer

A

Vector execution time depends on operand vector length, structural hazards, and data dependencies.
VMIPS functional units process one element per clock cycle, making execution time approximately equal to the vector length.
Concepts like “convoy” (a set of vector instructions that can execute together without hazards) and “chime” (the unit of time to execute one convoy) are introduced to analyze vector execution timing

Question 8

Q

Chaining

Answer

A

Chaining allows vector operations with read-after-write dependencies to be in the same convoy, improving performance

Question 9

Q

Optimizations

Answer

A

The document discusses optimizations like using multiple lanes to process more than one element per clock cycle

Question 10

Q

Vector Length Register (VLR)

Answer

A

The Vector Length Register (VLR) is used when the vector length is not known at compile time.
Strip mining is employed for vectors exceeding the maximum length.

Question 11

Q

Vector Mask Registers

Answer

A

Vector mask registers allow for selective operation on vector elements.

Question 12

Q

Memory Banks

Answer

A

The memory system in vector architectures must support high bandwidth via multiple memory banks, independent bank address control, and support for non-sequential word access.

Question 13

Q

Stride

Answer

A

Vector stride refers to the spacing between consecutive accessed elements in memory.

Lecture 10 - Data Level Parallelism Flashcards

(13 cards)