Lecture 10 - Data Level Parallelism Flashcards
(13 cards)
1
Q
Introduction
A
- Covers the transitions from Instruction-Level Parallelism (ILP) to Data-Level Parallelism (DLP), focusing on performing operations on data in parallel.
- SIMD (Single Instruction Multiple Data) is a key concept in DLP.
2
Q
Uses of SIMD Architectures
A
- SIMD architectures are effective for matrix-oriented scientific computing and media-oriented image and sound processors.
- SIMD is more energy-efficient than MIMD (Multiple Instruction Multiple Data) because it fetches one instruction per data operation, making it suitable for mobile devices.
- SIMD allows programmers to maintain a sequential programming mindset, although compilers may generate the actual SIMD instructions
3
Q
SIMD Architectures
A
- SIMD architectures include vector architectures (e.g., DSPs), SIMD extensions (e.g., MMX/SSE/AVX), and Graphics Processor Units (GPUs).
- SIMD extensions on x86 processors have shown rapid growth in SIMD width (128 to 512 bits), indicating a trend towards increased DLP, though AVX512 has been deprecated.
- SIMD offers greater potential speedup compared to MIMD.
4
Q
Scalar vs. Vector Operation
A
5
Q
Vector Architectures
A
- The basic idea of vector architectures involves reading sets of data elements into vector registers, operating on those registers, and dispersing the results back into memory.
- Vector registers are managed by the compiler to hide memory latency and efficiently access regularly arranged data, thus leveraging memory bandwidth.
6
Q
VIMPs Instructions
A
- VMIPS is presented as an example architecture loosely based on the Cray-1.
- VMIPS features vector registers (64 elements, 64 bits/element), vector functional units (fully pipelined, with hazard detection), a vector load-store unit (fully pipelined, one word per clock cycle after initial latency), and scalar registers.
- Key VMIPS instructions include ADDVV.D (add two vectors), ADDVS.D (add scalar to a vector), and LV/SV (vector load and store).
- The document uses the DAXPY (Y := A * X + Y) operation as an example to demonstrate the efficiency of VMIPS, requiring far fewer instructions than MIPS.
7
Q
Vector Execution Time
A
- Vector execution time depends on operand vector length, structural hazards, and data dependencies.
- VMIPS functional units process one element per clock cycle, making execution time approximately equal to the vector length.
- Concepts like “convoy” (a set of vector instructions that can execute together without hazards) and “chime” (the unit of time to execute one convoy) are introduced to analyze vector execution timing
8
Q
Chaining
A
- Chaining allows vector operations with read-after-write dependencies to be in the same convoy, improving performance
9
Q
Optimizations
A
The document discusses optimizations like using multiple lanes to process more than one element per clock cycle
10
Q
Vector Length Register (VLR)
A
- The Vector Length Register (VLR) is used when the vector length is not known at compile time.
- Strip mining is employed for vectors exceeding the maximum length.
11
Q
Vector Mask Registers
A
- Vector mask registers allow for selective operation on vector elements.
12
Q
Memory Banks
A
- The memory system in vector architectures must support high bandwidth via multiple memory banks, independent bank address control, and support for non-sequential word access.
13
Q
Stride
A
- Vector stride refers to the spacing between consecutive accessed elements in memory.