- enables overlapping of instructions - while one instruction is in one stage, another can occupy a different stage - increases instruction throughput and performance

- replicate execution units to create multiple pipelines - multiple instructions are issued and executed in parallel during each instruction cycle - parallelism is limited by instruction dependencies

- allows multiple threads to share pipeline resources within a single core - instructions from different threads can be issues in the same cycle

Mulitcore Systems Flashcards by Isobel Martin

what are the 3 drivers of growth in microprocessor performance

higher clock frequency
increased transistor density
smarter on-chip architecture designs

How well did you know this?

Not at all

Perfectly

what 3 things are now used in processor designs to exploit ILP

pipelining
superscalar
SMT

How well did you know this?

Not at all

Perfectly

describe pipelining

enables overlapping of instructions
while one instruction is in one stage, another can occupy a different stage
increases instruction throughput and performance

How well did you know this?

Not at all

Perfectly

limit of pipeling

deeper pipelines requrie more control & logic, with diminishing returns

How well did you know this?

Not at all

Perfectly

describe superscalar

replicate execution units to create multiple pipelines
multiple instructions are issued and executed in parallel during each instruction cycle
parallelism is limited by instruction dependencies

How well did you know this?

Not at all

Perfectly

limit of superscalar

code rarely offers parallelism beyond 6 pipelines

How well did you know this?

Not at all

Perfectly

describe SMT

allows multiple threads to share pipeline resources within a single core
instructions from different threads can be issues in the same cycle

How well did you know this?

Not at all

Perfectly

limit of SMT

complexity in scheduling and resource sharing limits scalability

How well did you know this?

Not at all

Perfectly

describe the affect of increasing clock speeds & complexity on power consumption

^ clock speed & added complexity = ^ power density

How well did you know this?

Not at all

Perfectly

what is the response to rising power density

^ on-chip cache memory as it consumes less power than logic

How well did you know this?

Not at all

Perfectly

what does multicore offer in terms of performance

if software is parallelisable = near-linear performance
- large caches are underutilised by single threads
- multiple cores and threads better exploit on-chip cache memory

How well did you know this?

Not at all

Perfectly

what does amdahl’s law represent

theoretical speedup for a single application on N cores

How well did you know this?

Not at all

Perfectly

amdhal’s law formula

Speedup(N) = 1/((1 - f) + f/N)
where:
- f = fraction of the program that is parallelisable
- 1 - f = fraction that is inherently sequential

How well did you know this?

Not at all

Perfectly

what does amdhal’s law assume

perfect parallelism with no scheduling overhead

How well did you know this?

Not at all

Perfectly

what is a more realisitc assuption about amdahl’s law

as core count increases, ^ overhead = peak & degrade

How well did you know this?

Not at all

Perfectly

4 applications that scale well with multicore systems

Study These Flashcards

multithreaded native apps
multiprocess applications
java applicationds
multi-instance applications

define threading granularity

Study These Flashcards

refers to the smallest unity of work that can be parallelised

what does fine-grained threading give

Study These Flashcards

flexibility
- BUT ^ overhead from managment

trade off of threading granularity

Study These Flashcards

parallelism vs overhead

what does rendering use

Study These Flashcards

hierarchical thread structure

what are the 4 multicore cache organisations

Study These Flashcards

dedicated L1 cache
dedicated L2 cache
shared L2 cache
shared L3 cache

5 advantages of shared higher-level cache

Study These Flashcards

constructive interface
no data repliction needed
dynmaic cache allocation
easier inter-core communication
simplified coherency

why is a constructive interface an advantage of shared higher-level cache

Study These Flashcards

threads on different cores benefit from shared data already located in cache

why is dynamic cache allocation an advantage of shared higher-level cache

Study These Flashcards

cores can use more or less cache depending on workload needs

why is simplified coherency an advantage of shared higher-level cache

coherence issues are limited to private (lower-level) caches, reducing overhead

what is the trend in cache hierachy design

as you ^ memory capacity and core counts = ^ importance of cache coherency

new cache hierachy design

L1 per core, L2 shared among 2-4 cores, and global shared L3

why is SMT deployed in multicore design

- SMT increases the number of hardware threads per chip - as software becomes more parallel, SMT becomes more attractive than purely superscalar designs

define homogenous multicore

all cores are identical

define heterogenous multicore

differnt types of cores on one chip - mixing cores in this context means using cores with differnt ISAs optimised for different tasks

describe CPU/GPU multicore

- GPUs - support thousands of parallel threads. - combining CPUs & GPUs improve flexibility & performance across diverse workloads - CPUs & GPUs share key-on-chip resources such as last level cache, interconnect & memory controllers

what are the 2 solutions to CPU/GPU cache issues

1. physical memory partitioned between CPU & GPU 2. enabling shared access to memory & unified expression

how does physical memory paritioned between CPU & GPU work

- CPU had to explicitly copy memory to GPU memory - GPU copied results back after computation - significant performance penalty

what are the 2 challenges with CPU/GPU multicore

1. ensuring coordination & correctness between different types of cores 2. cache sharing - differences in access patterns & sensitivity

how does enabling shared access to memory & unified execution work

- shared virtual memory - demand paging - cache coherence - unified programming interface

describe CPU/DSP multicore

DSP - excel at ultra-fast, math-intensive operations

uses of CPU/DSP

- cellphones - modems - sound cards - hard drives

describe cache design in heterogenous multicore systems

use dedicated L2 caches per processor type - hardware-based cache coherence preferred in SoCs

Mulitcore Systems Flashcards

Week 2.10 (38 cards)