Mulitcore Systems Flashcards

Week 2.10 (48 cards)

1
Q

what are the 3 drivers of growth in microprocessor performance

A
  1. higher clock frequency
  2. increased transistor density
  3. smarter on-chip architecture designs
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

what 3 things are now used in processor designs to exploit ILP

A
  1. pipelining
  2. superscalar
  3. SMT
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

describe pipelining

A
  • enables overlapping of instructions
  • while one instruction is in one stage, another can occupy a different stage
  • increases instruction throughput and performance
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

limit of pipeling

A

deeper pipelines requrie more controlo & logic, with diminishing returns

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

describe superscalar

A
  • replicate execution units to create multiple pipelines
  • multiple instructions are issued and executed in parallel during each instruction cycle
  • parallelism is limited by instruction dependencies
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

limit of superscalar

A

code rarely offers parallelism beyond 6 pipelines

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

describe SMT

A
  • allows multiple threads to share pipeline resources within a single core
  • instructions from different threads can be issues in the same cycle
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

limit of SMT

A

complexity in scheduling and resource sharing limits scalability

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

describe the affect of icnreasing clock speeds & complexity on power consumption

A

^ clock speed & added complexity = ^ power density

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

what is the response to rising power density

A

^ on-chip cache memory as it consumes less power than logic

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

what does multicore offer in terms of performance

A

if software is parallelisable = near-linear performance
- large caches are underutilised by single threads
- multiple cores and threads better exploit on-chip cache memory

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

what does amdahl’s law represent

A

theoretical speedup for a single application on N cores

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

amdhal’s law formula

A

Speedup(N) = 1/((1 - f) + f/N)
where:
- f = fraction of the program that is parallelisable
- 1 - f = fraction that is inherently sequential

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

what does amdhal’s law assume

A

perfect parallelism with no scheduling overhead

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

what is a more realisitc assuption about amdahl’s law

A

as core count increases, ^ overhead = peak & degrade

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

4 applications that scale well with multicore systems

A
  1. multithreaded native apps
  2. multiprocess applications
  3. java applicationds
  4. multi-instance applications
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

define threading granularity

A

refers to the smallest unity of work that can be parallelised

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

what does fine-grained threading give

A

flexibility
- BUT ^ overhead from managment

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

trade off of threading granularity

A

parallelism vs overhead

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

what does rendering use

A

hierarchical thread structure

21
Q

what are the 4 multicore cache organisations

A
  1. dedicated L1 cache
  2. dedicated L2 cache
  3. shared L2 cache
  4. shared L3 cache
22
Q

5 advantages of shared higher-level cache

A
  1. constructive interface
  2. no data repliction needed
  3. dynmaic cache allocation
  4. easier inter-core communication
  5. simplified coherency
23
Q

why is a constructive interface an advantage of shared higher-level cache

A

threads on different cores benefit from shared data already located in cache

24
Q

why is dynamic cache allocation an advantage of shared higher-level cache

A

cores can use more or less cache depending on workload needs

25
why is simplified coherency an advantage of shared higher-level cache
coherence issues are limited to private (lower-level) caches, reducing overhead
26
what is the trend in cache hierachy design
as you ^ memory capacity and core counts = ^ importance of cache coherency
27
new cache hierachy design
L1 per core, L2 shared among 2-4 cores, and global shared L3
28
why is SMT deployed in multicore design
- SMT increases the number of hardware threads per chip - as software becomes more parallel, SMT becomes more attractive than purely superscalar designs
29
define homogenous multicore
all cores are identical
30
define heterogenous multicore
differnt types of cores on one chip - mixing cores in this context means using cores with differnt ISAs optimised for different tasks
31
describe CPU/GPU multicore
- GPUs - support thousands of parallel threads. - combining CPUs & GPUs improve flexibility & performance across diverse workloads - CPUs & GPUs share key-on-chip resources such as last level cache, interconnect & memory controllers
32
what are the 2 solutions to CPU/GPU cache issues
1. physical memory partitioned between CPU & GPU 2. enabling shared access to memory & unified expression
32
how does physical memory paritioned between CPU & GPU work
- CPU had to explicitly copy memory to GPU memory - GPU copied results back after computation - significant performance penalty
32
what are the 2 challenges with CPU/GPU multicore
1. ensuring coordination & correctness between different types of cores 2. cache sharing - differences in access patterns & sensitivity
33
how does enabling shared access to memory & unified execution work
- shared virtual memory - demand paging - cache coherence - unified programming interface
34
describe CPU/DSP multicore
DSP - excel at ultra-fast, math-intensive operations
35
uses of CPU/DSP
- cellphones - modems - sound cards - hard drives
36
desribe ARM big.LITTLE architecture
- cortex-A15: high performance for intensive tasks - cortex-A7: power-efficient for light tasks - designed for smartphones & tablets - use low-power cores most of the time, and switch to high performance cores when needed - extends battery life while still meeting user performance expectations
37
describe Cortex A-15
- out-of-order, superscalar processor - pipeline length - 15-24 stages - 3 instructions per cycle - 8 execution units each with its own queue
38
describe Cortex-A7
- simpler, energy efficient in-order processor - pipeline length - 8-10 stages - uses a single queue - supports dual-issue: 2 instructions per cycle - low transistor count - low power consumption
39
describe cache design in heterogenous multicore systems
use dedicated L2 caches per processor type - hardware-based cache coherence preferred in SoCs
40
describe Intel Core i7 architecture
- 6-core processor with each core having a private L2 cache - all cores share a 12 MB L3 cache - uses hardware prefetching - integrated DDR3 memory controller - uses QPI
41
describe Cortex-A15 MPCore
- high-performance applications - homogenous multicore processor with multiple A15 cores
42
how does interrupt handling occur in Cortex-A15 MPCore
- Generic Interrupt Controller manages & routes interrupts to A15 cores and is accessed by each core via the SCU ○ mask and prioritise interrupts ○ route interrupts to one or more CPUs ○ track interrupt status ○ generate interrupts via software - interrupt distributor sends highest-priority pending interrupt to each CPU interface - CPU interface sends ○ acknowledge signal ○ end of interrupt
43
describe cache coherency in Cortex-A15 MPCore
- L1 cache coherence is maintained using the MESI protocol by the SCU - L2 cache coherence uses a hybrid MESI/MOESI protocol
44
describe the multichip module (MCM) in IDM EC12
integrates 6 high-speed processor units, each with 4 cores & 2 storage control chips for interconnection to other MCMs
45
describe the four-level cache architecture in IDM EC12
- L1 & l2 caches are private & use write-through policy - L3 shared for high capacity & fault tolerance - L4 cache (on SC chips) acts as a coherence manager
46
what is IDM EC12 best sutied for
- suited for transaction-heavy workloads with high data sharing & context switching - prioritises minimising main memory access by maximising on-chip cache