Accelerators Flashcards

1
Q

What are the current technology trends for single core performance?

A

Still more transistors, now into 5nm. Talk about going to 3nm, but beyond that would be very challenging. So though we are currently getting more transistors, this trend will vary off.

Frequency has flatten out since early 2000.
Single thread performance has also levelled off since early 2000. Power has also flattened out.

Number of logical cores has increased.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What is the main reason for single thread performance, typical power and frequency flattening out?

A

End of Dennars scaling. We have become power limited.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What are the technology trends for multi-core performance?

A

Clock frequency are flat

After Dennard scaling stopped, started doing multicore (early 2000, around 2005).

For a number of years after that, it was still possible to increase the number of cores, up to 8 cores around 2015.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Describe the development of performance from 1978 to 2019

A

Early days (78-86):
- CISC
- lot of innovations in the ISA themselves
- 2x increase of performance every 3.5 years

Mid 80s (86-2003):
- RISC introduced
- easier to optimise this ISA for HW
- 2x increase every 1.5 years

Early 2000 (2003-2011):
- End of Dennard scaling
- Static power (leakage) became so big we weren’t able to continue scaling processors
- 2x increase every 3.5 years
- Begin using multicore

Early 2010s (2011-2015):
- Amdahl’s law limits performance increase
- If we continue increasing the number of cores, at some point the SW reach a limit for how much it can be parallelized
- 2x increase every 6 years

Currently (2015-2018, maybe now):
- 2x increase every 20 years

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

As technology nodes decreases in size, what happens to the amount of power/nm^2?

A

Size of the technology nodes decreases through the years.

Power per square nm increases.

In the early 2000s, the power increased more slow, after around 2009 it started to increase more rapidly. This is when relative power per nm^2 crosses size in nm.

The power for each transistor no longer scales as well as the reduction in size. This means that each transistor now needs to use more power.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What does the term Dark Silicon mean?

A

Dennard scaling described that if a technology node was 25% of the original size, 4 of these nodes could be powered by the same amount of power as one node of size 100%.
(OBS, make sure this is the correct definition)

Now, the power for each node increases, meaning the power for a node does not decrease as much as its size deacreases. Meaning that if we have the same amount of power as a node of size 100% needs, now that we have nodes of size 25%, these requires more than 1/4 of the original power, though they are 25% of the size. So if such nodes were to be used, the same power budget would only be enough to power a chip of e.g. 40% the total size. Meaning it is not enough for even 2 nodes of the new size. (See figure in video Accelerator on time 6:11)

So, now that we have become power limited, we are not able to have as many tech nodes powered on on the same time, on each chip, as there is space for. If we were to power all nodes that fit on the chip, we would draw too much power.

Because of this, computer architecture has become power limited.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

When executing a RISC instruction, how much power does it take (example, don’t know for which node exactly)

A

125 pJ

The Overhead of executing the instruction takes most of the power (fetch, decode, …)

ALU takes some of the power

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

When executing a Load/store instruction, how much power does it take (example, don’t know for which node exactly)

A

150 pJ

Accessing Data cache takes some power, even more on the lower cache levels

Overhead takes some power

ALU takes some power

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

When executing a 32-bit addition, how much power does it take (example, don’t know for which node exactly)

A

7 pJ

So when only looking at the addition, this takes very little energy compared to a whole instruction

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

When executing a 8-bit addition, how much power does it take (example, don’t know for which node exactly)

A

0.2-0.5 pJ

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

When executing a SP (single point, 32 bit) floating point operation, how much power does it take (example, don’t know for which node exactly)

A

15-20 pJ

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

What are some advances microarchitectural techniques used to push performance, end are these energy efficient?

A

These tend to be very energy inefficient.

OoO:
- Large instruction window (ROB)
- complicated schedulers (reservation station, wake-up logic)

Pipelines - becoming larger (15+ stages)
- logic and control (hazards)
- pipeline registers

Branch prediction
- becoming larger
- flushing of pipeline wastes energy

Complex memory hierarchy
- large caches
- multi level
- deep levels of hierarchy are wasteful for example streaming programs where one line of data is only fetched once and never used again

Prefetchers
- On misspredict - wasted energy usage

multithreading

multiprocessing

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

What can be done to try and improve energy efficiency?

A

SIMD extension units/vectorisation:
- Instead of reading one instruction for each data element, apply one instruction to multiple data elements
- this allows for only needing to do the fetch/decode etc. once

GPUs

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Describe some properties of GPUs

A

Uses symmetric multithreading, and SIMD like behaviour. Fetches one instruction and executes it on multiple data.

This reduces energy overhead per data element.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

How does vectorisation effect energy efficiency?

A

Reducing the amount of instructions, reduces the amount of overhead and thereby becomes more energy efficient.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

What are heterogeneous architectures?

A

Large amount of transistors

Multi-core is not enough to improve performance

Use application specific accelerators to improve performance (GPU, encryption, TCP, etc.)

17
Q

What are some HW challenges in heterogeneous architectures?

A

Identify suitable domains to optimise architecture for. Find something that is common enough, that it is worth building a separate accelerators for.

Efficient data management, how to handle data movement between devices.

18
Q

What are some SW challenges in heterogeneous architectures?

A

SW must be adapted to fit the new compute engines.

Many types of compute engines to optimise for

Portability

19
Q

Why does energy efficiency increase with specialization?

A

General purpose design takes more energy to be able to handle a variety of different cases, whereas with specialization we can design HW to be optimised for specific tasks.

20
Q

Name the 5 general guidelines for accelerators?

A
  1. Use dedicated memories to reduce the distance over which data is moved. So instead of using caches and deep hierarchies that are not seen by the SW, use memories that offers SW control to move data in and out.
  2. Invest the resources saved from dropping advances microarchitectural optimizations into more arithmetic units or bigger memories. Instead of using resources on advanced scheduling components, for example, and try to utilize as many of these at the same time instead.
  3. Use the easiest form of parallelism that matches the domain
  4. Reduce data size and type to the simplest needed for the domain. General purpose would need larger size, i.e. 32/64 bit.
  5. Use a domain-specific programming language to port code to the accelerator. Make sure that the code is supported from generation to generation.
21
Q

Name one example of domain for accelerators

A

AI - much can be reduced to matrix multiplication

In AI have a use case that is common

This use case can be supported by one operation (matrix multiplication)

Can build specialised accelerators to support this

22
Q

What are sparse tensors?

A

When training neural networks, we often have dense matrices with weights in each matrix elements. However, not all of these weights are necessary to keep the precision/accuracy of the neural network, so a lot of these weight can be pruned away so that you get 0-entries.

The 0-entries does not do anything for the computation, and can therefor be compressed away so that the weight matrices can be stored compressed. This takes up less space and means you can do less computations as multiplications with 0 results in 0.

23
Q

What are FPGAs? (Feed-programmable-gate-array)

A

One way of doing acceleration

Circuits that can be reprogrammed.

Normal now a days to be able to rent machines, that has built in support for FPGAs, where you can map specific problems down to these hardwares to improve performance of applications.

24
Q

What can be used on the SW side of accelerators

A

Nvidia cuda

OpenCL

TensorFlow - very common with AI, neural networks

FPGAs:
- Need to generate bit files that represents a circuit, that are to be performing the computations
- Often done in HDL such as SystemVerilog, Verilog
- Getting more common is to use Highlevel synthesis

25
Q

What is high level synthesis?

A

Going from a more conventional programming language such as C, take it through a high-level synthesis (kind of like a compiler) that generates for example Verilog that can map down to the FPGA

26
Q
A