Collected Concepts Flashcards

1
Q

Flynn’s Taxonomy

A
  • Early attempt to classify difference types of parallel machines
  • Expressed in terms of Instruction Stream and/or Data Stream
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Single Instruction Single Data (SISD)

A
  • Single stream of instructions is operating on a stream of data, one element at a time
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Multiple Instruction Multiple Data (MIMD)

A
  • A collection of cores (CPUs) each executing their own sequence of instructions and each operating on a separate streams of data
  • Most modern processors is in this category
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Single Instruction Multiple Data (SIMD)

A
  • A collection of cores (CPUs) performing same sequence of instructions on multiple streams of data in ‘lockstep’
  • Currently used in graphics applications:
    • GPUs
    • Single general purpose CPUs work on packed data
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Single Program Multiple Data

A
  • General version of SIMD
  • All execute same code, but on different data
  • Doesn’t need to be in ‘lockstep’
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

A Grid

A
  • 2D interconnection of processors
  • Processors can communicate with neighbours
  • Memory is usually private to each core
  • Staged communication require beyond immediate neighbours
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Torus

A
  • Same as Grid, but with ‘Wraparound’
  • More symmetrical structure, enhances communications
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Bus Interconnection

A
  • Simple to build
  • Has to be shared using ‘time-slicing’
  • Memory can be local to each CPU/core
  • Can add shared memory to bus
  • All communications have equal access time (as long as there is no contention)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Shared Memory

A

Memory is accessible from every part of the computation

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Distributed Memory

A
  • Memory is distributed among parts of the computation and (usually) only accessible from one part of the computation
  • Easier to build.
  • Can have higher total communication bandwidth.
  • If problem is suited to Message Passing, it’s better with Distributed Memory hardware
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Shared Memory [Hardware View]

A
  • Memory is physically organised so that it can be accessed from any processor/core
  • Difficult to build
    • Better supports Data Sharing programming models
      • Which is easier for general purpose computing
  • More general purpose hardware, in particular chip multiprocessors, is Shared Memory
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Distributed Memory [Hardware View]

A

The memory is physically connected to only one CPU/core and only that core can directly access it.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Shared Memory [Software View]

A
  • Usually called Data Sharing
  • A program has global memory which is accessible from any part of the program
  • Communication between parallel parts can take place by writes to and reads from that global memory.
  • Easier then Message Passing parallel programming.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Distributed Memory [Software View]

A
  • Called Message Passing
  • Separate parts of the program have local memory on which they can operate
  • Communication between parts can only take place by sending messages between the parts.
  • Programming becomes harder the less regular the problem becomes and the more dynamic its structure is.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Programming Model

A

Software view available to the programmer

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Data Sharing on Distributed Memory

A
  • Code running on a core only communicates with another using messages.
  • Allows simulation of shared memory
  • Will be slow if we have lots of remote memory accesses
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

Message Passing on Shared Memory

A
  • Any core can write to a shared memory location which can be read by another core
  • An area of memory can be used as a communication buffer between separate threads of computation
  • A simple implementation of a message passing parallel programming model is possible​
  • If there is only local communication, Distributed Memory would be better due to higher total bandwidth
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

Threads

A
  • A thread is a flow of control executing a program
  • A process can consist of one or more threads
  • All threads in the same process have the same memory address
  • If shared memory is not present, threads have to communicate via messages.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

Data Parallelism

A
  • Divide computation into equal-sized chunks, each working on part of whole
  • Easiest form of parallelism
  • Works best when there are no data dependencies between these chunks
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

Explicit Parallelism

A

Programmer spells out what should be done in parallel and what should be done in sequnce

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

Implicit parallelism

A

System is supposed to work out parallelism out for itself

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q

Cache

A
  • Fast small local memory that holds recently used data and instructions
  • Used because main memory is too slow (50-100x) to keep up with processor
  • If program ‘working-set’ can be kept in caches most of the time, processor can run at full speed.
  • New data/instructions needed by the program have to be fetched from main memory (on each cache miss)
  • Newly written data in cache must eventually be written back to main memory
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
23
Q

Cache Coherence Problem

A
  • When there are multiple processors problems arise:
    • They may share variables resulting in data inconsistency.
    • Two processors write to same location results in two cache copies
  • A write-through policy can be used
    • Every write is updated in memory
    • Results in delay, as every write/read across processors could involve two main memory accesses.
  • The Write Back Policy
    • used nowadays as its more efficient
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
24
Q

Cache States

A
  • Cache has 2 control bits
    • Invalid - may be address match, but the data is not valid (need to go to memory and fetch it)
    • Dirty - cache line is valid and has been written to, but the latest values have not been updated in memory
  • Third implicit state exists - Valid
    • ​​Valid cache entry exists and the line has the same values as main memory
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
25
Bus snooping
* A bus attached to every cache * Observes all the transactions on the bus and is able to modify the cache independently of the core * Takes action on seeing pertinent transfers between another core's cache and memory * Able to send messages to other caches
26
Cache state transitions
1. Accesses to be made to main memory 2. Messages to be sent between caches 3. Changes between the valid states
27
Types of messages between cores
* 2 types of messages * Read requests for a cache line * Invalidates of other cache's line * Can be extended beyond 2 cores * Any core with a valid value can respond to read request * Can only be one Dirty and everything else should have Invalid
28
Problems with Bus snooping
* Cache coherence implies that all cores see the invalidate at the same time (in the same bus cycle) * The more cores connected to the bus, the more difficult this will get * With more cores: * Bus will get physically longer - signals take longer * Bus will have more capacitance, slowing it down * If the bus cycle is slowed, it impacts the performance * The coherence protocol is a major limitation to the number of cores which can be supported
29
MSI Protocol
* Derived protocol from a consideration of uniprocessor cache which include Valid * The result is MSI Protocol * **M**odified - Only valid copy in cache and its different from memory * **S**hared - valid, but other caches might have it and its value is same as memory * **I**nvalid - out of date and can't be used
30
MESI Protocol
* Extension to MSI protocol, which is used more widely due to significant effect on bus usage * Shared (Valid) state split into two states * **M**odified * **E**xclusive(unshared) - means there are no other copies in other caches. * **S**hared * **I**nvalid * Results in less use of bus bandwidth
31
MOESI Protocol
* Further optimisation of MESI * Split Modified: * **M**odified * **O**wned - Cache contains a copy which differs from memory and there are copies in other caches which are tagged as shared * Allows the latest value to be shared without having to write it back to memory
32
Directory Based Coherence
* Coherence scheme without a bus (for example a grid) * Centralised directory which holds information about every value(every cache line) in the memory * Drawbacks * Takes significant number of CPU cycles due to lack of bus communication * Due to longer delays, usually 'handshakes' are needed to work correctly * Unlikely solution for heavy used shared memory
33
Directory Structure
* Each directory entry contains: * Information on which core has a copy * Whether-or-not the copy is dirty * A core wishing to make a memory access may need to query the directory about the state of the line to be accessed​
34
Barrier
* A place where number of threads "meet up" * When all threads reach it, they can all proceed * But threads need to wait until the last arrives * Very good for data parallelism
35
Locks
* Used to achieve "correctness" * If two pieces of code take a lock on the same object, one should start and complete before the other starts * Results in the normal (sequential) meaning of the chunks of code preserved * Only one thread can lock any particular object at a time
36
Sequential Consistency
* We have sequential consistency if (both): * Method calls appear to happen in a one-at-a-time sequential order, and * method calls appear to take effect in program order
37
Problems with Locks
* Deadlock - two or more competing actions are each waiting for the other to finish, and thus neither ever does. * Mistaken use of conditions (the "Lost WakeUp" Problem)
38
Coarse-grained Granularity
* Code which depends on obtaining a lock is too large * Results in limited parallelism.
39
Fine-grained Granularity
* Code which depends on obtaining a lock is too small * Results in a lot of work obtaining and releasing locks, hence the program is harder to write
40
Binary Semaphore
* Hardware support construct * A single shared boolean variable S to 'protect' a shared resource * S == 1 means resource is free * S == 0 means resource is in use * Operations (atomic) * wait(S) : wait until S!=0 then S=0 * signal(s): set S to 1 * Maybe cached, hence for 'indivisible' behaviour might require coherence operations in the cache
41
Test\_and\_set Instruction (tas)
* Simple solution on older processors * tas R2 * If memory location addressed by R2 is 0 * Set it to 1 and set the processor 'zero' flag * else clear zero flag​
42
Fetch\_and\_Add
* Returns the value of a memory location and increments it * Atomic instruction * Read-Modify-Write instruction (RMW)
43
Compare\_and\_Swap
* Compare a memory location to a value (in register) and swap in another value if they are equal * Atomic instruction * Read-Modify-Write instruction (RMW)
44
Load\_Linked/Store\_Conditional overview
* Synchronisation mechanism used in modern RISC processors * Uses separate load and store instructions (similar to ordinary load and store) * However, have additional effects on processor state which allow them, as a pair, act atomically * Knowing that anything between LDL and STC has executed atomically is very powerful for certain things
45
Load\_Linked (ldl)
* ldl R1, R2 * Loads R1 with value addressed in memory by R2 * Sets a 'load linked flag' on the core that executes it * Records address in a 'locked address register' on the core which executes it
46
Store\_Conditional (stc)
* stc R1, R2 * Stores the value in R1 into the location addressed in memory by R2 * Only succeeds if the 'load linked flag' is set * Value of 'load linked flag' is returned in R1 * 'Load linked flag' is cleared
47
Load Linked Flag
* This state is applied on a core which has it set changed if a write (from another core) occurs to the locked address and, hence, an invalidate message is sent * Detected by comparison with 'locked address register' - processor must monitor (snoop) the memory address bus to detect this * This will occur because a write has occurred to the shared variable and, hence, another core has probably got the semaphore
48
Spin Lock
* Waiting for a lock by sitting in the loop
49
OpenMP
* Used to create parallel programs for shared memory computers * Goals * Parallel performance * Sequential equivalence * Incremental parallelism
50
Parallel Regions
* Having identified part of sequential program which would benefit * Annotate it so that multiple threads execute in this parallel region * Join up again at the end * Variables are _shared_ before parallel region, whereas those within are _private_(local to thread) * Code block following the region must be a structured block (has single point of entry(at top) and exit(at bottom))
51
Loop-splitting
* Way of sharing work between threads * Process: * Find a time-consuming loop * Restructure, to make iterations independent of each other * Split parts of the 'iteration space' to different threads
52
Synchronization Constructs [in C]
* flush - used to ensure memory consistency between threads * critical - indicates a critical section (can only be executed by one thread) * barrier - can be added explicitly * All done using pragmas
53
Loop Scheduling
* Users can affect the scheduling of loop-iterations to threads (which iterations to give to thread) and whatever to do it statically or dynamically * #pragma omp parallel for schedule(dynamic,10) * Says the iteration space should be broken into chunks of 10 iterations, and each thread takes one chunk at a time and come back for more when done * A static allocation would divide the chunks evenly between the threads at the start
54
Message Passing Interface (MPI)
* Most widely used API for parallel programming * Library for C & Fortran * Used to program wide range of architectures * Distributed memory machines * MPPs * Shared memory machines
55
MPI structure
* Messages are identified by * sending process * receiving process * integer tag * To avoid confusion these are defined with a communication context * Processes are grouped - initially with a single group but can split later * Communication context + process group forms a communicator (MPI\_COMM\_WORLD)
56
Speculation
* If there are idle resources then we might as well do something with them speculatively (do work which is useful) * Incorrect speculation uses resources that otherwise would be idle (so no waste, except power)
57
Thread Level Speculation (TLS)
* A technique for running fully serial program in parallel * Principle: * Divide single-threaded code into separate threads * Run threads in parallel * Detect any problems and handle them
58
Loop-based TLS
* Each data item in each thread has a tag * Not assessed (N) * Modified (M) * Speculatively loaded (S) * Speculatively loaded and later modified (SM) * In practice more work on each memory access
59
Procedure-based TLS
* Execute serial code and when we encounter a procedure call(method/function) split the execution into two threads * Hardware Support * every core has a cache which can be used as the speculative data buffer * Snooping protocols can be used to perform reading and conflict detection operations * But cache is limited * Process * Procedure body is executed in main thread * Code beyond call is executed speculatively * Threads re-join when call is finished * Validation is done * Success - speculative thread continues as main thread * Failure - code after call is re-executed in old main thread
60
Transactional Memory (TM)
* Provides implementation of transactions within a parallel data sharing context * Promises to provide a simplified programming model when parallel threads share updateable memory (i.e. to replace locks and barriers) * Performance not good for programs with a significant amount of sharing
61
ACID
* Atomicity (applicable to TM) * Consistency * Isolation (applicable to TM) * Durability
62
Transactions
* Indivisible (atomic) * Executes either as a whole or not at all * While executing, none of its changes can be observed from outside * When it completes, any changes will become apparent outside and they will all become apparent at the same time * All the changes are simultanious
63
TM Locks
* Writing correct code with locks is tricky and transaction should be simpler * Locks are not composable * Users need to know about locks to avoid deadlocks * Transactions shouldn't suffer from this
64
TM Commit and Abort
* Transaction algorithm: * Starts * Reads some shared variables * Possible writes some shared variables * Finally attempts to commit * Succeeds if there is nothing wrong * Fails (abort) if it cannot commit - and then the entire transaction must be repeated
65
TM Abort reasons
* If a variable it has read has since been written (R-W clash) * If a variable it has written to has been written to (W-W clash)
66
Readset and Writeset
* Keep track of all the shared variables it has read - its **readset** * Keep track of all the shared variables to it has written - its **writeset** * Uses both to determine clashes with other transactions
67
TM Versioning
* If the transaction writes to a variable and later reads it, it needs to see the value it wrote, not the original value * If the transaction aborts, the original value needs to still be there * Two types to implement: * Eager Versioning/Direct Update * Lazy Versioning/Deferred Update
68
Eager Versioning (Direct Update)
* Change the shared variable, but keep a private log of the original version * On commit, throw away the log * On abort, restore the original value from log * Trush clash detection to preserve isolation * Efficient if aborts are rare
69
Lazy Versioning (Deferred Update)
* Keep a private version of everything shared that a transaction changes * Reads should refer to these private versions * On **Commit**, copy private version values into main memory * On **Abort**, throw away private versions * Efficient if aborts are common (not rare!)
70
Validation (Conflict Detection)
* Process of checking that transactions do not clash * Two ways: * Lazy validation * Eager Validation
71
Lazy Validation
* Do the validation when the transaction tried to commit * At this stage all the reads and writes to shared variables are known * Conflict detection needs to do is work out whether it would be OK to commit this transactions result
72
Eager Validation
* Every write operation to a shared variable by a transaction can trigger a check for clashing transactions amongst those currently running * After identifying which clash * Choose which one to abort * Or delay some instead * Probably more work - but aborting transactions which would eventually fail anyway saves wasted effort
73
Strong Isolation
Shared variables can only be accessed within transactions
74
Weak Isolation
Shared variables can be used outside transactions as well as inside
75
Basis of TM Operation
* Declare sections of the code to be 'atomic' (a transaction) * When in transaction * Keep note of all addresses loaded from (readset) * Buffer all writes, keep local values, do not write them to main shared memory (writeset) * If it's detected that another thread writes to something in the readset (conflict), abort transaction and restart * If end of transaction is reached successfully commit writeset to main memory
76
Cache-based TM
* Cache can store both writeset and readset * Need extra tags to cache entries * Modified writeback protocol - writes of transactional data are only made from cache to main memory if transaction commits * Snooping protocol is modified * When writing a transactional variable an invalidate is broadcast as usual * Any core seeing an invalidate which matches one of its readset entries must invalidate all its transactional data, abort the transaction and restart it * When transaction terminates, can commit (flush) its writeset to main memory * Issues * Limited cache size - limited readset/writeset * Risk of livelock
77
Transactional Coherence and Consistency (TCC)
* Large number of optimisations * As yet no practical implementations, but simulated with encouraging results.
78
TCC Principles
* Each core still has a local cache in which it stores its readset during transaction, with tags to differentiate from other cached data * Writes are stored in a separate buffer (can be RAM-based queue) * Writes to shared data only occur in transactions - buffered locally and there is no need to broadcast invalidations * Transaction reaching its end successfully must commit its data to main memory * before doing this, broadcasts all the address in its write buffer to all other cores * if they are executing a transaction, must compare all the address in the packet with their readset in cache * If there are any matches - associated comparing transaction must abort and restart * No two transactions must be trying to cmiit conflicting data at the same time * Central 'permission to commit' resource could be a bottleneck * Lazy validation in place * Has no livelock * Less synchronisation due to inter-core communication only at transaction commit
79
Cache Coherence
the consistency of shared resource data that ends up stored in multiple local caches