Collected Concepts Flashcards
Flynn’s Taxonomy
- Early attempt to classify difference types of parallel machines
- Expressed in terms of Instruction Stream and/or Data Stream
Single Instruction Single Data (SISD)
- Single stream of instructions is operating on a stream of data, one element at a time
Multiple Instruction Multiple Data (MIMD)
- A collection of cores (CPUs) each executing their own sequence of instructions and each operating on a separate streams of data
- Most modern processors is in this category
Single Instruction Multiple Data (SIMD)
- A collection of cores (CPUs) performing same sequence of instructions on multiple streams of data in ‘lockstep’
- Currently used in graphics applications:
- GPUs
- Single general purpose CPUs work on packed data
Single Program Multiple Data
- General version of SIMD
- All execute same code, but on different data
- Doesn’t need to be in ‘lockstep’
A Grid
- 2D interconnection of processors
- Processors can communicate with neighbours
- Memory is usually private to each core
- Staged communication require beyond immediate neighbours
Torus
- Same as Grid, but with ‘Wraparound’
- More symmetrical structure, enhances communications
Bus Interconnection
- Simple to build
- Has to be shared using ‘time-slicing’
- Memory can be local to each CPU/core
- Can add shared memory to bus
- All communications have equal access time (as long as there is no contention)
Shared Memory
Memory is accessible from every part of the computation
Distributed Memory
- Memory is distributed among parts of the computation and (usually) only accessible from one part of the computation
- Easier to build.
- Can have higher total communication bandwidth.
- If problem is suited to Message Passing, it’s better with Distributed Memory hardware
Shared Memory [Hardware View]
- Memory is physically organised so that it can be accessed from any processor/core
- Difficult to build
- Better supports Data Sharing programming models
- Which is easier for general purpose computing
- Better supports Data Sharing programming models
- More general purpose hardware, in particular chip multiprocessors, is Shared Memory
Distributed Memory [Hardware View]
The memory is physically connected to only one CPU/core and only that core can directly access it.
Shared Memory [Software View]
- Usually called Data Sharing
- A program has global memory which is accessible from any part of the program
- Communication between parallel parts can take place by writes to and reads from that global memory.
- Easier then Message Passing parallel programming.
Distributed Memory [Software View]
- Called Message Passing
- Separate parts of the program have local memory on which they can operate
- Communication between parts can only take place by sending messages between the parts.
- Programming becomes harder the less regular the problem becomes and the more dynamic its structure is.
Programming Model
Software view available to the programmer
Data Sharing on Distributed Memory
- Code running on a core only communicates with another using messages.
- Allows simulation of shared memory
- Will be slow if we have lots of remote memory accesses
Message Passing on Shared Memory
- Any core can write to a shared memory location which can be read by another core
- An area of memory can be used as a communication buffer between separate threads of computation
- A simple implementation of a message passing parallel programming model is possible
- If there is only local communication, Distributed Memory would be better due to higher total bandwidth
Threads
- A thread is a flow of control executing a program
- A process can consist of one or more threads
- All threads in the same process have the same memory address
- If shared memory is not present, threads have to communicate via messages.
Data Parallelism
- Divide computation into equal-sized chunks, each working on part of whole
- Easiest form of parallelism
- Works best when there are no data dependencies between these chunks
Explicit Parallelism
Programmer spells out what should be done in parallel and what should be done in sequnce
Implicit parallelism
System is supposed to work out parallelism out for itself
Cache
- Fast small local memory that holds recently used data and instructions
- Used because main memory is too slow (50-100x) to keep up with processor
- If program ‘working-set’ can be kept in caches most of the time, processor can run at full speed.
- New data/instructions needed by the program have to be fetched from main memory (on each cache miss)
- Newly written data in cache must eventually be written back to main memory
Cache Coherence Problem
- When there are multiple processors problems arise:
- They may share variables resulting in data inconsistency.
- Two processors write to same location results in two cache copies
- A write-through policy can be used
- Every write is updated in memory
- Results in delay, as every write/read across processors could involve two main memory accesses.
- The Write Back Policy
- used nowadays as its more efficient
Cache States
- Cache has 2 control bits
- Invalid - may be address match, but the data is not valid (need to go to memory and fetch it)
- Dirty - cache line is valid and has been written to, but the latest values have not been updated in memory
- Third implicit state exists - Valid
- Valid cache entry exists and the line has the same values as main memory