Multicore and cache coherency Flashcards
What are 2 types of coherency protocols?
Snooping coherence protocols
Directory base coherency protocols
What is cache coherency?
How to keep the memory coherent across the different caches in a system.
How memory updates are propagated through the system.
What are some trends that occured in uniprocessor design, that motivates to the use of multicore? (3)
Single core designs started to become very complex. More difficult to verify
Speed of light have some limits on the wire length, and how far signals are able to travel in a cycle. Larger cores requires signals to travel farther
Diminishing returns from ILP, difficult to extract more ILP from a single thread
What are some advantages of a multiprocessor design? (4)
Increased performance through a different type of parallelism (task-based, thread-based instead of ILP)
Multichip: Put multiple CPUs into the same machine.
Multicore: Put multiple cores on one chip
Can keep the design of one core quite simple/smaller, and instead replicate this across multiple cores. These must be connected so do have some new complexity
How does the demand in server vs. desktop performance motivate multicore design?
More and more cloud computing, less need to have higher performance on personal computers.
Graphics performance is off-loaded to GPUs.
Servers are able to have a lot of TLP
How does technology issues motivate multicore designs?
Increasing the complexity of a single core, gives more problems with power and cooling.
Having multiple cores, with lower frequency, allows us to keep the throughput while lowering the power.
What are the two types of memory hierarchy structures used in multicore systems?
Centralized memory
Distributed memory
What is centralized memory?
Uniform memory access (UMA) - only one physical memory
Each core has their own L1 cache
L2 caches are shared between , or sets of, cores.
L3 and main memory is shared for all cores.
Constant latency between the memory layers. Constant latency between L1 and L2, L2 and L3, and so on.
What are a pro and con with having constant latencies between memory layers in centralized memory?
Pro:
Know that every load and store will take the same amount of time, don’t need to optimize this by the programmer.
Con:
All accesses to main memory will travel on the bus between L3 and Main memory. More difficult to scale, as all traffic goes to one point
What is distributed memory?
Each core has a L1.
2 cores, or set of cores, share L2
L2 are connected to a network that distributes accesses to different banks of L3 and possibly a divided set of pools of main memory
Can both have non-uniform (NUMA) and uniform (UMA) memory accesses:
Depending on what memory bank is distributed to your core, the latency can vary based on how close or far away it is. Also have many memory controllers, and the distance between these and the cores can vary.
What is a pro and con with distributed memory?
Pro:
Distributed accesses, less congestion on an individual resource.
Better scaling by using physically seperate memories.
Con:
Network becomes more complex.
What type of address spaces does distributed memory have?
Can both have shared and several separate address spaces.
Shared: Supports shared memory and message passing programming model
Separate: Support only message passing programmin model. Multiple cores on the same die. Easier to scale across devices (separate server, cloud, etc.)
What programming models works for shared memory?
pthreads, OpenMP
Data exchange between threads happens via memory (ld, sw, atomic). In memory synchronization primitives used to coordinate access (locks, semaphores, etc.)
Main thing to take care of is synchronizing the threads.
What programming models works for distributed memory?
MPI
Common in supercomputers, or systems with multiple servers.
Cannot access data of another core.
Data is exchanged via messages (send, recv)
Synch.primitives implemented via messages (barriers)
What is the (bus) snooping cache coherency protocol?
Each cache maintains local status
All caches monitor a broadcast medium, an interconnect network that is seen by all the caches.
Have a write invalidate and write update protocol. Each cache are keeping track of all updates happening in the system
What is the directory based coherency protocol?
Status is stored in a shared location (centralized or distributed). Have a directory that tracks for each cache block, who is sharing it.
Directory keeps track of what caches have what data. Make sure to update the relevant locations.
More complicated, need to handle deadlocks, livelock, starvation, and consistency models. Need to make sure messages are sent and received in the correct order.
Describe the write invalidate protocol of snooping.
Writes requires exclusive access.
Lets say we have 4 processors. Cache 1 wants to read, and gets a cache miss. The read miss gets distributed to all other caches.
As no other caches has this data, the request goes to main memory which responds to the request.
Processor 1 now has a local copy of the data and sets the received data as ‘shared’.
If processor 2 wants to read the data, it sends this out on the interconnect. The data can now either be read from main memory, or a bit more advanced protocol will see that another processor already has this data and fetch it from there instead.
If processor 1 wants to write to this data address that now is shared between multiple processors, an invalidate signal is sent over the interconnect. The other processors will check if they have a local copy of the data, if so, they will invalidate it.
Processor 1 will then mark the data as ‘modified’. ‘Modified’ communicates that when this cache line is evicted, it needs to be written back to memory.
Now, processor 2 want to read the modified data. It first sends a signal on the interconnect. Then processor 1, which has the modified version of the data, will send an abort signal to processor 2, signalling that they have a modified version. The read request will be aborted by processor 2 and processor 1 will then write the value back to main memory. After this the modified data is shared to processor 2, and both processors label the data with ‘shared’.
Describe the Write update (write broadcast) snooping protocol
Often combined with write-through.
On a write, broadcast the modified data on the interconnect and the caches who share this same cache line will read the data and update it in their local copies.
Memory is also updated because we write through cache changes.
Compare the invalidate and update snooping protocols.
Update:
Simpler, don’t need to keep track different copies of data, as all copies are always updated.
However, on every write, the data is sent out on the interconnect meaning we have a lot of communication traffic.
Invalidate:
Only the first write will send out an invalidate signal. Saves bus- and memory traffic
When implementing a cache coherency policy, what is added to the cache lines?
From earlier: Tag - Index - valid bit
Number of status bits that implements what status the cache line is in.
What is the MSI protocol?
Protocol for indicating what status a cache block is in.
M: Modified (this core has written to this block, all other copies must be invalidated).
S: Shared (multiple cores can have the same cache block in the shared state)
I: Invalid
Every cache block has their own state machine
Describe the Snoopy protocol
Invalidation protocol
Write-back cache
Each block is in one state
Shared (read only): clean in all caches, up to date in memory, block can be read
Modified: A single cache has the only copy, it’s writable, and dirty
Invalid: Block contains no data
How does the state machine for an MSI work, for CPU requests?
Initial: Invalid
Invalid -> Read miss: invalid -> shared, read miss on bus
Invalid -> Write: inavlid -> modified, write miss on bus
Shared -> read: CPU is free to read this data as much as it wants. Read hit to this block keeps it in the same state-
Shared -> read miss: Happens if cache is full and we need to evict a line. The Shared block is evicted, a new is fetched, this block is back in the shared state
Modified -> read/write hit: Can just read/write the value, as it is the only valid copy.
Modified -> write miss: Evict current cache block, and write a new one. This will have the ‘modified’ state.
Modified -> read miss: Write back block, place read miss on block, goes into shared state.
Shared -> write miss: Modified, invalidate the other cores
How does the state machine for an MSI work, for bus requests?
Modified -> bus read miss: Write back block, abort memory access, shared
Modified -> bus write miss: Someone else wants to write this data, but they don’t have it in the cache, meaning they want to read the data first. The current block therefore needs to write the block back and abort memory access from the other core. As the other core is writing, this core will get in the invalid state.
Shared -> Bus write miss: Data is written to by someone else, meaning we no longer have the correct data -> Invalid
Shared -> bus read miss: Shared