Multicore and cache coherency Flashcards by E Mi

What are 2 types of coherency protocols?

Snooping coherence protocols

Directory base coherency protocols

How well did you know this?

Not at all

Perfectly

What is cache coherency?

How to keep the memory coherent across the different caches in a system.

How memory updates are propagated through the system.

How well did you know this?

Not at all

Perfectly

What are some trends that occured in uniprocessor design, that motivates to the use of multicore? (3)

Single core designs started to become very complex. More difficult to verify

Speed of light have some limits on the wire length, and how far signals are able to travel in a cycle. Larger cores requires signals to travel farther

Diminishing returns from ILP, difficult to extract more ILP from a single thread

How well did you know this?

Not at all

Perfectly

What are some advantages of a multiprocessor design? (4)

Increased performance through a different type of parallelism (task-based, thread-based instead of ILP)

Multichip: Put multiple CPUs into the same machine.

Multicore: Put multiple cores on one chip

Can keep the design of one core quite simple/smaller, and instead replicate this across multiple cores. These must be connected so do have some new complexity

How well did you know this?

Not at all

Perfectly

How does the demand in server vs. desktop performance motivate multicore design?

More and more cloud computing, less need to have higher performance on personal computers.
Graphics performance is off-loaded to GPUs.
Servers are able to have a lot of TLP

How well did you know this?

Not at all

Perfectly

How does technology issues motivate multicore designs?

Increasing the complexity of a single core, gives more problems with power and cooling.

Having multiple cores, with lower frequency, allows us to keep the throughput while lowering the power.

How well did you know this?

Not at all

Perfectly

What are the two types of memory hierarchy structures used in multicore systems?

Centralized memory

Distributed memory

How well did you know this?

Not at all

Perfectly

What is centralized memory?

Uniform memory access (UMA) - only one physical memory

Each core has their own L1 cache

L2 caches are shared between , or sets of, cores.

L3 and main memory is shared for all cores.

Constant latency between the memory layers. Constant latency between L1 and L2, L2 and L3, and so on.

How well did you know this?

Not at all

Perfectly

What are a pro and con with having constant latencies between memory layers in centralized memory?

Pro:
Know that every load and store will take the same amount of time, don’t need to optimize this by the programmer.

Con:
All accesses to main memory will travel on the bus between L3 and Main memory. More difficult to scale, as all traffic goes to one point

How well did you know this?

Not at all

Perfectly

What is distributed memory?

Each core has a L1.

2 cores, or set of cores, share L2

L2 are connected to a network that distributes accesses to different banks of L3 and possibly a divided set of pools of main memory

Can both have non-uniform (NUMA) and uniform (UMA) memory accesses:
Depending on what memory bank is distributed to your core, the latency can vary based on how close or far away it is. Also have many memory controllers, and the distance between these and the cores can vary.

How well did you know this?

Not at all

Perfectly

What is a pro and con with distributed memory?

Pro:
Distributed accesses, less congestion on an individual resource.
Better scaling by using physically seperate memories.

Con:
Network becomes more complex.

How well did you know this?

Not at all

Perfectly

What type of address spaces does distributed memory have?

Can both have shared and several separate address spaces.

Shared: Supports shared memory and message passing programming model

Separate: Support only message passing programmin model. Multiple cores on the same die. Easier to scale across devices (separate server, cloud, etc.)

How well did you know this?

Not at all

Perfectly

What programming models works for shared memory?

pthreads, OpenMP

Data exchange between threads happens via memory (ld, sw, atomic). In memory synchronization primitives used to coordinate access (locks, semaphores, etc.)

Main thing to take care of is synchronizing the threads.

How well did you know this?

Not at all

Perfectly

What programming models works for distributed memory?

MPI

Common in supercomputers, or systems with multiple servers.

Cannot access data of another core.

Data is exchanged via messages (send, recv)

Synch.primitives implemented via messages (barriers)

How well did you know this?

Not at all

Perfectly

What is the (bus) snooping cache coherency protocol?

Each cache maintains local status

All caches monitor a broadcast medium, an interconnect network that is seen by all the caches.

Have a write invalidate and write update protocol. Each cache are keeping track of all updates happening in the system

How well did you know this?

Not at all

Perfectly

What is the directory based coherency protocol?

Study These Flashcards

Status is stored in a shared location (centralized or distributed). Have a directory that tracks for each cache block, who is sharing it.

Directory keeps track of what caches have what data. Make sure to update the relevant locations.

More complicated, need to handle deadlocks, livelock, starvation, and consistency models. Need to make sure messages are sent and received in the correct order.

Describe the write invalidate protocol of snooping.

Study These Flashcards

Writes requires exclusive access.

Lets say we have 4 processors. Cache 1 wants to read, and gets a cache miss. The read miss gets distributed to all other caches.
As no other caches has this data, the request goes to main memory which responds to the request.
Processor 1 now has a local copy of the data and sets the received data as ‘shared’.

If processor 2 wants to read the data, it sends this out on the interconnect. The data can now either be read from main memory, or a bit more advanced protocol will see that another processor already has this data and fetch it from there instead.

If processor 1 wants to write to this data address that now is shared between multiple processors, an invalidate signal is sent over the interconnect. The other processors will check if they have a local copy of the data, if so, they will invalidate it.
Processor 1 will then mark the data as ‘modified’. ‘Modified’ communicates that when this cache line is evicted, it needs to be written back to memory.

Now, processor 2 want to read the modified data. It first sends a signal on the interconnect. Then processor 1, which has the modified version of the data, will send an abort signal to processor 2, signalling that they have a modified version. The read request will be aborted by processor 2 and processor 1 will then write the value back to main memory. After this the modified data is shared to processor 2, and both processors label the data with ‘shared’.

Describe the Write update (write broadcast) snooping protocol

Study These Flashcards

Often combined with write-through.

On a write, broadcast the modified data on the interconnect and the caches who share this same cache line will read the data and update it in their local copies.

Memory is also updated because we write through cache changes.

Compare the invalidate and update snooping protocols.

Study These Flashcards

Update:
Simpler, don’t need to keep track different copies of data, as all copies are always updated.
However, on every write, the data is sent out on the interconnect meaning we have a lot of communication traffic.

Invalidate:
Only the first write will send out an invalidate signal. Saves bus- and memory traffic

When implementing a cache coherency policy, what is added to the cache lines?

Study These Flashcards

From earlier: Tag - Index - valid bit

Number of status bits that implements what status the cache line is in.

What is the MSI protocol?

Study These Flashcards

Protocol for indicating what status a cache block is in.

M: Modified (this core has written to this block, all other copies must be invalidated).

S: Shared (multiple cores can have the same cache block in the shared state)

I: Invalid

Every cache block has their own state machine

Describe the Snoopy protocol

Study These Flashcards

Invalidation protocol
Write-back cache

Each block is in one state
Shared (read only): clean in all caches, up to date in memory, block can be read

Modified: A single cache has the only copy, it’s writable, and dirty

Invalid: Block contains no data

How does the state machine for an MSI work, for CPU requests?

Study These Flashcards

Initial: Invalid

Invalid -> Read miss: invalid -> shared, read miss on bus

Invalid -> Write: inavlid -> modified, write miss on bus

Shared -> read: CPU is free to read this data as much as it wants. Read hit to this block keeps it in the same state-

Shared -> read miss: Happens if cache is full and we need to evict a line. The Shared block is evicted, a new is fetched, this block is back in the shared state

Modified -> read/write hit: Can just read/write the value, as it is the only valid copy.

Modified -> write miss: Evict current cache block, and write a new one. This will have the ‘modified’ state.

Modified -> read miss: Write back block, place read miss on block, goes into shared state.

Shared -> write miss: Modified, invalidate the other cores

How does the state machine for an MSI work, for bus requests?

Study These Flashcards

Modified -> bus read miss: Write back block, abort memory access, shared

Modified -> bus write miss: Someone else wants to write this data, but they don’t have it in the cache, meaning they want to read the data first. The current block therefore needs to write the block back and abort memory access from the other core. As the other core is writing, this core will get in the invalid state.

Shared -> Bus write miss: Data is written to by someone else, meaning we no longer have the correct data -> Invalid

Shared -> bus read miss: Shared

What is the MESI extended coherence protocol?

Modified Exclusive Shared Invalid Exclusive: Data that is read, is the only copy in the system. Write can be performed without invalidation Shared: Don't know if you're the only one, so need to send invalidation anyway, just in case

What is the MOESI extended coherence protocol?

Modified Owned Exclusive Shared Invalid Owned: Main memory has not been updated, but other caches share this data. When a cache block needs to change its state from owned to another state, it knows that it first needs to write back the change.

What is a disadvantage with using snooping?

Does not scale well as changes needs to be broadcasted. If we have a lot of caches we need a very large interconnect.

How does directory based coherency work?

Each memory bank in a system has its own directory. Message based communication. Only communicate with the relevant cores. If a core reads a cache block, then the directory is updated with information (a bit) saying that core X now has read this block. When more cores access this block, the directory is updated with adding all cores to the information about who has accessed this block. When core 4 wants to write to this block, it notifies the directory. The directory will then send invalidation signals to the other cores with this block. The directory then updates when the cores have acknowledged this invalidation, and removes these cores from the information about who holds this cache block. Only core 4 are registered with this block in the directory now. At this point, the directory signals core 4 that it is safe to write. If core 0 now wants to access this block again, the directory will ask core 4 for the updated value, and pass this to core 0. All the different messages travel across different interconnect types

What type of coherency policy work well for large multiprocessor systems?

Directory based policies. This avoids the problem of broadcasting. Distributed memory is also preferred, as it can also distribute the directories.

What is the 'uncached' state in directory based cache coherence policies?

A processor have some memory, for example shared last level cache, but this data is only available here and not cached with any of the other cores.

What is a Coherence miss ? (The 4th C)

Miss caused by introducing coherency. These depend on either true- or false sharing. True sharing: - Cores want to access the same address - If one core writes and the other reads, the reader must be invalidated. False sharing: - Cores are reading and writing data that are in the same cache block, but not the same address - writes to this cache block will invalidate reads even if the exact data elements are not used by the different cores

How does misses, and true- and false sharing change when the cache size increases?

Amount of true- and false sharing remain the same. However, capacity-, cold- and instruction misses decreases.

How does misses, and true- and false sharing change when the amount of processors increase?

Instruction-, conflict- and capacity misses remains the same. Cold misses increase a little. true sharing increases quit a bit, and false sharing also increases.

How does misses, and true- and false sharing change when the block size increase?

False: Increases True: Decreases Cold: Decrease

Multicore and cache coherency Flashcards

(34 cards)