Parallel Systems: Cache Coherence, Synchronization, Communication Flashcards

(72 cards)

1
Q

What is the Shared Memory Machine Model?

A

Also called Dance Hall Architecture

Each CPU has a cache and its own memory

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What is a Symmetric Memory Processor Model?

A

Each CPU has its own cache, but memory is shared by all CPUs

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What is the Distributed Shared Memory Model?

A

Each CPU has its own cache and local memory, but can access other CPU’s memory via the network

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

How long does it take the SMP model to access main memory and cache?

A
  • 100+ cycles to get to main memory

- ~2 cycles to access cache

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What happens on update to a shared cache entry?

A
  • Cache becomes snoop-y (checks for updates)

- Can invalidate other pages on change OR update all

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What is the memory consistency model?

A
  • Guarantee order of execution
  • Program order + arbitrary interleaving
  • Sequential consistency
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What is memory consistency?

A

What is the model presented to the programmer

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

What is cache coherence?

A

How is the system implementing the model in the presence of private caches?

Shared address space + cache coherence

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What is NCC?

A

Non cache coherence

Shared address space

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

What are the operations associated with hardware cache coherence?

A
  • Write invalidate: invalidates cache entry if present

- Write update: send updated value to update cache entries

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What is the issue with scalability?

A
  • More processors doesn’t necessarily increase performance
  • Increased overhead

Can exploit parallelism though

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

What are synchronization primitives for shared memory programming?

A
  • Locks (exclusive and shared)

- Barriers

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

What are atomic operations?

A
  • Guarantees no interrupts in instruction
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

What are examples of atomic operations?

A
  • Test-and-set
  • Patch-and-increment
  • Fetch_and_Φ (Φ can be anything)

Basically Read Modify Write

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Are atomic operations sufficient to ensure synchronization?

A
  • No. Need atomicity system-wide

- Don’t look at the cache, just go straight to memory

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

What are the issues with scalability with synchronization?

A
  • Latency
  • Waiting Time (application problem)
  • Contention
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

What are the goals of spin locks?

A
  • Exploit the cache
  • Disruptive of actual workers
  • Avoid contention
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

What are spin locks?

A
  • Spin lock located in cache

- Waiting processes spin on cached copy of lock

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

What are the issues with spin locks?

A
  • Contention on lock release

- Takes N^2 operations to invalidate and update

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

What are spin locks with delay?

A
  • Delay after lock release
  • Delays should be staggered
  • Delay with exponential back off
  • Don’t need cache coherence for this to work
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

What is ticket lock?

A
  • Similar to a deli situation
  • Lock has a ticket number that matches the process it is serving
  • Other processes have to wait until their number is called
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q

What are the issues with ticket lock?

A
  • Contention can happen if reading from memory

- Invalidate based protocol will cause contention

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
23
Q

What are array based queueing locks?

A
  • Circular queue
  • Each slot has two values: has_lock, must_wait
  • Array size = N number of processors
  • queue_last pointer = pointer to last position in queue
  • Processes point to where they are placed in the queue
  • Processes spin on their spot to get the lock
  • On unlock, set self to must_wait and set neighbor to has_lock
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
24
Q

What is false sharing?

A
  • Cache block may have multiple variables

- Variable appears shared due to cache layout/how data is stored in the cache

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
25
What is gang scheduling?
OS schedules multiple threads of the same process concurrently
26
How do we handle spin locks over blocking threads?
- Critical section should be small | - Avoid context switching overheads by allowing waiting threads to spin rather than block
27
What is a linked list based queue-ing locks?
- L stores last_requestor in queue - Lock will point to current process - Next process in line will spin on the lock
28
What is the issue with linked list based queueing locks?
Need mutual exclusion in order to work
29
What does a linked list based queue-ing lock do on unlock?
- Remove current process from the queue | - Need to check if a new process is being formed by checking who is next
30
What are the benefits of a linked list based queue-ing lock?
- Fair - Spins on private variable - Only only processor signaled when lock is released - Only one RMW primitive per CS execution - Space complexitiy proportional to number of sharers
31
What are the drawbacks of a linked list based queue-ing lock?
- Latency to get a lock could be higher due to linked list maintenance overhead - Needs support of RMW on architecture
32
What are barrier algorithms?
Threads will do work until they reach a meeting point, where they will wait until everyone arrives
33
What is a counting barrier?
- As process reaches barrier, decrement counter - If you are not last, spin on counter - If you are last process, reset the counter and release all processes to continue
34
What are issues with counting barriers?
- Counter may not be reset fast enough before processes reach next barrier - Need an additional variable to combat this: count == 0 and count reset to N
35
What is a sense reversing barrier?
Two shared variables: - Sense (boolean that flips when all processes reach barrier) - Count All except last process: decrement counter and spin on sense Last processor: reset count to N, flip sense variable
36
What is a tree variable barrier?
- Processors are leaves in the tree hierarchy - Two phases: arrival phase and wake up phase Arrival Phase: - At each branch point, when children have reached 0, decrement count of parent - Keep going until you reach the root Wake Up Phase: - Reset the senses and counters back down the tree
37
What are the problems with tree variable barriers?
- Spin memory location cannot be statically determined - Ariness of the tree depends on how many processors will read from the same variable (contention) - Spinning on remote memory if NCC/MP - May cause contention on network if NCC/NUMA machine
38
How does the MCS Tree Barrier/4-ary tree work?
- Each parent also has a child vector that marks how many children they have and who their children are - Child vector indicates the status of the child - One spin location - Parents notify children to wake up - Wake up phase activated when all have been marked as arrived in arrival tree and vice versa
39
What are the benefits of the MCS Tree Barrier?
- Don't need RMW instruction (only one processor reads/writes to the lock) - Don't need mutual exclusion - Takes O(n) space - Takes O(log n) network transactions - Works on NCC and CC NUMA
40
How does MCS Tree Barrier work on a NUMA machine?
- Memory is accessible to everyone - Remote memory is accessible via network - Every CPU has associated memory
41
How does MCS Tree Barrier work on a NCC machine?
Cache only caches from local memory
42
How does Tournament Barrier work?
- N players play log2N rounds - Works due to message passing - Rounds are rigged; winning processor has been pre-selected - Winning processor can be spinning locally on disk for shared memory multiprocessor - Spinning location kept static
43
What are the virtues of Tournament Barrier?
- Static determination of spin location - No need for fetch-and-Φ - Takes O(n) space - Can exploit parallelism if ICN is available - Also works on cluster machine
44
What are the drawbacks of Tournament Barrier?
Does not exploit spatial locality
45
How does Dissemination Barrier work?
- Gossip protocol over several rounds - Every process sends and receives a message to each other - O(N) communication events per round - Round ends when you have sent a message and received the expected message
46
What are the virtues of Dissemination Barrier?
- No hierarchy (can communicate in parallel) - No pairwise synchronization - Each node is independent - Total communication: O(N log2 N) - Total amount of communication in each round is constant and equal to N - Works for NCC, MP and SM machines
47
What is the trade-off when it comes to cross domain communication?
Safety vs performance
48
When are call procedures determined?
At run time
49
When do RPCs do work?
At run time in the kernel
50
What work is involved in RPCs?
- 2 context switches (client and server address space switch in kernel) - 2 traps - 4 copies being made from client to server and back
51
How do RPC calls work in parallel systems?
- Kernel has no clue about the semantics of the RPC call | - Client and server work to facilitate via client stub and server stub
52
What is the flow for client-to-server process communication?
- Client writes to client stub in RPC message - RPC message copied into kernel buffer - Kernel copies message into server stub in server domain - Server unpacks message and delivers to server Total of 4 possible copies
53
How can the client-to-server RPC be made more efficient? (How can we reduce from 4 total copies?)
- Bind client to server through name server (one time cost) - Name server is above the kernel and is accessible by all processes - Kernel can import entry point for server code for ease of access into client
54
When is the kernel needed in the more efficient implementation of client-to-server RPC?
- On import to client | - Entry point is set up
55
How does the client-to-server RPC call bind the client to the server?
- Using the A-stack in a common communication area | - Client stub does most of the work
56
What is the cost of changing protection domains?
- Implicit cost (losing locality) | - Context switching
57
How does the binding object work in an RPC call?
- Client calls server code, which sends a call trap to the kernel - Binding object is forwarded through the PD to the server - Arguments are passed into the A-stack by the client - Server extracts arguments from A-stack and puts them in the execution stack - Result from server is passed back through the A-stack to the client Number of copies reduced to 2
58
What are the processes of copying data to and from the A-stack called?
Marshaling and unmarshaling
59
Is it possible to reduce the number of copies in the RPC call to one?
Yes. Using Modula 3, can pass a pointer as an argument for the A-stack
60
What are areas that can be optimized further in an RPC call?
- Thread switching from client to server (can run client thread in server domain to save context switching) - Can use Liedtke's domain packing trick to avoid implicit cost
61
What is unavoidable when it comes to RPC calls?
- Client/server traps - Switching domains - Loss of locality
62
How does RPC work on SMP machines?
Exploits multiple processors by preloading server domain and keeping caches warm
63
What is cache affinity scheduling?
- Processors will prefer to run on processors that have run them before to take advantage of cache entries - Other processes can run on the processor, but will pollute the cache
64
What are different scheduling policies?
- First come, first serve (emphasizes fairness, doesn't care about cache affinity) - Fixed processor: thread will always run on set processor - Last processor: thread will run on processor it last ran on - Minimum intervening: thread runs on processor with least amount of intervening threads - Minimum intervening plus queue: thread runs on processor with minimum wait and intervening threads
65
What scheduling policies are good for light loads?
Minimum intervening, minimum intervening plus queue
66
What scheduling policy is good for heavy loads?
Fixed processor
67
What is the drawback of first come, first serve scheduling policy?
High variance in response time
68
What are implementation issues when it comes to scheduling?
- Global queue doesn't scale well - Affinity based local queues: occupancy is policy specific, and processors can steal work from other queues - Priority of thread plays a role in determining position in queue
69
How do we calculate thread priority?
Priority = BPi + Age_i + Affinity_i
70
What are performance metrics for measuring scheduling?
- Throughput (System centric) - Response time (User centric) - Variance (User centric) - Cache reload time vs memory footprint
71
How do you choose what scheduler to use?
Will depend on architecture and workload
72
How can schedulers be optimized?
Inserting an idle loop in processor scheduling can help increase affinity (like mutual exclusion locks with delay)