Final Flashcards

Question

What are the tradeoffs between message-based vs. shared-memory-based communication?

Answer 1

Message-based IPC Pros: Simplicity. Everything is handled by the OS (eg channel management, synchronization). Message-based IPC Cons: Overhead of user/kernel crossings which requires context switching. Requires copying data twice (in/out of the kernel). Shared-memory IPC Pros: After initial setup, the OS is out of the way. No longer requires user/kernel crossings. Data copies are potentially reduced for read-only use cases. Shared-memory IPC Cons: More complexity. Requires developer to handle synchronization, communication protocol, shared buffer management, etc.

Answer 2

Pthread mechanisms like mutexes and condition variables. Binary semaphores which rely on values of 0 & 1 to provide similar behavior to a mutex for controlling access to shared memory. Also message queues which can exhibit protocol type behavior by executing an operation only once a confirmation message is sent/received.

Answer 3

c) neither; it depends

Answer 4

Because in a concurrent thread use case, a purely software solution won’t be able to implement an efficient way to manage mutual exclusion. Whether your code uses a while loop or an if/else statement, the condition variable for the lock can be accessed concurrently by multiple threads. Only hardware atomic instructions can ensure that only one thread at a time can check the lock value using methods such as “test\_and\_set” within a loop.

Answer 5

Spinlocks are useful in that they continue to burn CPU cycles, constantly checking to see if the lock is available. Compared to mutexes, spinlocks don't wait to be signaled when the lock becomes available and therefore can improve performance for small critical sections and when no other work is required. In addition, they can support more complex implementations. No. Spinlocks provide mutual exclusion just like mutexes do so there would be an overlap. In addition, there are use cases where it makes sense for the thread to block so that another thread can execute a different operation as it would with mutexes.

Answer 6

Higher-level synchronization constructs like reader-writer locks or monitors are more powerful because they abstract away a lot of the complexities involved with using basic constructs like mutexes, spinlocks, etc. These complexities are error prone and effect the correctness/ease-of-use. Also, they inherently lack additional capabilities to control access or priority.

Answer 7

est and Set: Low latency and low delay but very bad contention. The lock holder needs to wait for all the other test\_and\_set operations to complete. There’s no clear way of giving it priority. In addition, the atomic instructions require bypass of the cache directly to the memory on every spin. Not even compared in table. Test-and-Test-and-Set (Spin on Read): Loops on value of lock in cache first and once it’s free, then executes Test-and-Set. The latency and delay are a little bit worse but performs well under light load. With a write-update strategy, the performance improves because the cache gets updated with the new locked value. With a write-invalidate strategy, the performance is the worst, creating a lot of contention and coherence traffic. Delay Lock: There’s either a delay introduced after the lock is released or after every access to the lock. The goal is to spread out the threads and improves the contention but makes the delay worse. Static delays are a little better vs Dynamic delays. Dynamic is better under lighter load. Delaying after each memory reference is better than delaying after the lock is freed. Queuing Lock: Uses an array of flags with the number of elements equal to the number of threads. Assigns values of must\_wait and has\_lock to each thread. Relies on read\_and\_increment atomic instructions to control concurrent access to the queue. This lock is great for contention and delay and performs the best for large loads. Under light loads, is the worst due to read\_and\_increment.

Answer 8

Starting with the user process, a system call (send data/read file) is sent to the kernel. The kernel/OS will run the in-kernel stack for the applicable device (TCP/IP stack to form a packet, filesystem needed to determine the disk block that stores the file data). The kernel will invoke the applicable device driver. The device driver will perform the configuration of the request (perform a transmission of packet data, issue disk head movement) to the device. Once the device is configured, it will perform the request (perform transmission, read block from disk). The results from the device will traverse the steps in a reverse manner. There is another flow that allows for the CPU to directly to the device and back called OS bypass.

Answer 9

For programmed I/O, the CPU make a call to the command register and then all subsequent calls to the data register for any data transfer use cases. For the same use case, the CPU will make one call to the command register and one setup call to the DMA controller with details about the memory address & buffer of data that needs to be transferred. The DMA then handles the data transfer calls moving forward. The DMA is preferred for transfer of large amounts and programmed I/O is preferred for frequent transfers of a small amount of data since it doesn’t execute the DMA setup instruction.

Answer 10

At the top of the stack, user applications interface with a file (via POSIX API). Next is the kernel file system (FS) which takes the application level reads/writes and determines where & how to find a file block and access it. The FS will rely on the generic block layer to know how to interact and pass those operations to a particular device driver and interpret any responses. The device driver will then speak the device specific APIs to the device. The file is the main abstraction for the VS. The file is represented as a file descriptor. Each file has an inode structure which holds an index of all the blocks for the file. A structure called the dentry is used to track a single path component for files/directories that are accessed. A superblock is used to track info about how the filesystem is laid out on disk.

Answer 11

The size of the inode directly determines the size limit of a file. The inode will contain direct pointers to file blocks so if each pointer was 4B pointing to 1KB blocks, for a 128B inode we’d have 32KB (128/4 = 32) file size limits. To address this, inodes have indirect pointers to point to a block of pointers. So for 1KB blocks and each block pointer is 4B, we’d have 256 pointers (1KB/4B = 256). To determine the maximum size of a file, you’d need to calculate the (number of direct pointers + single indirect pointers + double indirect pointers + triple indirect points) x block size. Using the previous example, we’d have (12 + 256 + 256^2 + 256^3) \* 1KB = 16 GB.

Answer 12

Buffer Caches in main memory. File is read/written to the cache and periodically flushed to disc. Reduces the # of disk accesses. I/O scheduling maximizes sequential vs random disk access. Will reorder block numbers sequentially to reduce any disk head movement. For example, if the disk head is on block 15, it will reorder write requests for block 25 & 17 so that 17 is written first and then 25. Prefetching increases cache hits and locality by reading additional blocks. For example, when a read for block 17 is executed, blocks 18 & 19 are also read. Negates the need for future disk reads. Journaling/Logging keep a log of all writes so that there is a record to be references for disk access. This reduces random disk access since there is no guessing involved. Will periodically move log data to disk.

Answer 13

Virtualization originated in the 60s at IBM where there were a few large mainframes that were shared by many users and business services. Virtualization allows concurrent execution of multiple OSs (and their apps) on the same physical machine. Each OS thinks it “owns” the HW resources and is presented in a virtual machine. The virtualization layer manages the physical hardware and includes the virtual machine monitor (VMM) and hypervisor. Examples are Zen & ESX.

Answer 14

Bare-metal (also known as hypervisor-based) virtualization relies on a VMM to manage all hardware resources and supports the execution of VMs. For interactions with devices, relies on a Service/Privileged VM. Hosted virtualization is where the host OS owns the interaction with devices via device drivers (no Service VM) and leverages a special VMM module to provide hardware interfaces to VMs. Not only can it host VMs, it can directly run native applications too. Example is KVM (kernel-based VM).

Answer 15

Paravirtualization gives up on the idea that the guest VMs are unmodified. Instead, the VMs know it’s running virtualized and can make explicit calls to the hypervisor. The goal here is to avoid the overhead of inspecting/rewriting the VM binary and improve performance. Originally adopted and made popular by Xen.

Answer 16

Pre 2005, there were only 4 rings of privilege where the hypervisor was in ring 0 and the guest OS was in ring1. If called by the OS (ring 1), the privileged register flags instructions POPF and PUSHF failed silently. The hypervisor was never notified so it never changed it’s setting and the guest OS also was never notified so it assumed the operations were a success. The solution was Binary Translation which rewrites the VM binary so that it doesn’t execute the 17 instructions that cause the problem.

Answer 17

Device virtualization is when a guest VM needs to access hardware devices. The passthrough models involves the VMM giving the device driver on the Guest VM direct access permissions to the hardware device. This allows the Guest VM to directly access the device and bypass the VMM. Unfortunately, this makes it difficult to share the device which isn’t feasible and it requires the VMM to know the exact type of device that the Guest VM expects based on it’s device drivers. Also, VM migration becomes difficult because the Guest VM isn’t decoupled from the HW. The split-device driver model involves device access control to be split between the front-end driver in the guest VM and the back-end driver in the Service VM (or host). Requires the guest VM to install modified guest drivers so only paravirtualized guests are supported. This model eliminates emulation overhead of a direct hypervisor model and allows for better management of shared devices

Answer 18

In reviewing all the requirements for IPC, it was observed that there were a lot of common boilerplate code that was repeatedly used. The primary difference is the protocol definitions. RPC provides the developer with the ability to define protocol details while auto generating the necessary boilerplate code.

Answer 19

Binding mechanism that allows a Client to determine which server to connect to and how. Clients can interact with a registry that is an online distributed service that any client can register with. The registry can also be a dedicated process that runs on every machine. For this use case, the client must know the machine address and port number. Use of Interface Definition Language (IDL) to determine how to package arguments & results for communication between client and server. RPC can use a language-agnostic IDL (eg XDR in SunRPC) or language-specific (eg Java in JavaRMI). Language-specific IDLs only bring value to those who actually use the language (eg Java). Use of Pointers as Arguments which should be disallowed or serialize the pointed data. Causes issues because the pointer will likely point to a location in another process/machine address space. Only way to make this work is to send the “pointed to” data as part of the call. Partial Failures can make it tricky to identify errors because there are so many components that could cause the failure. RPC provides a special RPC error notification that serves as a “catch all” for any type of failure but doesn’t specify what caused it.

Answer 20

IDL: XDR (language-agnostic) Pointers: Allows them and serializes the pointer data. Failures: Retry mechanism for when a connection times out. Returns meaningful errors with as much info as possible.

Answer 21

Marshalling is when a procedure and it’s arguments are serialized/encoded so that they are reflected in contiguous bytes in the buffer sent in the RPC call. This way, the RPC call is ready in a manner that identifies the procedure and arguments in order. Unmarshaling involved reading the buffer from the RPC call, reading the procedure and data types, then parsing out the data. As a result of unmarshalling, the arguments are placed in the received address space and initialized. Marshaling/unmarshalling aren’t explicitly created by instead done by the RPC system compiler.

Answer 22

For XDR encoding, all data types are encoded in multiples of 4 bytes. For complex data types like arrays, the transmission buffer will require 4 bytes for the array length, X bytes for the characters (1byte per char), and any additional padding bytes required to get the total bytes to a multiple of 4 bytes.

Answer 23

Sharing at the granularity of either the cache line or the variable is too fine-grained and would lead to too much coherence traffic. Instead, it is preferred to share at the granularity of a page which makes sense for the OS. However, we need to be careful of false sharing. Sharing at the granularity of the object also works but it is dependent on the language runtime.

Answer 24

A global index structure provides a map of all ‘home nodes’ so that we understand where the most updated copy of data resides. The ‘home node’ maintains an index of the file and is responsible for driving coherence with the other nodes so it relies on the index to tell it who else has a copy of the day.

Answer 25

It’s a guarantee that state/memory will behave correctly (access ordered and memory propagated) if and only if the software follows specific rules (eg use of locks, atomic operations, counters).

Answer 26

Strict Consistency: Updates visible everywhere immediately and in the same order. In practice, there are no guarantees with a single SMP w/o locking and synchronization. With distributed systems, latency & message reorder/loss make this even harder. Impossible to guarantee. Sequential Consistency: Memory updates from different processors may be arbitrarily interleaved. Ordering of updates is consistent across all processes. Updates from the same process always appear in the order they were issued. Causal Consistency: For causally related writes (P2 reads P1 write before and in order to execute P2 write), it is guaranteed that those writes will be correctly ordered. Weak Consistency: Leverages synchronization points so that processes can see the updates in order from other processes to guarantee consistency. How often synchronization points occur impact the level of consistency. This limits data movement & coherence operations but extra state needs to be maintained for additional operations.

Answer 27

Homogeneous Pros: Keeps front-end simple (eg load balancer). Cons: Won’t be able to benefit from local node caching (understanding state for efficiencies). Heterogenous Pros: Different nodes have different tasks/requests. Benefit from locality and caching. Cons: Requires a more complex front end and complex node management.