Final - Provided Review Questions Flashcards Preview

CS 6200 > Final - Provided Review Questions > Flashcards

Flashcards in Final - Provided Review Questions Deck (63):

How does scheduling work? What are the basic steps and data structures involved in scheduling a thread on the CPU?

scheduler dispatches tasks from runqueue
• scheduling policy/algorithm decides which task to schedule

Tasks enter the runqueue after:
• creation (fork)
• interrupt
• I/O Completion
• expiration of time slice

Scheduler Runs when:
• CPU becomes idle
• Timeslice expires
• New task becomes available

When dispatching a thread:
• context switch to selected thread
• enter user mode
• set PC to next instruction of scheduled task


Do you understand the tradeoffs associated with the frequency of preemption and scheduling/what types of workloads benefit from frequent vs. infrequent intervention of the scheduler (short vs. long timeslices)?

• For CPU bound tasks, longer time slices are better because there are fewer context switches (overhead). Keeps cpu utilization and throughput high.
• For I/O bound tasks, smaller time slice is better. I/O ops get issued sooner. Better user-perceived performance.


Can you work through a scenario describing some workload mix (few threads, their compute and I/O phases) and for a given scheduling discipline compute various metrics like average time to completion, system throughput, wait time of the tasks…

Throughput = tasks / sec
• total # tasks / (timestamp when all tasks are done)

Avg Completion Time
• sum_of_times_to_complete_each_job / # tasks completed
• (sum of completion timestamps)

Avg Wait Time
• sum timestamps each task started at / number tasks


Can you contrast the MLFQ with the O(1) scheduler? Do you understand what were the problems with the O(1) scheduler which led to the CFS?

O(1) scheduler:
• Two queue's:
• Active Array and Expired Array
• Active is used to pick the next task to run
• If task is preempted and doesn't finish it's time slice, it will be put back in the queue within remaining amount of time.
• Once timeslice expires, it's added to the Expired array
• Once active array is empty, pointers are swapped and expired array becomes active array

O(1) scheduler replaced b/c it affected performance of interactive tasks. Once on expired list, wouldn't run again until all active queue was exhausted. Also no fairness guarantees.

CFS - only for non real-time tasks
• Red-Black tree (self balancing tree - all paths to leaves approximately same length)
• ordered by virtual runtime (vruntime - time spent on cpu)
• Always schedules task with least amount of vruntime on the cpu (generally leftmost task)
• periodically update current vruntime
• then compare current vruntime to leftmost node
• schedule whichever is lower
• if swapping to another task, move old task into proper place in tree
• vruntime progresses faster for lower priority tasks
• vruntime progreses slower for high priority tasks
• Uses the same tree for all priorities

• selecting a task => O(1) time
• adding a task => O(log n)


What are some performance counters that can be useful in identifying the workload properties (compute vs. memory bound) and the ability of the scheduler to maximize the system throughput.

Use a new metric (CPI – cycles per instruction) to evaluate job type and adjust scheduling (other metrics: historic info doesn’t work bc no sleeping when waiting on Mm,

hardware counters – cache misses, IPC, power + energy data, software interfaces & tools – oprofile, vLinux perf tests)


What happens when a process tries to modify a page that’s write protected/how does COW work?

Tries to modify a page that's write protected (and COW = copy on write):
• When a new process is created, it creates a new virtual address space that points to the same physical locations as the previous process. This is done to avoid copying EVERYTHING when that may not be needed.
• Now everything becomes write protected
• If the new process tries to write to a write protected block, the processor traps into the OS determines it's a page fault, and then copies that page over for the new process and updates it's VA => PA mapping.
• This allows us to only copy what is necessary instead of the entire process. Saves space and time - like a lazy copy.


How do we deal with the fact that processes address more memory than physically available? What’s demand paging? How does page replacement work?

Processes won't use all their memory all the time.

We can save physical page frames out to secondary storage.

Demand Paging:
• swapped in/out of memory and a swap partition (disk)

Page Replacement:
• if present bit is 0 (not in memory):
• traps into OS
• determines it's a page fault
• Submits memory request to pull it from disk
• puts it into free location in physical memory (DRAM)
• Updates page table with new free location (VA is unchanged).
• program counter reset to make memory access again, but this time it will be present


How does address translation work? What’s the role of the TLB?

Address Translation:
• Virtual Address is mapped to a PA via the page table.
• VA consists of N bit offset (LSBs), then further chunks which represent indexes into the page table, page directory, table of directories, etc.
• Offset is ultimately used to get to the memory location inside the physical page.

TLB - cache of valid VA to PA translations. If present in the cache, no need to do any operations.


Do you understand the relationships between the size of an address, the size of the address space, the size of a page, the size of the page table…

• address size: 2^32 or 2^64 (architecture)
• address size / page size = # VPNs (2^32 / page size)
• page size often 4kB (8kB, 2MB, 4MB, 1GB also options)

• address size = 2^32
• page size = 4kB
• (2^32 / 2^12) * 4B / address = size of page table in memory = 4MB, PER PROCESS!
• If 64bit architecture, that number becomes enormous


Do you understand the benefits of hierarchical page tables? For a given address format, can you workout the sizes of the page table structures in different layers?

• Don't need to have virtual address space for EVERYTHING - only what is actually needed.

• [P1 | P2 | Offset] - offset concept remains the same, but each P# is an offset into each inner table.
• 2^#offset bits = pages size
• 2^(Most inner P) is the number of pages per page table
• 2^(next level outward P) is the number of page tables
• Then directories, etc...

+ Smaller internal pages means more granularity of coverages
- more memory access required for translation (increase translation latency)


For processes to share memory, what does the OS need to do? Do they use the same virtual addresses to access the same memory?

The OS sets aside a section of physical memory and maps it into each processes virtual address space. They virtual addresses in each process do not have to be the same.


What are the tradeoffs between message-based vs. shared-memory-based communication?

• Message based is simple and has an established interface, not very error prone. But you must copy the data from the user space into the kernel then back into the second processes user space, and you must context switch 4 times to do it.
• Shared memory requires heavy initial setup costs - mapping the shared memory into each process, but incurs no overhead beyond that. The space is shared between the two processes. It is must more error prone for developers, however, because there are no pre-determined signaling mechanisms.


What are different ways you can implement synchronization between different processes (think what kids of options you had in Project 3).

• Shared Memory
• Mutexes + condition variables
• semaphores
• Message Queues
• Sockets


To implement a synchronization mechanism, at the lowest level you need to rely on a hardware atomic instruction. Why? What are some examples?

Because you need an operation that will both test the variable and then set it as a single operation, otherwise you could get race conditions. If not atomic, then multiple processors could see a lock was free and 'acquire' the lock and move into the critical section.

If it's an atomic instruction, only one processor at a time can 'test and set' the variable at a time.

Examples: Spin Locks, Test & Set, Read & Increment, Compare & Swap, Test & Test & Set

Note: On SMP systems, atomic instructions go to the memory instead of the cache so they can be ordered/synchronized. Therefore, they take much longer and generate coherence traffic regardless of change.


Why are spinlocks useful? Would you use a spinlock in every place where you’re currently using a mutex?

Spinlocks are useful when the critical section is very short. This prevents having to context switch and check conditions/mutexe's etc (many cycles to do all that when critical section may be really short).

No - really depends on what the critical section looks like.

It’s also important to note that spin-waiting can cause other processes to slow by consuming communication bandwidth (Anderson)


Do you understand why is it useful to have more powerful synchronization constructs, like reader-writer locks or monitors? What about them makes them more powerful than using spinlocks, or mutexes and condition variables?

They are less error prone and more expressive. Monitors can have enter/exit code that is implicitly executed for you so you don't forget to do it. This prevents difficult to trace errors. They provide sync structures with a better granularity to the type of critical section they are protecting. Example: reader/writer locks are specific to the types of access they protect.


Which lock types are best under low and high load (test_and_set, queuing, test_and_test_and_set, etc)?

• High Load
• Queue is BEST under high load
• Static better than dynamic
• Low Load
• test_and_test_and_set is best
• Queueing lock is worst


What are the basic differences in using programmed I/O vs. DMA support?

Programmed I/O:
• Programmed I/O: No additional HW support needed; interacts directly with the command and data registers; takes multiple CPU store instructions; better for small data transfers.

• Relies on DMA Controller; Less store instructions but DMA configuration is more complex and can take more overall CPU cycles; better for large data transfers.

So for 1500B packet, with 8byte registers you would need:
• 1 CPU access to request packet transmission
• 1 DMA configuration operation

DMA config is more complex, so for smaller transfers, programmed I/O still takes less operations than DMA.

DMA controller requires data to be in memory buffer until transfer is complete - they are pinned - non-swappable


Do you understand the relationship between the various data structures (block sizes, addressing scheme, etc.) and the total size of the files or the file system that can be supported on a system?

Data Structures and Relationships
• inodes point to file meta data + all the blocks that represent the file
• inodes have:
• direct blocks
• single indirect
• double indirect
• and so on...


For the virtual file system stack, we mention several optimizations that can reduce the overheads associated with accessing the physical device. Do you understand how each of these optimizations changes how or how much we need to access the device?

• buffer/caching - reduce # accesses
• in main memory
• read/write from cache
• periodically flush to disk (fsync)

• I/O Scheduling - reduce disk head movement
• maximize sequential vs random access
• write block 25, write block 17 => write 17, 25

• prefetching - increase cache hits
• leverages locality
• eg, read block 17 => read also 18, 19

• journaling/logging - reduce random access
• "describe" write in log: block, offset, value ...
• periodically apply update to proper disk locations


What’s hosted vs. bare-metal virtualization? What’s paravirtualization, why is it useful?

• Host OS has a VMM module that invoked drivers as needed
• Host owns all hardware
• KVM (Kernel Based VM)

• A hypervisor (vmm) controls resource allocation to all of the VM's present and controls access to the underlying hardware.
• Has a privileged service VM to deal with devices and device access
• ESX, Zen

• A modified version of the OS so that it's aware it is being virtualized.
• Makes explicit calls to hypervisor (hypercalls)


What were the problems with virtualizing x86? How does protection of x86 used to work and how does it work now?

Certain instructions would not trap into the hypervisor and therefore just would not get executed.

• Old Way:
• Only 4 rings, no support for root/non-root
• 17 instructions that were privileged and did not cause trap, thus failed silently
• interrupt enable/disable bit in privileged register, failed silently from Ring1 OS, so VMM didn't know it was trying to set it, and OS assumed it worked.

• New Way:
• Rewrite binary of guest VM so it never issues those 17 instructions (dynamic binary translation)
• dynamically inspect code block. if bad instructions not found, execute at HW speeds, if found, emulate desired behavior (possibly avoiding trap)
• cache translated block to improve efficiency, only analyze kernel code


What is passthrough vs. a split-device model?

• VMM driver configures devices access permissions
+ guest VM has exclusive access to a device, and is only one who can access it and they have direct access.
- device sharing is difficult - no concurrent sharing
- VMM must have exact type of device as what VM expects (since it's got direct access, very specific)
- VM migration is tricky b/c it must have same hardware and state at destination

• VMM intercepts all device accesses
• emulates device operation
+ device independent
- latency of device operations
- exposed to device driver ecosystem complexities/bugs in hypervisor

• access control split between front-end (guest) and back end (service VM or host)
• back end is regular device driver that host/linux etc would use
• front end driver MUST be modified. Basically a wrapper that creates messages that are sent to the service VM instead of issuing requests directly to the hardware.
• limited to paravirtualized guests that can install special front end device drivers
+ eliminate emulation overhead
+ allow for better management of shared devices - b/c all requests are arbitrated by the service VM


What are the various design points that have to be sorted out in implementing an RPC runtime (e.g., binding process, failure semantics, interface specification… )? What are some of the options and associated tradeoffs?

RPC Requirements:
• Has synchronous call semantics
• error handling (type checking, packet bytes interpretation)
• cross machine conversion (endian-ness)
• higher level protocol (access control, fault tolerance, different transport protocols)

RPC Structure:
• Client calls fn that looks like a normal function
• This fn actually calls a stub that creates message and sends the request to the server
• server stub receives msg and deserializes/unpacks it
• server stub calls actual implementation and then process goes in reverse to return result

RPC Steps:
• register - server 'registers' procedures, arg types, location, etc so it can be discovered by clients
• binding - find and 'binds' to desired server that implements desired function
• client makes RPC
• client stub 'marshals' args (serialized into buffer)
• send - client sends message to server
• receive - server receives msg, passes to server stub
• stub 'unmarshals' args (extract args and create data structs)
• actual call: server stub calls local implemenataion
• result: server computes results and follows steps backwards

RPC Design Points:
• Binding - how to find the server
• IDL - how to talk to the server and package data
• Pointers - disallow vs serialize pointed data
• Partial failures - special error notifications


What’s specifically done in Sun RPC for these design points – you should easily understand this from your project?

Binding - per-machine registry daemon. talks to registry on target machine.

IDL - XDR (language agnostic specification and encoding)

Pointers - allowed and serialized

Partial Failures - retries; return as much info as possible


What’s marshalling/unmarshaling? How does an RPC runtime serialize and deserialize complex variable size data structures? What’s specifically done in Sun RPC/XDR?

Serialization/DeSerialization - converting from objects to a byte stream and byte stream to objects

Variable size structures are represented by a pointer to the start of the memory location and then a length of the object.

Sun Solution:
• Sun XDR :
• All data types are encoded in multiples of 4 bytes;
• Big-endian is the transmission standard;
•  Two’s complement is used for integers;
• IEEE format is used for floating point;


What are the tradeoffs (benefits and costs) associated with using techniques such as caching, replication, partitioning, in the implementation of a distributed service (think distributed file service).

+ performance, less network interaction
- complexity, coherence mechanisms and management

file sharing semantics:
• unix semantics - every write visible immediately (not realistic)
• session semantics (write back on close, update on open) - not a great solution still
• periodic updates - client writes back periodically and server invalidates periodically. Augment w/ flush/sync API.
• immutable files - never modify, only create new
• transactions - all changes are atomic

replication - each server has all files
+ load balancing, availability, fault tolerance
- writes are more complex (sync to all, write to one, prop to others)
- replicas must be reconciled (voting)

partitioning - each server contains some of the files
+ availability vs single server DFS, scalability (w/ file sys size), single file writes are simpler
- on failure, lose portion of data, load balancing harder, if not balanced could get hot spots
• both - files partitioned, partitions replicated


The Sprite caching paper motivates its design based on empirical data about how users access and share files. Do you understand how the empirical data translated in specific design decisions?

Concurrent writing was rare so optimized for common case and turn off caching in the rare instance of concurrent writers.

Files were often opened very briefly, not justifying high overhead of session semantics

Good portion of new data deleted within 30s - 5min, so write-back on close not necessary - collect dirty blocks from last writer when another client tries to open


When sharing state, what are the tradeoffs associated with the sharing granularity?

For small granularity, you are very specific about what is being shared (for instance, variables), however, that can generate too much traffic and DSM overheads.

With larger granularity, you cut down on DSM overheads, but now you need to worry about False sharing

eg, with page granularity, to systems sharing a page that contains variables X and Y. One CPU cares only about X and the other only about Y. But from a DSM perspective, these are concurrent accesses so triggers all the DSM overheads/coherence/invalidation/etc. These two aren't _actually_ sharing the same memory.


For distributed state management systems (think distributed shared memory) what are the basic mechanisms needed to maintain consistency – e.g., do you understand why is it useful to use ‘home nodes’, why do we differentiate between a global index structure to find the home nodes and local index structures used by the home nodes to track information about the portion of the state they are responsible for.

• home node drives coherence (like what server maintains in DFS)
• keeps state: pages access, modifications, caching enabled/disabled, locked. Keeps track of who the current owner is.
• important b/c it's how you find a page. Manager node know's who the owner is.

• Owner: currently 'owns' the page, eg exclusive writer. Controls all state updates and drives consistency operations. Owner may change.

• Each page object has:
• address = node id + page frame number
• node id = home node

• Global Map (replicated):
• object (page) id => manager node id
• manager map available on each node
• Can also use a global mapping table where object id is an index into table, which then holds the manager node. This means you can change the manager node where as in the other case, the manager is Fixed as part of the object id.

• Metadata for local pages (partitioned)
• per page metadata is distributed across managers


Do you have some ideas how would you go about implementing a distributed shared memory system?

• DSM needs to intercept memory accesses to see if it needs to do anything related to the DSM
• can do this via the trap - when the VA is accessed it will trap into the OS if it doesn't exist, this can then be passed to the DSM layer where it can then handle the access
• if content is cached on a node, DSM layer will ensure content is write protected and cause a trap if anyone tries to modify that content. Will then perform necessary coherence operations.
• Other MMU information is useful (dirty page)


What are the different guarantees that change in the different models we mentioned – strict, sequential, causal, weak?

strict - Updates are instantaneous and visible everywhere immediately in the order they occurred. Not really possible in DSM (or even on SMPs w/o extra lock/synch privileges) Only a theoretical model in DSM due to latency and overheads associated with implementing it.

sequential - memory updates from different processors may be arbitrarily interleaved, but all processes should see the same values in the same order . Each process will see the same interleaving. Updates from the same process will not be arbitrarily interleaved. They must be observed by other processes in the actual order they occurred.

causal - if updates are causally related, then updates that other processors see must occur in the same order as the causal relationship. Eg, P1 W_m1(x), P2 R_m1(x), P2 W_m2(y) => causal b/c after p1 wrote x to m1, P2 read that value and then wrote y to m2. Thus it's possible the reading of m1 influenced the value y. If the the Writes are unrelated (no read before writes on different variables) then there are no guarantees (called 'concurrent' writes). Writes occurring on same processor must be observed in that same order in other processors.

weak - Uses 'sync' points. A sync point makes sure all updates prior to this point will become visible to all other processors and all other updates on other processors so far will become visible here. Must be called by process performing the update AND process that wishes to know what updates have been made.

Can also use an entry/acquire and exit/release. entry/acquire means to receive (pull) all updates made by others while exit/release means push all updates to others.


When managing large-scale distributed systems and services, what are the pros and cons with adopting a homogeneous vs. a heterogeneous design?

+ Simple design and easy to work with
+ Easy to scale - just add more workers
- lose locality b/c each worker does the entire process

+ Gain locality since each worker is doing a specific task
- More complicated to scale - which specific tasks do you scale, how do you identify which ones are the bottleneck?


Do you understand some of the enabling technologies that make cloud offerings broadly useful?

• High speed interconnects
• Distributed Memory and File Systems
• Virtualization (Server/Network)
• Distributed storage


Do you understand what about the cloud scales make it practical? Do you understand what about the cloud scales make failures unavoidable?

Practical Scalability:
• economies of scale - you buy a single set of hardware, then you can virtualize the servers and the network.
• You can host multiple tenants in a single location, on a single machine even.
• You can easily provision new systems as needed, or bring them down

Unavoidable Failures:
• With this many devices, each with a fixed level of reliability, something is going to fail eventually.


What are the overheads associated with scheduling?

• Context Switching + scheduler computation time


What is the motivation behind the multi-level feedback queue?

MLFQ - ability to dynamically move a task into different different queues (I/O Bound, Mixed, CPU Bound). Allows tasks to live in a queue appropriate for their workload type.


Why do different queues in a MLFQ have different timeslices?

Because different workloads perform better with different time slices. CPU bound tasks favor long time slices and I/O bound tasks favor short time slices.


How do threads move between these queues in an MLFQ?

1. Task starts in I/O bound queue.
2. If voluntarily pre-empted (I/O op issued), keep in this queue. If it runs to expiration, move into Mixed queue.
3. Repeat these steps, moving task up and down in queues as required.


Thinking about Fedorova’s paper on scheduling for chip multi processors, what’s the goal of the scheduler she’s arguing for?

The goal of her scheduler is to balance schedule CPU Bound and Memory bound tasks on the same CPU's, thus maximizing the utilization (b/c hypterthreading allows for a cheap context switch, quickly pop over and start you long running memory op, then back to the CPU bound workload) - allows you to hide memory access latency with cheap context switches.


How does the OS map the memory allocated to a process to the underlying physical memory?

• Page Based - Virtual Memory Pages are mapped onto Physical Page Frames
• Virtual Memory page address -> page table -> DRAM address
• Only map the _pages_, from there you can use offsets from those pages to get to specific pieces of information.
• Virtual address: Virtual Pagetable Number (VPN): Offset
• VPN is an offset into the page table
• When you index into the page table you get the Physical Frame Number (PFN), and again combined with the offset gives you the physical location in DRAM (offset is the same from the virtual address)
• Physical memory is only allocated on first touch (only when actually needed)


What happens when a process tries to access a page not present in physical memory?

• If it just hasn't been allocated yet, then it allocates it (allocate on first touch)
• If it has been allocated but isn't in DRAM, then it raises a fault and traps to the OS
• If bringing it back into memory, it may not put it into the same place in physical memory so the page table gets updated for that VA


What happens when a process tries to access a page that hasn’t been allocated to it?

Tries to access page not allocated to it:
• page fault generated on kernel stack
• trap to OS
• raise SIGSEGV


For processes to communicate using a shared memory-based communication channel, do they still have to copy data from one location to another?

No - the memory is shared, so they only need to copy the memory into the shared channel.


What are the costs associated with copying vs. (re-/m)mapping?

• Copying incurs the cost of crossing in/out of the kernel. Mapping a file maps the VA to the PA of the shared memory region and can be done once per process until the process no longer needs to access the region anymore. Because if this, the overall cost associated with mapping/unmapping is significantly less than copying data.


Can you work through the evolution of the spinlock implementations described in the Anderson paper, from basic test-and-set to the queuing lock?

• test_and_set - performs an atomic operation that tests a variable and then sets its value to 1, returning the original value. If value is already set, setting it again has no effect.
` while (test_and_set(variable) == busy);` -- most basic spinlock
+ low latency (just atomic operation)
+ low delay (spinning continuously on atomic)
- bad contention - processor goes to memory on each test_and_set (b/c it's atomic)

• test_and_test_and_set - test cached value first, if it's free (0) then perform test_and_set.
+ OK latency
+ OK delay
- better contention, but ...
• if non-cc, no difference.
• if WU - OK
• if WI - horrible! b/c each test_and_set operation invalidates every other persons cache, defeating the purpose of checking the cache b/c now they all go to memory to get the value.

• test_and_test_and_set - delay after lock becomes free. do the normal test_and_test_and_set, but if lock is busy go into another normal while(lock == busy) loop, then after it becomes available create a delay before performing the outer loop check.
+ Contention - improved
+ Latency - OK
- Delay - much worse b/c of inherent delay. Could be delaying even with no contention.

• test_and_test_and_set - delay after each spin (instead of when lock becomes free)
+ Works on NCC architectures
- Delay = much worse

• queueing lock - array with length == # CPUs. One element == 1 (has lock) while all others are 0 (must wait). Each cpu wanting to wait for lock calls read_and_increment(queuelast) to get their position into the array. Each cpu spins on its flag until it's no longer must-wait. Then it does it's critical section, sets it's array index value to must-wait, and sets the next value in lie = has-lock.


What are the steps in sending a command to a device (say packet, or file block)? What are the steps in receiving something from a device?

Sending a command to a device:
• Sending a packet to a device: send data (user process) → form packer (kernel) → write Tx request record (driver) → perform transmission (device)

So for 1500B packet, with 8byte registers you would need:
• 1 CPU access to request packet transmission
• 188 for the data (~1500 / 8)
• total of 189 CPU instructions

• Receiving something back works in the reverse order (device retrieves and passes to driver, driver passes to kernel, kernel passes to user process)


For block storage devices, do you understand the basic virtual file system stack, the purpose of the different entities?

Virtual File System
file - elements on which VFS operates (identified by an inode)

file descriptor - OS representation of a file (open, read, write, etc)

inode - persistent representation of file "index"
• list of all data blocks
• device, permissions, size, etc

dentry - directory entry, corresponds to a single path component that is being traversed as we try to reach a file
• /users/ada => /, /users, /users/ada
• dentry cache (soft state - dentry's don't exist on disk)

superblock - filesystem specific information regarding the FS layout


How do you calculate the max file size for an inode?

To calculate possible file size per inode:
• get block size (eg, 1KB = 1024B)
• get size per points (eg, 4B)
• calculate # pointers / block
• block size / pointer size = 1024 / 4 = 256 pointers
• this is # of pointers in any indirect block
• calculate total number of pointers:
• # direct blocks + 256 (single indirect) + 256^2 (double indirect) and so on...
• Calculate total size:
• total # of block pointers * block size
• eg (12 + 256 + 256^2 +... ) * 1KB = Total KB possible / inode


What is virtualization? What’s the history behind it?

Multiple OS's running on a single host, each one believing it is the only OS running and owns all resources available to it.

• started in the 60's because computers were huge and run a VMM so multiple OS's could be run on the single piece of hardware.


How were/are the virtualization problems on x86 fixed?

x86 HW Improvements:
• closed holes in x86 ISA
• added root/non-root modes
• VM Control structure
• extended page tables and tagged TLB with vm ids
• multiqueue devices and interrupt routing
• security and management support
• additional instructions to exercise above features


How does device virtualization work?

Device virtualization uses 3 key models and standardizes the interface for which the VM can interact with the devices.


What’s the motivation for RPC?

• Repeatedly implementing the same infrastructure over and over to do things like client-server communication. Lots of common steps related to remote IPC.
+ higher level interface for data movement and communication
+ error handling
+ hiding complexities of cross machine interactions


What are some of the design options in implementing a distributed service?

Design Options:
• Upload / Download - file moved to client, access done on client, client done => reupload to server
+ local read/writes on client
- entire file download/upload even for small access
- server gives up control - no idea what client is doing

• True Remote - every access is done to the server
+ file access is centralized, server has full control and knows state of access and file
- every file operation pays network costs
- limits server scalability

• Practical Solution - combination of the two
• allow client to store parts of files locally (blocks)
+ low latency on file operations
+ server load reduced, more scalable
• force clients to interact w/ server (frequently)
+ server knows what clients are doing, controls access permissions, easier to maintain consistency
- server more complex, requires different file sharing semantics


What are the tradeoffs associated with a stateless vs. stateful design in DFS?

Stateless vs Stateful:
• stateless - keeps no state
• only OK for 'extreme' models, not practical model
- cannot support caching and consistency mgmnt
- more bits transferred b/c every request self contained
+ no resources used on the server side (CPU/mem)
+ on failure, just restart - very resilient
• stateful - keeps client state
• needed for 'practical' model to track what's cached/accessed
+ can support locking, caching, incremental operations
- on failure, all state could be lost, so need checkpointing and recover mechanisms
- overheads to maintain state and consistency


What are the major design considerations for creating a distributed service?

• where to cache,
• coherence mechanisms to user
• when to write back / update
(session semantics - write back on close, update on open - vs periodic updates - clients lease cached data (can be concurrent), server invalidates periodically to provide bounds on inconsistency),


Do you understand what type of data structures were needed at the servers’ and at the clients’ side to support the operation of the Sprite system

Client data structures:
* cachable, cached blocks, dirt block timer (per block), version

Server data structures:
• readers, writer, version, cachable

All open() go thru server, decides if needs to pull dirty blocks, if caching is allowed, etc


What’s a consistency model?

What is it?
• agreement b/t memory (state) and upper software layers. Memory behaves correctly if and only if software follows specific rules.
• access ordering
• propagation / visibility of updates


Do you understand the history and motivation behind cloud computing?

Capacity scales elastically with demand, law of large numbers (on average, a fixed amt of resources needed while individual demands vary) -> shared resources a good thing, allowing us to capitalize on economies of scale


What are the basic service models of cloud offerings?

• IAS - You manage everything including the virtualization
• PAS - We virtualize servers for you, and you get to manage the OS and everything above it
• SAAS - You just manager your software and disk usage - the OS, virtualization, etc is done by the provider


What are the basic deployment models of cloud offerings?

Basic (deployment) models:
• public (3rd party customers)
• private (leverage tech internally)
• hybrid (failover, deal with spikes, testing)
• community (certain types of users)


What state is maintained by the home node?

• whether page has been accessed
• whether it's been modified
• caching enabled/disabled
• locked or not


What are the NFS Semantics?

• session based when non-concurrent
• additionally uses periodic updates
• 3 seconds for files
• 30 seconds for directories
• Uses lease-based locking - client must renew or release the lease w/in time period or else