final Flashcards Preview

cs6200 > final > Flashcards

Flashcards in final Deck (88):
1

How does scheduling work?

* CPU scheduler chooses one of the ready tasks to run on the CPU

* run CP scheduler when...:
** the CPU becomes idle
** new task becomes ready
** timeslice expired timeout

which task should be selected depends on the scheduling policy/algorithm

how the scheduling gets done depends on the runqueue data structure

2

What are the basic steps and datastructures involved in scheduling a thread on the CPU?

threads may become ready and enter the ready Q:
-after an IO op they were waiting on has completed
-or after they were woken up from waiting on an interupt
-or when thread is created (forked)
-or if thread was previously running and it got interrupted and its timeslice expired

then thread is dispatched onto the CPU
-perform a context switch
-enter user mode
-set PC
-and go!

then new thread starts executing

runqueue data structure: organizes ready tasks
-can be a queue: FIFO (FCFS), ordered (SJF)
-can be a tree (SJF)

-may need to know heuristics based on history of last times that thread ran on CPU (SJF+Preemption)

-may need to know thread priority (Priority Scheduling)
--use multiple runqueues: one for each priority level
--or tree ordered based on priority

runqueue data structures:
-if we want I/O and CPU bound tasks to have different timeslice values, then...
--same runqueue, check type [and apply different policies, 1 for CPU, 1 for I/O]
--or 2 different structures with different policies [one for CPU one for I/O]

...or multilevel feedback queues with different timeslices

3

What are the overheads associated with scheduling?

interrupt [the running task],
[run the] schedule[r] to pick which task is next,
context switch [from one process to another]

* keep ts_time >> context_switch_time

4

Do you understand the tradeoffs associated with the frequency of preemption and scheduling/what types of workloads benefit from frequent vs. infrequent intervention of the scheduler (short vs. long timeslices)?

*timeslice scheduler*
+short tasks finish sooner
+more responsive
+lengthy I/P ops initiated sooner
-overhead
* keep ts_time >> context_switch_time

in general:
-tput: higher ts value == better tput
-avg wait t: lower ts value == better avg wait t
-avg comp t: higher ts value == better avg comp t

*CPU bound tasks:*
* prefer longer timeslices
* limits context switching
* keeps CPU utilization and throughput high

*I/O bound tasks:*
* prefer shorter timeslices
* I/O bound tasks can issue I/O ops earlier
* keeps CPU and device utilization high
* better user-perceived performance

5

Can you work through a scenario describing some workload mix (few threads, their compute and I/O phases) and for a given scheduling discipline compute various metrics like average time to completion, system throughput, wait time of the tasks…

tput: num_tasks/total_time
avg comp t: (T_1 + T_2 + T_3)/num_tasks
avg wait t: (WT_1 + WT_2 + WT_3)/num_tasks

*FCFS*
ex: T1=1s, T2=10s, T3=1s
tput = 3/12s = 0.25 tasks/s
avg comp t = (1+11+12)/3 = 8s
avg wait t = (0+1+11)/3 = 4s

*SJF*
ex: T1=1s, T2=10s, T3=1s
tput = (1+1+10)/3 = 0.25 tasks/s
avg comp t = (1+2+12)/3 = 5s
avg wait t = (0+2+1)/3 = 1s

*SJF + Preemption*

*Priority Scheduling*
ex: T1: {arr_t=5s, exec_t=3s, prio=1}, T2: {arr_t=3s, exec_t=4s, prio=2}, T3: {arr_t=0s, exec_t=4s, prio=3}
tput=(3+2+3+2+1)/3= 3.7 tasks/s
avg comp t=(8+10+11)/3= 9.7s
avg wait t=??? does this make sense to calculate when you have preemption???

*Round Robin Scheduling*
-same as FCFS

*Round Robin with Priorities*

*Round Robin with Interleaving*
ex: ts = 1s; T1: {arr_t=0s, exec_t=1s}, T2: {arr_t=0s, exec_t=10s}, T3: {arr_t=0s, exec_t=1s}
tput = 3/12s = 0.25 tasks/s
avg comp t = (1+12+3)/3 = 5.33s
avg wait t = (0+1+2)/3= 1s

*I/O and CPU bound tasks*
-ex: ts = 1s; T1: {CPU bound, exec_t=10s, ctx_switch_t=0.1s}, T2: {I/O bound, exec_t=10s, ctx_switch_t=0.1s, I/O ops issued every 1s}
tput = 2/(10+10+19*0.1) = 0.091s
avg wait t = (0 + (1+0.1))/2 = 0.55s
avg comp t = (((1+0.1)*18+1)+((1+0.1)*19+1))/2 = 21.35s
-ex: ts = 5s; T1: {CPU bound, exec_t=10s, ctx_switch_t=0.1s}, T2: {I/O bound, exec_t=10s, ctx_switch_t=0.1s, I/O ops issued every 1s}
tput = 2/(24.3) = 0.082 tasks/s
avg wait t = (0+5.1)/2 = 2.55s
avg comp t = (11.2+24.3)/2 = 17.75s
T1: [5] 5.1 [10] 11.2 (11.3)
T2: [1] 6.2 [2] 12.3 IO0.5 12.8 [3] 13.8 IO0.5 14.3 [4] 15.3 IO0.5 15.8 [5] 16.8 IO0.5 17.3 [6] 18.3 IO0.5 18.8 [7] 19.8 IO0.5 20.3 [8] 21.3 IO0.5 21.8 [9] 22.8 IO0.5 23.3 [10] 24.3

6

Do you understand the motivation behind the multi-level feedback queue, why different queues have different timeslices, how do threads move between these queues…

CPU-bound tasks prefer longer timeslices (because it minimizes context switching and keeps CPU utilization and throughput high) but I/O-bound tasks prefer shorter timeslices (because the I/O tasks get to execute sooner and it keeps CPU utilization, device utilization, and user-perceived performance high).

We can make them both happy by using a multi-level feedback queue (MLFQ) (like a queue of queues). The most I/O intensive tasks go in the highest priority queue, which has the shortest timeslice value; and the most CPU intensive tasks go in the lowest priority queue, which has the longest timeslice value.

+timeslicing provides benefits for I/O-bound tasks
+timeslicing avoids overheads for CPU-bound tasks

flow:
1. tasks enter topmost Q
2a. if task yields voluntarily: good choice, keep task at this level
2b. if task uses up entire timeslice: push down to lower level
3. task in lower Q gets a priority boost when releasing CPU due to I/O waits

MLFQ != Priority Queues
* different treatment of threads at each level
* [incorporates] feedback

7

Can you contrast this [multi-level feedback queue] with the O(1) scheduler?

O(1) scheduler:
== constant time to select/add task, regardless of task count
-preemptive, priority-based
--realtime (0-99)
--timesharing (100-139)
---User processes:
----default=120, but can be adjusted with "nice value" (syscall) to be between -20 and 19

* O(1) scheduler is similar to [borrows from] MLFQ scheduler:
-associates different timeslice values with different priority levels
-uses some feedback of how task behaved in the past to determine how to adjust its priority level in the future
* O(1) scheduler is different from MLFQ scheduler:
-how it assigns the timeslice values to priorities
--depends on priority
--smallest for low priority
--highest for high priority
-how it uses the feedback
--sleep time: waiting/idling
--longer sleep=> interactive
---priority - 5 (boost)
--smaller sleep => compute intensive
---priority + 5 (lowered)

O(1) scheduler runqueue == 2 arrays of tasks:
-active [array]:
--used to pick next task to run
--constant time to add/select
--tasks remain in queue in active array until timeslice expires
-expired [array]:
--inactive list
--when no more tasks in active array => swap active and expired

8

Do you understand what were the problems with the O(1) scheduler which led to the CFS?

workloads changed (apps in linux env were becoming more timesensitive):
-O(1) scheduler significantly impacted performance of interactive tasks [performance of interactive tasks]
-jitter introduced by O(1) scheduler was becoming unacceptable
-fairness [doesnt make any fairness guarantees]

*what was the main reason why the O(1) scheduler was replaced by the CFS scheduler?
because interactive tasks could wait unpredictable amounts of time to be scheduled

*CFS*
-runqueue == Red-Black Tree
--ordered by "vruntime"
--vruntume==time spent on CPU
-always pick leftmost node (the task that has the least amount of time on the CPU)
-periodically increment vruntime
-compare to leftmost vruntime
--if currently running task has smaller runtime than what's in leftmost node of tree, keep running
--if currently running task has larger runtime than what's in leftmost node of tree, preempt the currently running task and place it appropriately in the tree
-vruntime progress rate depends on priority and niceness
--rate faster for low-priority
--rate slower for high-priority
--same tree for all priorities
-performance:
--select task [from runQ] = O(1)
--add task [from runQ] = O(logN)

9

Thinking about Fedorova’s paper on scheduling for chip multi processors, what’s the goal of the scheduler she’s arguing for?
What are some performance counters that can be useful in identifying the workload properties (compute vs. memory bound) and the ability of the scheduler to maximize the system throughput.

goal:
-is CPI (cycles per instruction) a good metric? (if so, hoping that hardware engineers will add CPI counters to future hardware archs)

performance counters:
* Instruction-per-cycle (IPC); max IPC=1
** memory bound: high CPI
** CPI-bound: 1 (or low) CPI
** if doing experiment on machine with 4 cores: then best possible CPI = 4
* NOT sleep time, bc thread is not sleeping when waiting on mm, and bc software takes too much time to compute
* hardware counters:
** L1, L2, LLC [lower level cache] misses
** IPC
** power and energy data
* software interface and tools:
** e.g. oprofile, Linux perf tool, etc. see oprofile website for hw counters avail on diff archs
* from hardware counters: (g)estimate what kind of resources a thread needs
** scheduler can make informed decisions
---typically multiple counters
---models with per arch thresholds
---based on well-understood workloads

10

How does the OS map the memory allocated to a process to the underlying physical memory?

*page-based memory mgmt*
* pages: virtual addr space subdivided into fixed-size segments
* page frames: physical memory is divided into "page frames" of same size
* allocate: pages->page tables
* arbitrate: page tables

*segment-based memory mgmt*
* segments: more flexibly sized (instead of fixed-size pages)
* allocate: segments
* arbitrate: segment registers

MMU (memory mgmt unit)
-translates virtual to physical addresses
-reports faults: illegal accesses (addr its requesting hasnt been allocated at all), permission, not present in memory [and must be fetched from disk]

Registers:
-pointers to page table
-base and limit size, number of segments...

Cache Translation-Lookaside Buffer (TLB):
-contains valid VA-TA translations

Translation:
-the actual translation process occurs in the hardware

for each VA: an entry in PT is used to determine PA
-only 1st portion of VA "virtual page number (VPN)" is used to index into page table, rest of VA is offset
--produces "physical frame number (PFN)" which is "physical address" of physical frame in DRAM
-PFN needs to be summed with offset from VA to produce the actual PA, which can be used to reference the appropriate location in physical memory

11

What happens when a process tries to access a page not present in physical memory?

the MMU generates a fault because the page is not present in physical memory (because page table map entry valid bit=0) and must be fetched from disk.

ex:
1. we have an array and "we have already allocated memory to it" but we have not actually accessed it before (so array address in the page table is blank) so the OS has not actually allocated memory for it yet. 2. The first time we access this memory, the OS will realize that there isnt physical mem that corresponds to that range of virtual mem addrs, so it'll take a page of physical mem that is free (ex: P2) and establish a mapping btwn the virtual addr (ex: V_k) and the physical addr of the page (ex: P2).
-physical mem for the array is only allocated when the proc is 1st trying to access it, during the initialization routine ("allocation on first touch")
3. as long as a valid addr is being accessed, then on-fault, mapping will be reestablished btwn valid VA and valid location in physical mem

12

What happens when a process tries to access a page that hasn’t been allocated to it?

the MMU generates a fault because this is either an illegal access (the addr hasnt been allocated to it)

MMU relies on the bits in the page-table entry to establish the validity of the access.
If the hardware determines that a physical memory access cannot be performed, it causes a page fault.
->
If this happens then the CPU places an error code on the kernel stack and generate a trap into the kernel.
->this generates a page fault handler which determines what action should be taken based on the error code;
-- page fault handler == determines action based on error code and faulting addr
-error code will contain info about whether fault was caused bc page was not present so it needs to bring in page from disk to mem or if there's a protection error (SIGSEGV)

13

What happens when a process tries to modify a page that’s write protected/how does COW work?

the MMU generates a fault because it does not have adequate permissions for the access

COW == Copy On Write
-another useful optimization/service that MMU hardware can perform
*basically: when processes are created, we don't copy over ALL of the pages for the new process; we let the different processes use the same orig pages, and only actually make a copy if one of the processes issues a write-request for one of the 'shared' pages


-on process creation:
--copy entire parent addr space
--many pages are static (don't change!)
---why keep multiple copies?
-[in order to avoid unnecessary copying,] on creation, the virtual addr space of the new proc [or portions of it, at least] will be mapped to the original page that had the original addr space content
--the same PA can be referred to by 2 completely different VAs from the 2 processes
--also need to make sure you write-protect the physical memory so you can track concurrent accesses to it

-if contents of orig page will just be read-only, then you save [$] memory AND time (that you would otherwise need to copy it)

-if someone issues write request for orig page, then MMU will detect that page is write-protected and generate page fault
--at that point, the OS will see the reason for the page fault and _then_ it'll create the copy, then updates the PTs of the 2 proc's as necessary; so just copies the pages that need to be updated
--copy cost is only payed when we need to do a write

14

How do we deal with the fact that processes address more memory than physically available?

...
-Apps use virtual memory; virtual addresses translate to physical addresses.

-OS must be able to allocate physical memory and arbitrate how its being accessed

*page-based memory mgmt*
-allocate:
--allocation [OS incorporates certain mechanisms/algorithm and data structures so it can keep track of how physical mem is used and what physical memory is free]
---map *pages* from virtual memory to *page frames* of physical memory
--replacement [since physical mem is smaller than virtual mem, some of the contents that are needed in virual addr space are likely not present in physical mem (they might be stored on some secondary storage, like on disk); so OS has to have mechanism to be able to replace the contents of whats in physical memory with content thats on some temp storage: has to determine when contents should be brought in from disk and which contents from mem should be stored on disk]
-arbitrate:
--addr translation and validation [btwn virtual addr to phys addr and validate it to ensure that its a legal access]

*segment-based mem mgmt*
-allocate: segments
-arbitrate: segment registers

-process doesn't use entire address space
-even on 32-bit arch, will not always use all of 4 GB
-but page table assumes an entry per VPN regardless of whether corresponding virtual mem is needed or not

Heirarchical page tables...

15

What’s demand paging?

demand paging:
-pages swapped in/out of memory and a swap partition (e.g. on disk)
[because virtual memory is much much larger than physical memory, the virtual memory page is not always in the physical memory; so the physical page frame is saved and restored to and from secondary storage]

16

How does page replacement work?

When a page is not present in memory, it has its present bit in the page table entry that's set to 0.
1. when there is a reference to that page ,the MMU raises an exception (page fault)...
2. ...which causes a trap into the kernel. The kernel can determine that the exception is a page fault, and that it had previously swapped out this memory page onto disk, and can establish what is the correct disk access that needs to be performed...
3. ...and it will issue an I/O op to retrieve the page
4. once the page is brought into memory, the OS will determine a free frame where the page can be placed...
5. and cant use the PFN for the page to update the PTE that corresponds to the VA of that page.
6. then control is pushed back to the process that caused the reference, and the PC will be restarted with the same instructions....

...except this time through, at step 2 the page table will find a valid entry with a ref to that particular physical location

*original PA != PA after swap*
*if you do need the page to be constantly present in memory or if it has to maintain the same PA, then you can use page "pinning", which disables swapping (can be useful for devices that support DMA)

17

How does address translation work?

the actual translation process is done by the hardware.
the OS maintains certain info and data structures (such as page tables) thats necessary for translation process, but hardware performs the actual translation

for each VA: an entry in PT is used to determine PA
-only 1st portion of VA "virtual page number (VPN)" is used to index into page table, rest of VA is offset
--produces "physical frame number (PFN)" which is "physical address" of physical frame in DRAM
-PFN needs to be summed with offset from VA to produce the actual PA, which can be used to reference the appropriate location in physical memory

18

What’s the role of the TLB?

Cache-Translation Lookaside Buffer (TLB): a cache that contains valid virtual addr to physical addr translations
-MMU level address translation
-on TLB miss: page table access from memory
-has protection/validity bits
-small number of cached addr => high TLB rate <=> temporal and spatial locality

-on x86 Core i7:
--per core: 64-entry data TLB, 128-entry instruction TLB
--512-entry shared second-level TLB

19

Do you understand the relationships between the size of an address, the size of the address space, the size of a page, the size of the page table…

*page-based memory mgmt*
* virtual addr space is subdivided into fixed-size segments "pages"
* physical memory is divided into "page frames" of same size

*segment-based memory mgmt*
* uses more flexibly sized "segments" (instead of fixed size pages)

size of pages in virtual memory == size of corresponding page frames in physical memory

for each VA: an entry in PT is used to determine PA
-only 1st portion of VA "virtual page number (VPN)" is used to index into page table, rest of VA is offset
--produces "physical frame number (PFN)" which is "physical address" of physical frame in DRAM
-PFN needs to be summed with offset from VA to produce the actual PA, which can be used to reference the appropriate location in physical memory

OS creates/maintains a page table for every process that it runs
-so on ctx switch: OS needs to make sure it switches to page table of newly switched proc

* to calculate the size of a page table, we know that a page table has a certain # of entries, which is equal to the number of VPNs that exist in a VA space
* for each one of these entries we know that page table needs to hold the PFN and some other info like permission bits

ex: 32-bit arch:
* each PTE (page table entry) = 4 B, including PFN and flags
* total # of PTEs depends on total # of VPNs: 2^32 / page_size
* total # of VPNs you can have depends on size of VAs and on size of page itself
** ex: 32-bit PA, 32-bit VAs, 4kB page size:
*** total # of VPNs = 2^32/page_size; ex....???


32-bit arch:
* PTE = 4 bytes (including PFN and flags)
* VPN = 2^32/page_size
* page_size = 4kB (...8kB, 2MB, 4MB, 1GB) == (2^32/2^12)*4B = 4MB per process

64-bit arch:
* PTE = 8 bytes (including PFN and flags)
* VPN = 2^64/page_size
* page_size = 4kB (...8kB, 2MB, 4MB, 1GB) == (2^64/2^12)*8B = 32PB!!! per process

How large is a page?

_from examples so far:_
* 10-bit offset, 1kB page size [means you can have 10 total addresses that can be addressed in a page, so page size is 1 kB (2^10=1kB)]
* 12-bit offset, 4kB page size [means you can have 12 total addresses that can be addressed in a page, so page size is 4kB (2^12 = 2^10+2^2 = 1kB * 4 = 4kB)]
** offset determines the total # of addrs that can be addressed in a page
*** therefore offset determines the page_size in those examples

_in real systems:_
* Linux/x86: 4kB (regular pages), 2MB (large pages, need 21-bit offset, reduction factor on page table size = x512), 1GB page sizes (huge pages, need 30-bit offset, reduction factor on page table size = x1024)
* Solaris/SPARC: 8kB, 4MB, 2GB

20

Do you understand the benefits of hierarchical page tables?

It helps reduce the space requirements for a page table (compared to the flat, single-level PT design: [page_table_size (per process) = VPN * PTE; VPN = size_of_addresses/page_size; how you end up with needing 32PB per proc on 64-bit arch with 8B PTE and 4kB pagesize] ).

Heirarchical Page Tables:
-outer page table or top page table == page table directory
-internal page table == only for *valid* virtual memory regions
--so it doesn't allocate internal page tables for any holes in VM space
-on malloc: a new internal page table may be allocated

inner table addrs = 2^10 * page_size
-dont need an inner page table for each VM gap
--vs single-level PT design: where the page table has to be able to translate every single VA and has entries for every single VPN

Additional layers (further extension):
-page table directory pointer (3rd level)
-page table directory pointer map (4th level)
--important on 64-bit arch because its larger and [processes tend to be] more sparse
---larger gaps could save more internal page table components

Multi-level PT tradeoffs:
+smaller internal page tables/directories
+granularity of coverage
++potential reduced page table size
-more memory accesses required for translation
--increased translation latency

21

For a given address format, can you workout the sizes of the page table structures in different layers?

...
TODO!!!

22

For processes to share memory, what does the OS need to do?

-processes read and write to shared memory region
-OS establishes shared channel between the processes
1. physical pages are mapped into virtual addr space
2. VA (P1) and VA (P2) map to the same physical addr
3. VA (P1) != VA (P2)
4. physical memory doesn't need to be contiguous

CPU has to map memory into target addr space
CPU has to copy data to a channel

23

Do they use the same virtual addresses to access the same memory?

no, VA of P1 != VA of P2

24

For processes to communicate using a shared memory-based communication channel, do they still have to copy data from one location to another?

yes, the CPU is used to copy the data into the channel when necessary, but in this case there are no user-kernel level switches required (unlike message-based)

25

What are the costs associated with copying vs. (re-/m)mapping?

Copy (Messages) vs Map (Shared Memory)
*BOTH: Goal: transfer data from one into target address space
*COPY:
-CPU cycles to copy data to/from port
--large data: t(copy) >> t(map)
*MAP:
-CPU cycles to map memory into addr space (mapping is costly operation)
-CPU to copy data to a channel
++set up once, use many times -> good payoff
++can perform well for one-time use

eg: tradeoff exercised in Windows "Local" Procedure Calls (LPC) [take advantage of this difference]:
-if data that needs to be transferred is < a certain threshold, then data is copied in and out of comm channel via a port-like iface
-else data is potentially copied once to make sure that its in a page-aligned area; then area is mapped into addr space of target proc

26

What are the tradeoffs between message-based vs. shared-memory-based communication?

*Message-Passing*:
+simplicity: kernel does channel management and synchronization
-overheads
* ex: pipes, message queues, sockets

*Shared-Memory IPC*:
+system calls only for setup
+data copies potentially reduced (but not eliminated)
-explicit synchronization
-communication protocol
-shared buffer management
-[all of the above are]...=>programmer responsibility

message passing: must perform multiple copies
shared mem: must establish all mappings among processes' addresses spaces and shared memory pages

27

What are different ways you can implement synchronization between different processes (think what kids of options you had in Project 3).

"like threads accessing a shared state in a single space ...but for processes"
Synchronization Method...
1. mechanisms supported by process threading library (pthreads)
2. OS-supported IPC for synchronization
Either method must coordinate...
-number of concurrent accesses to a shared segment
-when data is available and ready for consumption

*Pthreads Sync for IPC*:
-PTHREAD_PROCESS_SHARED;
-**synchronization data structures must be shared**
--pthread_mutexattr_t
--pthread_condattr_t

*Message Queues*:
-implement "mutual exclusion" via send/recv

*Semaphores*:
-binary semaphore <=> mutex
-if value == 0 -> stop/blocked
-if value == 1 -> decrement (lock) and go/proceed

28

To implement a synchronization mechanism, at the lowest level you need to rely on a hardware atomic instruction. Why? What are some examples?

-sync mechanisms require low level support from hardware to guarantee the correctness of the sync construct
-hardware provides this low level support via atomic instructions

ALL sync methods require support from underlying hardware to atomically make updates to a memory location
-this is the only way they can actually guarantee:
--that the lock is properly required
--and that the state change is performed in a way that is safe
--and that it doesnt lead to any race conditions
--and that all threads in the system are in agreement of what exactly is the current state of the sync construct

examples of sync mechanisms:
-mutexes
-semaphores
-spinlocks
-R/W locks
-monitors
-serializers
-path expressions
-barriers
-rendezvous points
-optimistic wait-free sync (ex: RCU - read-copy-update)

atomic instructions == critical section with hardware-supported synchronization
Guarantees:
-atomicity
-mutual exclusion
-queue all concurrent instructions but one

examples of atomic instructions:
hardware-specific:
-test_and_set
-read_and_increment
-compare_and_swap
--specialize/optimize to available atomics

29

Why are spinlocks useful?

Would you use a spinlock in every place where you’re currently using a mutex?

spinlocks (basic sync construct, but also used to create more complex sync constructs)
-similar to a mutex:
--mutual exclusion
--lock and unlock (free)
-BUT
--when lock == busy: the thread that's suspended is SPINNING: running on the CPU and repeatedly checking to see if the lock becomes free
(---vs with mutexes its blocked: it relinquishes the CPU and allows for another thread to run

no [would not want to use a spinlock everywhere you're using a mutex ...because sometimes you don't want the thread to keep spinning on the CPU, would rather have it block and relinquish the CPU for another thread to run]

30

Do you understand why is it useful to have more powerful synchronization constructs, like reader-writer locks or monitors? What about them makes them more powerful than using spinlocks, or mutexes and condition variables?

Limitations of mutexes and condition variables:
-error prone/correctness/ease of use
--unlock wrong mutex, signal wrong cond var
-lack of expressive power
--helper vars for access or priority control

sometimes its useful to distinguish btwn different types of accesses:
* read (never modify): shared access
* write (always modify: exclusive access
-use RWlocks:
--specify the type of access, then lock behaves accordingly

prev constructs require developers to pay attention to the use of peer[pair]-wise operations lock/unlock, wait/signal, etc: significant cause of error
-can use monitors to help; monitors specify:
--shared resource
--entry procedure
--possible condition variables
-monitors == high-level synchronization construct
-monitors == programming style

* R/W locks allow programmer to distinguish btwn types of accesses
* Monitors are high-level construct that allow programmer to not have to pay attention to pair-wise operations (less error prone?)

31

Can you work through the evolution of the spinlock implementations described in the Anderson paper, from basic test-and-set to the queuing lock? Do you understand what issue with an earlier implementation is addressed with a subsequent spinlock implementation?

-alternative implementations of spinlocks
-generalize techniques to other constructs


*test-and-set:*
```spinlock_lock(lock): //spin until free
while( test_and_set(lock) == busy);```
* atomically returns (tests) original value and sets new value = 1 (busy)
* 1st thread test_and_set(lock) => 0: free
* next ones: test_and_set(lock) => 1: busy
** reset lock to 1 (busy), but that's OK
+Latency: minimal (just atomic)
+Delay: potentially minimal (spinning continuously on atomic)
--Contention: processors go to memory on each spin (spinning on atomic is really bad!!!)


*test-and-test-and-set:*
```// test(cache), test_and_set(Mm)
// spins on cache (lock == busy)
// atomic if freed (test_and_set)
spinlock_lock(lock):
while( (lock==busy) OR (test_and_set(lock) == busy) );```
* just tests cached value of lock first to see if its busy; doesn't try to do the atomic part of the while predicate until it at least knows that its cached value says its OK
+Latency: ... OK
+Delay: ... OK
--Contention: ...better but...:
---NCC: makes no difference whatsoever
---CC-WU: OK, although all processors will try to execute test_and_set op at the same time
---CC-WI: horrible! bc it just invalidates the cached values of the lock, so then all the processors have to go fetch the new value from memory
* PROBLEM SOLVED FROM PREV MODEL (test_and_set): in prev. model, it was constantly spinning on the atomic test_and_set; now we're spinning on the cached value first


*delay after lock release:*
```spinlock_lock(lock):
while( (lock == busy) OR (test_and_set(lock) == busy) ) {
// failed to get lock
while(lock == busy);
delay();
}```
* everyone sees lock is free
* not everyone attempts to acquire it
+Latency: ...OK
-Delay: ...much worse
+Contention: ...improved
* PROBLEM SOLVED FROM PREV MODEL (test_and_test_and_set): in prev. model, after lock was freed, everyone was trying to issue the atomic operation at the same time; now we're not all trying to issue the atomic at the same time


*delay after each lock reference*
```spinlock_lock(lock):
while( (lock == busy) OR (test_and_set(lock) == busy) ) {
delay();
}```
* doesn't spin constantly
* works on NCC architectures
* but can hurt delay even more
+Latency: ...OK
--Delay: ...*much worse*
+Contention: ...improved
* PROBLEM SOLVED FROM PREV MODEL (test_and_test_and_set): prev. model did not make a difference on NCC architectures


*queueing lock:*
```init:
flags[0] = has_lock;
flags[1...p-1] = must_wait;
queuelast = 0; //global var
lock:
myplace = r&inc(queuelast); //get ticket
// spin
while (flags[myplace mod p] == must_wait;
// now in critical section
flags[myplace mod p] = must_wait;
unlock:
flags[myplace+1 mod p] = has_lock;```
* set unique ticket for arriving thread
* assigned queue[ticket] is a private lock
* enter critical section when you have lock
* signal/set next lock holder on exit
--Latency: more costly r&inc [read and increment]
+Delay: directly signal next CPU/thread to run
+Contention: better! but requires CC and cacheline aligned elements
** DOWNSIDES:
--assumes arch has a read_and_increment atomic
--requires O(N) size
* PROBLEM SOLVED FROM PREV MODEL[S]:
** (test_and_set, test_and_test_and_set??, delay after lock release, delay after each lock set): in prev. models, everyone sees that lock is free at the same time
*** only 1 CPU/thread sees that the lock is free and tries to acquire the lock!
** (test_and_set, test_and_test_and_set): in prev. models, everyone tries to acquire lock at the same time once lock is freed

32

What are the steps in sending a command to a device (say packet, or file block)?
What are the steps in receiving something from a device?

user process->
[send data / read file]
->kernel->
[form pkt / determine disk block]
->driver->
[write TX request record / issue disk head move/read]
->device->
[perform transmission -> ethernet wifi card / read block from disk ->disk]
*receiving results or events from device just goes in reverse order*


1. process that requires access to device performs a system call where it specifies appropriate operation, process is requesting the operation from the kernel
2. OS runs in-kernel stack related to the particular device; (ex: the TCPIP stack to form the appropriate packet before the data is sent out to the network; or ex: the file system that's needed to determine the particular disk block that stores the file data)
3. OS invokes the appropriate device driver for the network or block device for the disk
4. then the device driver performs the configuration of the request to the device (ex: on network side, the device driver will write out a record that configures the deivce to perform a transmission of the appropriate pack of data; or on disk side, the device driver will issue certain commands to the disk that configure the disk head movement or where the data should be read from, etc)
5. [once the deivce is configured] device performs request (ex: network card, device performs transmission of data; or disk device, device performs block read operation from disk)
[6. any results from device or in general events that are originating on the devices will traverse call chain in reverse manner]


ex PIO: NIC, data == network packet; when a process running on CPU wants to transmit some data in the form of a network packet
1. write command to [command registers on device to] request packet transmission
2. copy packet to data registers
3. repeat until packet sent
e.g. 1500 B packet, 8 B registers or bus:
->1 (for bus command) + 188 (for data)
->189 CPU store instructions

ex DMA: NIC, data == network packet; when a process running on CPU wants to transmit some data in the form of a network packet
1. write command to [command registers on device to] request packet transmission
2. configure DMA controller with in-memory address and size of packet buffer [the location of the data we want to move and the size of the data we want to move]
e.g. 1500 B packet, 8 B registers or bus:
->1 store instruction + 1 DMA configure
->less steps but DMA config is more complex

33

What are the basic differences in using programmed I/O vs. DMA support?

Programmed I/O (PIO)
* no additional hardware support
* CPU "programs" the device
** via command registers
** data movement
ex: NIC, data == network packet
1. write command to request packet transmission
2. copy packet to data registers
3. repeat until packet sent


Direct Memory Access (DMA)
* relies on DMA controller
* CPU "programs" the device
** via command registers
** via DMA controls
ex: NIC, data == network packet
1. write command to request packet transmission
2. configure DMA controller with *in-memory address and size of packet buffer*
->less steps but DMA config is more complex
* for DMAs: data buffer must be present in physical memory until transfer completes
->pinning regions (non-swappable)

***for devices like a keyboard: PIO is better because keyboard does not transfer much data per keystroke, because configuring DMA may be more complex than to perform one or two extra store instructions

***for devices like a NIC: it depends if PIO or DMA is better:
****PIO is better if: sending out small packets that require that we perform < [# of cycles to configure a DMA controller] store instructions to the device's data registers
****DMA is better if: sending out larger packets

34

For block storage devices, do you understand the basic virtual file system stack, the purpose of the different entities?

Block device stack
* processes use files -> logical storage unit
* kernel file system (FS)
** where, how to find and access file
** OS specifies interface
* generic block layer
** OS standardized block interface
* device driver

user process (file)->
kernel (FS, POSIX API)->

driver->
device (protocol specific API)->
block device \\ typical storage for files





Virtual File System:
file-system interface->
VFS interface-->
->local file system type 1-> disk
->local file system type 2->disk
->remote file system type 1->network

Virtual File System Abstractions
* file == elements on which the VFS operates
* file descriptors == OS representation of file
** open, read, write, sendfile, lock, close
* inode == persistent representation of file "index"; index of all disk blocks corresponding to a file
** list of all data blocks
** device permissions, size, ...
* dentry == directory entry, corresponds to single path component
** /users/ada => /, /users, /users/ada
*** this is useful bc if we need to find another file in the directory /users/ada then we don't need to go thru the entire path and try to reread the files that correspond to all of these directories in order to get to the directory ada and ultimately to find the next file that we are looking for
** dentry cache
* superblock == file system specific information regarding the FS layout

35

Do you understand the relationship between the various data structures (block sizes, addressing scheme, etc.) and the total size of the files or the file system that can be supported on a system?

*inodes*
e.g. 128 B inode, 4 B block ptr
=> 32 addressible blocks * 1 kB block
=> 32 kB file size

*inodes with indirect pointers*
e.g. 4 B block ptr, 1 kB blocks
* direct pointer: "points to data block"
** 2 disk accesses
=> 1 kB per entry
* indirect pointer: "points to block of pointers"
=> 256 kB per entry
* double-indirect pointer: "points to block of block of pointers"
** up to 4 disk accesses
=> 64 MB
1 kB block / 4 byte => 256 ptrs

*maximum file size:*
number_of_addressible_blocks * block_size
number_of_addressible_blocks = 12 + blocks_addressible_by_single_indirect + blocks_addressible_by_double_indirect + blocks_addressible_by_triple_indirect

an inode has the following structure. [img] each block is 4 B.

if a block on disk is 1 kB, what is the maximum file size that can be supported by this inode structure (nearest GB)?
1 kB -> 256 ptrs (12 + 256 + 256^2 + 256^3) * 1 kB
= 16 GB

what is the maxmimum file size if a block on disk is 8 kB (nearest TB)?
8 kB block / 4 B ptr size = 2k ptrs/block
(12 + 2k + 2k^2 + 2k^3) * 8 kB
= 64 TB

36

For the virtual file system stack, we mention several optimizations that can reduce the overheads associated with accessing the physical device. Do you understand how each of these optimizations changes how or how much we need to access the device?

caching/buffering => reduce # of disk accesses
-buffer cache in main memory
-read/write from cache
-periodically flush to disk - fsync()

I/O scheduling => reduce disk head movement
-maximimze sequential vs randomaccess
-e.g. write block 25, write block 17 => write 17, 25

prefetching => increases cache hits
-leverages locality
-e.g. read block 17 => also read 18, 19

journalling/logging => reduce random access (ext3, ext4)
-"describe" write in log: block, offset, value...
-periodically apply updates to proper disk locations

37

What is virtualization?

virtualization allows concurrent execution of multiple OSs and their applications on the same physical machine

virtual resources == each OS thinks that it "owns" hardware resources

virtual machine (VM) == OS and apps and virtual resources (guest domain)

virtualization layer == management of physical hardware (virtual machine monitor [VMM], hypervisor)

"A virtual machine... is an efficient, isolated duplicate of the real machine supported by a virtual machine monitor (VMM)."
1. provides *environment essentially identical with the original machine*
2. programs show *at worst only minor decrease in speed*
3. VMM is in *complete control of system resources*

goals of VMM:
* fidelity
* performance

benefits of virtualization:
+consolidation: decrease cost, improve manageability
+migration: availability, reliability
+security
+debugging
+support for legacy OSes

38

What’s the history behind it [virtualization]?

-originated at IBM in the 60s
-but hasn't been used ubiquitously since then because:
--mainframes were not ubiquitous
--other hardware was cheap

more recently, we've come around to virtualization because:
-servers were being underutilized
-data centers were becoming too large
-companies had to hire more system admins
-companies were paying high utility bills to run and cool servers

39

What’s hosted vs. bare-metal virtualization?

*bare metal:*
-AKA "hypervisor-based"
-type 1
-VMM (hypervisor) manages all hardware resources and supports execution of VMs
-privileged, service VM to deal with devices (and other config and mgmt tasks)
ex: Xen (opensource on Citrix XenServer); ESX (VMWare); Microsoft Hyper-V

*hosted:*
-type 2
-host OS owns all hardware
-special VMM module provides hardware interfaces to VMs and deals with VM context switching
ex: KVM (kernel-based VM); Fusion; VirtualBox; VMWare Player

40

What’s paravirtualization, why is it useful?

goal: performance; give up on unmodified guests

approach: paravirtualization == modify guest so that...
-it knows its running virtualized
-it makes explicit calls to the hypervisor (hypercalls)
-hypercall (~system calls):
--package context info
--specify desired hypercall
--trap to VMM
* e.g. Xen == opensource hypervisor (XenSource->Citrix)

memory virtualization with paravirtualization:
-guest aware of virtualization
-no longer strict requirement on contiguous physical memory starting at 0
-explicitly registers page tables with hypervisor
-can "batch" page table updates to reduce VM exits
-...other optimizations

paravirtualization helped to resolve some of the issues with executing the 17 privileged instructions in x86 architectures that did not trap with the trap-and-emulate method

41

What were the problems with virtualizing x86?

used to use trap-and-emulate model:
* processor virtualization:
** guest instructions are executed directly by hardware
** for privileged ops: trap to hypervisor; hypervisor can:
*** terminate VM if illegal op
*** emulate behavior that guest OS was expecting from hardware if legal op


with x86 pre 2005:
* 4 rings, no root/non-root modes yet; hypervisor in ring 0, guest OS in ring 1
***but 17 privileged instructions that do not trap! fail silently!***
--hypervisor doesn't know, so it doesn't try to change the settings
--OS doesn't know, so it assumes that the change was successful

* e.g. interrupt enable/disable bit in privilaged register; POPF/PUSHF instructions that access it from ring 1 fail silently
** as a result:
*** guest VM could not request interrupts enabled
*** guest VM could not request interrupts disabled
*** guest VM could not find out what is the state of the interrupts enabled/disabled

42

How does protection of x86 used to work and how does it work now?

[Hardware Protection Levels
* commodity hardware has more than 2 protection levels]

***how it used to work:***
* e.g. x86 has 4 protection levels (rings)
** ring3: apps
** ring 1: OS
** ring 0: hypervisor

***how it works now:***
* e.g. x86 has 4 protection levels (rings) and 2 protection modes (root and non-root)
** non-root: VMs
*** ring 3: apps
*** ring 0: OS
** root:
*** ring 0: hypervisor
** in non-root mode, certain ops are not permitted
*** so if the guest VM tries to execute one of these instructions, it causes a VMexit trap, which switches it to root mode and passes control to the hypervisor; hypervisor completes the op and passes control back to non-root via VMentry

AMD Pacifica & Intel Vanderpool Technology (Intel-VT), ~2005
* 'close holes' in x86 ISA
* modes: root/non-root (or 'host' and 'guest' mode)
* VM Control Structure
** per VCPU; 'walked' by hardware
* extended page tables and tagged TLB with VM ids
* multiqueue devices and interrupt routing
* security and management support
Also: => additional instructions to exercise the above features

43

How were/are the virtualization problems on x86 fixed?

Binary Translation:
* main idea: rewrite the VM binary to never issue those 17 instructions
* goal: full virtualization == guest OS not modified
* approach: dynamic binary translation
1. inspect code blocks to be executed
2. if needed, translate to alternative instruction sequence
-e.g. to emulate desired behavior, possibly even avoiding trap
3. otherwise, run at hardware speeds
* cache translated blocks to amortize translation costs

Paravirtualization:
* goal: performance; give up on unmodified guests
* approach: paravirtualization == modify guest so that...
** it knows its running virtualized
** it makes explicit calls to the hypervisor (hypercalls)
** hypercall (~system calls):
*** package context info
*** specify desired hypercall
*** trap to VMM
* e.g. Xen == opensource hypervisor (XenSource->Citrix)


***how it works now:***
* e.g. x86 has 4 protection levels (rings) and 2 protection modes (root and non-root)
** non-root: VMs
*** ring 3: apps
*** ring 0: OS
** root:
*** ring 0: hypervisor
** in non-root mode, certain ops are not permitted
*** so if the guest VM tries to execute one of these instructions, it causes a VMexit trap, which switches it to root mode and passes control to the hypervisor; hypervisor completes the op and passes control back to non-root via VMentry

AMD Pacifica & Intel Vanderpool Technology (Intel-VT), ~2005
* 'close holes' in x86 ISA
* modes: root/non-root (or 'host' and 'guest' mode)
* VM Control Structure
** per VCPU; 'walked' by hardware
* extended page tables and tagged TLB with VM ids
* multiqueue devices and interrupt routing
* security and management support
Also: => additional instructions to exercise the above features

44

How does device virtualization work?

For CPUs and memory...
-less diversity, ISA-level 'standardization'

For devices...
-high diversity
-lack of standard specification of device interface and behavior

3 key models for device virtualization (pre vf* [virtualization-friendly] hardware):
* passthrough model
* hypervisor-direct model
* split-device driver model

45

What a passthrough vs. a split-device model [re. device virtualization]?

*Passthrough model:*
* approach: VMM-level driver configures device access permissions
+VM provided with exclusive access to the device
+VM can directly access the device (VMM-bypass)
--device sharing difficult
--VMM must have exact type of device as what the VM expects
--VM migration tricky

*Hypervisor-Direct model:*
* approach:
** VMM intercepts all device accesses
** emulate device operation
** translate to generic I/O op
** traverse VMM-resident I/O stack
** invoke VMM-resident driver
+VM decoupled from physical device
+sharing, migration, dealing with device specifics
--latency of device ops
--device driver ecosystem complexities in hypervisor

*Split-Device Driver model:*
* approach:
=> device access control split between:
** frontend driver in guest VM (device API)
** backend driver in service VM (or host)
** modified guest drivers
=>[--]limited to paravirtualization guests
+eliminate emulation overhead
+allow for better management of shared devices
* model is still relevant today because we are consolidating all access to the service VM where we can make better decision and enforce stronger policies about how device is shared

46

What’s the motivation for RPC?

RPC is intended to simplify the development of cross-address space and cross-machine interactions

common steps related to remote IPC => RPC; ex:
-client-server
-create and init sockets
-allocate and populate buffers
-include 'protocol' info
-copy data into buffers

benefits:
+higher-level interface for data movement and communication
+error handling
+hiding complexities of cross-machine interactions

47

What are the various design points that have to be sorted out in implementing an RPC runtime (e.g., binding process, failure semantics, interface specification … )? What are some of the options and associated tradeoffs?

RPC Requirements
1. client/server interactions
2. Procedure Call Interface => RPC
-sync call semantics
3. Type checking
-error handling
-packet bytes interpretation
4. Cross-Machine Conversion
-e.g. big/little endian
5. Higher-Level Protocol
-access control, fault tolerance...
-different transport protocols

RPC Design Choice Summary:
* Binding => how to fidn the server
* IDL => how to tlak to the server; how to package data
* Pointers as args => disallow or serialize pointed data
* Partial Failures => special error notifications
design decisions for RPC Systems, e.g. Sun RPC, Java RMI


WHAT can the server do?
WHAT arguments are required for the various operations?
-WE NEED AN AGREEMENT!
WHY
-client-side bind decision
-runtime to automate stub generation
=>Interface Definition Language (IDL)

* an IDL [is] used to describe the interface the server exports:
** procedure name, arg and result types
** version #

* RPC can use IDL that is:
** language-agnostic [ex XDR in Sun RPC]
** language-specific [ex: Java in JavaRMI]
**JUST INTERFACE, NOT IMPLEMENTATION!**


*Binding and Registry:*
* client determines...
** WHICH server should it connect to? service name, version #, ...
** HOW will it connect to that server? IP address, network protocol, ...
* Registry == database of available services
** search for service name to find service (WHICH) and contact detains (HOW)
** distributed
*** any RPC service can register
** machine specific:
*** for services running on same machine
*** clients must know the machine address
* needs naming protocol:
->registry provices prot # needed for connection
** exact match for "add"
** or consider "summation", "sum", "addition", ...

*What about Pointers?*
* ex: y points to a location in the caller's address space
* Solutions:
** NO POINTERS!
** serialize pointers; copy referenced ("pointed to") data structure to send to buffer

*Handling Partial Failures*
* when a client hangs... what's the problem?
** is the server down? service down? network down? message lost?
** timeout and retry => NO GUARANTEES!
* special RPC error notification (signal, exception...)
** catch all possible ways in which the RPC call can (partially) fail

48

What’s marshalling/unmarshaling?

*Marshalling*
* when client-stub creates a data buffer and populates it with the values of the arguments that are passed to the procedure call
** RPC runtime needs to send a contiguous buffer to the sockets transmission
** so marshalling process takes care of this and places all the args into a buffer that will be passed to the sockets

1. variables i and array_j are somewhere in client address space
2. client makes call to rpc procedure rpc_array_add(i,j)
3. [At lowest level (RPC runtime) this will need to result in a message thats stored in some buffer that needs to be sent via socket API to some remote server, buffer has to be some contiguous location of bytes that includes the args as well as an ID for the procedure so that on the other end the server will know what to do and how the rest of the bytes need to be interpreted.]
buffer gets generated by the marshalling code: takes vars i and array_j and copies them into the buffer; serializes the args into a contiguous memory location. Arrays can be serialized in different ways;
ex: first place the size of the array and then place all of the elements of the array
ex2: just list all of the elements of the array and include some special char to denote the end of the array (like strings in C).


*Unmarshalling*
* when server stub extracts the args from the buffer [byte stream from the receive buffers] that the client sent and creates the data structs to hold the values

1. take a buffer provided by the network protocol and based on the input procedure and the data types that we know are required by that procedure, parse the rest of the byte stream from the buffer
2. extract the correct number of bytes and use those bytes to initialize the data structures that correspond to the argument types
3. as result of unmarshalling process, variables i and j are allocated somewhere in server address space and will be initialized to the values of whatever was placed in the message received by the server

* marshalling and unmarshalling routines dont have to be written by developer; they're provided by RPC system compiler

49

How does an RPC runtime serialize and deserialize complex variable size data structures?

[with Sun XDR]
* RPC Runtime serializes the data by copying it into a buffer:
** with variable-length array in C, RPC runtime copies this into a struct with 2 members: char * that contains the content of the array, and an int that holds the size of the arrray
[translates a variable length array into a data structure with "len" and "val" fields]

* RPC Runtime deserializes the data by copying the members of the struct in the buffer to a C-struct with the specified types that the client and server code can use

* RPC runtime will also allocate space for the args on the client and server side if the client or server have not already done so
* [with Sun RPC at least] it also provides a cleanup function that the server stub can call when it's done with the RPC procedure to free the args and results vars

50

What’s specifically done [re. RPC runtime binding process, failure semantics, interface specification…, serialize and deserialize complex variable size data structures] in Sun RPC/XDR?

in Sun RPC/XDR:
* Binding => per-machine registry daemon
* IDL => XDR (for interface spec and for encoding)
* Pointers => allowed and serialized
* Failures => re-tries; return as much info as possible

* interface specified via XDR (.x files)
* rpcgen compiler => converts .x to language-specific stubs
* server registers with local registry daemon
* registry (per-machine)
** name of service, version, protocol(s), port #, ...
* binding creates handle
** client uses handle in calls
** RPC runtime uses handle to track per-client RPC state

Summarizing RPC Development
* From .x => header, stubs...
* Developer:
** server code: *** implementation of square_proc_1_svc
** client side:
*** call squareproc_1()
** #include .h
** link with stub objects
* RPC Runtime does the rest:
** OS interactions, communication mgmt


* for variable length arrays: RPC Runtime translates a variable length array into a data structure with "len" and "val" fields

* RPC runtime will also allocate space for the args on the client and server side if the client or server have not already done so

* it also provides cleanup functions that the server stub can call when it's done with the RPC procedure to free the args and results vars
[xdr_free(), _free_result()]

51

What are some of the design options in implementing a distributed service?

DFS Models:
* client/server on different machines
* file server distributed on multiple machines
** replicated (each server: all files)
** partitioned (each server: part of files)
** both (files partitioned; each partition replicated)
* files stored on and served from all machines (peers)
** blurred distinction between clients and servers

Remote File Service:
_extreme 1: Upload/Download_
* like FTP, SVN
+local reads/writes at client
--entire file download/upload even for small accesses
--server gives up control

_extreme 2: True Remote File Access_
* every access to remote file, nothing done locally
+file access centralized, easy to reason about consistency
--every file operation pays network cost
--limits server scalability

_compromise: A More Practical Remote File Access (with Caching)_
1. allow clients to store parts of files locally (blocks)
+low latency on file operations
+server load reduced => is more scalable
2. force clients to interact with server (frequently)
+server has insights into what clients are doing
+server has control into which access can be permitted => easier to maintain consistency
--however... server more complex, requires different filesharing semantics

52

What are the tradeoffs associated with a stateless vs. stateful design?

*Stateless*
== keeps no state
* OK with the extreme models, but cannot support 'practical' model
--cannot support caching and consistency mgmt
--every request self-contained=>more bits transferred
+no resources are used on server side (CPU/mm)
+on failure, just restart

*Stateful*
== keeps client state
* needed for 'practical' model to track what is cached/accessed
+can support locking, caching, incremental operations
--on failure... need checkpointing and recovery mechanisms
--overheads to maintain state and consistency => depends on caching mechanism and consistency protocol

53

What are the tradeoffs (benefits and costs) associated with using techniques such as caching, replication, partitioning, in the implementation of a distributed service (think distributed file service).

*Caching*
* locally clients maintain portion of state (e.g. file blocks)
* locally clients perform operations on cached state (e.g. open, read, write)
* requires coherence mechanisms
** HOW [will other clients find out about changes]:
]*** SMP: write-update/write-invalidate]
*** DFS: client-server driven
** WHEN:
[*** SMP: on write]
*** DFS: on-demand, periodically, on open, ...
=>details depend on file sharing semantics

*Replication*
== each machine holds all files
+load balancing, availability, fault tolerance
--writes become more complex:
=>synchronously to all
=>or, write to one, then propagated to others
---replicas must be reconciled; e.g. voting...


*Partitioning*
== each machine has subset of files
+availability vs single server DFS
+scalability with file system size
+single file writes simpler
--on failure, lose portion of data
--load balancing harder; if not balanced then hot spots possible

ex: 3 server machines with 100 files each:
* with replication:
** can store 100 total files
** if 1 machine fails: you lose 0% of the files
* with partitioning:
** can store 300 total files
** if 1 machine fails: you lose 33% of the files

==>can combine both replication and partitioning: replicate each portion

54

The Sprite caching paper motivates its design based on empirical data about how users access and share files.

Do you understand how the empirical data translated in specific design decisions?

Do you understand what type of data structures were needed at the servers’ and at the clients’ side to support the operation of the Sprite system
(i.e., what kind of information did they need to keep track of,
what kids of fields did they need to include for their per-file/per-client/per-server data structures).

_how the empirical data translated in specific design decisions_
* 33% of all file accesses are writes
=>caching OK; but write-through is not sufficient
* 75% of files are open < 0.5 sec; 90% of files are open < 10 sec
=>session semantics still too high overhead
* 20-30% of new data is deleted within 30 sec; 50% of new data deleted within 5 min; file sharing is rare!
=>write-back on close not really necessary
=>no need to optimize for concurrent access, but must support it

_design:_
[* 20-30% of new data is deleted within 30 sec]
=>cache with write-back
-every 30 sec write-back blocks that have NOT been modified for the last 30 sec
[* 33% of all file accesses are writes]
--when another client opens file => get dirty blocks

=>open goes to server; directories not cached
=>on "concurrent write" => disable caching

sequential write sharing == caching and sequential semantics

concurrent write sharing == no caching




_data structures_
*server, per file*
* readers
* writers [current writer, ex: w2]
* version
* cacheable [Y/N]

*client, per file*
* cache [Y/N]
* cached blocks
* timer for each dirty block
* version

*server, per-client*
* client connection info?

*server, per-server (metadata)??*
* open files
* ALL dirty blocks in its cache ??

*client, per-server*
* server connection info

*client, per-client (metadata)??*
* open files?
* addresses of ALL cached blocks??

ex:
*R1...Rn readers, w1 writer:*
-all open() go thru server
-all clients cache blocks
-writer keeps time stamps for each modified block
*...w2 sequential writer (sequential sharing)*
-server contacts last writer for dirty blocks
-if w1 has closed update version, w2 can now cache file
*...w3 concurrent writer (concurrent sharing)*
-server contacts last writer for dirty blocks
-since w2 hasnt closed file => disable caching!

55

When sharing state, what are the tradeoffs associated with the sharing granularity?

* cache-line granularity? NO
--overheads too high for DSM
* variable granularity NO
--useful from programmer's perspective, but still potentially too fine-grained [have a lot of vars that are just ints 2 bytes long, so DSM system would still have too high of overheads]
* page granularity (OS-level) OK
* object granularity (language runtime) OK

=>[once you start increasing sharing granularity] beware of false sharing; e.g. x and y are on the same page!

56

For distributed state management systems (think distributed shared memory) what are the basic mechanisms needed to maintain consistence

– e.g., do you [know] why is it useful to use ‘home nodes’,

why do we differentiate between a global index structure to find the home nodes
and local index structures used by the home nodes to track information about the portion of the state they are responsible for.

Page-based DSM architecture:
* distributed nodes, each with own local mm contribution
* pool of pages from all nodes
* each page has ID ("home node"), PFN

if MRMW...
* need local caches for performance (latency)
* home (or manager) node drives coherence ops
* all nodes responsible for part of distributed memory (state mgmt)

"Home" node:
* keeps state: pages accessed, modifications, caching enabled/disabled, locked...
* current "owner" (owner may not equal home node)

Explicit replicas:
* for load balancing, performance, or reliability
* "home"/manager node controls management

=>each mode contributes part of mm pages to DSM
=>need local caches for performance (latency)
=>all nodes responsible for part of distributed memory
=>home node manages accesses and tracks page ownership
=>explicit replication possible for load balancing, performance, or reliability

DSM Metadata [indexing distributed state]
* Each page (object) has...
** address == node ID + PFN
** node ID == "home" node
* Global map (replicated)
** object (page) ID => manager node ID
** manager map available on each node!
** -->Global mapping table
*** object ID => index into mapping table => manager node
* Metadata for local pages (partitioned)
** per-page metadata is distributed across managers

57

What’s a consistency model?

consistency model == agreement between memory (state) and upper software layers

"mem behaves correctly if and only if software follows specific rules"
* memory (state) guarantees to behave correctly...
** access ordering
** propagation/visibility of updates

58

What are the different guarantees that change in the different models we mentioned – strict, sequential, causal, weak…

*Strict Consistency*
== updates visible everywhere immediately
* in practice:
** even on single SMP, no guarantees on order without extra locking and synchronization
** in distributed system, latency and message reorder/loss make this even harder...
=>IMPOSSIBLE TO GUARANTEE!

*Sequential Consistency*
==
* memory updates from different processes may be arbitrarily interleaved
* ALL processes will see the same interleaving
* operations from same process always appear in the order they were issued


*Causal Consistency*
guarantees they'll be able to detect the possible causal relationship between updates and if updates are causally related then the mem will guarantee that those writes are correctly ordered
* "concurrent" writes [==writes that are not causally related; where theres no relationship btwn them] => no guarantees
* causally related writes => ordered


*Weak Consistency*
[all of the models that also introduce some types of sync primitives in addition to pure read/write; allow you to control overheads of system from one aspect but they also introduce some additional overheads in order to support the new ops that they added]
* synchronization points == ops that are available (R, W, sync)
** all updates prior to a sync point will be visible
** no guarantee what happens in between
* variations:
** single sync operation (sync)
** separate sync per subset of state (page)
** separate "entry/acquire" vs "exit/release" operations
+limit data movement and coherence ops
--maintain extra state for additional ops

59

Can you work through a hypothetical execution example and determine whether the behavior is consistent with respect to a particular consistency model?

...
TODO!!!

60

Do you have some ideas how would you go about implementing a distributed shared memory system?

* can implement DSM in hardware, software, or hybrid of both
** software:
+flexible
--performance
** hardware:
+performance
--complex to design and verify
** hybrid: balance btwn cost and complexity

* need to decide on a sharing granularity (has to be high enough to overcome overheads, but beware of false sharing!)

* have to decide on application access algorithm:
** SRSW
** MRSW
** MRMW

* achieve low latency thru...?
** migration? OK for SRSW but not in all cases
** replication? OK but might have too high overheads if you have many concurrent writes in system
** ...caching? OK but might have too high overheads if you have many concurrent writes in system

* need to decide on a consistency management protocol:
** push invalidations when data is written to...
** pull modification info periodically...
[either way, when they get triggered depends on consistency model]


PROBLEM: DSM must "intercept" accesses to DSM state
-to send remote messages requesting access
-to trigger coherence messages
=>overheads should be avoided for local, non-shared state (pages)
=>dynamically "engage" and "disengage" DSM when necessary

SOLUTION: Use hardware MMU support!
-trap in OS if mapping invalid or access not permitted
-remote address mapping=>trap and pass to DSM to send message
-cached content=>trap and pass to DSM to perform necessary coherence ops
-other MMU information useful (e.g. dirty page)

61

When managing large-scale distributed systems and services, what are the pros and cons with adopting a homogeneous vs. a heterogeneous design?

...
*Homogeneous*
Functionally homogeneous, each node...
* can do any processing step
+keeps front-end simple: doesn't mean that each node has all data; just each node can get to all data
--how to benefit from caching?

*Heterogeneous*
Functionally heterogeneous...
* different nodes, different tasks/requests
* data doesn't have to be uniformly accessible everywhere
+benefit of caching
--more complex FE
--more complex mgmt

*Scale-Out architecture*
* for scale: multiprocess, multi-node
1. "boss-worker": front-end distributes requests to nodes
2. "all equal": all nodes execute any possible step in request processing for any request [functionally homogeneous]
3. "specialized nodes": nodes execute some specific step(s) in request processing; for some request types [functionally heterogeneous]

62

Do you understand the history and motivation behind cloud computing, and basic models of cloud offerings?
Do you understand some of the enabling technologies that make cloud offerings broadly useful?

* Amazon provisioned a bunch of extra hardware resources for the holiday sale season.
** The resources were idle for the rest of the year.
** they "opened" access to their resources via web-based APIs
** clients [Animoto == poster child] could run their workloads on Amazon's hardware for a fee
=>AWS EC2 was born

Cloud Deployment Models:
* Public: 3rd party customers/tenants
* Private: leverage technology internally
* Hybrid (Public and Private): failover, dealing with spikes, testing
* Community: used by certain type of users

* shared resources:
** infrastructure [VMs, not physical machines] and software/services
* APIs for access and configuration:
** web-based, libraries, command-line...
* billing/accounting services:
** many models: spot, reservation, ...entire marketplace
** typically discrete quantities: tiny, medium, XL...
* managed by (cloud) provider

Separation of Responsibilities:
* on-premisis: you manage everything
* Infrastructure as a Service:
** you manage Applications, Data, Runtime, Middleware, O/S
** they manage: Virtualization, Servers, Storage, Networking
* Platform as a Service:
** you manage: Applications, Data
** they manage: Runtime, Middleware, O/S, Virtualization, Servers, Storage, Networking
* Software as a Service: they manage everything

Cloud-Enabling Technologies:
* Virtualization
* Resource provisioning (scheduling): mezos, yarn...
* Big Data processing (Hadoop, MapReduce, Spark...) and stoage (Distributed FS - "append-only"; NoSQL, distributed in-memory caches...)
* Software-Defined
** ...networking
** ...storage
** ...data centers...
* monitoring => realtime log processing (Flume, Cloudwatch, Log Insight...)

63

Do you understand what about the cloud scales make it practical?

Why does Cloud Computing Work?
* Law of Large Numbers:
** per customer there is large variation in resource needs
** average across many customers is roughly consistent
* Economies of Scale:
** unit cost of providing resources or service drops at "bulk"

64

Do you understand what about the cloud scales make failures unavoidable?

With so many machines and so many potential places for failures to occur, failure itself is inevitable

65

1. Timeslices
On a single CPU system, consider the following workload and conditions:
* 10 I/O-bound tasks and 1 CPU-bound task
* I/O-bound tasks issue an I/O op once ever 1 ms of CPU computing
* I/O ops always take 10 ms to complete
* Context switching overhead is 0.1 ms
* I/O devices have infinite capacity to handle concurrent I/O requests
* All tasks are long-running

Now, answer the following questions (for each question, round your answer to the nearest %):

1. what is the *CPU utilization* (%) for a round-robin scheduler where the timeslice is 20 ms?

2. what is the *I/O utilization* (%) for a round-robin scheduler where the timeslice is 20 ms?

1. CPU utilization % =

2. I/O utilization % =

66

2. Linux O(1) Scheduler
Consider a Linux system with the following assumptions:
* uses the O(1) scheduling algorithm for time sharing threads
* must assign a time quantum for thread T1 with priority 110
* must assign a time quantum for thread T2 with priority 135

1. Which thread has a *"higher"* priority (will be serviced first)?

2. Which thread is assigned a *longer time quantum*?

3. Assume T2 has used its time quantum without blocking. What will happen to the value that represents its priority level when T2 gets scheduled again?
* lower/decrease
* higher/increase
* same

4. Assume now that T2 blocks for I/O before its time quantum expired. What will happen to the value that represents its priority level when T2 gets scheduled again?
* lower/decrease
* higher/increase
* same

1.

2.

3.

4.

67

3. Hardware Counters
Consider a quad-core machine with a single memory module connected to the CPUs via a shared "bus". On this machine a CPU instruction takes 1 cycle and a memory instruction takes 4 cycles.

This machine has 2 hardware counters:
* counter that measures IPC
* counter that measures CPI

1. What does IPC stand for in this context?

2. What does CPI stand for in this context?

3. What is the highest IPC on this machine?

4. What is the highest CPI on this machine?

1. IPC == [instructions per cycle??]

2. CPI == Cycles Per Instruction

3.

4.

68

4. Synchronization
In a multiprocessor system, a thread is trying to acquire a locked mutex.

1. Should the thread spin until the mutex is released or block?

2. Why might it be better to spin in some instances?

3. What if this were a uniprocessor system?

1.

2.

3.

69

5. Spinlocks
For the following question, consider a multi-processor with write-invalidated cache coherence.
Determine whether the use of a dynamic (exponential backoff) delay has the *same, better, or worse performance* than a test-and-test-and-set ("spin on read") lock. Then explain why.

Make a performance comparison using each of the following metrics:
1. Latency
2. Delay
3. Contention

1. Latency:
dynamic (exponential backoff) delay ??? test-and-test-and-set (spin on read)


2. Delay:
dynamic (exponential backoff) delay ??? test-and-test-and-set (spin on read)


3. Contention:
dynamic (exponential backoff) delay ??? test-and-test-and-set (spin on read)

70

6. Page Table Size
Consider a 32-bit (x86) platform running Linux that uses a single-level page table.
What are the *maximum number of page table entries* when the followng page sizes are used?

1. regular (4kB) pages?

2. large (2MB) pages?

1. regular (4kB) pages:

2. large (2MB) pages:

71

7. PIO

1. Considering I/O devices, what does PIO stand for?

2. List the steps performed by the OS or process running on the CPU when sending a network packet using PIO

1. Programed Input/Output

2. ...

72

8. inode Structure:
Assume an inode has the following structure:

https://s3.amazonaws.com/content.udacity-data.com/courses/ud923/notes/ud923-final-inodes.png

i-node:
Mode
Link count
Uid
Gid
FIle size
Times
Addresses of first 12 disk blocks
Single indirect->...=>Pointers to disk blocks
Double indirect->...=>...=>[Pointers to disk blocks]
Triple indirect->Triple indirect block->Double indirect block=>Single indirect block=>...

Also assume that *each block pointer element is 4 bytes*

If a block on the disk is 4kB, then what is the *maximum file size* that can be supported by this inode structure?

...

73

9. RPC Data Types
An RPC routine get_coordinates() returns the N-dimensional coordinates of a data point, where each coordinate is an integer.

Write the leements of the C data structure that corresponds to the 3D coordinates of a data point.

...

74

10. DFS Semantics
Consider the following timeline where 'f' is distributed shared file and P1 and P2 are processes:
[img]

Other Assumptions:
* 't' represents the time intervals in which functions execute
* the 'w' flag means write/append
* the 'r' flag means read
* the original content of 'f' was "a"
* the read() function returns the entire contents of the file

For each of the following DFS semantics, what will be read --**the contents of 'f'** -- by P2 when t = 4s?

1. UNIX semantics

2. NFS semantics

3. Sprite semantics

1. Unix semantics

2. NFS semantics

3. Sprite semantics

75

12. Consistency Models
Consider the following sequence of ops by processors P1, P2, and P3 which occurred in a distributed shared memory system:
[img]

Notation
* R_m1(X) => X was read from memory location m1 (*does not indicate where it was stored*)
* W_m1(Y) => Y was written to memory location m1
* initially all memory is set to 0

1. Name all processors (P1, P2, or P3) that observe causally consistent reads.

2. Is this execution causally consistent?

1.

2.

76

12. Distributed Applications
You are designing the new image datastore for an application that stores users' images (like Picasa). The new design must consider the following scale:
* The current app has 30 million users
* Each user has on avg 2000 photos
* each photo is on avg 500 kB
* requests are evenly distributed across all images

1. Would you use replication or partitioning as a mechanism to ensure high responsiveness of the image store?

2. If you have 10 server machines at your disposal, and one of those crashes, what's the percentage of requests that the image store will not be able to respond to, if any?

1. Partitioning (bc it wouldn’t all fit on 1 Machine

2. 10% (if using partitioning, bc the images are evenly distributed across all machines)

77

scheduling algorithms

*FCFS*:
-schedules tasks in order of arrival
-uses runqueue: when task becomes ready it gets placed at end of Q; scheduler picks next task to execute from the front of the Q
-scheduler only needs to know the head of the Q and how to deQ tasks from the Q
-runqueue == queue (FIFO)

*SJF*:
-schedules tasks in order of their execution time
-runqueue == ordered queue
-or runqueue == tree

*SJF + Preemption*:
-tasks can be preempted off of the CPU before they're done executing
--can use heuristics based on history like windowed average (how long did the task run the last n times?) and job running time (how long did the task run last time) to help determine when to preempt a task now

*Priority Scheduling*:
-tasks have different priority levels
-run the highest priority thread next
-supports preemption
-also needs to know task's priority
--use multiple runqueues: one Q for each priority level
--or a tree ordered by priority
-PROBLEM: starvation - low priority task stuck in Q forever
--SOLN: use priority aging to increase a task's priority the longer it spends in the Q

*Round Robin Scheduling*:
-pick up 1st task from Q
-task may yield to wait on I/O op

*Round Robin with Priorities*:
-include preemption

*Round Robin with interleaving*:
-interrupt tasks to mix in all of the tasks in the system at once
-timeslicing


*MLFQ scheduler*
* can use MLFQ to accommodate tasks with different timeslice values
* ~like a queue of queues
** most I/O intensive tasks go in the highest priority queue, which has the shortest timeslice value
** most CPU intensive tasks go in the lowest priority queue, which has the longest timeslice value
* MLFQ != Priority Queues
** different treatment of threads at each level
** [incorporates] feedback

*O(1) scheduler in Linux*
* O(1) scheduler == constant time to select/add task, regardless of task count
-preemptive, priority-based
--realtime (0-99)
--timesharing (100-139)
---User processes:
----default=120, but can be adjusted with "nice value" (syscall) to be between -20 and 19
* similar to [borrows from] MLFQ scheduler:
-associates different timeslice values with different priority levels
-uses some feedback of how task behaved in the past to determine how to adjust its priority level in the future
* different from MLFQ scheduler:
-how it assigns the timeslice values to priorities
-how it uses the feedback
* O(1) scheduler runqueue == 2 arrays of tasks:
-active [array]:
-expired [array]:

*CFS scheduler*
-runqueue == Red-Black Tree
--ordered by "vruntime"
--vruntume==time spent on CPU
-always pick leftmost node (the task that has the least amount of time on the CPU)
-periodically increment vruntime
-compare to leftmost vruntime [and swap if current task runtime is > leftmode node; else keep running]
-vruntime progress rate depends on priority and niceness
-performance:
--select task [from runQ] = O(1)
--add task [from runQ] = O(logN)

79

IPC definition, types

IPC == OS-supported mechanisms for interaction among processes (coordination & communication)
-message passing: e.g. sockets, pipes, message Qs
-memory-based IPC: shared memory, mem. mapped files...
-higher-level semantics
--files, RPC
-synchronization primitives

79

Types of Synchronization Constructs (P3L4)

*Mutexes*

*Condition Variables*

*Spinlocks*
-like a mutex:
--mutual exclusion
--lock and unlock (free)
-not like a mutex:
--when lock == busy: SPINS (vs blocked, with mutexes)
-basic sync construct, but also used in creating some more complex sync constructs

*Semaphores*
-common sync construct in OS kernels
-similar to a mutex...but more general
-==integer value -> count-based sync
--on init: assigned a max value (positive int) -> maximum count
--on try (wait): if nonzero => decrement and proceed -> counting semaphore (P proberen)
--if initialized with 1: semaphore == mutex (binary semaphore)
--on exit (post): increment (V verbogen)

*Reader/Writer Locks*
-specify the type of access (read/write) -> then lock behaves accordingly

*Monitors*
-== high level synchronization construct
-== programming style
-specify:
--shared resource
--entry procedure
--possible cond vars
-on entry: lock, check...
-on exit: unlock, check, signal

*serializers*
-easier to define priorities
-hide [from programmer] need for explicit signaling and cond vars

*path expressions*
-requires programmer to spec regex that captures correct sync behavior (ex: many reads OR single write)
-runtime makes sure that the way the ops that access shared resource are interleaved satisfy the regex

*barriers*
-like reverse of semaphore, ex:
--if semaphore allows n threads to proceed before it blocks,
--then barrier will block all threads until n threads arrive at this point

*rendezvouz points*
-waits for multiple threads to meet at that particular point in execution

*optimistic wait-free sync (RCU)*
-category for approaches that achiefe concurrency without explicitly locking and waiting
-bet on the fact that there wont be any conflicts due to concurrent writes and its safe to allow reads to proceed concurrently
--ex: Read-Copy-Update log (RCU) in linux

80

Basic I/O Device Features

Control registers:
-command [Command]
-data transfers [Data]
-status [Status]

Micro-controller == device's CPU

on-device memory [Memory (DRAM or SRAM or both)]

other logic [Other Hardware-specific Chips]:
-e.g. analog to digital converters

81

CPU-Device Interconnect

Devices interface with the rest of the system via a controller that's part of the device packaging that's used to connect he device with the rest of the CPU complex via some sort of CPU device interconnect

* Peripheral Component Interconnect (PCI)
** PCI Express (PCIe)
** ( > PCI-X > PCI)

* SCSI bus
* peripheral bus
* bridges handle differences

82

Types of I/O Devices

Block: disk
-read/write blocks of data
-direct access to arbitrary block

Character: keyboard
-get/put character

Network Devices
[somewhere between, looks more like a stream of data]

OS representation of a device == special device file
* UNIX-like systems:
** /dev
** tmpfs
** devfs

83

Pseudo-devices

Linux supports a number of pseudo ("virtual") devices that provide special functionality to a system.

ex:
/dev/null: accepts and discards all output (provides no output)
/dev/urandom: provides a variable-length string of pseudo random numbers

84

CPU-Device Interactions

access device registers == memory load/state

memory-mapped I/O:
-part of 'host' physical memory dediceated for device interactions
-Base Address Registers (BAD)

I/O port
-dedicated in/out instructions for decice access
-target device (I/O port) and value in register

Path from Device to CPU [models]:
*Interrupt:
--interupt handling steps
+can be generated as soon as possible
*Polling:
+when convienent for OS
-delay or CPU overhead

85

virtualization requirements

* present virtual platform interface to VMs
-virtualize CPU, memory, devices

* provide isolation across VMs
-preemption, MMU for addr translation and validation

* protect guest OS from Apps
-cannot run guest OS and apps at same protection level

* protect VMM from guest OS
-cannot run guest OS and VMM at same protection level

86

Memory Virtualization

*with full virtualization:*
-all guests expect contiguous physical memory starting at 0
-virtual vs physical vs machine addresses and PFNs [essentially 2 diff page tables]
-sitll leverages hardware MMU, TLB...
Option 1:
-guest page table: VA => PA
-hypervisor: PA=>MA
--too expensive!!
Option 2:
-guest page tables: VA=>PA
-hypervisor shadow PT: VA=>MA
-hypervisor maintains consistence; e.g. invalidate on ctx switch, write-protect guest PT to track new mappings...


*with paravirtualization:*
-guest aware of virtualization
-no longer strict requirement on contiguous memory starting at 0
-explicitly registers page tables with hypervisor
-can "batch" page table updates to reduce VM exits
-...other optimizations...

BOTH for full and paravirtualized memory virtualization:
=>overheads eliminated or reduced on newer platforms

87

RPC Steps

-1. *register:* server "registers" procedure, args types, location...

0. *bind:* client finds and "binds" to desired server

1. *call:* client makes RPC call; control passed to stub, client code blocks

2. *marshal:* client stub "marshals" arguments (serialize args into buffer)

3. *send:* client sends message to server

4. *receive:* server receives message; passes msg to server stub; access control

5. *unmarshal:* server stub "unmarshalls" args (extracts args and creates data structs)

6. *actual call:* server stub calls local procedure implementation

7. *result:* server performs operation and computes result of RPC operation

...similar steps on return

88

cloud computing requirements

* on-demand elastic resources and services
* fine-grained pricing based on usage
* professionally mapped and hosted
* API-based access

Requirements for the Cloud
1. "fungible" resources
2. elastic, dynamic resource allocation methods
3. scale: management at scale, scalable resources allocations
4. dealing with failures
5. multi-tenancy: performance and isolation
6. security