Flashcards in Final Deck (107):
What kind of scheduling policies can a scheduler have?
* Assign tasks immediately: simple, first come first serve (FCFS)
* Simple tasks first: maximize throughput, Shortest Job First (SJF)
* Assign Complex Tasks First: max system utilization (CPU, devices, memory)
When does the scheduler run?
* When CPU becomes idle
* new task becomes ready (could have higher priority task in queue)
* timeslice expired timeout
Design of run queue and scheduling algorithm are tightly coupled!
Describe priority scheduling
* Threads are scheduled by their priority, highest priority going first.
* This can be implemented with priority queues for each priority level (or you could use an ordered tree)
* Could leave to task starvation for low priority tasks
* This is prevented through priority aging. The longer the task is in the queue, the higher its priority. So it will eventually be run as its priority increases.
What is priority inversion?
When a low priority thread acquires a lock and then a higher priority thread is run and needs that lock, the low priority thread could run to completion before the higher priority thread because the high priority thread is stuck waiting for the lower one to complete.
A solution to this is to temporarily boost the priority of the thread holding the mutex to the level of the higher priority thread so it can run until it releases the mutex. Then it would re-lower the priority of the should-be-low-priority thread. This means it's helpful to know which thread owns a mutex.
What is run to completion scheduling?
Threads are executed until completion unless they are preempted by something.
Describe round robin scheduling
Go through tasks in a round robin order. If task yields due to I/O switch to next task and round robin remaining tasks until I/O is complete.
Can also do round robin with interleaving which is known as time slicing.
timeslice = max amount of uninterrupted time to give a task (aka, time quantum)
task may run for less amount of time than timeslice (task must wait on I/O, so switches, or higher priority task becomes available)
The ideal time slice length is different for I/O bound vs CPU bound tasks.
What are the costs and benefits of timeslicing?
+ short tasks finish sooner
+ more responsive
+ lengthy I/O ops initiated sooner
- interrupt, schedule, context switch = overhead for each switch
Can minimize these overheads by keeping the timeslice time much >> context switch time
What are the ideal timeslice lengths for cpu bound and I/O bound workloads?
* Longer timeslices lead to better throughput and lower avg. completion times, though higher avg wait time.
* Overall, user cares about throughput and how long it takes to complete. Wait time is hidden b/c there is no I/O.
* Using a smaller timeslice is better.
* I/O requests are able to get kicked off earlier
* Keeps both the CPU and I/O devices busier more often
* better user-perceived performance
How do you calculate cpu utilization?
[cpu_running_time/(cpu_running_time + context_switching_overheads)] * 100
Describe the multi-level feedback queue (MLFQ).
Have three run queues, but they are treated as a single queue data structure:
1) TS = 8ms, most I/O intensive, highest priority
2) TS = 16ms, medium I/O intensive (mix I/O and CPU)
3) TS = infinity (CPU intensive, lowest priority, FCFS)
+ get timeslicing benefits for I/O bound tasks
+ avoid timeslicing overhead for CPU bound tasks
Ultimately this becomes like a decision making pipeline
1. start in 8ms queue, if it yields before timeslice is up, good choice, keep it here!
2. if timeslice expires, move it to next queue. Repeat the previous step.
3. If timeslice expired again, put it in bottom queue.
Can always be bumped back up if they start yielding before timeslice expires.
Describe the linux O(1) scheduler
P3L1: 18 Linux O(1) -- fill this out :)
Describe the Completely Fair Scheduler (CFS)
Scheduler for all non-real time tasks
* Runqueue = Red-Black Tree
* Tasks ordered by virtual runtime (time spent on cpu)
* Tasks on the left are ones with less vruntime, so they need to be schedule sooner
* Run smallest vruntime task until it's new vruntime is greater than the next smallest vruntime (do this check periodically)
* One vruntime has moved past next smallest one, place current task appropriately in the tree and start working on the now smallest vruntime task
* For low priority tasks, vruntime progresses faster (artificially)
* For high priority tasks, vruntime progresses slower (artificially)
* Same tree used for all levels
* Task selection = O(1)
* Add task = O(log n)
What are the problems with the O(1) scheduler?
- performance of interactive tasks
- inactive list not executed until all tasks done in the active list
- fairness is not enforced in any way
Why is it important to try and schedule threads back onto the same CPU?
Cache-Affinity - likely to have everything you need in the cache so you will be more efficient.
If you swap CPUs, then you'll have a cold cache.
What is NUMA scheduling?
Non-Uniform Memory Access
* A CPU may have faster access to one memory bank than another.
* NUMA aware scheduling means the scheduler is aware of this so it tries to keep the tasks on the CPU closer to where that tasks memory lies.
What is hyperthreading?
* Multiple hardware supported execution threads
* This means there are multiple sets of registers needed for an execution context
* still 1 CPU but much faster context switches because all it has to do is switch the registers being used by the CPU (not switch what's in the registers, but which registers are being used)
Can even hide memory access (hundreds of cycle access) b/c a context switch to the other hyperthreaded execution context is on the order of cycles. So it can make sense to switch between the hyperthreaded executions contexts even during a memory access.
Why is it good to mix CPU and Memory bound tasks together on a single CPU?
Because you can get better overall utilization.
If both threads are CPU bound, only one can actually make progress at a time
If both threads are Mem bound, lots of wasted cycles:
T1: GXGGGXGGGX (CPU Bound)
T2: XGXXXGXXXG (Mem Bound)
How can the CPU scheduler make informed decisions about scheduling?
By use of hardware counters - there are typically several and they are different for different architectures.
Help to estimate kind of resources they need (CPU, Mem).
Can look at number of cache hits/misses from the HW counter to help guesstimate whether its a CPU or mem bound workload.
Why can't CPI be used as a scheduling technique?
Because real workloads may not have enough variation in CPI to be deterministic of workload category. They could have similar averages with high or low std deviation which means isn't not so cut and dry.
What is a TLB?
Translation Look Aside Buffer
* small cache that holds valid Virtual Address to Physical Address mappings
* Used because the VA to PA translation basically has to be done for every memory access
What does the Hardware Memory unit do?
- Translates VA to PA addresses
- Reports faults (illegal access, permissions)
- Pointers to page table
- base and limit size, number of segments
- Cache - TLB
- actual PA generation done in hardware
Describe how page tables work.
* VA Pages -> Page Table Map -> DRAM
* Size of VA Pages is the same as the DRAM pages
* There is a page table for each process
* So on context switch, must switch to the right page table
* Virtual Address = [VPN | offset]
* VPN is offset into page table map
* Page Table Map = [PFN | valid bit]
* Physical Address = [PFN | offset]
What is allocation on first touch?
The DRAM will only actually allocate space for a page the first time it is referenced. Declaring it in the VA address space is great, but it won't be created in PA until you try to do some kind of access to it.
What flags exist for a page table entry?
* P-resent (valid/invalid)
* D-irty (written to)
* A-ccessed (whether its been accessed for read or write)
* protection bits -> r, w, etc
How are page table sizes calculated?
* 32 bit system
* Page Table Entry (PTE) = 4 bytes (32 bit)
* Number of pages = 2^32 / Page Size
* A common page size is 4kB
* (2^32 / 2^12) * 4B = 4MB
We don't use a flat page table anymore, though.
What are Hierarchical Page Tables?
* Outer Level: Page Table Directory
* elements are pointers to page tables
* Internal page tables exist only for valid virtual memory regions
* Now Virtual address looks like:
[outer page table index | internal page table index | offset]
* Could have additional layers using the same philosophy
* on malloc, new internal page table may be allocated
P3L2: 8 Multi Level Page Tables
What are the pros/cons of a Hierarchical Page Table?
+ Reduced page table size
+ Likely more 'gaps' that are the same size as the page table
- More lookups required for the translation process since there are more levels
Describe inverted page tables
P3L2: 11 Inverted Page Tables
P3L2: 12 Segmentation
What are the pros/cons of large pages?
+ Fewer page tables entries, smaller page tables, more TLB hits
- internal fragmentation => wastes memory
* page size is calculated as 2^[# offset bits]
What are some challenges with memory allocation?
* external fragmentation
* enough 'free' pages exist, but they are not contiguous so they can't be allocated
* When possible, alloca tor will free memory to create more contiguous zones.
What is the buddy allocator?
* Subdivides the area in half over and over until it finds a memory region that is the size of the request
+ Aggregation works well ad is fast
- Internal fragmentation can still exist because this only allocates by powers of 2
What is the slab allocator?
Builds custom object caches on top of slabs [contiguously allocated physical memory].
Each slab is built for a specific size object, so freeing objects in slab is fine b/c they can be replaced by objects of the same size easily.
+ Internal fragmentation avoided
+ External fragmentation not really an issue
Linux uses a combination of both Slab and buddy Allocator.
Describe Demand Paging
When the valid bit is set to 0 in the page table (eg, it's not in memory), then a request to access it becomes a trap and control passes to the OS. The OS makes a call to retrieve that piece of info from the backing storage (I/O request). The memory is placed in DRAM, the page tabled entry is updated with the new PA corresponding to that VA and the valid bit is set to 1. Then control is passed back to the thread and it will execute the same memory request again.
When should pages be swapped out of physical memory onto disk and which pages should be swapped out?
* When memory usage is above a threadshold (high watermark)
* When CPU usage is below a threshold (low watermark)
* pages that won't be used
* LRU: uses access bit to tell if it has beena ccessed
* pages that don't need to be written out
* dirty bit to track if modified
* won't swap non-swappable pages
* 'Second Change' variation of LRU
What is Copy on Write (COW)?
When a new process is created we would copy the original address space. Instead, we map a new VA to the original page. Then we write protect that data. If only reads occur, then this is fine, we only need one copy of the information. If a write occurs, a trap is created and the data is copied to a new PA and the VA updated. This ensures we only pay the copy cost when we need to modify data.
What is checkpointing?
* Periodically save process state during execution
* Write protect entire PA and copy everything once
* Copy diffs of 'dirtied' pages for incremental checkpoints
* rebuild images of process from multiple diffs, or in background aggregate diffs into a single one
More @ P3L2: 22 Failure Management Checkpoint
What kinds of message passing IPCs are there?
Pipes - bytestream from one process to another
Message Queues - just what it sounds like
Sockets - send(), recv(), socket()
Describe Shared Memory IPC
OS Maps same PAs into VAs of two processes. May not have the same VAs in each process
+ System calls only for the setup phase
+ potentially reduced data copies (but no eliminated)
- must use explicit synchronization
- must have some kind of communication protocol
- shared buffer management - programmers responsibility
Describe Copy vs Map in message based IPC vs Shared Memory IPC.
* CPU cycles to copy data to / from port for each message
* CPU cycles to map memory into address space
* CPU to copy data to channel
+ memory map is costly, but only needs to be done once. If used many times -> good payoff
+ Can still perform well for one time use, especially with large data transfers
In Windows, they support Local Procedure Calls (LPCs) which means if memory to be copied is small, it would use a messages based IPC and if it's large, it would use shared memory (copy once, then map that to shared space for both processes).
What are spin locks?
Like a mutex, except that instead of relinquishing the cpu over once it encounters the locked spin_mutex, it instead just sits and checks over and over until it either acquires the lock or the thread is preempted.
The most basic synchronization construct.
What is a semaphore?
Like a Traffic Light: STOP and GO
- similar to a mutex, but they are more general
- represented as a positive integer value -> maximum count
on try (wait):
- if non-zero -> decrement and proceed -> counting semaphore
on exit (post):
A semaphore with the max value of 1 is like a mutex - it's a binary semaphore.
What is a reader/writer lock?
Locks that allow multiple access or single access at a time:
- read_lock/read_unlock --> multiple access (shared)
- write_lock/write_unlock--> single access (exclusive)
What are monitors?
They specify - high level sync. construct:
- shared resource
- entry procedure
- possible condition variables
What are no-write, write-through, and write-back caches?
no-write: The write is performed on memory only. The cached copy will be invalidated.
write-through: write is applied to the cache and the memory
write-back: Write is applied to cache, but the write to memory is done at some later time, for instance, when that cache line is evicted.
What is cache coherence?
Whether the caches are kept in sync.
non-cache-coherence (NCC): software must keep the caches in sync. Hardware doesn't do anything about it.
cache-coherence (CC): hardware will make sure the caches are all synchronized.
What is write-invalidate and write-update?
write-invalidate: If one mem location is changed in one cache, the same entry in another cache is invalidated.
+ Lower bandwidth
+ amortizes cost
write-update: When value is written to one cache, hardware updates it in another cache.
+ data immediately available to all CPUs b/c it's updated in their cache.
These are properties of the hardware - no option to choose in software.
Describe atomic operations.
* Multiple instructions which are un-interruptable.
* Atomics are always issued to the memory controller. They bypass the cache.
+ can be ordered and synhronized
- can take MUCH LONGER since they are using memory instead of cache.
- generates all the traffic necessary to now update all caches since the caches were initially bypassed
Describe pros/cons of test and set/test-test-and-set.
+ minimal latency (just the atomic)
+ potentially min delay (spinning continuously on atomic)
- processors go to memory on each spin (since it's an atomic)
test-and-test-and-set (spin on read/spin on cached value):
+ Latency is OK (slightly worse than test-and-set)
+ Delay is OK (slightly worse than test-and-set)
- Contention ... better but:
- NCC - no difference
- CC-Write update - OK
- CC-Write Invalidate - HORRIBLE
- contention due to atomics + caches invalidated == more contention
Describe the pros and cons of spinlock with delay.
Add a delay when the thread notices the lock is free - so after it's released.
+ Contention is improved
+ Latency is OK
- Delay is much worse
Start P3L5 0 I/O Management
Start P3L5 0 I/O Management
What are device drivers?
- Response for all aspects of device access, management, and control.
- Need a device driver for each device type
- provided by device manufacturers per OS/version
- Each OS standards the interfaces to the device type
What are device drivers?
- Response for all aspects of device access, management, and control.
- Need a device driver for each device type
- provided by device manufacturers per OS/version
- Each OS standards the interfaces to the device type
What categories are drivers grouped into?
- read/write blocks of data
- direct access to arbitrary block
- get/put character
Network Devices: Stream
How are devices represented on the Linux OS?
via special device file
- On Unix-like systems, devices appears as files under /dev directory.
- treated by special file systems: tmpfs, devfs (not stored in real file system)
How does the CPU communicate with devices?
- interrupt handling steps (overheads, indirect cache pollution, copying masks, etc)
+ can be generated as soon as possible
+ when convenient for os
- delay or CPU overhead
What is Programmed I/O?
CPU issues instructions by writing into cmd registers of device, controlls data movement by read/write to device data register.
Example: NIC sending data
- 1500B packet, device data register=8 byte (or bus is only 8bytes)
- 1 access to device register to write command
- 188 accesses for data
-> Total of 189 instructions
Does not require special hardware
What is an alternative to Programmed I/O?
DMA - Direct Memory Access
- Needs special hardware support (DMA controller)
- CPU 'programs' the device
- via command registers
- via DMA controls
Example: NIC sending data
- write command to request packet transmission
- configure DMA controller with in-memory address and size of packet buffer (location + total amount of data to move)
- from CPU perspective, only 1 store instruction + 1 DMA configuration
-- Less steps, but DMA config is more complex
- smaller transfer, PIO is still better
For DMA's, data buffer must be in physical memory until transfer is complete - not swappable!
What does the typical device access call stack look like?
1. User Process (send data), makes system call
2. Kernel (form packet), in-kernel stack
3. Driver (write Tx request Record, eg, issue commands) - driver invocation
5. Perform Request
- For some devices you can bypass the Kernel using Operation System Bypass. Driver must be a user level driver or library.
- OS still retains course-grained control
- relies on device features such as having sufficient registers, demux capabilities so it can know which process to send data to (single device, many processes using it).
What is a virtual file system?
A layer that hides from applications the details of the underlying file system.
Underlying file systems could have remote servers, local file systems (of different types), etc.
file == elements on which VFS operations
file descriptor == OS representation of file (has op's like open, read, writer, lock, close etc)
inode == persistent representation of file, the 'index'
- list of all data blocks for file
- device, permissions, size
dentry == directory entry, corresponds to single path component (soft state, in memory only)
- /users/geoff -> /, /users, /users/geoff
- dentry cache
superblock == filesystem-specific information regarding the FS layout
- maintains overall map of FS
- inode blocks, data blocks, free blocks
More in P3L5.22/23
Describe a block group in ext2
Each Block Group:
* superblock: #inodes, #diskblocks, start of free blocks
* group descriptor: bitmaps, # free nodes, # directories
* bitmaps: tracks free blocks and inodes
* inodes: 1 to max_nambuer, 1 per file
* data blocks: file data
Index of all disk blocks corresponding to a file
- file: identified by an inode (inode == a unique number)
- inode: list of all blocks + other metadata
index block = inode = 19
inode contains blocks:
- 9, 16,1, 10, 25, -1, -1, -1
If needed more storage for file, FS allocates a free block and updates inode (list of blocks), replacing the next -1 with the block id.
+ easy to perform sequential or random access
- limit on file size
- 128B inode, 4 byte block ptr
- 32 addressable blocks x 1kB block
- 32 kB file size
What are inodes with indirect pointers?
- pointers to blocks
normally, blocks are pointed to by their direct pointer
The indirect pointer points to a block of pointers (instead of to a single block)
Could even do a block of block of points (double indirect points)
This can keep going...
+ small inode -> but address large files
- files access slow down due to traversal of layered blocks (double indirect = 4x accesses)
What kind of disk access optimizations occur to improve file access overheads?
caching/buffering - reduce # disk accesses
- buffer cache in main memory
- read/write from cache
periodically flush to disk (fsync)
I/O scheduling - reduce disk head movement
- maximize sequential vs random access
(eg, write block 25, write block 17 -> write 17, 25)
Prefetching - increases cache hits
- leverages locality
(eg read block 17, ... also read 18 and 19)
Journaling/Logging - reduce random access
- 'describe' write in log: block, offset, value
- periodically apply update to proper disk locations
What is Virtualization?
An efficient, isolated, duplicate of the real machine.
Allows many full OS's to exist on the same machine and that OS thinks it owns the hardware. But maybe it doesn't get ALL the memory - eg, the system has 16GB but the VM only gets 2GB.
What are the goals of the virtual machine monitor (VMM)?
1. provide environment essentially identical with the original machine (Fidelity)
2. programs show at worst only minor decrease in speed (performance)
3. VMM is in complete control of system resources (safety and isolation)
What are the benefits of virtualization?
+ Consolidation - decrease cost - improve manageability
+ Migration - availability, reliability
+ Security (isolation)
+ Debugging (test new OS features)
+ Support for Legacy OSs (no longer needs specific hardware to run)
What is bare metal virtualization?
Hypervisor based; type 1
* Hypervisor Based, manages all hardware resources and supports execution of VMs
* Generally runs a Privileged Service VM to deal with devices and other configuration and management
* Xen: (open source or Citrix Xen Server)
* Dom0 = service VM - drivers run here
* DomU's = Gues VMS
* ESX (VMWARE)
* many open API's
* Drivers in VMM
* Used to have Linux control core, now remote APIs
What is Hosted virtualization?
* Host OS owns all hardware
* special VMM module provides hardware interfaces to VM's and deal with VM context switching
Example: KVM (Kernel-Based VM)
* based on linux
* KVM kernel Module + QEMU for hardware virtualization
* Leverages Linux Open-Source community
What are the hardware protection levels and how are they implemented?
Ring 0: highest privilege (OS)
Ring 3: lowest Privilege (Apps)
ring 3: apps
ring 0: OS
ring 0: hypervisor
VMExit switches to root mode, execution done, then VMEntry goes back to non-root
What is trap-and-execute in hypervisor?
For non-privileged operations: you get hardware speeds - just like native OS
privileged operations: trap to hypervisor
Hypervisor will 'trap and emulate' if privileged access is required.
Basically intervenes and does the op for the Guest OS if it's allowed, so the guest OS felt like it did it ... when really the hyperivisor did.
What were the problems with trap and emulate?
pre-2005 they didn't have root and non-root modes. They just had ring0 - ring 4. They would operate the hypervisor on Ring 0 and the OS on ring 1. There were 17 instructions that when executed from anything but Ring 0 would fail silently. They wouldn't pass control to Ring 0 first.
What is Binary Translation?
goal: full virtualization == guest OS not modified
approach: dynamic binary translation
1. inspect code blocks to be executed
2. if one of bad instructions is present, translate to alternate instruction that emulates the behavior (possibly even avoiding the trap)
3. Otherwise run at hardware speeds
Cache the translated blocks to amortize translation costs
What is paravirtualization?
Guest OS is modified so that is knows it's running virtualized
- it makes explicit calls to hypervisor (hypercalls) instead of those calls that would normally not be trapped.
- behave similarly to system calls
How does memory virtualization work?
- Guest OS Page Table: VA => PA
- Hypervisor: PA => MA
This is too slow!
- Guest OS Page Table: VA => PA
- Hypervisor Shadow Page Table: VA => MA
- Guest is aware of virtualization
- No longer strict requirement of contiguous physical memory starting at 0
- explicitly registers page tables with hypervisor
- can 'batch' page table updated to reduce VM exits
* Overheads eliminated /reduced on newer platforms! (paravirtualization)
Describe the difference in ease of virtualization between CPUs/Memory and other devices.
For CPU's/Memory - less diversity at the instruction set architecture level. standardization of interface
For Devices - high diversity. lack of standard specification of device interface and behavior.
What are the 3 models for device virtualization?
Pass Through Model:
* VMM-Level driver configures device access permissions
+ VM provided with exclusive access to the device
+ VM can directly access the device (VMM-bypass)
- device sharing is difficult - Guest VM is mapped directly to a specific piece of hardware
- VMM must have exact type of device as what VM expects
- VM migration is tricky - b/c mapped to specific piece of hardware, new system would need to have that hardware. Can't take state with it.
Hypervisor Direct Model:
* VMM intercepts all device accesses
* Emulate device operation
* translate to generic I/O op, traverse VMM resident I/O stack, invoke VMM-resident driver
+ VM decoupled from physical device
+ sharing, migrations, dealing with device specific
- latency on device accesses
- device driver ecosystem integrated with hypervisor. Needs to support all drivers so it can support the operations and is exposed to driver bugs
Split-Device Driver Model
* device access control split between:
* front-end driver in guest VM (device API)
* back-end driver in service VM (or host)
- Applies only to paravirtualized guests (look this up?)
+ eliminates emulation overhead
+ allow for better management of shared devices
What properties must a VMM satisfy?
Efficiency - All innocuous instructions are executed by the hardware directly, with no intervention at all on the part of the VMM.
Resource Control - Must be impossible for arbitrary program to affect the system resources.
Equivalence - Any program K executing with a control program resident, performs in a manner indistinguishable from the case where the control program was absent and K had whatever freedom of access to privileged instructions the programmer had intended.
What is the purpose of RPC?
Intended to simplify the development of cross address space and cross machine interaction
+ higher level interface for data movement and communication
+ error handling is captured
+ hiding complexities of cross-machine interactions
What are the steps of the RPC Process?
-1. register: server 'registers' procedure, arg types, location...
0. Bind - client finds and 'binds' to desired server
1. call - mclient makes RPC call; control passed to stupd, client mode blocks
2. marshal - client stub 'marshals' arguments (serialize args into a buffer)
3. send - client sends message to server over protocol agreed upon in binding
4. receive - server receives message; passes msg to server-stub
5. unmarshal - server stub 'unmarshals' args (extras args and creates data structs)
6. actual call - server stub calls local procedure implementation
7. result - server performs operation and computes result of RPC operation
Similar steps in return
What is an interface definition language (IDL)?
used to describe interface the server exports
- procedure name, arg, result types
- version #
RPC can use IDL that is:
- language agnostic (XDR in Sun RPC)
- language specific (Java in Java RMI)
** These are ONLY an interface specification, not an implementation **
What is marshaling?
Encodes the data into some agreed upon format so it can be interpreted on the other side.
Marshaling/UnMarshaling routines are provided by the RPC system compiler
What is binding?
- Which server it should connect to (service name, version number)
- How it will connect (IP address, network protocol...)
There is generally a 'registry' that servers register their services with and clients can query that registry to get the info necessary to find a server that will provide the service they request.
How can pointers be treated in RPC?
1. No pointers!
2. Serialize pointer content - copy referenced data (pointed to) structure to send buffer
In what ways can a distributed file system be organized?
- Client/Server on different machines
- File server distributed on multiple machines
- replication (each server: all files)
- partitioned (each server: part of files) -- more scaleable than replicated model
- both (files partitioned; each partition replicated)
- files store on and served from all machines
- blurred distinction b/t clients and servers
Describe the ways a remote file service can be implemented (in the extremes).
* Client downloads entire file then re-uploads it
+ local reads/writes at client
- entire file download/upload even for small accesses
- server gives up control
2. True remote file access
* every access to remote file, nothing done locally
+ file access centralized, easy to reason about consistency
- every file operation pays network cost
- limits scaleability
The better solution is a compromise
What is the better, more compromised solution to a remote file service?
1. Allow clients to store parts of files locally (blocks)
+ Low latency on file operations
+ server load reduces -> more scaleable
2. Force clients to interact w/ server, frequently
+ server has insights into what clients are doing
+ server has control into which accesses can be permitted (easier to maintain consistency)
- server more complex, requires different file sharing semantics
Compare and contrast stateless vs stateful servers
* Stateless: keeps no state
- OK w/ extreme models, cannot support 'practical' model
- cannot support caching and consistency management
- every request self-contained => more bits transferred
+ no resources used on server side (CPU/MM)
+ on failure, just restart (b/c stateless)
Stateful: Keeps client state
Needed for the 'practical' model to track what is cached/accessed
+ can support locking, caching, incremental operations (guaranteed consistency)
- On failure ... need checkpointing and recovery mechanisms
- Runtime overheads to maintain state and consistency => depends on caching mechanism and consistency protocol
What possible file sharing semantics exist?
Unix Semantics => every write visible immediately
Session Semantics (b/t open - close => session)
* write-back on close(), update on open()
* easy to reason, but may be insufficient
- client writes-back periodically => client have a 'lease' on cached data (not necessarily exclusive)
- service invalids periodically => provides bounds on inconsistency
- augment w/ flush()/sync() API
- Never modify, new files created
- All changes atomic
Compare and contrast DFS Replication vs Partitioning
Replication: Each Machine holds all files
+ load balancing, availability, fault tolerance
- writes become complex (synch. to all, write to one and propagate to others)
- replicas must be reconsiled
Partitions: each machine has subset of files
+ availability vs single server DFS
+ scaleability w/file system size
+ single file writes simpler (write localized to single machine)
- on failure, lose portion of data
- load balancing hard. if not balanced, hot spots possible
Can combine to replicate each partition
What are the NFS semantics for caching and locking?
- Session-Based (non-concurrent)
- Periodic Updates
- default: 3 sec for files, 30 sec for dirs
NFSV4 - delegation to client for a period of time (avoids update checks above)
- NFSV4 - also 'shared reservation' (reader/writer lock)
What are the benefits of distributed shared memory (DSM)?
+ lower cost
- slower access times
~ RDMA is reducing access times
Compare/contrast hardware vs software DSM
- dedicated NICs translate memory accesses to messages and pass them to the appropriate node
- very expensive, but fast! HIghend/supercomputing only
- software detects which mem is local/remote and sends messages to get the proper memory
- accept messages from other nodes and perform memory operations for them
What are the various sharing granularities in DSM?
cache line granularity
- overheads too high to DSM
- too small of granularity - many small variables, too much overhead for transfers
- (OS Level) - os maps pages of virtual memory to physical memory
- 4kB is common page size
- Amortize cost of remote access with the larger granularities
- (language runtime) - OS doesn't need to be modified, but only supported by languages that support DSM
What is false sharing in DSM?
When two variables exist on the same page, and two systems are concurrently using (writing) to one of them, but doesn't care about the other variable.
page = [x y]
node 1 only cares about x
node 2 only cares about y
but from the DSM perspective, they are shared, and will interpret it as concurrent write access to the same piece of memory (page). This will trigger all overheads associated with coherence, invalidation, etc, even though the two nodes don't care about the other piece of memory that's being modified.
What types of applications use DSM?
single read single writer (SRSW)
multiple readers single writer (MRSW)
multiple readers multiple writers (MRMW)
What is migration vs replication, and when would you use each?
- Move data from one node to another for access
- Good for single reader single writer
- Replicate (cache) data across each node that wants to access a piece of data
- Good for multiple readers multiple writers
- This will require consistency management - can control this overhead by limiting number of replicas
What are the DSM consistency management options?
Push invalidations when data is written to ...
- proactive, eager, pessimistic
Pull modification info periodically ...
- on-demand (reactive), lazy, optimistic
When these methods get triggered depends on the consistency model
What are the major points of the DSM architecture?
- each node contributes part of mm pages to DSM
- all nodes reponsible for part of distributed memory
- home node manages access and tracker page ownership
- explicit replication possible for load balancing, performance or reliability
What metadata is maintained in DSM?
Each page (object) has:
- address == node id + page frame number
- node ID == "home" node
- object (page) id => manager node id
- manager map available on each node!
Metadata for local pages
- per page metadata is distributed across managers
What is strict consistency?
updates visible everywhere immediately and ordering of updates must be preserved
In Practice though ...
- latency + message loss/reorder, impossible to guarantee. only a theoretical model
What is sequential consistency?
memory updates from different processors may be arbitrarily interleaved.
All processors should see the same interleaving.
Operations from the same process always appear in order they were issues.
What is causal consistency?
Causally related writes will maintain order
The writes from a single processor cannot be arbitrarily reordered.
For 'concurrent' writes (non-causal, unrelated), no guarantees about order.
What is weak consistency?
Uses synchronization primitives - now there are R, W, and Sync operations.
When P1 makes a write and then calls sync, P2 won't see those changes until it too calls sync. Then it will be updated by any other changes that may have existed.
- Single sync operation (sync)
- separate sync per subset of state (page)
- separate "entry/acquire" vs "exit/release"
- acquire = acquire updates made by others
- release = release all the updates made by us to others
What are the two main types of internet architectures for internet services?
=> Each node does the same job (boss worker where each worker handles the entire process)
+ keeps front end simple
- how to benefit from caching? No locality benefits.
=> Each node performs a specific function
+ benefit of locality and caching
- more complex Front End (which request should go to which node)
- more complex management (more requests of a certain type, can't just add any old node, need to add more of the specific node)
- some nodes can end up as hot spots with long request logs