Data Intensive Ch8 - Trouble with distributed systems Flashcards

1
Q

Context map

A

Process pauses
Unbounded network delays
Clock errors

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Chapter Summary

A

Problems in distributed systems

  • send a package over network - it may be lost or delayed for unknown time; reply may be lost - you’ve no idea if request was received
  • node’s clock may be out of sync with other nodes;
  • process can be paused for significant amount of time at any point of execution (stop-the-world GC). node can be then declared dead and come back to life later

Partial failures are defining characteristics of distributed systems.
When there is network communication involved it can:
- occasionally fail
- randomly go slow
- not respond at all

Design for failure - tolerate partial failures

Detecting failure is hard - timeouts are one way but not ideal. Dealing with node that is not really dead but was declared as such is difficult.

After fault detection it’s hard to make decisions based on quorum of nodes. No global variable, no shared memory no shared state makes consensus problem hard.

Working with distributed systems is fundamentally different from writing software for a single computer - lots of new ways for things to go wrong

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Partial failure

A

In a single machine environemnt there is no reason to things be falky. Either it works or not.
Operations are deterministic.
Memory corruption leads to total system failure (kernel panic, blue scree on death, no boot)

In distributed systems some parts can be broken in unpredictable ways while others are fine.
They are NONDETERMINISTIC
Information traveling network takes UNKNOWN time to reach its destination so we don’t know if operation SUCCEDED or NOT

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Difference between supercomputer (HPC) and cloud computing (or distributed computing)

A

High performance computing systems with 1000 CPUs are designed in a way that if one node fails - entire cluster fails.
Failure of a part escalates to total system crash.
Jobs do checkpoints periodically.
If failure occured - job is restarted from a checkpoint.
It’s more similar to single-node computer.

For cloud computing:

  • no batch job like workloads - online, low latency, any time
  • high availability is important
  • nodes are built from lower cost components
  • cloud network is based on IP, ethernet; HPC use specialized network topologies - meshes, toruses
  • the bigger the system (more parts), the more likely one part will break; safe to assume there is always something broken so strategy of escalating failure to the whole system won’t work
  • rolling upgrades, cattle like approach so service can serve users without interruption
  • keeping data geographically close to the users reduces latency but communication goes over unreliable, slow internet; HPC assumes nodes are close together
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Unreliable network - what could go wrong

A
  • request is lost (broken network cable)
  • request is waiting in a queue (recipient is overloaded)
  • remote node has crashed
  • remote node is temporarily not responding (process pause)
  • remote process the request but the response was lost or delayed
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Distributed system are … systems

A

Share nothing - bunch of machines connected by a network

One machine cannot access another’s memory or disk in any other way but by making a network request

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Network paritition

A

AKA Netsplit

Condition when one part of the network is cut off from the rest due to network fault

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Detecting faults

A

The only way is to know if request was successful is to receive positive response.
If anything go wrong we cannot assume we get a response back.
We can try app-level retries, timeouts - eventually node must be declared dead (to stop sending more requests there for example)

Some feedback about node’s failure:

  • no process listening on app port, RST/FIN TCP packets returned (but if it crashed during request handling then no way to knoiw how much it processed)
  • process crashed/was killed but OS is still running - we can have a script to notify other nodes about the crash
  • management interface of network switch can be queried to detect link failures at hardware level (ex machine was powered down)
  • router would send ICMP Destination Unreachable (but router is not an oracle)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

How long should the timeout be

A

No simple answer
Long timeout - long wait until node is declared dead -> users will have poor experience
Short timeout - faster fault detection -> high risk of false positive on temporary slowdowns (load spike)
-> some operation might be retried and done twice;
-> other nodes taking the responsibility might get overloaded (cascading failures)

Assuming system which either delivers not longer than “d” and any successful request is handled in “r” time reasonable timeout is 2d + r
Problem - async networks have UNBOUNDED DELAYS
Packets are delivered as quickly as possible but no upper limit of delivery

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Most common reason of network packet delays

A

Queueing
As with traffic jams - travel times vary due to traffic congestion

  • several nodes try sending packets to the same node at the same time. Network deliver them to into the network link one-by-one. Overflow of queue - packet is dropped and must be resent
  • packet reaches the node but all CPUs are busy - request is waiting in queue to be handled by an app.
  • VM is paused (so other VMs can use CPU) - no data is consumed from the network so it must be buffered by VM monitor
  • TCP flow control (backpressure) - node limits its own rate of sending to avoid overloading network link/receiver; extra queueing at sender side
  • TCP auto retransmissions - if packet is not ACK within timeout (based on observed round-trip time) TCP retries which is an extra delay
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

How to choose timeout values?

A

Experimentally - measure distribution of network round-trip times over

  • extended period of time
  • differnet machines

Continually measure response times and variability (jitter) to automatically adjust timeouts according to observed response time distro. Similar to TCP retransmissions timeouts.

Phri Accrual failure detector (Akka, Cassandra)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Why network cannot have predictable delays?

A

Telephone network is circuit-switched network:
Fixed bandwith is allocated per call along entire route between the 2 callers for entire call duration.
The bandwith is limited (around 16 bits of audio every 250ms)
No queueing means that even though there’re many hops there is maximum e2e latency - BOUNDED DELAY.

Ethernet, IP are packet-switched network protocols:
TCP uses whatever bandwith is available.
Idle TCP connections DO NOT USE any BW at all.
TCP allows VARIABLE SIZED blocks of data
Adaptive rate of data transfer to available network capacity

Why - optimized for BURSTY TRAFFIC
Circuits are good for audio/video calls - constant, predictable number of bits per second for the duration of call.
Web pages, emails, downloading files - no fixed bandwith requirement -> latency is priority, complete ASAP

TCP way is cheaper - better resource utilization

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Ways to emulate circuit switching on package networks

A
Quality of Service (priorities, scheduling of packets)
Admission control (rate-limitation of senders)

Goal: statistically bounded delay

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

What is time and clocks used for in apps?

A
  • detect timeout
  • measure statistics
    • 99th percentile
    • user engagement on particular site
    • requests per second on average over last X minutes
  • publish date of resource
  • cache expiry date
  • when did an error occured

Measuring duration vs points in time

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Why time in distributed systems is tricky?

A

Communication is not instantaneous
Message travels across network from one machine to another
Time when message is received is always later than the time it was sent
The delay is variable
Each machine has its own, IMPERFECT clock (dedicated hardware).
Clocks can go slightly faster or slower.

This makes it difficult
- to order events
- measure duration
based on time reported by machines involved.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

What are 2 kinds of clocks?

A

Time of day

  • > For Measuring wall-clock time
  • > synchronized usually with NTP
  • > Synchronization may cause “reset” to the point in the past
  • > No-no for measuring elapsed time

Monotonic

  • > For Measuring duration
  • > Always move forward (no jump back in time)
  • > Absolute value is meaningless
  • > Absolute value must not be compared between machines
  • > Typically separate timer per CPU socket
  • > NTP may adjust FREQUENCY of moving forward - CLOCK SLEWING
  • > Resolution of microseconds or less

Differ in purpose

17
Q

What’s NTP?

A

Network Time Protocol

Servers are used to sync time

18
Q

Clock sync issues

A

Clock drift, clock resets and jumps in time, VM pauses, leap seconds handling, malicious NTP servers
- Quartz clock is inaccurate - drifts (faster or slower)
Google assumes 200 parts per milion drift (17 seconds drift for a clock sync once a day)
- If clock differs too much from NTP it may refuse to sync or FORCIBLY RESET - app will observe jump backward/forward in time
- firewall may block node from NTP and clock skew will go unnoticed for some time
- sync is susceptible to network delay as any network request (min error of 35ms is achievable when syncing over the Internet)
- NTP server may be wrong/misconfigured/malicious
Node should ask multiple NTP servers
- Leap seconds - minute may be 59 or 61 seconds -> may cause crashing
SMEARING - adjust leap seconds gradually over the course of a day techqniue - not all NTP supports it
- VM have virtualized hardware clocks - when VM is paused app will observe sudden jump in time after it’s resumed
- running software on untrusted devices (mobiles, embedded) - users can set their clock arbitrarly

19
Q

What to do if software requires synchronized clock?

A

MINIMIZING THE DAMAGE of sync drifts
MONITOR clock offsets between ALL machines
Node that drifts too far from others should be declared dead and removed from the cluster

20
Q

Timestamp based replication problems

A

Example Figure 8-3 292 page
Even if there is a skew of 3 ms when node with lagging clock tries to replicate a newly written value it might send a timestamp lower than last write on fast-clock node.

Last write wins conflict resolution strategy with timestamps issues

  • fast clock db may prevent other nodes from writing until skew of time elapsed on slow-clock node
  • Cannot distinguish sequential writes in quick succession from truly concurrent ones
  • 2 nodes can generate writes with the same timestamp (especially if low-resolution clock is used - miliseconds)
    • use tiebreaker value (random number fex) - violation of causality
21
Q

Using clocks for DB snapshot isolation level

A

generating monotically increasing trx ID across all partitions is expensive for lots of small, rapid trx (Snowflake)

We can use timestamps if we know uncertainty about clock accuracy
Interval Timestamps with earliest and latest
If 2 timestamps do not overlap - we’re good to go
Otherwise the order is unknown

Google Spanner, datacenters have GPS-based/atomic clock

22
Q

What is lease?

A

Similar to lock with a timeout
Nodes can use them to determine they’re leaders
Must be renew periodically.
When node fails - it won’t renew. After lease expires - new one can take it over
Kinesis lease for shard processing

23
Q

Lease validation/renewing issuesd

A
while True:
  request = get_incoming_request()
  if self.lease.expiry - current_millis() < 10000:
    self.lease = self.lease.renew()
  if lease.valid():
    process(request)

Issues:

  1. relies on synchronized clocks (expiry of lease is set by different machine)
  2. code assumes little time passes between checking the time and processing requests. 10 seconds buffer is usually fine BUT what if thread stops for 11 seconds
24
Q

What is thrashing

A

Swapping pages in and out of memory with little actual work done.
Better to keep paging disabled on server machines!

25
Q

Why is it a bad idea to rely that very little time would pass in between 2 lines of code?

A

Nasty process pauses and thread preemption
GC, surprise IO (classload, mem swap), client device pause, VM/thread ctx switch, VM live migration
- Stop-the-world GC in JVM
- VM can be suspended and resumed at any time, for any duration (LIVE MIGRATION of VM from one host to another without rebooting - save VM state and restore it elsewhere)
- Client devices can be suspended and resumed aribtrarly (close lid of laptop, lock the phone)
- Ctx switch to another thread/VM hypervisor switches to another VM. Current thread is suspended at any point in code. Heavy load - long pause
- App performs sync IO - thread is paused; some are implicit
java classloader lazily loading class files on first use
- OS Swapping to disk (paging) - memory access leads to IO
- Unix process SIGSTOP (ctrl+z) signal; OPS engineer may accidentally do that :D

26
Q

Why s it hard to make multi-threaded code in distributed system

A

No shared memory - only network communication
Cannot use mutexes, semaphores, atomic counters, lock-free data structures, blocking queues etc
Nodes must assume arbitrary execution pauses

27
Q

What is real-time OS?

A

Real time means system is designed and tested to meet time guarantees under all circumstances (no process pause is welcomed in the airbag release system)
Software must respond within specified deadline or it may cause a failure (watchdog?)

All functions must document worst-case exec time. Dynamic memory allocation is usually restricted

Low throughput due to time guarantees constraints.
Expensive

28
Q

How to limit impact of GC?

A

Treat GC as brief planned outage.
Runtime should provide notifications that GC pause is soon required.
Load balance should take out the node until GC is done.

Other idea: use GC for short-lived objects and restart an app periodically to clear stale long-lived objects.

29
Q

The truth is defined by majority meaning

A

Asymetric network fault - node can receive all messages sent to but is unable to respond. After timeout - other nodes declare it dead even though it works.
Node is paused due to GC, it’s declared dead and then comes back. Other nodes resume talking to it. The resumed node does not realize how much time has passed.
Nodes can’t trust their own judgment - prone to errors.
Holding a lease and doing lease-requiring operation after long GC pause is an example.
Fig 8-4, p302
Quorum is needed - if quorum declares node dead then it is dead and stays dead.

30
Q

Fencing tokens

A

Solve issue of nodes falsely believing they are still holding a lease.
Server granting lock/lease returns fencing token - number monotonically increasing each grant.

Requirement - clients sending write requests pass fencing tokens.
When larger fencing token was observed - writes with lower one are rejected.

ZooKeper as a lock service. zxid or cversion can be used as fencing token.

It is assumed that nodes are UNRELIABLE but HONEST
Byzantine Generals Problem - there’re nodes that cannot be trusted - goal is to reach consensus.
Byzantine in the sense of excessively complicated

31
Q

Which systems need to be Byzantine Fault Tolerant?

A

BFT - system continues to operate correctly even though there’re Byzantine nodes (lying)
Aeorspace systems - CPU system or memory can be corrupted by radiation. That leads to unpredictable responses.

System with multiple organizations participating - some might attempt to cheat. Peer to peer networks like Bitcoin. Mutual agreement if transaction happened or not between untrusting parties.

32
Q

Protection against weak forms of “lying”

A

Checksums in app-lvl protocol (extra protection against corrupted packets which is already provided by TCP/UDP)

User Input Sanitization
Range of values, max size of strings, prevent SQL injection

NTP clients should use multiple servers
Estimate error of all nodes, take the majority

33
Q

UDP

A

User Datagram Protocol
No ACK, no retransmissions
Video streaming - no point of resending audio from the past

34
Q

Three system models with regard to timing

A

Synchronous - bounded network delay, pauses, clock drifts; not realistic

Partially Synchronous - sync most of the time but can exceed the bounds for network delay, pauses, clock drifts
Realistic model for many systems - most of the time network behaves

Asynchronous - no timing assumptions, no clock, no timeouts; suoer restrictive

35
Q

Types of node failures

A

Crash-stop faults - node can fail only by crashing; if it stops responding then it’s gone forever

Crash-recovery faults - node can crash at any moment but may stat responding again after unknown time. Assumes stable-storage (nonvolatile) preserved across crashes; in-memory state is assumed to be lost

Byzantine faults - nodes can do anything - trick and deceive other nodes too

36
Q

Algorithm correctness in distribured system - Safety and liveness properties

A

Safety - nothing bad happens (for fencing tokens uniqueness and monotonic sequence)
If property is violated we can point at particular point in time when it was broken; cannot be undone

Liveness - something good EVENTUALLY happens
May not hold at some point in time but it can be satisfied in future (send request, may receive response later)