Data Intensive Ch9 - Consistency and Consensus Flashcards
Context map
Timestamp ordering
Lamport timestamps
Causal ordering
Total order broadcast
Distributed transactions
Linearizability
Compare-and-set
Global constraints
Failure detectors - membership services
Summary
Linearizability - consistency model
Goal - make replicated data appear as though there was only a single copy; make all ops appear as if they acted on that copy atomically
Totally ordered, single timeline
Variable in single threaded alike; slow; susceptible to network delays
Causality - imposes ordering on events - what happened before what; cause and effect
Weaker consistency model - some things can be concurrent
Timeline with branching & merging
Less sensitive to network problems - less coordination overhead
Can be captured using Lamport timestamps
Causality not sufficient for ensuring name uniqueness fex
Consensus required
Achieving consensus - deciding something in a way all nodes agree on what was decided - decision is irrevocable
Many problems are reducible to consensus
- > Linearizable compare-and-set (atomically decide whether to set value based on current value and passed)
- > Atomic distributed trx commit
- > Total order broadcast (decide order in which to deliver messages)
- > Lock/lease (decide who acquired one)
- > membership/coordination service (given failure detector like timeout decide which node is alive)
- > Uniqueness constraint (decide who wrote the value first and who should abort)
Consensus is easy in single node environment or only a single node can decide. Single leader database is an example. SPOF; can be handled by:
- waiting for the leader to recover (what if they don’t? termination property violated - system blocked forever)
- Consensus by act of God -> manual fail over - admin chooses new leader and reconfigures the cluster
Humans are SLOW
- Algorithmically auto select new leader - CONSENSUS ALG
Single leader DB provides linearizability for writes without consensus but requires it to maintain its leadership!
Leader “KICKS THE CAN DOWN THE ROAD”
Outsourcing consensus, failure detection & membership service using ZooKeeper and similar
Easier than implementing custom algorithms able to withstand: partial failure (network packet lost), process pauses, clock drifts
Not all systems require consensus - leaderless, multi-leader rely on handling conflicts instead
Causality is enough no need for linearizability
Idea behind building fault-tolerant systems
Find general-purpose abstractions with useful guarantees, implement them once and let apps rely on them.
Database transactions as an example - apps can pretend there’re no crashes (atomicity), nobody accesses db at the same time (isolation), storage devices are reliable (durability)
Eventual consistency revisited
For replicated databases - if you stop writing to DB for some UNSPECIFIED length of time then all reads will finally return the same value
Weak guarantee - no guarantees when replicas converge
No guarantees for reading your own writes
Problems appear only when there is high concurrency or network issues.
Distributed consistency vs transaction isolations
Trx isolation primarily about avoiding race conditions due to concurrently executing trx
DC - coordinating state of replicas in the face of delays and faults
Lenearizability
Atomic consistency, strong consistency, immediate consistency, external consistency
Idea: All clients have the same view of the data as if there is only one replica. No worries about replication lag
Constraint - after any read has returned a new value ALL following reads from any replica must also return the new value
Imagine there is some point in time (between start and end of the write) where value of x flips from old to the new value everywhere
Linearizability vs Serializability
Different guarantees
Serializability - trx isolation property
Every trx may read/write many OBJECTS
Trx behave though as if they had executed in SOME serial order. Serial order may not be the order in which trx were actually run.
Linearizability - guarantees on reads and writes of an INDIVIDUAL object
Does not group operations into trx
Does not prevent write skews
2PL, literal serial execution are typically linearizable as well
SSI is not - reads are made from consistent snapshot
Snapshot does not include more recent writes than the snapshot
Where do we need linearizability
Locking/leader election
Only single leader allowed (no split brain)
Once lock is acquired all reads must acknowledge that
Uniqueness and other constraints
Same as acquiring a lock for given name
Bank account cannot go below zero
Cross-channel timing dependencies
Image resizer - queue fetching an image from a storage. This may happen before new version of an image is stored.
Which replication methods are linearizable?
Single leader - potentially yes - reads made from leader or sync update followers. If snapshot isolation used then nop.
If delusional leader continues to serve requests - likely to violate
Async replication, failover - committed writes can be lost - violates durability and linearizability
Consensus - we can implement linearizable storage using it
Multi-leader - nope, concurrently process writes on multiple nodes; async replication to other nodes; this produces conflicts
Leaderless - probably not - quorum reads not really
Strict quorum does not help (w+r>n) because of variable network delays (we can read from the 2 replicas that didn’t get an update yet and other reader can read from the one value was written initially)
Fig 9-6 p334
How to make Dynamo-style quorums linearizable?
Reader must perform a sync read repair
Writer must read the latest state of a quorum of nodes before sending its writes
Cost - reduced performance
p335
Why is CAP theorem unhelpful?
The definition “Consistency, Availability, Partition (tolerance) - pick 2 out of 3” is misleading
Partitioning is a network fault which is not chosen - it just happens
If network is working correctly - system can provide consistency (linearizability) and total avaiability but when network fails - you choose either of the two.
Recommended to avoid CAP
Why are RAM and multicore CPU not linearizable system?
One Thread writing to a variable to the memory and other reading it shortly afterwards - read can be stale due to use of cache.
Trade-off for LATENCY
Linearizability is dropped for performance, not fault toerance.
Linearizability is slow always, not only during a network fault.
Using CAP here makes no sense
C is dropped yet we don’t expect system to be available if there is a partitioning (CPU disconnected from the system)
Causal consistency
System obeys the ordering imposed by causality
cause comes before effect, message is sent before it is received, quest is asked before it’s answered
Causal order - what happened before what
Examples:
Snapshot isilation - consistent snapshot as causally consistent
Causal order is a partial order
2 concurrent events neither is greater or lower
Strongest possible consistency model that does not slow down due to network delays and remains available in the face of network failures!
Which kind of order is linearizability?
Total order
IF system behaves as if there is single copy of data and all operations are atomic then for any 2 opeartions one must have happened before the other
There is no concurrency in linearizable datastore as there is always a single, linear timeline
Linearizability vs causality
Linearizability implies causailty
Capturing causal dependencies
If. replica processes operation then it must ensure ALL causaully preceding operations (which happened before) have already been processed. Otherwise processing must be postponed.
System needs a way to know which value system has seen when it made given write.
If node seen X when it write Y then X is causally related to Y
Causal dependencies must be tracked for across entire database (all objects) not just a single object.
Version vectors can be used for this
DB needs to know which version of data was read by the application; Snapshot version numbers being passed on a write.
Sequence number ordering
Causality is THEORETICAL concept
Actual keeping track of all causal dependencies can be an overkill
Clients usually read lots of data before writing something -> unclear which data write depends on: all? only some?
Sequence numbers/timestamp (not time-of-day but logical clock ones) provide a better way.
Algorithm to generate sequence of numbers that identify operations (counter incrementing every op executed)
Sequence numbers provide total order (one is always comparable to another)
Sequence number is consistent with causality -> if A happened before B then sequence number A must be lower than B. For concurrent writes - order is aribtrary.
Single-leader DBs - replication log defines total order of writes
When follower applies writes in replication-log order then followers are always CAUSALLY CONSISTENT (even if lagging behind)
How to generate sequence numbers for operations in leaderless or multi-leader environment?
Each node generates seq number independently. Some bits are reserved for unique node ID.
Timestamp from time-of-day clock can be attached to each operation. With sufficient resolution they might provide total order of operations.
Preallocation of blocks of sequence numbers. Node A might claim sequence numbers from 1-1000 and B from 1001 to 2000
This perform and scales better than bottleneck-pushing through single leader BUT they lose consistency with causality
Causality problem with distributed sequence generators
Each node may process different number of operations per second
Imagine 2 nodes - one generates odd and another even numbers sequence.
Odd-numbered op cannot be compared with even numbered one.
Timestamps are prone to clock-skew
Block allocator ccase - one operation may be assigned number from larger range sequence than another even though it happened after.
Lamport timestamps
Purpose: Generating sequence numbers consistent with causality
(node counter, node ID) - node ID guarantees uniqueness
Every node keeps track of max counter value they have seen so far; max value is included on every request
If node receives request/response with greater counter value they update their own counter
Fig 9-8, p346
Version vectors vs lamport timestamps
First - distinguish whether 2 operations are concurrent or causally dependent on the other
Second - enforce total ordering from which it’s impossible to tell whether 2 ops are concurrent or dependent
Why total ordering is not enough to solve fex uniqueness problem?
Total ordering lets determine the winner AFTER. the fact
Given a request to create an user node cannot tell whether there is a concurrent attempt with the same username unless it asks all nodes. Which is obviously not fault tolerant.
Total order broadcast
Protocol for exchaning messages between nodes.
Safety. properties required:
- reliable delivery -> no message is lost, if it’s delivered to one node then must be to all
- total ordered delivery - messages are delivered to every node in the same order
STATE MACHINE REPLICATION
Useful to DB replication - each message = write to db
Every replica processes writes in the same order - they remain consistent with each other (aside from replication lag)
It’s a way of creating a log (replication, trx, write ahead log etc). Delivering a message = appending to log
All nodes mmust deliver the same message in the same order, all nodes can read the log and see the same sequence of messages
Useful for implementing a lock service for fencing tokens
Sequence number can be fencing token itself
Linearizable storage with total order broadcast
TOB is async - messages guaranteed to delivered reliably in fixed order EVENTUALLY
TOB can be used to implement linearizable WRITE
Unique username example of implementing linearizable compare-and-set using total order broadcast
1. Append a message to the log
This is tentative reservation for the username
2. Read the log until reservation message is delivered back
3. check for any messages claimingthe same username
- if it’s my message. - reservation successful
- otherwise - abort
All messages are delivered in the same order so in case of many concurrent writes - all nodes would agree which came first