Data Intensive Ch9 - Consistency and Consensus Flashcards

1
Q

Context map

A

Timestamp ordering
Lamport timestamps

Causal ordering

Total order broadcast

Distributed transactions

Linearizability
Compare-and-set

Global constraints

Failure detectors - membership services

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Summary

A

Linearizability - consistency model
Goal - make replicated data appear as though there was only a single copy; make all ops appear as if they acted on that copy atomically
Totally ordered, single timeline
Variable in single threaded alike; slow; susceptible to network delays

Causality - imposes ordering on events - what happened before what; cause and effect
Weaker consistency model - some things can be concurrent
Timeline with branching & merging
Less sensitive to network problems - less coordination overhead
Can be captured using Lamport timestamps

Causality not sufficient for ensuring name uniqueness fex
Consensus required

Achieving consensus - deciding something in a way all nodes agree on what was decided - decision is irrevocable

Many problems are reducible to consensus

  • > Linearizable compare-and-set (atomically decide whether to set value based on current value and passed)
  • > Atomic distributed trx commit
  • > Total order broadcast (decide order in which to deliver messages)
  • > Lock/lease (decide who acquired one)
  • > membership/coordination service (given failure detector like timeout decide which node is alive)
  • > Uniqueness constraint (decide who wrote the value first and who should abort)

Consensus is easy in single node environment or only a single node can decide. Single leader database is an example. SPOF; can be handled by:
- waiting for the leader to recover (what if they don’t? termination property violated - system blocked forever)
- Consensus by act of God -> manual fail over - admin chooses new leader and reconfigures the cluster
Humans are SLOW
- Algorithmically auto select new leader - CONSENSUS ALG

Single leader DB provides linearizability for writes without consensus but requires it to maintain its leadership!
Leader “KICKS THE CAN DOWN THE ROAD”

Outsourcing consensus, failure detection & membership service using ZooKeeper and similar
Easier than implementing custom algorithms able to withstand: partial failure (network packet lost), process pauses, clock drifts

Not all systems require consensus - leaderless, multi-leader rely on handling conflicts instead
Causality is enough no need for linearizability

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Idea behind building fault-tolerant systems

A

Find general-purpose abstractions with useful guarantees, implement them once and let apps rely on them.
Database transactions as an example - apps can pretend there’re no crashes (atomicity), nobody accesses db at the same time (isolation), storage devices are reliable (durability)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Eventual consistency revisited

A

For replicated databases - if you stop writing to DB for some UNSPECIFIED length of time then all reads will finally return the same value

Weak guarantee - no guarantees when replicas converge
No guarantees for reading your own writes
Problems appear only when there is high concurrency or network issues.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Distributed consistency vs transaction isolations

A

Trx isolation primarily about avoiding race conditions due to concurrently executing trx
DC - coordinating state of replicas in the face of delays and faults

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Lenearizability

A

Atomic consistency, strong consistency, immediate consistency, external consistency
Idea: All clients have the same view of the data as if there is only one replica. No worries about replication lag

Constraint - after any read has returned a new value ALL following reads from any replica must also return the new value
Imagine there is some point in time (between start and end of the write) where value of x flips from old to the new value everywhere

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Linearizability vs Serializability

A

Different guarantees

Serializability - trx isolation property
Every trx may read/write many OBJECTS
Trx behave though as if they had executed in SOME serial order. Serial order may not be the order in which trx were actually run.

Linearizability - guarantees on reads and writes of an INDIVIDUAL object
Does not group operations into trx
Does not prevent write skews

2PL, literal serial execution are typically linearizable as well

SSI is not - reads are made from consistent snapshot
Snapshot does not include more recent writes than the snapshot

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Where do we need linearizability

A

Locking/leader election
Only single leader allowed (no split brain)
Once lock is acquired all reads must acknowledge that

Uniqueness and other constraints
Same as acquiring a lock for given name
Bank account cannot go below zero

Cross-channel timing dependencies
Image resizer - queue fetching an image from a storage. This may happen before new version of an image is stored.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Which replication methods are linearizable?

A

Single leader - potentially yes - reads made from leader or sync update followers. If snapshot isolation used then nop.
If delusional leader continues to serve requests - likely to violate
Async replication, failover - committed writes can be lost - violates durability and linearizability

Consensus - we can implement linearizable storage using it

Multi-leader - nope, concurrently process writes on multiple nodes; async replication to other nodes; this produces conflicts

Leaderless - probably not - quorum reads not really
Strict quorum does not help (w+r>n) because of variable network delays (we can read from the 2 replicas that didn’t get an update yet and other reader can read from the one value was written initially)
Fig 9-6 p334

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

How to make Dynamo-style quorums linearizable?

A

Reader must perform a sync read repair
Writer must read the latest state of a quorum of nodes before sending its writes
Cost - reduced performance
p335

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Why is CAP theorem unhelpful?

A

The definition “Consistency, Availability, Partition (tolerance) - pick 2 out of 3” is misleading
Partitioning is a network fault which is not chosen - it just happens
If network is working correctly - system can provide consistency (linearizability) and total avaiability but when network fails - you choose either of the two.

Recommended to avoid CAP

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Why are RAM and multicore CPU not linearizable system?

A

One Thread writing to a variable to the memory and other reading it shortly afterwards - read can be stale due to use of cache.
Trade-off for LATENCY
Linearizability is dropped for performance, not fault toerance.

Linearizability is slow always, not only during a network fault.

Using CAP here makes no sense
C is dropped yet we don’t expect system to be available if there is a partitioning (CPU disconnected from the system)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Causal consistency

A

System obeys the ordering imposed by causality
cause comes before effect, message is sent before it is received, quest is asked before it’s answered
Causal order - what happened before what

Examples:
Snapshot isilation - consistent snapshot as causally consistent

Causal order is a partial order
2 concurrent events neither is greater or lower

Strongest possible consistency model that does not slow down due to network delays and remains available in the face of network failures!

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Which kind of order is linearizability?

A

Total order
IF system behaves as if there is single copy of data and all operations are atomic then for any 2 opeartions one must have happened before the other

There is no concurrency in linearizable datastore as there is always a single, linear timeline

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Linearizability vs causality

A

Linearizability implies causailty

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Capturing causal dependencies

A

If. replica processes operation then it must ensure ALL causaully preceding operations (which happened before) have already been processed. Otherwise processing must be postponed.

System needs a way to know which value system has seen when it made given write.
If node seen X when it write Y then X is causally related to Y
Causal dependencies must be tracked for across entire database (all objects) not just a single object.
Version vectors can be used for this

DB needs to know which version of data was read by the application; Snapshot version numbers being passed on a write.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

Sequence number ordering

A

Causality is THEORETICAL concept
Actual keeping track of all causal dependencies can be an overkill
Clients usually read lots of data before writing something -> unclear which data write depends on: all? only some?

Sequence numbers/timestamp (not time-of-day but logical clock ones) provide a better way.
Algorithm to generate sequence of numbers that identify operations (counter incrementing every op executed)

Sequence numbers provide total order (one is always comparable to another)
Sequence number is consistent with causality -> if A happened before B then sequence number A must be lower than B. For concurrent writes - order is aribtrary.

Single-leader DBs - replication log defines total order of writes
When follower applies writes in replication-log order then followers are always CAUSALLY CONSISTENT (even if lagging behind)

18
Q

How to generate sequence numbers for operations in leaderless or multi-leader environment?

A

Each node generates seq number independently. Some bits are reserved for unique node ID.

Timestamp from time-of-day clock can be attached to each operation. With sufficient resolution they might provide total order of operations.

Preallocation of blocks of sequence numbers. Node A might claim sequence numbers from 1-1000 and B from 1001 to 2000

This perform and scales better than bottleneck-pushing through single leader BUT they lose consistency with causality

19
Q

Causality problem with distributed sequence generators

A

Each node may process different number of operations per second
Imagine 2 nodes - one generates odd and another even numbers sequence.
Odd-numbered op cannot be compared with even numbered one.

Timestamps are prone to clock-skew

Block allocator ccase - one operation may be assigned number from larger range sequence than another even though it happened after.

20
Q

Lamport timestamps

A

Purpose: Generating sequence numbers consistent with causality
(node counter, node ID) - node ID guarantees uniqueness
Every node keeps track of max counter value they have seen so far; max value is included on every request
If node receives request/response with greater counter value they update their own counter

Fig 9-8, p346

21
Q

Version vectors vs lamport timestamps

A

First - distinguish whether 2 operations are concurrent or causally dependent on the other
Second - enforce total ordering from which it’s impossible to tell whether 2 ops are concurrent or dependent

22
Q

Why total ordering is not enough to solve fex uniqueness problem?

A

Total ordering lets determine the winner AFTER. the fact
Given a request to create an user node cannot tell whether there is a concurrent attempt with the same username unless it asks all nodes. Which is obviously not fault tolerant.

23
Q

Total order broadcast

A

Protocol for exchaning messages between nodes.
Safety. properties required:
- reliable delivery -> no message is lost, if it’s delivered to one node then must be to all
- total ordered delivery - messages are delivered to every node in the same order

STATE MACHINE REPLICATION
Useful to DB replication - each message = write to db
Every replica processes writes in the same order - they remain consistent with each other (aside from replication lag)

It’s a way of creating a log (replication, trx, write ahead log etc). Delivering a message = appending to log
All nodes mmust deliver the same message in the same order, all nodes can read the log and see the same sequence of messages

Useful for implementing a lock service for fencing tokens
Sequence number can be fencing token itself

24
Q

Linearizable storage with total order broadcast

A

TOB is async - messages guaranteed to delivered reliably in fixed order EVENTUALLY

TOB can be used to implement linearizable WRITE

Unique username example of implementing linearizable compare-and-set using total order broadcast
1. Append a message to the log
This is tentative reservation for the username
2. Read the log until reservation message is delivered back
3. check for any messages claimingthe same username
- if it’s my message. - reservation successful
- otherwise - abort

All messages are delivered in the same order so in case of many concurrent writes - all nodes would agree which came first

25
Q

How to make linearizable reads using total order broadcast

A
  1. Append the “sync” message and wait for it to be delivered back then perform read - guaranteed consistency of db state at “sync” sequence number point-in-time
  2. If it’s possible to fetch the position of the latest log message (in linearizable way) - query the position and wait for all entries up to that position to be delivered. Then perform the read (ZooKeeper sync() op).
  3. Make a red from a replica that is sync updated on writes (which must be up to date)
26
Q

Total order broadcast using linearizable storage

A

Assume linearizable integer register with increment-and-get operation (or atomic CAS)
For every message to be send through TOB - bump the register and attach its value as a sequence number

For fault tolerant system such storage is not trivial
Both problems reduce to consensus

27
Q

Key difference between total order broadcast and timestamp ordering

A

Unlike fex Lamport TS numbers for TOB form CONTINUOUS sequence (no gaps)
If node delivered msg 4 and received msg 6 then it KNOWS it must wait for msg 5

28
Q

What is FLP result

A

Fischer, Lynch & Paterson proof that consensus is impossible given there is a risk node may crash (always a case for distributed system)
Assumption - async system model - no timeouts, no clocks, strictly deterministic algorithm

29
Q

Why is it not sufficient to send a commit request to all nodes in distributed atomic commit?

A
  1. some nodes may detect a constraint violation/conflict - meaning abort is required. Other nodes might have already committed.
  2. Some commit request may get lost in the network (and abort after timeout) while some may get through
  3. Some nodes may crash before they actually handle commit
    Trx commit must be irrevocable - after commit it becomes visible to other trx!
30
Q

Two phase commit

A

Algorithm for achieving atomic trx commit across multiple nodes - all nodes either commit or abort

Available to apps in form of XA Transactions (supported by JTA fex)

2PC uses coordinator/trx manager component
Can be same process as app requesting trx or separate

Algorithm:
1. App reads and writes data to multiple db nodes as usual (partiipants of trx)
2. When app wants to commit coordinator begins PHASE 1 - sends PREPARE request to each participant
Participants should check whether they are able to commit (constraint violations etc)
3. Coordinator gathers all responses
- if all replied “yes” then coordinator sends PHASE 2 COMMIT request and commit actually takes place
- if any replied “no” then coordinator sends PHASE 2 ABORT request

Also called BLOCKING atomic commit - nodes can become stuck waiting for the coordinator to recover

31
Q

2PC atomicity

A

Detail breakdown
1. App starts trx - gets globally unique trx id from coordinator
2. App begins single node trx on each participant + attaches trx id - anything goes wrong now abort is easy
3. when app is ready to commit coord sends prepare tagged with trx id - any request fails - abort is sent with the same trx id
4. When participant receives prepare it makes sure that it can DEFINITELY commit trx under ALL CIRCUMSTANCES
Trx data is written to disk (so commit can be done even if power failure/no disk space happens)
Constraints and conflicts are checked
If “yes” is responded then participant YIELD the right to ABORT
5. When coord got all responses it makes DEFINITIVE decision whether to commit or abort
Decision is written to trx log on disk - COMMIT POINT
6. After securing the decision - commit or abort is sent out
If any request fails here - RETRY (forever until success!)
No going back - if participant crashes then after recovery it MUST accept the request from coord

2 points of NO RETURN

  • participant says yes in prepare
  • coord makes definite decision
32
Q

What if coordinator fails?

A

Participant can only safely abort on its own before responding “yes” to prepare request
After that it MUST hear back from the coordinator
If coordinator crashes or network fails - trx is in the state “in doubt” or “uncertain”
Participant CANNOT abort because coordinator might have already committed elsewhere
Participant CANNOT commit because some other participant could say “no”
Participant MUST wait for coord
Coord MUST save its decision to trx log BEFORE sending it out

33
Q

Fault tolerant consensus

A

One or more nodes may PROPOSE values
Algorithm DECIDES on ONE of those values
Properties:
Uniform Agreement - no two nodes decide differently
Integrity - no node decides twice
Validity - if node decides value v then v was proposed by SOME node (no algorithms always deciding null)
Termination - every node that does not crash eventually decides some value

If no fault tolerance is needed - one node can be hardcoded as “dictator” (like coord in 2PC)
Hence termination property - if a node fail other nodes are expected to reach a decision anyway
Consensus requires at least a majority of nodes to be running (or no quorum can be formed)
So termination property assumes less than half of nodes can crash

Examples:
- Viewstamped Replication VSR (TOB)
- Paxos (Multi-Paxos for TOB version)
- Raft (TOB)
- Zab (TOB)
Most of them decide on sequence of values (making them total order broadcast - more efficient than doing repeated rounds of one-value-at-a-time consensus)
34
Q

Total order broadcast viewed as consensus

A

Repeated rounds of consensus - each round node poposes the message to be sent next and decide on the next message to be delivered in the total order
Each decision = 1 message delivery

35
Q

How does consensus algorithm elects a leader when there is no leader?

A

Everytime there seems to be no leader a vote is started among the nodes
Each election is given EPOCH NUMBER (which is totally ordered and monotonically increasing)
If there is a conflict between 2 leaders in 2 different epochs - leader with the higher epoch number prevails
Node that wants to become a leader must collect votes from a quorum of nodes
Node votes in favor of a proposal if it is not aware of any other leader with higher epoch
There’re 2 rounds of voting each epoch
Elect a leader
Vote on leader’s proposal

Key insight: quorums of those 2 votes must overlap - if a vote on a proposal succeeded at least one of the nodes that voted for it must have also participated in the most recent leader election
If vote on proposal does not reveal higher epoch leader then current leader can conclude it still holds the leadership and decide the proposed value

36
Q

Difference between voting in fault tolerant consensus to 2PC

A

Coordinator is not elected in 2PC
Conensus requires votes from MAJORITY of nodes
2PC requires “yes” from all participants
Consensus defines recovery process (nodes can get into a consistent state after a new leader is elected)

37
Q

Limitation of consensus

A

Voting is similar to SYNC db REPLICATION
Requires a STRICT MAJORITY of nodes to operate
Most consensus algorithms assume FIXED set of nodes that vote
Uses TIMEOUTS as FAILURE DETECTOR - when there is high variability in network consensus can become election-fest

38
Q

What are ZooKeeper and etcd designed for primarily?

A

Hold small amounts of data fitting in memory (disk writes for durability)
All the data is replicated using fault-tolerant total order boradcast

39
Q

Which features does ZooKepper provide?

A

Linearizable atomic operations
- can be used to implement lock/lease

Total ordering of operations
- can be used for implementing fencing token (prevent old leases from being used if process is paused)

Failure detection
- clients can maintain long-lived session on ZooKeeper servers. Heartbeats are exchanged periodically. ZooKeeper can auto-release all locks held by a session when it times out (ephemeral nodes)

Change notifications
- clients can watch for changes (like new node joining the cluster or node failures) . Notifications can be subscribed, no polling required

All in all - ZooKeeper has a useful set of features for distribute coordination

40
Q

Example of using ZooKeeper in place of implementing in-house consensus

A

Allocating work to nodes
Partitioned resource (like message stream shard in Kinesis). New node joins the cluster and some work should be moved from existing nodes to the new one - rebalancing partitions.
If node is removed or has failed - same story.

How to do it with ZooKeper:
combine atomic (linearizable) operations, ephemeral nodes (failure detection) and change notifications

Supposedly it’s not easy (even when using higher level APIs like Apache Curator) but still easier than fault-tolerant consensus from the scratch (poor success record it is said)

Key idea: application may grow to thousand of nodes so majority voting would become ineffectual.
ZooKeeper, usually run on FIXED NUMBER OF NODE (3/5), allows to OUTSOURCE coordination work (consensus, ordering of operations, failure detection).

ZooKeeper is intended for SLOW-CHANGING DATA (node is running on IP , partition assigned)
Timescale of minutes/hours not millions of times per second