Data Intensive Ch5 - Replication Flashcards

1
Q

Replication

A

Copy same data on multiple machines connected via network
Why?
- keep data geo-close to user (lowers LATENCY)
- allow system to work even if facing a failure (increases AVAILABILITY)
- scale out number of machines which can serve requests (increases READ THROUGHPUT)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Leaders and Followers

A

Replication approach.
Leader (aka master, primary) is the only node which accepts ALL writes
Follower (aka slave, secondary) receives stream of data changes from the leader to update its local copy of data. Writes must be applied in the same order as they happened on the leader.
Clients can read from leader or followers.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Sync vs Async replication

A

Sync - leader must block client and wait with confirmation until ALL followers ack the change
Drawback:
Leader MUST block and wait for ack as if follower failed or is recovering it’s not known whether data was saved successfully (could be just network issues, could be app was under heavy load)
That increases latency to unbounded values
It’s undefined how much time leader will replicate as it’s unknown how long will any failed follower recover
Advantage:
if write was ACK then reads from any replica is guaranteed to be fresh once leader returned response

Async - leader broadcasts change and ACK to client before followers did
Drawback:
if leader fails before writes were replicated they may get lost
Advantage:
leader accepts writes even if followers can’t keep up with applying new changes

Hybrid approach is also possible (some followers are replicated in sync manner, rest in async)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

How to add a new follower in single-leader DB.

A

Take a snapshot in some point-in-time from the leader
Ideally this happens without taking any locks on DB
New follower loads the snapshot

Then it asks for a STREAM of changes that happened AFTER snapshot -> snapshot should be associated with some position in replication log

Once follower consumes the stream it is ready to serve requests.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Follower catchup recovery

A

Follower must keep log of changes received from leader on local disk (write ahead log of a kind?)
If follower is taken down after restart it can read point-in-time from which stream of changes is required.
Not sure wtf here - re-read

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Failover with single-leader replication

A

-> Difficult problem because:
Some follower must be promoted as new leader
Client must start sending writes to a new leader
Followers must start listening from a new leader

-> Manual - admin does the switch
-> Automatic - adds challenges:
How to detect failure? (can be due to app crash (OOM, bug), can be due to power cut, can be due to network split…)
No foolproof method - timeouts and heartbeats are usually good enough.

How to elect new leader?
- election by majority
- arbitrary election by controller node
Ideally take the node with the most up-to-date state
Consensus problem - ALL nodes must agree the leader is the same node

How to switch clients to a new leader?
Writes should be auto-rerouted to new leader

Old leader must step-down after recovery and become a follower replica

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Issues with leader failovers

A

Pending writes
Previous leader can store writes not yet replicated to a new leader
In the meantime new leader could accept conflicting changes
Pending writes cannot be simply applied anymore
Usually - discard them but this violates durability so case-by-case
Caution is necessary if data is used for coordination of business process with external system
Github incident
New leader didn’t have up-to-date value for autoincrement counter
IDs were reused and some users gained access to Redis cache entries made for records pending on previous leader. Data leak.

Split brain
Previous leader after recovery still assumes to be leader
Both nodes accept writes - data conflicts are bound to happen
Shutting down a leader if more than 1 detected is risky (why? re-read)

Failover timeout selection
If too short - often leadership changes - expensive
If too long - writes get lost
-> Dobór odpowiedniego timeoutu - za krótki = częste zmiany lidera, za długi - zapisy będą przepadać

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Types of replication log

A

Statement-based
Leader broadcasts databse-query-language commandss to followers (SQL if SQL DB etc)
Caveats
- nondeterministic functions like rand() or now()
Theoretically could replace non-deterministic output with concrete values but it’s still prone to errors
- autoincrementing columns
Concurrent transaction commands are sent out to replicas and it’s extremely unlikely they will be executed in the exact same order on all replicas
- side-effects from triggers, stored procedures

Write-ahead log shipping
Leader sends WAL to replicas
WAL is usually low-level byte repesentation of record change (like literally what row looks like on disk)
Might be tightly coupled to DB version making multi-version replicas out of question

Logical-row-based log replication
Independent from byte-representation of data
Inserts could be a tuple of all columns’ values
Deletes could be primary key of deleted row
Update could be primary key + changed columns’ valyes
Useful for Change Data Capture

Trigger based
Useful if replicating only subset of data
Useful if conflict resolution logic needs to be added at low level
Trigger can call our app code on events in specified table (like row added to users)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Replication lag and eventual consistency

A

Replication lag is how much time are replicas behind

Sync replication does not scale well
The more followers, the more latency and the lower reliability
More nodes in the system - higher probability of failure

Async is the way then but this brings eventual consistency
Bascially if leader receives no new writes then after some UNSPECIFIED time followers will EVENTUALLY catchup and become CONSISTENT with leader
Eventual is vague - could be order of milliseconds, could be order of minutes (failures etc)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

What is read-after-write consistency problem?

A

AKA reading your own writes
Results in poor UX
Symptomes:
user after writing some data reads from the replica that does not have that data
Example: posting a comment on facebook then comment is invisible to the author on some page

Remedies:
Read data modified by current user only from the leader
Appliable to specific kind of data like user profile which is always edited by the owner
Ineffective if there is a lot of data being edited
-> Can use a heuristic like for X minutes after user makes an edit read from the leader only
-> Monitor replication lag and disable reads from replicas which are X behind

Client passes last-write timestamp to the system
Routing selects replica which is up-to-date up-to-passed-timestamp
Can also block read until replica catches up to the timestamp but it’s obviously asking for latency problems
Timestamp meaning logical or clock (logical being better)
-> Cross-device consistency is hard - timestamps from different devices are not synchronized with each other
Timestamps would need to be centrailized
-> Multi-data center replicas
One device could connect to other data center than another
Each has different set of leader/follower

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What is monotonic reads problem?

A

Symptomes:
user experiences going back in time
Observed data is seemingly lost on subsequent read
Example: user views post on FB but after refresh it’s gone as read was made from the replica that does not have it yet
How to counteract:
one user session should read from the same replica (use hashing ID etc)
On failure re-routing is necessary

TLDR
Read-after-read -> can’t read older data than the one already observed
Logical timestamp returned on reads, always read at least fresh data or…
Couple user session with replica

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

What is consistent prefix reads problem?

A

Violation of causality due to replication lag
Symptomes:
user sees the answer before the question
Especially Impactful with partitioned DB
Without partitioning we can assume ALL replicas apply the writes in the same order. So now way answer-write is applied before question-write.
When partitioning comes into play then if answer is partitioned differently than question it can become available sooner.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Multi-Leader Replication

A

AKA Active/active, master-master
Leaders are followers to each other
Usually implemented as cross-datacenter (each DC have an independent set of leader-follower)

Pros:

  • writes have lower latency as we have leaders close to user’s geo-location
  • fault-tolerance in datacenter-failure scenario - each is independent. When the other recovers it must catchup
  • network-failures-tolerance - single leader for multiple datacenters lots of writes can get lots in cross-datacenter network requests (as cross-datacenter network is less reliable than local-inner-datacenter one)

Cons:
- concurrent, conflicting writes on different leaders
Conflict Resolution Mechanism required
- autoincrementing IDs, constraints are tricky

Offline mobile clients is an example of multi-leader replication
In offline mode client-device becomes a local DB leader
Syncing back to remote db is kind of replication
Replication lag can be order of days
Collaborative editting (Google docs) also bears a lot of similarities with multi-leader-replication

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Handling conflict resolution in multi-leader replication

A

TLDR:
Avoid conflicts - couple session with leader (user home datacenter fex)
Last write wins (by write ID, leader ID etc) - eventual consistency

Waiting for all replicas (leaders?) before ACKing
- might as well use SLR

Avoiding conflicts
For given record writes always go to specific datacenter leader (fex home datacenter of user)
Does not work if leader changes
Does not work if user changes geo-location and different DC is close
Read conflict resolution - client solves the problem manually, we store conflicting version

Converging towards consistent state
All writes get timestamp/id

Select which conflicting write wins (Risk of discarding some writes):
- Greater ID wins (LWW)
- replicas with greater ID win
Alternatively:
- Store conflict information in data structure and let user resolve the conflict - no data loss
- Data mergin (like concat two strings in alphabetic order)

Conflict resolution logic in replication tools:
On write
DBs run a script when conflict is detected
Script runs in a background - no user feedback possible
Bucardo - Perl script

On read
Conflicting values are read
On read multiple versions are present to the user in app
User must resolve conflict manually and make a write
CouchDB
Issues with transactions - some trx may save multiple records at the same time but conflict resolution is done in separation

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Automatic conflict resolution

A

Immature feature yet

-> Conflict-free replicated datatypes in Riak 2
Special data structures that can be concurrently edited
Two-way merge
-> Mergeable persistent data structure
Tracks history like Git
Three way merge
-> Operational transformation
Used in Google docs
Useful for concurrent edit of list of ordered elements like text-documemt

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Multileader replication topologies

A

Communication path between leaders
Circular - prone to failures (breaking circuit)
Star - also prone to failures…
All-to-all

17
Q

Leaderless replication

A
Examples:
Amazon's dynamo
Riak
Cassandra
Voldemort

Quorum Reads and writes - meaning clients read and write from many replicas at the same time
Version numbers are used to select which read-data is non-stale

Quorum consistency condition
w + r > n
n - nodes (partitions) where given record can be stored, usually odd number
w - minimum successful write ACKs
r - minimum successful reads
w < n - can write
r < n - can read

Edge cases
W writes and R reads goes to disjoint replicas lol
Concurrent writes - hard to tell which record is first (concurrent writes can arrive in different order at each replica, safe bet is to merge the writes)
Writes does not succeed on all R replicas while it does on some - no rollback
Replica that received the value crashes and data is recovered from a replica that does not yet seen the new value. The value ends up in less than W copies

Problem: replication lag cannot be easily measured. Writes do not have any deterministic order
Cannot compare replication log positions like in SLR where followers apply writes in exact same order as leader

18
Q

Leaderless replication - ensuring eventual consistency

A

Read repair
Clients when reading in parallel from multiple nodes can detect replicas where stale reord is stored and can “repair” the value

Anti entropy proess
Background process which compares differences between replicas and repair missing/stale data between them
Write order is not preseved

19
Q

Sloppy quorum, hinted handoff

A

Reads and writes require “w” and “r” ACKs
It is acceptable to accept the write on non-destination nodes
It’s like sleeping over the neighbour if we lose our flat keys

Hinted handoff - if data is stored at incorrect node they are passed on the destination node ASAP

20
Q

Last write wins

A

Conflict resolution method
Add “timestamp” (logical, time-of-day) for each write
On conflict always pick the greater and discard lesser timestamps
Good for caches (as it is acceptable to lose some writes)

21
Q

Happens before relationship

A

A happens before B if B knows about A or depends on A or builds upon A in some way
2 operations are concurrent if neither happens before the other (neither knows about the other)
A -> B
B -> A
A || B

When do we call 2 operations concurrent
They need not literally overlap in time -> clock issues in distributed systems
They just need to be unaware of each other (physical time irrelevant)
Relativity theory - information cannot travel faster that speed of light -> two events that occur some distance apart cannot affect each other if time between events is shorten that it takes time to travel in-between
Network issues extend the time 2 events can be unaware of each other

22
Q

Capturing happens before relationship

A

General algorithm
1. Server maintains verrsion number per key
2, On each write it is incremented, all concurrent versions are stored
3. On read client gets all values of version that were not overwritten + latest latest version
4. On write client mut send version number of last write and must merge all non-overwritten versions into one value
5. When server receives write with given version it can overwrite values for that version and down (they were merged together in current write)
Higher versions must be kept as they are concurrent to the write. Latest version is bumped

Cons:
extra client work
Automatic methods - union of sibling records with deduplication -> this makes deletes tricky, records come back to life
Delete is done via tombstone markers

Basket example for single replica (p188-189):
K1, K2 - clients
0. basket is empty, version is 0
1. K1 adds milk and writes to the server with version 1
2. K2 adds eggs and is not aware of K1 change
Server saves [egss] as separate value with version 2
Read returns
[eggs] v2
[milk] v1
3. K1 unaware of K2 change adds flour to the basket
Write [milk, flour] v1 is made
Server overwrites version 1 with [milk, flour], sets the new version to 3
Read returns
[milk, flour] v3
[eggs] v2
4. K2 adds ham and merges together [eggs] and [milk] values
Writes [ham, eggs, milk] v2
Server overwrites version 2 [eggs] with [ham, eggs, milk] and sets this record’s version as 4
Read returns
[ham, eggs, milk] v4
[milk, flour] v3
And so on and so forth…

23
Q

Version vectors

A

Collection of version numbers from all replicas.
Returned to client on read.
Send back by a client on write.

24
Q

Context map

A

todo

25
Q

Summary

A

todo

26
Q

What question is useful to be asked when considering system with eventual consistency present?

A

What if replication lag grows to the order of minutes