Chapter 5 - Replication Flashcards by Don C

What is replication?

Keeping the same data on multiple machines that are connected via a network

How well did you know this?

Not at all

Perfectly

What are the reasons one may want to replicate data?

Reduce latency by keeping data geograpically close to users
Increase availablility as the system can continue to work even if some parts fail
Increase read throughput by scaling out the number of machines that can serve read queries

How well did you know this?

Not at all

Perfectly

Whate the three main approaches to replicating changes between nodes?

Single leader
Multi-leader
Leaderless replication

How well did you know this?

Not at all

Perfectly

What is a replica?

A node/server that stores a copy of the database.
Every write needs to be processed by ever replica, otherwise the replicas no longer contain the same data.

How well did you know this?

Not at all

Perfectly

What is leader-based replication?

One replica is designated the leader
All client write queries go to the leader
The other replicas are followers
When the leader writes new data to its local storage it also sends the data change to its followers as part of he change stream
Clients can read from any replica

How well did you know this?

Not at all

Perfectly

What are synchronous and asynchronous replication?

Synchronous: The leader waits for the follower to confirm it has recieved the write before reporting success to the user
Asynchronous: The leader sens the message to the follower replica but does not wait for a response

How well did you know this?

Not at all

Perfectly

What are the disadvantages of synchronous replicatoin?

Synchronous replication may slow down the entire system if the follower is recovering from a failure, the system is near capacity or there are networking problems
Impractical for all followers to be synchronous, any node outage would cause the system to grind to a hault

How well did you know this?

Not at all

Perfectly

What are the advantages of synchronous replicatoin?

Synchronous replication gaurantees the follower has an up-to-date copy of the data consistent with the leader
One synchronous follower can be upgraded to leader if leader fails

How well did you know this?

Not at all

Perfectly

What are the advantages of asynchronous replication?

The leader can continue to process writes even if all followers are down

How well did you know this?

Not at all

Perfectly

How can we add new followers in leader-based replication?

Take a consistent snapshot of the leaders database without taking a lock on the database (most DBs have this feature)
Copy snapshot to follower node
Follower requests all data changes that have happened since the snapshot was taken

How well did you know this?

Not at all

Perfectly

How can we handle node outages for followers in leader-based replication?

Once the follower has restarted checked the log for latest processed transaction
Follower can request all the data changes that occurred since then
Can continue recieving a stream of data changes as before

How well did you know this?

Not at all

Perfectly

How can we handle node outages for leader in leader-based replication?

Controller node appoints new leader (may be the load balancer?)
No easy way to decide how to recover unreplicated writes

How well did you know this?

Not at all

Perfectly

What is statement-based replication?

Leader logs every write request, a statement, that it executes
Leader sends that statement log to its followers
For relational databases this means every literal SQL statement (INSERT, DELETE, UPDATE) is forwarded to followers
The followers parse and execute the statement as if it has been recieved from a client

How well did you know this?

Not at all

Perfectly

What are the potential pitfalls of statemened-based replication?

Statements that call non-deterministic functions, NOW() or RAND() would generate a different value on each replica
If statements use autoincrementing columns or depend on existing data they must be executed in the EXACT same order on each replica

How well did you know this?

Not at all

Perfectly

What is write-ahead log shipping?

For both log-structured storage engines and B-trees, an append-only log is stored on disk
The leader sends the log to followers and uses it to build a copy of the exact same data structures found on the leader

How well did you know this?

Not at all

Perfectly

What are the disadvantages of write-ahead log shipping?

Write ahead log contains details of which bytes were changes in which disk blocks
Closely coupled to the storage engine
Not possible to run different versions of the database software on the leaders and followers

How well did you know this?

Not at all

Perfectly

What is logical (rows-based) log replication?

Study These Flashcards

Different log formats for replication and for the storage engine
Logical log is a sequence of records describing the writes to database tables at the granularity of a row
Allows different nodes to run different database engines

What is trigged-based replication?

Study These Flashcards

Lets you register custom application code that is automatically executed when a data change (write transaction) occurs in a database system.
This custom application code or external process can then replicate the data change to another system

What is read-after-write or read-your-write consistency?

Study These Flashcards

A guarantee that if a user writes a change to the database they will always see any updates they submitted themselves
Also need to consider cross device read-after-write consistency
Can be implemented by forwarding the reads of a user that has recently written to the leader or a sufficiently updated follower

What are monotonic reads?

Study These Flashcards

A guarantee that if a user makes several reads in sequence, they won’t read older data after having previously read newer data
e.g read from a follower and get 2 comments, then read from another follower with more lag and only get 1st comment
Can be implemented by making sure users always read from the same replica

What are consistent prefix reads?

Study These Flashcards

A guarantee that if a sequence of writes happens in a certain order, anyone reading those writes will see them appear in the same order
If the database always applied writes in the same order, reads always see a consistent prefix – this is more of a problem for partitioned databases

What is a multi-leader configuration?

Study These Flashcards

There are multiple leaders in the database topology, each leader can both be written to and acts as a follower to other leaders
The benefits rarely outweight the added complexity

What are some disavantages of multi-leader replication?

Study These Flashcards

The same data may be concurrently modified in two different datacenters
Those write conflicts must be resolved

Describe a simple write conflict in a multi-leader database

Study These Flashcards

A wiki page is being simultaneously edited by two users
User 1 changes the title from A to B
User 2 changes the title from A to C
Each users change is successfully applied to their local leader but when the change is asynchronously replicated a conflict is detected

How can we avoid conflicts?

Method 1: Changes to a certain page, for example, are always sent to the same leader Method 2: Last Write Wins Method 3: Give each replica a unique ID, writes from higher-numbered replica take precedence Method 4: Record the conflict in an explicit data structure and write application code that resolves the conflict on read

What is a replication topology?

The communication path along which writes are propogated from one node to another

What is the all-to-all multi-leader replication topology?

Every leader sends its writes to every other leader

What is the circular multi-leader replication topology?

Each node recieves writes from one node and forwards those writes, plus any writes of its own, to one ther node

What is the star multi-leader replication topology?

One node is designated as the root node which forwards it's writes to all other nodes

What problems may arise with the star and circular topology?

If just one node fails, it can interrupt the flow of replication messages between the other nodes

What problems may arise with the all-to-all topology?

Client A insert a row into a table on leader 1 Client B updates the row on leader 3 Leader 2 recieves the writes in a different order and is being asked to update a row that does not exist

What is leaderless replication?

Also known as Dynamo-style. Any node can process client write requests. A coordinator node may send write requests to other nodes on behalf of clients.

What are quorum writes and quorum reads?

Quorum write: Clients said their write requests to all/multiple replicas. If the number of nodes that respond successfully is greater than a certain threshold the write is considered successful. Quorum read: Clients said read requests to several nodes in parallel, version numbers are used to determine which value is newer.

What is read repair?

Clients make reads from several nodes in parallel (quorum reads) If the client sees that one of the responses is stale they can send a newer read back to that replica Good for data that is frequently read

What is an anti-entropy process?

A background process that looks for differences in data and copies missing data from one replica to another

What is the quorum condition?

If there are n replicas Every write must be confirmed by w nodes to be considered successful And we must query at least r nodes for each read As long as w + r > n we expect to get at least one up-to-date value when reading Think about it... set of nodes written and set of nodes read must overlap

How can stale values be returned even if the quorum condition is met?

- Two writes occur concurrently, especially if last write wins is used - Write happens concurrently with a read - Write succeeded in some replicas but failed in others, it is not rolled back, some replicas may or may not return the value - Data carrying new value fails and is restored using replica carrying old value, breaking the quorum condition

What is a sloppy quorum and hinted handoff?

- The client cannot connect to the usual n nodes which the data is stored - The data can be written to any w nodes, which may include nodes that are not where the data is usuaully stored - Once the client cannot connect again that data is sent back to the usual n nodes (hinted handoff) Useful for increaisng write availability

What are concurrent operations? (tricky)

Two operations that are unaware of each other. There is no happens-before relationship between them. e.g User 1 changes title A to B, User 2 changes title A to C

Describe a versioning algorithm to capture the happens-before relationship and deal with concurrent writes

Page 187-188 haha

Chapter 5 - Replication Flashcards

(40 cards)