Consistent Hashing Flashcards

(16 cards)

1
Q

What is data partitioning? What is data replication? How do they differ?

A

Data partitioning distributes the data across a set of servers or instances. Data replication involves copying the data to other servers or instances.

Partitioning improves performance and scalability, replication improves availability and durability.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What efficiently solved the problem of data partitioning and replication?

A

Consistent hashing

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What are the two primarily challenges of data partitioning?

A

Determining where/which node data will be stored on and supporting scaling or rebalancing based on the number of nodes.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What’s the most naive approach for partitioning data?

A

Round-robin partitioning; Hashing the data to map a given key to a number and performing a module operation (by the number of nodes) to determine where to distribute it.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

How does consistent hashing play a role in data partitioning? How does it differ from a basic hashing approach?

A

Instead of hashing data to a specific value, it works on ranges of values, which will ensure very little key changes during rebalancing operations.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What’s an example of how consistent hashing works? Can you walk through it?

A

If we think of consistent hashing as a ring-like structure with each section of that ring corresponds to a range of values, when we hash the data we would assign it to the node of the range it falls into.

In cases where the number of nodes changed, it would be allocated to the next available range.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Since nodes in a distributed system can frequently be created and removed, how does consistent hashing solve this?

A

It uses virtual nodes as opposed to physical ones as physical nodes require a rebalance during these changes.

It ensures that each physical node supports multiple ranges to allow an even distribution regardless of the number of nodes.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

What are problems associated with a purely manual or fixed division for node ranges relayed to hashing?

A

Addition/removal of nodes forces token recalculation which can be expensive.

Large node ranges can lead to hotspotting or overloaded nodes.

Rebuilding nodes can be challenging as the only place data is replicated on the replicas for a given node.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

How do virtual nodes (vnodes) differ from physical nodes?

A

Instead of using static ranges to assign data to nodes, virtual nodes have ranges that are assigned to the nodes instead which can assign data.

This can allow the cluster to remain balanced as new nodes will be assigned virtual nodes from the other available nodes.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

How do virtual nodes differ depending on their corresponding physical machine or nodes?

A

Larger, more powerful machines can assign a high number of ranges, whereas less powerful machines can have smaller subranges.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

How do virtual nodes help prevent hotspotting?

A

Since assignment isn’t based off of a single large range, but instead smaller ranges for each node, it can help ensure that data is more evenly distributed.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

What property do highly available and durable systems use in conjunction with consistent hashing? How can it be configured best?

A

Replication, preferably one where the replication factor of a n, where n is the number of nodes.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

What is replication factor? Can you provide an example?

A

Replication factor is the number of copies in of each piece of data in a system. For example, if the replication factor is three, that means there are three copies each stored on different nodes.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

How is replication generally handled in a distributed system? How does this differ in consistent and eventually consistent systems?

A

Typically a coordinator node will be responsible for ensuring how the data is replicated to the number of target replicas. In conssitent systems, it uses a synchronous or semi-synchronous process, in eventually consistent it can be asynchronous.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

What important factor can eventual consistently help achieve?

A

It can help achieve high availability.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

What are two technologies that leverage consistent hashing?

A

Amazon Dynamo and Apache Cassandra both use it to distribute and replicate data.