Designing Data-Intensive Applications Flashcards

Question

What is a partition key?

Answer 1

A partition key is a field or set of fields in a dataset that is used to determine which partition a given record belongs to.

Answer 2

A partitioning function is a function that takes a partition key as input and outputs the partition number or ID that the record belongs to.

Answer 3

Data skew is a situation where some partitions in a distributed system receive a disproportionate amount of data or traffic, causing performance problems or even system failure.

Answer 4

A hot spot is a partition or node in a distributed system that receives a disproportionately large amount of traffic or load, causing performance problems or even system failure.

Answer 5

Consistent hashing is a hashing technique that allows a distributed system to add or remove nodes without having to remap all the partition assignments.

Answer 6

Hash partitioning is a partitioning scheme where a hash function is applied to a partition key, and the resulting hash value determines the partition number or ID.

Answer 7

Range partitioning is a partitioning scheme where data is partitioned based on a range of values in a partition key, such as a timestamp or a numeric range.

Answer 8

List partitioning is a partitioning scheme where data is partitioned based on specific values or ranges of values in a partition key, rather than a hash function or numeric range.

Answer 9

A transaction is a sequence of operations that are executed as a single unit of work.

Answer 10

The ACID properties: Atomicity, Consistency, Isolation, and Durability.

Answer 11

Atomicity means that a transaction should be treated as a single, indivisible unit of work, so either all of its operations are executed or none of them are.

Answer 12

Consistency in transactions means that any changes made to a database by a transaction should make sense and not violate any rules or constraints set by the application.

Answer 13

Isolation means that transactions should be executed in a way that does not interfere with each other, even if they are executed at the same time.

Answer 14

Durability means that once a transaction is committed, its effects are permanent and cannot be undone, even in the event of a system failure.

Answer 15

A serializable transaction is a way of making sure that even if lots of different things are happening at the same time, it looks like they are happening one after another in a specific order. This helps to keep everything organized and make sure that the data stays consistent and accurate.

Answer 16

A concurrent transaction is a transaction that is executed at the same time as other transactions.

Answer 17

A transaction log is a record of all the operations performed by transactions, used to ensure durability and recoverability in the event of a system failure or crash.

Answer 18

A distributed system is a computer system made up of multiple parts that are located in different places, and work together to provide a service or run an application.

Answer 19

Network latency is the delay or time it takes for data to travel from one node to another in a distributed system.

Answer 20

Caching, pre-fetching, and data replication are all techniques that can help reduce the impact of network latency in a distributed system.

Answer 21

A node failure is when a component or part of a distributed system stops working, either due to hardware or software issues.

Answer 22

Redundancy, replication, and failure detection are all techniques that can help maintain system availability and prevent data loss in the event of a node failure.

Answer 23

Consistency is the property of a distributed system where all nodes see the same data at the same time, regardless of which node they access.

Answer 24

Eventual consistency is a weaker form of consistency where all nodes eventually see the same data, but there may be a delay or inconsistency in the meantime.

Answer 25

Consensus is the process of agreeing on a value or decision among a group of nodes in a distributed system.

Answer 26

A consensus algorithm is a set of rules or procedures that nodes in a distributed system use to agree on a value or decision.

Answer 27

The CAP theorem is a principle that states that in a distributed system, it is impossible to simultaneously achieve all of the following: consistency, availability, and partition tolerance. Therefore, trade-offs must be made when designing distributed systems.

Answer 28

Batch processing is the process of running a series of tasks or jobs in a batch, rather than in real-time or interactively. Batch processing involves running a bunch of tasks or jobs all at once, and then processing the results later, rather than processing them in real-time or as they come in.

Answer 29

A batch job is a unit of work that is executed as part of a batch processing system, typically involving data processing or analysis.

Answer 30

A stream processor is a component of a stream processing system that is responsible for analyzing and processing data in real-time, as it flows through the stream.

Designing Data-Intensive Applications Flashcards

(54 cards)