Designing Data-Intensive Applications Flashcards

1
Q

What are the three main goals of a data-intensive application?

A

Reliability, scalability, and maintainability.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What is reliability in the context of a data-intensive application?

A

The ability of the application to continue functioning correctly, even in the face of errors, failures, or unexpected inputs.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What is scalability in the context of a data-intensive application?

A

The ability of the application to handle increased load or demand, by adding more resources or by distributing the workload across multiple machines.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What is maintainability in the context of a data-intensive application?

A

The ease with which the application can be modified, updated, or fixed over time, without introducing errors or breaking existing functionality.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What are the two main types of storage systems?

A

Disk-based storage and memory-based storage.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What is disk-based storage?

A

Disk-based storage uses hard disk drives (HDDs) or solid-state drives (SSDs) to store data persistently on disk.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What is memory-based storage?

A

Memory-based storage uses volatile memory (RAM) to store data in memory, which is much faster but also more expensive and less durable than disk-based storage.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

What are some common disk-based storage systems?

A

Relational databases, file systems, and key-value stores.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What are some common memory-based storage systems?

A

In-memory databases, distributed caches, and message brokers.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

What are some trade-offs between disk-based and memory-based storage?

A

Memory-based storage is faster but more expensive and less durable than disk-based storage. Disk-based storage is slower but more affordable and durable.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What is a B-tree?

A

A B-tree is a type of data structure used for indexing and searching data in disk-based storage systems.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

What is a hash index?

A

A hash index is a type of index that uses a hash function to map keys to addresses in memory or on disk.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

What is column-oriented storage?

A

Column-oriented storage is a method of storing data where each column of a table is stored separately, instead of storing entire rows of a table together.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

What are some advantages of column-oriented storage?

A

Column-oriented storage can be more efficient for certain types of queries, such as range queries or aggregation queries, because it allows the database to read only the columns that are relevant to the query. It can also be more space-efficient because it reduces the amount of data that needs to be read from disk.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

What is encoding?

A

Encoding is the process of representing data in a format that can be stored, transmitted, or processed by a computer.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

What is schema evolution?

A

Schema evolution is the process of changing the structure or format of stored data over time, while preserving the ability to read and write data in both the old and new formats.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

What is replication?

A

Replication is the process of copying data from one node in a distributed system to one or more other nodes, in order to improve availability, performance, and/or durability.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

What are some common replication topologies?

A

Master-slave replication, where one node (the master) receives updates and then replicates them to one or more other nodes (the slaves). Multi-master replication, where multiple nodes can receive updates and then replicate them to other nodes. Leaderless replication, where all nodes are equal and can receive updates independently.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

What is a replica?

A

A replica is a copy of data that has been replicated from one node to another.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

What is a quorum?

A

A quorum is a subset of replicas that must agree on a value or decision, in order for the system to make progress.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

What is the CAP theorem?

A

The CAP theorem states that in a distributed system, it is impossible to simultaneously provide all of the following guarantees: Consistency, Availability, and Partition tolerance.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q

What is the P in the CAP theorem?

A

The P in the CAP theorem stands for Partition tolerance, which means that the system can continue to function correctly even when some nodes are unable to communicate with each other.

23
Q

What is partitioning?

A

Partitioning is the process of splitting a large dataset into smaller subsets called partitions, which can be distributed across multiple nodes in a distributed system.

24
Q

What are some common partitioning schemes?

A

Hash partitioning, range partitioning, and list partitioning.

25
Q

What is a partition key?

A

A partition key is a field or set of fields in a dataset that is used to determine which partition a given record belongs to.

26
Q

What is a partitioning function?

A

A partitioning function is a function that takes a partition key as input and outputs the partition number or ID that the record belongs to.

27
Q

What is data skew?

A

Data skew is a situation where some partitions in a distributed system receive a disproportionate amount of data or traffic, causing performance problems or even system failure.

28
Q

What is a hot spot?

A

A hot spot is a partition or node in a distributed system that receives a disproportionately large amount of traffic or load, causing performance problems or even system failure.

29
Q

What is consistent hashing?

A

Consistent hashing is a hashing technique that allows a distributed system to add or remove nodes without having to remap all the partition assignments.

30
Q

What is hash partitioning?

A

Hash partitioning is a partitioning scheme where a hash function is applied to a partition key, and the resulting hash value determines the partition number or ID.

31
Q

What is range partitioning?

A

Range partitioning is a partitioning scheme where data is partitioned based on a range of values in a partition key, such as a timestamp or a numeric range.

32
Q

What is list partitioning?

A

List partitioning is a partitioning scheme where data is partitioned based on specific values or ranges of values in a partition key, rather than a hash function or numeric range.

33
Q

What is a transaction?

A

A transaction is a sequence of operations that are executed as a single unit of work.

34
Q

What are the properties of a transaction?

A

The ACID properties: Atomicity, Consistency, Isolation, and Durability.

35
Q

What is atomicity?

A

Atomicity means that a transaction should be treated as a single, indivisible unit of work, so either all of its operations are executed or none of them are.

36
Q

What is consistency?

A

Consistency in transactions means that any changes made to a database by a transaction should make sense and not violate any rules or constraints set by the application.

37
Q

What is isolation?

A

Isolation means that transactions should be executed in a way that does not interfere with each other, even if they are executed at the same time.

38
Q

What is durability?

A

Durability means that once a transaction is committed, its effects are permanent and cannot be undone, even in the event of a system failure.

39
Q

What is a serializable transaction?

A

A serializable transaction is a way of making sure that even if lots of different things are happening at the same time, it looks like they are happening one after another in a specific order. This helps to keep everything organized and make sure that the data stays consistent and accurate.

40
Q

What is a concurrent transaction?

A

A concurrent transaction is a transaction that is executed at the same time as other transactions.

41
Q

What is a transaction log?

A

A transaction log is a record of all the operations performed by transactions, used to ensure durability and recoverability in the event of a system failure or crash.

42
Q

What is a distributed system?

A

A distributed system is a computer system made up of multiple parts that are located in different places, and work together to provide a service or run an application.

43
Q

What is network latency?

A

Network latency is the delay or time it takes for data to travel from one node to another in a distributed system.

44
Q

What are some techniques for dealing with network latency?

A

Caching, pre-fetching, and data replication are all techniques that can help reduce the impact of network latency in a distributed system.

45
Q

What is a node failure?

A

A node failure is when a component or part of a distributed system stops working, either due to hardware or software issues.

46
Q

What are some techniques for dealing with node failures?

A

Redundancy, replication, and failure detection are all techniques that can help maintain system availability and prevent data loss in the event of a node failure.

47
Q

What is consistency?

A

Consistency is the property of a distributed system where all nodes see the same data at the same time, regardless of which node they access.

48
Q

What is eventual consistency?

A

Eventual consistency is a weaker form of consistency where all nodes eventually see the same data, but there may be a delay or inconsistency in the meantime.

49
Q

What is consensus?

A

Consensus is the process of agreeing on a value or decision among a group of nodes in a distributed system.

50
Q

What is a consensus algorithm?

A

A consensus algorithm is a set of rules or procedures that nodes in a distributed system use to agree on a value or decision.

51
Q

What is a CAP theorem?

A

The CAP theorem is a principle that states that in a distributed system, it is impossible to simultaneously achieve all of the following: consistency, availability, and partition tolerance. Therefore, trade-offs must be made when designing distributed systems.

52
Q

What is batch processing?

A

Batch processing is the process of running a series of tasks or jobs in a batch, rather than in real-time or interactively. Batch processing involves running a bunch of tasks or jobs all at once, and then processing the results later, rather than processing them in real-time or as they come in.

53
Q

What is a batch job?

A

A batch job is a unit of work that is executed as part of a batch processing system, typically involving data processing or analysis.

54
Q

What is a stream processor?

A

A stream processor is a component of a stream processing system that is responsible for analyzing and processing data in real-time, as it flows through the stream.