Data Engineering & Databases Flashcards

1
Q

What does a Database do?

A

Primarily 2 things: stores and retrieves data

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What are the 2 main philosophies for primary data storage

A
  1. Log-structured

2. Update in place

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Log

A

An append only sequence of records

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Index

A

Additional metadata to help locate specific data

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

SSTable

A

Sorted String Table

- A file in which data is sorted by key

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

LSM-tree

A

Log-structured Merge Tree

Fundamental idea

  • keep a cascade of SSTables that are merged in the background
  • turns random writes into sequential writes
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What is the DB write path when using SSTable / LSM-tree

A
  • Write to a memtable - in-memory balanced tree
  • Also write to unsorted log for crash recovery
  • When the memtable gets big enough (several MB) disk as SSTable file
  • While SSTable is written, writes continue on new memtable
  • Background merge & compaction to combine SSTable files
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

What is the DB read path for a DB implemented with SSTables / LSM-trees

A
  • Look for key in memtable
  • Look in most recent disk segment
  • Look in next older disk segment
  • (Sped up by using Bloom filters to determine if it doesn’t exists, which are the most expensive reads)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

LSM-tree Compaction & Merge Strategies

A

Size-tiered - newer, smaller SSTables merged into older, larger SSTables
(HBase, supported by Cassandra, Scylla)

Levels - key range split into smaller SSTables, older data is moved into separate levels
(Used by LevelDB, RocksDB, supported by Cassandra, Scylla)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

B-Tree

A

Standard index implementation in almost all relational DBs (and many non-relational)

  • Break DB into fixed-size pages (trad 4KB), and read/write one page at a time
  • Each page contains keys and references to child pages (using on-disk “pointers”)
  • 1st page is the root (search always begins here)
  • Each child has a continuous range of keys, keys between the references indicate boundaries of those ranges
  • Leaf pages contain individual keys and values inline (or in references pages)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Branching factor

A

The number of references to child pages in one B-tree page

  • Typical factor is several hundred
  • Very large DBs can be stored with few levels (256 TB stored in 4 levels of 4KB w branching factor of 500)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

WAL

A

Write-ahead Log
- Append-only file where every B -tree modification is written before its written to the tree (in case of failure during hardware write, which could cause an orphan page with no parent)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Writes/Reads for LSM-tree vs B-tree

A

LSM-tree typically faster for writes

B-tree typically fasters for reads

How well did you know this?
1
Not at all
2
3
4
5
Perfectly