Data Intensive Ch3 - Storage & Retrieval Flashcards

1
Q

Full-text-search-engine secret sauce

A

Term dictionary
Each term from query is looked up in reverse dict which maps term to doc ids that contain it
Key-value store of word to IDs (postings list)
Term dictionary is kept in a SSTable like structure

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Bloom filter

A

In memory structure

Approximates content of key set to quickly tell if key exists in SSTable

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

SSTables

A

Sorted string table
AKA LSM-tree (Log Structured Merge Tree) + SST
Log structured as append only segments
Merge tree because of merging segments and traveling from latest to oldest segments

Old segments are immutable
Append-only sequential writes = high throughput
Each segment contains keys in sorted order

Index is a hashmap key-offset
Keys are sparse (every N kilobytes of keys. So if segment contains records for keys a, b, c, d, e, f, g and each record is 2kB and we store index. every 6kB then hash map contains offsets for a, d and g keys)

Latest segment is stored in memory only (memtable using red black tree fex)
When reading memtable is read first and if key is not found segments are checked in latest-oldest order one-by-one until hit or miss if key is not found

Merging and compaction can be run periodically (merge sort of adjacent segments)
Compaction of segments for which key is not stored?
Range queries can be done

Cons: key lookup which does not exist takes long (go over all segments in vain)

For durability in case of failure - write-ahead-log is used so data can be recovered

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

ETL

A

extract transform load

(Usually periodic) Process of getting data from different data sources into data warehouse

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Star & Snowflake schemas

A

Data model optimized for analytics in data warehouse

Consists of “central” table of facts
Each record is a business relevant event in time like purchase of a product
Some columns are regular attributes like price
Other are references to “dimension” tables

Dimension table answer question who, what, when, where, how, why
Example: sold product (with its description, category etc) or buyer. special deal offer etc
Date and time can also be a dimension - this allows to add metadata about holidays etc easily

Structure - fact table in the middle and dimensions protrude outwards around like a star
Snowflake - same as star except dimensions can have subdimensions

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Column oriented storage

A

OLAP system DBs with star/snowflake data model queries usually refer to small subset of columns and the rest is ignored
Number of rows is usually of high orders of magnitude
Keeping rows close together is inefficient - large rows, lots of IO read where most of the read data is discarded, lots of scanning over disk as multiple row fetches are required
Idea - keep data column-wise close together
Each column is stored in the same order

Enables column compression
Enums can be turned into bitmaps (bit equals 1 means row has a value coded by ith index)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Materialized aggregates

A

Most queries of data warehouse revolve around count, sum, avg, max functions
Most frequent ones can be pre-computed over some possible dimensions and cached into materialized view
Each data change requires refresh of the view

How well did you know this?
1
Not at all
2
3
4
5
Perfectly