Distributing Databases Flashcards by Ophelia Lavey

What are the requirements for building a distributed relational database system?

Local autonomy: sites should operate INDEPENDENTLY, and not interfere with one another. No site should rely on another
No centralization: no single site-failure should break the system
No one central site crucial for everything!
Continuous Operation: an aspiration - the system is AVAILABLE most of the time, and reliable - when it falls over, it should recover gracefully.
Location Independence: user need not know where the data is located. Data can be moved between sites without changing functionality.
Partition Independence: user does not need to know how data’s partitioned, data can be moved from one partition to another. Optimization can still benefit from partition knowledge (where names M-Z are stored), but not necessary.
Replication Independence: usually distribution duplicates data, which is what we were trying to avoid!
Execute queries close to the data: moving data/processing it is expensive. Generate smaller results before moving it (MapReduce takes advantage of this reduction, by moving processing close to the data).
Distribution among different DBMSs should be possible: an aspiration

How well did you know this?

Not at all

Perfectly

What does “distribution making no difference to the user” mean?

Hardware/OS independence
Network independence

How well did you know this?

Not at all

Perfectly

What is another word for partitioning a database?

Fragmentation

How well did you know this?

Not at all

Perfectly

What is “vertical partitioning”?

Breaking a table into columns (like in normalization) before distributing it, to reduce duplication.

How well did you know this?

Not at all

Perfectly

Why use vertical partitioning?

When some cols used more than others, some are either read/written to a lot more

How well did you know this?

Not at all

Perfectly

What is “horizontal partitioning”?

Can partition rows based on hash/keys.

How well did you know this?

Not at all

Perfectly

What do partitioning methods depend on?

Foreign Keys, how to JOIN the data later. Also the col definitions and “catalogue”: a relation is not just the data!
Catalogue management has to be distributed as well. Any catalogue changes have to be distributed to maintain the integrity of the database.

How well did you know this?

Not at all

Perfectly

How to ensure distributed DBs have recovery control?

Transactions can be hardened with a 2-phase protocol: every node is locked for the duration of the operations, and only confirms them when everyone is happy. A node is assigned as the “coordinator” for the duration of the transaction. It is hard to ensure ACID properties with distributed databases.

How well did you know this?

Not at all

Perfectly

What is Brewer’s conjecture?

Three goals (CAP) in tension with a distributed database.
1. Consistency: all parts of the DB should converge to a consistent state (a weaker version of ACID consistency).
2. Availability: every request should result in a response AT SOME POINT eventually, even if it’s an error.
3. Partition Tolerance: if a network flaw breaks the network into two subnets, the database should keep going, then run and recover (consistency) eventually.
Brewer’s Conjecture states these 3 cannot be fully satisfied all at once. You have to prioritize some over the other.
Which to prioritize depends on the particular use case/context.

How well did you know this?

Not at all

Perfectly

What is a key-value database?

Has 2 columns in a table. A ‘key’ column (the INDEX) and a ‘value’ column.

How well did you know this?

Not at all

Perfectly

Does a key-value database have JOINs or Foreign Keys (integrity constraints)?

How well did you know this?

Not at all

Perfectly

What really are key-value pairs in terms of data structures?

Associative arrays (dict)

How well did you know this?

Not at all

Perfectly

Is there explicit connection between 2 entries in key-value DB?

How well did you know this?

Not at all

Perfectly

Is a key-value database good for sophisticated retrieval?

How well did you know this?

Not at all

Perfectly

How are things retrieved in a key-value database?

Just ‘walk over’ the whole database.

How well did you know this?

Not at all

Perfectly

Why use a key-value database?

Study These Flashcards

Already vertically fragmented
Very easy to distribute
Very easy to parallelize
Each row is already completely separated from another row!
Partitioning is ALWAYS horizontal as already vertically fragmented

How should processing happen in a key-value database?

Study These Flashcards

Raw processing should happen very close to where the data is stored.

What are the stages of processing data from a key-value database?

Study These Flashcards

MapReduce - how is information first processed?

Study These Flashcards

MAP: Gather information from the base table - to get something intermediate. Analogous to any JOINs and WHERE statements –> producing an intermediate table.

MapReduce - what is the second stage of processing?

Study These Flashcards

REDUCE: Aggregate, sort, do something with the intermediate results. E.g. GROUP BY, SELECT, COUNT/AVG/SUM etc.

How/where is the MAP phase of Map-Reduce executed?

Study These Flashcards

Close to the data (direct access)
Loops over all the data (or uses an index to iterate over parts of it) and transforms it (e.g. representing a word as 1) into a form useful to the reducer
Outputs a new key-value set

How/where is the REDUCE phase of Map-Reduce executed?

Study These Flashcards

Doesn’t have to happen close to the data
Carried out by reducer-workers
Summarize data based on keys, using equivalents of SQL GROUPBY functions –> aggregates and summarizes based on the KEY of the new dataset (e.g. counting all the different texts’ word-counts that have been mapped to 1 to output the count of words for each text)
Output is collated & sent to users

How can map-reduce be parallelized?

Study These Flashcards

We know the reduce step is going to be based on a key. Reducer-workers are assigned based on the key (e.g. hash function to deduce which key to give to which worker).

List 9 advantages of MapReduce and key-value databases.

Study These Flashcards

Easily parallelizable
Fault tolerance –> can easily distribute processes to other nodes, easy to recover from failure of processing nodes (although there’s often a coordinate node, and some risk of centralization)
Very scalable
Flexible: can handle different kinds of data structures
Cost-effective, don’t need expensive systems
Very easy to distribute/distribute processing: it is designed for distributed data
Easy to model
Allows you to control the data that moves from one place to another: can decide where to do the processing.
Privacy: map can run in a private space, but data sent to the reducers (i.e. anonymized) can be regarded as safe

Summarize the stages of the map-reduce algorithm.

1. Map: take input data and convert it into intermediate key-value pairs. Process chunks in parallel (scalable). 2. Shuffle and sort: shuffle resulting key-value pairs and sort by key across parallel nodes. All values for one key are moved to a single reducer. 3. Reduce: take the key-value pairs sorted by key, and aggregate or merge to form a smaller set of results.

What is a key-value database?

A data structure with two columns - the key is a string, and the value is often a string or more complex object. Like an assoc array.

Why does MapReduce work well with distributed architectures?

- A key-value database can trivially be partitioned horizontally - Since the Reducer phase only needs the values for a single key at a time, the task can be parallellised easily

What model is a compromise between relational database (hard to distribute) and key-value database (where harder to search/retrieve values)?

A DOCUMENT BASED DB.

What is a document based db?

- Structures are separate like rows in key-value DB - But more complicated - Less strictly controlled than rel system/tables --> column defs not prescribed - Rich, nested and deep data structures: less JOINs - Allows for repetition - LESS STRICT - NESTED -REPEATED - Less constrained, we don't need to know columns we'll need in advance - Order-sensitive (unlike SQL) - Documents seldom interlink: why they work as parallelizable structures - Documents are less interconnected than tables - Any connections between docs are NOT central to fast retrieval - Less interlinking - Less well-defined - Less guaranteeable - CANNOT LINK BETWEEN DOCUMENTS - No cross-referencing - Cannot embed variables in the code (unlike JavaScript) - XML: also a document database

What are the JSON use cases?

1. Data persistence (serialisation) 2. Data exchange/transfer 3. Good for file storage + writing data out 4. Easy to serialize and share 5. Very easy distributed processing

What is MongoDB?

A document-based database model. It is open-source and has copyleft elements. It operates on JSON objects.

How is MongoDB distributed?

By "sharding": taking horizontal partitions.

Distributing Databases Flashcards

(32 cards)