Distributing Databases Flashcards
(32 cards)
What are the requirements for building a distributed relational database system?
- Local autonomy: sites should operate INDEPENDENTLY, and not interfere with one another. No site should rely on another
- No centralization: no single site-failure should break the system
No one central site crucial for everything! - Continuous Operation: an aspiration - the system is AVAILABLE most of the time, and reliable - when it falls over, it should recover gracefully.
- Location Independence: user need not know where the data is located. Data can be moved between sites without changing functionality.
- Partition Independence: user does not need to know how data’s partitioned, data can be moved from one partition to another. Optimization can still benefit from partition knowledge (where names M-Z are stored), but not necessary.
- Replication Independence: usually distribution duplicates data, which is what we were trying to avoid!
- Execute queries close to the data: moving data/processing it is expensive. Generate smaller results before moving it (MapReduce takes advantage of this reduction, by moving processing close to the data).
- Distribution among different DBMSs should be possible: an aspiration
What does “distribution making no difference to the user” mean?
- Hardware/OS independence
- Network independence
What is another word for partitioning a database?
Fragmentation
What is “vertical partitioning”?
Breaking a table into columns (like in normalization) before distributing it, to reduce duplication.
Why use vertical partitioning?
When some cols used more than others, some are either read/written to a lot more
What is “horizontal partitioning”?
Can partition rows based on hash/keys.
What do partitioning methods depend on?
Foreign Keys, how to JOIN the data later. Also the col definitions and “catalogue”: a relation is not just the data!
Catalogue management has to be distributed as well. Any catalogue changes have to be distributed to maintain the integrity of the database.
How to ensure distributed DBs have recovery control?
Transactions can be hardened with a 2-phase protocol: every node is locked for the duration of the operations, and only confirms them when everyone is happy. A node is assigned as the “coordinator” for the duration of the transaction. It is hard to ensure ACID properties with distributed databases.
What is Brewer’s conjecture?
Three goals (CAP) in tension with a distributed database.
1. Consistency: all parts of the DB should converge to a consistent state (a weaker version of ACID consistency).
2. Availability: every request should result in a response AT SOME POINT eventually, even if it’s an error.
3. Partition Tolerance: if a network flaw breaks the network into two subnets, the database should keep going, then run and recover (consistency) eventually.
Brewer’s Conjecture states these 3 cannot be fully satisfied all at once. You have to prioritize some over the other.
Which to prioritize depends on the particular use case/context.
What is a key-value database?
Has 2 columns in a table. A ‘key’ column (the INDEX) and a ‘value’ column.
Does a key-value database have JOINs or Foreign Keys (integrity constraints)?
NO
What really are key-value pairs in terms of data structures?
Associative arrays (dict)
Is there explicit connection between 2 entries in key-value DB?
No
Is a key-value database good for sophisticated retrieval?
No
How are things retrieved in a key-value database?
Just ‘walk over’ the whole database.
Why use a key-value database?
- Already vertically fragmented
- Very easy to distribute
- Very easy to parallelize
- Each row is already completely separated from another row!
- Partitioning is ALWAYS horizontal as already vertically fragmented
How should processing happen in a key-value database?
Raw processing should happen very close to where the data is stored.
What are the stages of processing data from a key-value database?
MapReduce - how is information first processed?
MAP: Gather information from the base table - to get something intermediate. Analogous to any JOINs and WHERE statements –> producing an intermediate table.
MapReduce - what is the second stage of processing?
REDUCE: Aggregate, sort, do something with the intermediate results. E.g. GROUP BY, SELECT, COUNT/AVG/SUM etc.
How/where is the MAP phase of Map-Reduce executed?
- Close to the data (direct access)
- Loops over all the data (or uses an index to iterate over parts of it) and transforms it (e.g. representing a word as 1) into a form useful to the reducer
- Outputs a new key-value set
How/where is the REDUCE phase of Map-Reduce executed?
- Doesn’t have to happen close to the data
- Carried out by reducer-workers
- Summarize data based on keys, using equivalents of SQL GROUPBY functions –> aggregates and summarizes based on the KEY of the new dataset (e.g. counting all the different texts’ word-counts that have been mapped to 1 to output the count of words for each text)
- Output is collated & sent to users
How can map-reduce be parallelized?
We know the reduce step is going to be based on a key. Reducer-workers are assigned based on the key (e.g. hash function to deduce which key to give to which worker).
List 9 advantages of MapReduce and key-value databases.
- Easily parallelizable
- Fault tolerance –> can easily distribute processes to other nodes, easy to recover from failure of processing nodes (although there’s often a coordinate node, and some risk of centralization)
- Very scalable
- Flexible: can handle different kinds of data structures
- Cost-effective, don’t need expensive systems
- Very easy to distribute/distribute processing: it is designed for distributed data
- Easy to model
- Allows you to control the data that moves from one place to another: can decide where to do the processing.
- Privacy: map can run in a private space, but data sent to the reducers (i.e. anonymized) can be regarded as safe