The NoSQL Ecosystem Flashcards Preview

The Architecture of Open Source Applications: Elegance, Evolution, and a Few Fearless Hacks > The NoSQL Ecosystem > Flashcards

Flashcards in The NoSQL Ecosystem Deck (87):
1

NoSQL: definition, according to the community

Not Only a SQL interface, referring to providing an alternative rather than a wholesale replacement for SQL

2

SQL's expressiveness makes it challenging to ___

reason about the cost of each query, and thus the cost of a workload

3

Application developers may find using relational data models to be challenging because ___

it may not be perfect for modeling every kind of data (i.e. lists, queues, sets, etc)

4

If relational data grows past the capacity of one server, then ___

the tables in the database will have to be partitioned across computers, leading to denormalization

5

Complex query logic is typically left to the application, resulting in ___

a data store with more predictable query performance because of lack of variability in queries

6

Two characteristics of Google's BigTable

hierarchical range-based partitioning scheme

strict consistency

7

Two characteristics of Amazon's Dynamo

Maps keys to application-specific blobs of data

Loose consistency makes the partitioning model resilient to failure

8

Considerations regarding NoSQL systems (SPACTSDD)

Scalability
Partitioning
Analytical workloads
Consistency
Transactional semantics
Single-server performance
Data and query model
Durability

9

The simplest form of a NoSQL store is a ___

key-value store

10

key-value store

each key is mapped to a value containing arbitrary data

11

store popularized by Redis

key-data structure store

12

key-data structure store

assigns each value a type (i.e. integer, string, list, set, sorted set, etc)

13

store common to CouchDB, MongoDB, Riak

key-document store

14

key-document store

map a key to some document that contains structured information in a JSON or JSON-like format

15

key-document stores grant a lot of freedom in document modeling, however ___

application-based query logic can become exceedingly complex

16

store common to HBase, Cassandra

BigTable column family store

17

column family store

* complex key identifies a row containing data stored in one or more Column Families
* each row can contain multiple columns with a CF
* values within each column are timestamped

18

store common to HyperGraphDB, Neo4J

graph store

19

exception to key-only lookup: MongoDB

allows indexing of data based on any number of properties and has a relatively high-level language for specifying which data to retrieve

20

exception to key-only lookup: BigTable-based systems

support scanners to iterate over a column family and select particular items by a filter on a column

21

exception to key-only lookup: CouchDB

allows creation of different views of the data and running MapReduce tasks across the table to facilitate more complex lookups and updates

22

ACID

Atomic
Consistency
Isolation
Durability

23

Most NoSQL systems choose performance over ___

full ACID guarantees

24

Redis is an exception to NoSQL's no-transaction trend, in that ___

it provides a MULTI command to combine multiple operations atomically and a WATCH command to allow isolation

25

benefits of schema-free storage

supports less structured data requirements and requires less structured data requirements

26

after a few iterations of a project relying on sloppy-schema NoSQL systems, ___

data and schema versioning is usually present in application-level code

27

single-server durability

ensures that any data modification will survive a server restart or power loss

28

the OS may not immediately write data to an on-disk file, instead ___

buffering the write to group several writes together in a single operation

29

typical hard drives can perform ___ random accesses (seeks) per second

100 - 200

30

typical hard drives are limited to ___ of sequential writes

30 - 100 MB/s

31

ensuring efficient single-server durability means ___

limiting the number of random writes the system incurs and increasing the number of sequential writes per hard drive

32

Ideally, one wants to minimize ___ and maximize ___, all while ___

the number of writes between fsync calls

the number of those writes that are sequential

never telling the user their data has written until the write has been fsynced

33

Techniques for improving performance of single-server durability guarantees

Control fsync frequency
Increase sequential writes by logging
Increase throughput by grouping writes

34

Memcached offers ___ in exchange for ___

no on-disk durability

extremely fast in-memory operations

35

Redis offers several options for ___

when to call fsync

36

To reduce random writes, some systems ___

append update operations to a sequentially written file (a log)

37

log-structured merge tree / log-structured hash table

combining logs and lookup data structures into one

38

Techniques such as log-structure merge trees / hash tables and modified B+ trees ___

result in improved write throughput, but require a periodic log compaction

39

group commit

grouping multiple concurrent updates within a short window into a single fsync call

40

benefit of group commit

increase in throughput, as multiple log appends can happen in a single fsync

41

drawback of group commit

higher latency per update, as users must wait on several concurrent updates for acknowledgement of their own update

42

Multi-server durability varies between systems as either ___ or ___

traditional primary-replica structure

replication where multiple servers store copies of the data

43

scaling up

adding more RAM and disks to handle load on one machine

44

scaling out

replicate data and spread requests across multiple machines

45

the ideal horizontal scalability goal is

linear scalability

46

linear scalability

doubling the number of machines in your storage system doubles the query capacity of the system

47

sharding

the act of splitting your read and write workloads across multiple machines to scale out your storage system

48

sharding your data means that no one machine ___ but also is unable to ____

has to handle the write workload on the entire dataset

answer queries about the entire dataset

49

sharding adds ___

system complexity

50

two ways to scale without sharding

read replicas

caching

51

read replica structure

make copies of the data on multiple machines, while write requests go to a primary node

52

Generally, the less stringent the demands for freshness of content, the more you can ___

use read replicas to improve read-only query performance

53

___ and ___ allow you to scale up your read-heavy workloads

read replicas

caching

54

To add memory to Memcached's cache pool:

just add another Memcached host

55

sharding through coordinators: Lounge and BigCouch

a coordinator distributes requests to individual CouchDB instances based on the key of the requested doc

56

sharding through coordinators: Gizzard

takes standalone data stores and arranges them in trees of any depth to partition keys by key range

57

NoSQL systems built around Dynamo's consistent hashing technique

Voldemort
Riak
Cassandra

58

consistent hashing

a kind of hashing such that when a hash table is resized, only K/n keys need to be remapped on average, where K is the number of keys, and n is the number of slots

59

range partitioning differs from consistent hashing in that ___

two keys that are next to each other in the key's sort order are likely to appear in the same partition

60

range partitioning allows active management of load by ___

having a load manager that can reduce the size of a range on an overloaded server

61

tablets in BigTable

stores a range of row keys and values within a column family, maintaining all necessary logs and data structures to answer queries

62

as BigTable's tablets change in size, ___

two small tablets may merge or a big tablet splits in two

63

a primary server in BigTable manages ___

tablet size, load, and availability

64

to recognize and handle machine failures, the BigTable paper describe the use of Chubby, which is ___

a distributed locking system for managing server membership and liveness

65

ZooKeeper is used in several Hadoop-based projects to ___

manage secondary leader servers and tablet server reassignment

66

BigTable employs a hierarchical approach to range-partitioning by ___

maintaining tablet assignment in a metadata table, which is also sharded into tablets

67

HBase uses BigTable's hierarchical approach to range-partitioning by ___

using HDFS to handle data storage, replication, and consistency, leaving the rest to servers

68

MongoDB handles range-partitioning by ___

using config nodes to specify key ranges, staying in sync with a two-phase commit protocol

69

Cassandra allows fast range scans over data by ___

preserving order in its partitioning, mapping data to the server directly managing its key range

70

Gizzard's routing servers ___

form routing hierarchies of any depth, assigning ranges of keys to servers below them in the hierarchy

71

range partitioning is the obvious choice when ___

one will be frequently be performing range scans over the keys of the data, avoiding random node jumps over the network

72

range partitioning requires the up-front cost of ___

maintaining routing and configuration nodes

73

when executed well, range partitioning data can be load-balanced ___

in small chunks which can be re-assigned in high-load situations

74

In practice, maintaining replicas are hard and the following will happen:

crash and get out of sync
crash and never come back
networks will partition two sets of replicas
messages between machines will get delayed or lost

75

two major approaches to data consistency in NoSQL ecosystem

strong consistency
eventual consistency

76

systems that promote strong consistency ___

ensure that the replicas of a data item will always be able to come to consensus on the value of a key

77

the minimum R, W, and N choices for ensuring strong consistency while allowing temporary replica disagreements is

R + W = N + 1

78

in HDFS, a write cannot succeed until ___, while a read ___

it has been replicated to all N servers (W=N)

will be satisfied by a single replica (R=1)

79

Dynamo-based systems use a type of versioning called

vector clocks

80

Voldemort handles conflicts by ___

returning multiple copies of the key to the requesting client application

81

Cassandra resolved conflicts by ___

using the most recently timestamped version of the data

82

Voldemort's and Cassandra's conflict resolution are both present in ___

Riak

83

CouchDB provides a hybrid of Voldemort's and Cassandra's conflict resolution:

it identifies a conflict and allows users to query for conflicted keys for manual repair, but deterministically picks a version to return to users until conflicts are repaired

84

read repair is handled in Dynamo-based systems by ___

repairing out-of-sync replicas of the data in the background while returning the non-conflicting data to the requestor

85

hinted handoff

assigning a node to temporarily take over an unavailable node's write workload, forwarding all those writes when the node is available again

86

Cassandra and Riak synchronize from one another using ___

Merkle trees

87

gossip

periodically (~1s) a node will communicate with a random node to exchange knowledge on other nodes' health