Data intensive apps Flashcards by James Michels

Avro

Binary transfer protocol. Tailored to Hadoop

Must know the precise schema in order to read -

No self description game

Very small size

Nice handling of reads and writes to old and new schemes

How well did you know this?

Not at all

Perfectly

MessagePack

Binary Json format.

Not as compact as thrift, protobufs

How well did you know this?

Not at all

Perfectly

Thrift

Binary json format for data
Developed at Facebook
Similar to protobufs
Not self describing
Hard schema, need for read or write
Thrift interface definition language (IDL) like this: 
struct Person {
  1: required string       userName,
  2: optional i64          favoriteNumber,
  3: optional list interests
}

How well did you know this?

Not at all

Perfectly

Protobufs

Binary message format, similar to thrift
Must know schema to read

message Person {
    required string user_name       = 1;
    optional int64  favorite_number = 2;
    repeated string interests       = 3;
}

How well did you know this?

Not at all

Perfectly

Apache arrow

Apache Arrow is a cross-language development platform for in-memory data. It specifies a standardized language-independent columnar memory format for flat and hierarchical data, organized for efficient analytic operations on modern hardware. It also provides computational libraries and zero-copy streaming messaging and interprocess communication…

How well did you know this?

Not at all

Perfectly

Apache pig

high-level platform for creating programs that run on Apache Hadoop. The language for this platform is called Pig Latin.[1] Pig can execute its Hadoop jobs in MapReduce, Apache Tez, or Apache Spark.[2] Pig Latin abstracts the programming from the Java MapReduce idiom into a notation which makes MapReduce programming high level, similar to that of SQL for relational database management systems. Pig Latin can be extended using user-defined functions (UDFs) which the user can write in Java, Python, JavaScript, Ruby or Groovy[3] and then call directly from the language.

How well did you know this?

Not at all

Perfectly

ASN.1

Super old

Abstract Syntax Notation One (ASN. 1) is a standard interface description language for defining data structures that can be serialized and deserialized in a cross-platform way. It is broadly used in telecommunications and computer networking, and especially in cryptography.

It is a coded as DER as in ssl certs

How well did you know this?

Not at all

Perfectly

Binary over json, xml, cab

They can be much more compact than the various “binary JSON” variants, since they can omit field names from the encoded data.

The schema is a valuable form of documentation, and because the schema is required for decoding, you can be sure that it is up to date (whereas manually maintained documentation may easily diverge from reality).

Keeping a database of schemas allows you to check forward and backward compatibility of schema changes, before anything is deployed. For users of statically typed programming languages, the ability to generate code from the schema is useful, since it enables type checking at compile time.

How well did you know this?

Not at all

Perfectly

Three common ways data flows between processes

Via databases (see “Dataflow Through Databases”)

Via service calls (see “Dataflow Through Services: REST and RPC”)

Via asynchronous message passing (see “Message-Passing Dataflow”)

How well did you know this?

Not at all

Perfectly

Web service

When HTTP is used as the underlying protocol for talking to the service, it is called a web service.

This is perhaps a slight misnomer, because web services are not only used on the web, but in several different contexts. For example:

A client application running on a user’s device (e.g., a native app on a mobile device, or JavaScript web app using Ajax) making requests to a service over HTTP. These requests typically go over the public internet.

One service making requests to another service owned by the same organization, often located within the same datacenter, as part of a service-oriented/microservices architecture. (Software that supports this kind of use case is sometimes called middleware.)

One service making requests to a service owned by a different organization, usually via the internet. This is used for data exchange between different organizations’ backend systems. This category

How well did you know this?

Not at all

Perfectly

Finagle

Thrift based rpc using futures aka promises

How well did you know this?

Not at all

Perfectly

GRPC

Rpc for google protobufs

How well did you know this?

Not at all

Perfectly

asynchronous message-passing systems,

somewhere between RPC and databases.

Using a message broker has several advantages compared to direct RPC:

It can act as a buffer if the recipient is unavailable or overloaded, and thus improve system reliability.

It can automatically redeliver messages to a process that has crashed, and thus prevent messages from being lost.

It avoids the sender needing to know the IP address and port number of the recipient (which is particularly useful in a cloud deployment where virtual machines often come and go).

It allows one message to be sent to several recipients.

It logically decouples the sender from the recipient (the sender just publishes messages and doesn’t care who consumes them).

How well did you know this?

Not at all

Perfectly

Message brokers

In the past, the landscape of message brokers was dominated by commercial enterprise software from companies such as TIBCO, IBM WebSphere, and webMethods. More recently, open source implementations such as RabbitMQ, ActiveMQ, HornetQ, NATS, and Apache Kafka have become popular.

How well did you know this?

Not at all

Perfectly

shared-disk architecture

shared-disk architecture, which uses several machines with independent CPUs and RAM, but stores data on an array of disks that is shared between the machines, which are connected via a fast network.ii This architecture is used for some data warehousing workloads, but contention and the overhead of locking limit the scalability of the shared-disk approach

How well did you know this?

Not at all

Perfectly

shared-nothing architectures [3] (sometimes called horizontal scaling or scaling out)

Study These Flashcards

machine running the database software is called a node. Each node uses its CPUs, RAM, and disks independently. Any coordination between nodes is done at the software level, using a conventional network.

Replication Versus Partitioning

Study These Flashcards

There are two common ways data is distributed across multiple nodes:

Replication Keeping a copy of the same data on several different nodes, potentially in different locations. Replication provides redundancy: if some nodes are unavailable, the data can still be served from the remaining nodes.

Replication can also help improve performance. We discuss replication in Chapter5. Partitioning Splitting a big database into smaller subsets called partitions so that different partitions can be assigned to different nodes (also known as sharding). We discuss partitioning in Chapter6.

leader-based replication (also known as active/passive or master–slave replication)

Study These Flashcards

One of the replicas is designated the leader (also known as master or primary). When clients want to write to the database, they must send their requests to the leader, which first writes the new data to its local storage.

The other replicas are known as followers (read replicas, slaves, secondaries, or hot standbys).i Whenever the leader writes new data to its local storage, it also sends the data change to all of its followers as part of a replication log or change stream. Each follower takes the log from the leader and updates its local copy of the database accordingly, by applying all writes in the same order as they were processed on the leader.

When a client wants to read from the database, it can query either the leader or any of the followers. However, writes are only accepted on the leader (the followers are read-only from the client’s point of view).

Statement-based replication

Study These Flashcards

the leader logs every write request (statement) that it executes and sends that statement log to its followers. For a relational database, this means that every INSERT, UPDATE, or DELETE statement is forwarded to followers,

calls a nondeterministic function, such as NOW() to get the current date and time or RAND() to get a random number, is likely to generate a different value on each replica.

must be executed in exactly the same order on each replica, or else they may have a different effect.

Statements that have side effects (e.g., triggers, stored procedures, user-defined functions) may result in different side effects occurring on each replica, unless the side effects are absolutely deterministic.

Write-ahead log (WAL) shipping

Study These Flashcards

the log is an append-only sequence of bytes containing all writes to the database. We can use the exact same log to build a replica on another node: besides writing the log to disk, the leader also sends it across the network to its followers.

method of replication is used in PostgreSQL and Oracle,

Logical (row-based) log replication

Study These Flashcards

A logical log for a relational database is usually a sequence of records describing writes to database tables at the granularity of a row:

For an inserted row, the log contains the new values of all columns.

For a deleted row, the log contains enough information to uniquely identify the row that was deleted. Typically this would be the primary key, but if there is no primary key on the table, the old values of all columns need to be logged.

For an updated row, the log contains enough information to uniquely identify the updated row, and the new values of all columns (or at least the new values of all columns that changed).

Trigger-based replication

Study These Flashcards

trigger lets you register custom application code that is automatically executed when a data change (write transaction) occurs in a database system. The trigger has the opportunity to log this change into a separate table, from which it can be read by an external process. That external process can then apply any necessary application logic and replicate the data change to another system.

Databus for Oracle [20] and Bucardo for Postgres [21] work like this, for example.

read-after-write consistency,

Study These Flashcards

This is a guarantee that if the user reloads the page, they will always see any updates they submitted themselves. It makes no promises about other users:

Conflict-free replicated datatypes (CRDTs)

Study These Flashcards

family of data structures for sets, maps (dictionaries), ordered lists, counters, etc. that can be concurrently edited by multiple users, and which automatically resolve conflicts in sensible ways. Some CRDTs have been implemented in Riak 2.0

Mergeable persistent data structures

track history explicitly, similarly to the Git version control system, and use a three-way merge function (whereas CRDTs use two-way merges).

Operational transformation

conflict resolution algorithm behind collaborative editing applications such as Etherpad [30] and Google Docs [31]. It was designed particularly for concurrent editing of an ordered list of items, such as the list of characters that constitute a text document.

replication topology

the communication paths along which writes are propagated from one node to another. The most general topology is all-to-all (Figure 5-8 [c]), in which every leader sends its writes to every other leader. However, more restricted topologies are also used: for example, MySQL by default supports only a circular topology [34], in which each node receives writes from one node and forwards those writes (plus any writes of its own) to one other node. Another popular topology has the shape of a star:v one designated root node forwards writes to all of the other nodes. The star topology can be generalized to a tree. problem with circular and star topologies is that if just one node fails, it can interrupt the flow of replication messages between other nodes, all-to-all topologies can have issues too. In particular, some network links may be faster than others (e.g., due to network congestion), with the result that some replication messages may “overtake”

version vectors

mechanism for tracking changes to data in a distributed system, where multiple agents might update the data at different times. The version vector allows the participants to determine if one update preceded another (happened-before), followed it, or if the two updates happened concurrently (and therefore might conflict with each other). In this way, version vectors enable causality tracking among data replicas and are a basic mechanism for optimistic replication. In mathematical terms, the version vector generates a preorder that tracks the events that precede, and may therefore influence, later updates. Like the version numbers in Figure 5-13, version vectors are sent from the database replicas to clients when values are read, and need to be sent back to the database when a value is subsequently written. (Riak encodes the version vector as a string that it calls causal context.) The version vector allows the database to distinguish between overwrites and concurrent writes. Also, like in the single-replica example, the application may need to merge siblings. The version vector structure ensures that it is safe to read from one replica and subsequently write back to another replica. Doing so may result in siblings being created, but no data is lost as long as siblings are merged correctly.

Leaderless replication

Dynamo system [37].vi Riak, Cassandra, and Voldemort are open source datastores with leaderless replication models inspired by Dynamo, so this kind of database is also known as Dynamo-style. In some leaderless implementations, the client directly sends its writes to several replicas, while in others, a coordinator node does this on behalf of the client. However, unlike a leader database, that coordinator does not enforce a particular ordering of writes. As we shall see, this difference in design has profound consequences for the way the database is used. when a client reads from the database, it doesn’t just send its request to one replica: read requests are also sent to several nodes in parallel. Version numbers are used to determine which value is newer If a write succeeded on some replicas but failed on others (for example because the disks on some nodes are full), and overall succeeded on fewer than w replicas, it is not rolled back on the replicas where it succeeded. This means that if a write was reported as failed, subsequent reads may or may not return the value from that write Stronger guarantees generally require transactions or consensus.

Read-after-write consistency

Users should always see data that they submitted themselves.

Monotonic reads

After users have seen the data at one point in time, they shouldn’t later see the data from some earlier point in time.

Inversion of control

custom-written portions of a computer program receive the flow of control from a generic framework. traditional programming, the custom code that expresses the purpose of the program calls into reusable libraries to take care of generic tasks, but with inversion of control, it is the framework that calls into the custom, or task-specific, code.

Spring

At its core, Spring framework is really just a dependency injection container, with a couple of convenience layers (think: database access, proxies, aspect-oriented programming, RPC, a web mvc framework) added on top. It helps you build Java application faster and more conveniently.

Data intensive apps Flashcards

(33 cards)