Data intensive apps Flashcards
(33 cards)
Avro
Binary transfer protocol. Tailored to Hadoop
Must know the precise schema in order to read -
No self description game
Very small size
Nice handling of reads and writes to old and new schemes
MessagePack
Binary Json format.
Not as compact as thrift, protobufs
Thrift
Binary json format for data Developed at Facebook Similar to protobufs Not self describing Hard schema, need for read or write Thrift interface definition language (IDL) like this: struct Person { 1: required string userName, 2: optional i64 favoriteNumber, 3: optional list interests }
Protobufs
Binary message format, similar to thrift
Must know schema to read
message Person { required string user_name = 1; optional int64 favorite_number = 2; repeated string interests = 3; }
Apache arrow
Apache Arrow is a cross-language development platform for in-memory data. It specifies a standardized language-independent columnar memory format for flat and hierarchical data, organized for efficient analytic operations on modern hardware. It also provides computational libraries and zero-copy streaming messaging and interprocess communication…
Apache pig
high-level platform for creating programs that run on Apache Hadoop. The language for this platform is called Pig Latin.[1] Pig can execute its Hadoop jobs in MapReduce, Apache Tez, or Apache Spark.[2] Pig Latin abstracts the programming from the Java MapReduce idiom into a notation which makes MapReduce programming high level, similar to that of SQL for relational database management systems. Pig Latin can be extended using user-defined functions (UDFs) which the user can write in Java, Python, JavaScript, Ruby or Groovy[3] and then call directly from the language.
ASN.1
Super old
Abstract Syntax Notation One (ASN. 1) is a standard interface description language for defining data structures that can be serialized and deserialized in a cross-platform way. It is broadly used in telecommunications and computer networking, and especially in cryptography.
It is a coded as DER as in ssl certs
Binary over json, xml, cab
They can be much more compact than the various “binary JSON” variants, since they can omit field names from the encoded data.
The schema is a valuable form of documentation, and because the schema is required for decoding, you can be sure that it is up to date (whereas manually maintained documentation may easily diverge from reality).
Keeping a database of schemas allows you to check forward and backward compatibility of schema changes, before anything is deployed. For users of statically typed programming languages, the ability to generate code from the schema is useful, since it enables type checking at compile time.
Three common ways data flows between processes
Via databases (see “Dataflow Through Databases”)
Via service calls (see “Dataflow Through Services: REST and RPC”)
Via asynchronous message passing (see “Message-Passing Dataflow”)
Web service
When HTTP is used as the underlying protocol for talking to the service, it is called a web service.
This is perhaps a slight misnomer, because web services are not only used on the web, but in several different contexts. For example:
A client application running on a user’s device (e.g., a native app on a mobile device, or JavaScript web app using Ajax) making requests to a service over HTTP. These requests typically go over the public internet.
One service making requests to another service owned by the same organization, often located within the same datacenter, as part of a service-oriented/microservices architecture. (Software that supports this kind of use case is sometimes called middleware.)
One service making requests to a service owned by a different organization, usually via the internet. This is used for data exchange between different organizations’ backend systems. This category
Finagle
Thrift based rpc using futures aka promises
GRPC
Rpc for google protobufs
asynchronous message-passing systems,
somewhere between RPC and databases.
Using a message broker has several advantages compared to direct RPC:
It can act as a buffer if the recipient is unavailable or overloaded, and thus improve system reliability.
It can automatically redeliver messages to a process that has crashed, and thus prevent messages from being lost.
It avoids the sender needing to know the IP address and port number of the recipient (which is particularly useful in a cloud deployment where virtual machines often come and go).
It allows one message to be sent to several recipients.
It logically decouples the sender from the recipient (the sender just publishes messages and doesn’t care who consumes them).
Message brokers
In the past, the landscape of message brokers was dominated by commercial enterprise software from companies such as TIBCO, IBM WebSphere, and webMethods. More recently, open source implementations such as RabbitMQ, ActiveMQ, HornetQ, NATS, and Apache Kafka have become popular.
shared-disk architecture
shared-disk architecture, which uses several machines with independent CPUs and RAM, but stores data on an array of disks that is shared between the machines, which are connected via a fast network.ii This architecture is used for some data warehousing workloads, but contention and the overhead of locking limit the scalability of the shared-disk approach
shared-nothing architectures [3] (sometimes called horizontal scaling or scaling out)
machine running the database software is called a node. Each node uses its CPUs, RAM, and disks independently. Any coordination between nodes is done at the software level, using a conventional network.
Replication Versus Partitioning
There are two common ways data is distributed across multiple nodes:
Replication Keeping a copy of the same data on several different nodes, potentially in different locations. Replication provides redundancy: if some nodes are unavailable, the data can still be served from the remaining nodes.
Replication can also help improve performance. We discuss replication in Chapter5. Partitioning Splitting a big database into smaller subsets called partitions so that different partitions can be assigned to different nodes (also known as sharding). We discuss partitioning in Chapter6.
leader-based replication (also known as active/passive or master–slave replication)
- One of the replicas is designated the leader (also known as master or primary). When clients want to write to the database, they must send their requests to the leader, which first writes the new data to its local storage.
The other replicas are known as followers (read replicas, slaves, secondaries, or hot standbys).i Whenever the leader writes new data to its local storage, it also sends the data change to all of its followers as part of a replication log or change stream. Each follower takes the log from the leader and updates its local copy of the database accordingly, by applying all writes in the same order as they were processed on the leader.
When a client wants to read from the database, it can query either the leader or any of the followers. However, writes are only accepted on the leader (the followers are read-only from the client’s point of view).
Statement-based replication
the leader logs every write request (statement) that it executes and sends that statement log to its followers. For a relational database, this means that every INSERT, UPDATE, or DELETE statement is forwarded to followers,
calls a nondeterministic function, such as NOW() to get the current date and time or RAND() to get a random number, is likely to generate a different value on each replica.
must be executed in exactly the same order on each replica, or else they may have a different effect.
Statements that have side effects (e.g., triggers, stored procedures, user-defined functions) may result in different side effects occurring on each replica, unless the side effects are absolutely deterministic.
Write-ahead log (WAL) shipping
the log is an append-only sequence of bytes containing all writes to the database. We can use the exact same log to build a replica on another node: besides writing the log to disk, the leader also sends it across the network to its followers.
method of replication is used in PostgreSQL and Oracle,
Logical (row-based) log replication
A logical log for a relational database is usually a sequence of records describing writes to database tables at the granularity of a row:
For an inserted row, the log contains the new values of all columns.
For a deleted row, the log contains enough information to uniquely identify the row that was deleted. Typically this would be the primary key, but if there is no primary key on the table, the old values of all columns need to be logged.
For an updated row, the log contains enough information to uniquely identify the updated row, and the new values of all columns (or at least the new values of all columns that changed).
Trigger-based replication
trigger lets you register custom application code that is automatically executed when a data change (write transaction) occurs in a database system. The trigger has the opportunity to log this change into a separate table, from which it can be read by an external process. That external process can then apply any necessary application logic and replicate the data change to another system.
Databus for Oracle [20] and Bucardo for Postgres [21] work like this, for example.
read-after-write consistency,
This is a guarantee that if the user reloads the page, they will always see any updates they submitted themselves. It makes no promises about other users:
Conflict-free replicated datatypes (CRDTs)
family of data structures for sets, maps (dictionaries), ordered lists, counters, etc. that can be concurrently edited by multiple users, and which automatically resolve conflicts in sensible ways. Some CRDTs have been implemented in Riak 2.0