Highload Application Flashcards by Vlad Hilko

What is Kafka?

Kafka is an open source software which provides a framework for storing, reading and analysing streaming data.

Something like Redis but with database-level reliability

How well did you know this?

Not at all

Perfectly

What is Memcached?

Memcached is an open source, high-performance, distributed memory caching system intended to speed up dynamic web applications by reducing the database load. It is a key-value dictionary of strings, objects, etc., stored in the memory, resulting from database calls, API calls, or page rendering.

( Tools for caching )

How well did you know this?

Not at all

Perfectly

What is ElasticSearch?

Elasticsearch is a real-time distributed and open source full-text search and analytics engine.

How well did you know this?

Not at all

Perfectly

What is Solr?

Solr is a scalable, ready to deploy, search/storage engine optimized to search large volumes of text-centric data

How well did you know this?

Not at all

Perfectly

What is Reliability?

The system should continue to work correctly (performing the correct function at the desired level of performance) even in the face of adversity (hardware or software faults, and even human error)

How well did you know this?

Not at all

Perfectly

What is Maintainability?

Over time, many different people will work on the system (engineering and operations, both maintaining current behavior and adapting the system to new use cases), and they should all be able to work on it productively

How well did you know this?

Not at all

Perfectly

What kind of errors can break Relibity?

Hardware error ( database broken, turn of light etc… )
Program error (Infinity recursion, cascade errors)
Human factor ( Accidently remove something important)

How well did you know this?

Not at all

Perfectly

What examples of scalability workload params do you know?

Number of requests to webserver per second
Number of read/write database request per second
Number of active user in the chat

How well did you know this?

Not at all

Perfectly

What is Hadoop?

Hadoop is an open-source software framework with ability to store and process huge amounts of any kind of data, quickly.

How well did you know this?

Not at all

Perfectly

What is MapReduce?

MapReduce is a module in the Apache Hadoop open source ecosystem. We use MapReduce to write scalable applications that can do parallel processing to process a large amount of data on a large cluster of commodity hardware servers.

How well did you know this?

Not at all

Perfectly

What is a rolling upgrade?

A rolling upgrade is an upgrade of a software version, performed without a noticeable down-time or other disruption of service. ( we have a load balancer and roll upgrade one by one on each server )

How well did you know this?

Not at all

Perfectly

What is Shared-nothing architecture?

Shared Nothing Architecture (SNA) is a distributed computing architecture that consists of multiple separated nodes that don’t share resources. The nodes are independent and self-sufficient as they have their own disk space and memory. In such a system, the data set/workload is split into smaller sets (nodes) distributed into different parts of the system. Each node has its own memory, storage, and independent input/output interfaces.

How well did you know this?

Not at all

Perfectly

What is replication?

Replication is the continuous copying of data changes from one database (publisher) to another database (subscriber).

How well did you know this?

Not at all

Perfectly

What is a database table partitioning (секционирование/шардинг)?

Partitioning is the database process where very large tables are divided into multiple smaller parts. By splitting a large table into smaller, individual tables, queries that access only a fraction of the data can run faster because there is less data to scan.

How well did you know this?

Not at all

Perfectly

What replication strategies do you know?

single-leader ( main node send changes to others)
multi-leader ( several main nodes send changes to others)
leaderless ( send data to all nodes together)

How well did you know this?

Not at all

Perfectly

What are the differences between synchronous, asynchronous and semi-synchronous replication?

synchronous replication waits untill all child nodes receive all updated info and then send succcess status
asynchronous replication doesn’t wait
semi-synchronous replication ( works synchronous only with one node and asynchronous with others)

How well did you know this?

Not at all

Perfectly

How to add one more child node without downtime and losing data?

Study These Flashcards

Create leader db snapshot
Move the snapshot to child db
Register all changes on leader db since making snapshot
Apply these changes to child db

What replication data sending strategies do you know?

Study These Flashcards

Statement-based replication (SBR)
Write-Ahead Logging (WAL)/ Streaming Replication
Logical replication
Trigger replication

What is Statement-based replication (SBR) ? What are pros/cons?

Study These Flashcards

Binary log stores the SQL statements used to change databases on the master server. The slave reads this data and reexecutes these SQL statements to produce a copy of the master database.

Problems
- Rand and Time.now function inside the statement
- Auto incremented columns

What is Write-Ahead Logging (WAL) replication/ Streaming Replication?

Study These Flashcards

WAL stands for Write-Ahead Logging. It is the standard protocol being used to ensure that all the changes made to the database are being logged properly in their order of occurrence. ( we send low level data to replica to restore data )

What is Logical replication?

Study These Flashcards

Logical replication is a method of replicating data objects and their changes, based upon their replication identity (usually a primary key). We use the term logical in contrast to physical replication, which uses exact block addresses and byte-by-byte replication.

What is trigger replication?

Study These Flashcards

This replication allows you to run trigger and handle data on the application side. It’s useful if you use a different DB and you need your custom logic.

What is replication lag?

Study These Flashcards

A replication lag is the cost of delay for transaction(s) or operation(s) calculated by its time difference of execution between the primary/master against the standby/slave node. ( When we have differences between main and child nodes)

What is read-after-write consistency?

Study These Flashcards

Read-after-write consistency is the ability to view changes (read data) right after making those changes (write data). For example, if you have a user profile and you change your bio on the profile, you should see the updated bio if you refresh the page. There should be no delay during which the old bio shows up.

How to solve replication lag for read-after-write consistency?

- Read data that user can edit from the primary node ( user's profile) - Read all from the primary node during the first minute after update.

What is Monotonic Reads?

Monotonic read consistency guarantees that after a process reads a value of data item x at time t, it will never see the older value of that data item. ( It can happen if user read values from different replicas)

What kind of replication lags do you know?

- Read-after-write consistency - Monotonic Reads - Consistent prefix reads

What is consistent prefix reads?

Consistent prefix reads guarantee says that if a sequence of writes happens in a certain order, then anyone reading those writes will see them appear in the same order.

How to solve conflicts for multy-leader replication? ( When we write different data to the primary node )

- just avoid coflicts ( try not to allow edit the same data on the same node ) - last write wins ( check request time and accept the latest ) - Give number to each replica ( replica with biggest number has priority) - save conflict to the new data structure and solve them later with user

What replication topologies with multy-leader nodes do you know?

- ring(circular) topology ( one by one by ring ) - star topology ( one in the midle connect to each) - all to all topology

How to solve conflicts for leaderless replications?

- When user reads data check the version of changes and update outdated values - Background process that can update outdated values

What Partitioning strategies to dustribute data do you know?

- RANGE partitioning (This type of partitioning assigns rows to partitions based on column values falling within a given range) - HASH partitioning ( values goes to hash function and returns random values, this values goes to one of the nodes)

What is partitioning hot spot?

A partition with disproportionate high load is known as hotspot. Such partitioning is less effective and leads to uneven load distribution among nodes

What is document-based partitioning for secondary index ?

Document-based partitioning (called the local index). Allows to create local index for each node and make search by several nodes. ( search a car by id and color)

What is term-based partitioning for secondary index ?

Global index. Allows to split index values by several nodes as well ( for example colors from a to j to one node, others to another)

What is data rebalancing?

When a new node joins the cluster, some of the partitions are relocated to the new node so that the data remains distributed equally in the cluster. This process is called data rebalancing. ( moving old data to a new node to keep balance)

What rebalancing methods do you know?

- With Fixed Number of Partitions - Dynamic Partitioning

What is Fixed Number of Partitions rebalancing?

If there is a fixed number of Partitions say 100 partitions . now if a node is added the new node can steal a few partitions from every node until partitions are fairly distributed once again.( Each node has own range and just move some values to the new node)

What is Dynamic Partitioning rebalancing?

For data with key range partitioning a fixed number of partitions would be very troublesome. When a partition grows to exceed a configured size it is split into two and one of the portions is sent to the new nodes.

What is Two-phase commit transaction?

Two-phase commit (2PC) is a standardized protocol that ensures atomicity, consistency, isolation and durability (ACID) of a transaction for distributed systems.

What is Two-phase commit coordinator?

After the coordinator has received a reply from every participant, it decides whether to commit or to abort the transaction The following are the two rules, which govern the coordinator global termination decision regarding a transaction: - If even one participant votes to abort the transaction, the coordinator has to reach a global abort decision. - If all the participants vote to commit the transaction, the coordinator has to reach a global commit decision.

What is XA Transaction?

XA is a two-phase commit protocol.

Highload Application Flashcards

(42 cards)