Week 2: Data Collection Flashcards

Question

Apache Flume: Architecture

Answer 1

It consists of a distributed pipeline, with agents connected to each other in a chain. It can also be optimised for different data sources and destinations.

Answer 2

An event is a unit of data flow that has a byte payload and possibly headers.

Answer 3

These are sets of attributes, which can be used for contextual routing. They consist of unordered collections (maps) of string key-value pairs.

Answer 4

They are byte arrays, and are opaque to Flume. Opaque in this context means that Flume treats the payloads are raw bytes and doesn't attempt to interpret or restructure them.

Answer 5

Agents are processes that host components through which events move from one place to another. The agent operates in the following manner: 1. Events are received 2. Events are passed through interceptors if they exist 3. Events are put in channels selected by the channel selector if there are multiple channels. Otherwise, they're feed into a single channel. 4. The sink processor invokes sink if it exists. 5. The sink or invoked sink takes events from its channel and sends them to the next hop destination. 6. If event transmission fails, the sink processor takes the secondary action.

Answer 6

The source receives data from data generators and transfer the data to one or more channels. It requires at least one channel to function. There are also different types of sources for integration with well-known systems. Common sources include: 1. Data serialisation frameworks such as Avro and Thrift 2. Social Media posts 3. Data from stdout 4. Data from port written by NetCat 5. HTTP POST events

Answer 7

A channel is a transient store which buffers events until they're consumed by sinks. It can work with any number of sources and sinks. Channel types include: 1. Memory 2. File 3. JDBC

Answer 8

A sink removes events from the channel and transmits them to their next hop destination. A sink requires exactly one channel to function. Sink types include: 1. HDFS 2. Hbase 3. File Roll 4. Avro 5. Thrift

Answer 9

This isn't present in all Apache Flume Agents. The interceptor is applpied to the source in order to modify, filter, or drop events. Can also add information to headers such as timestamp, host, static, etc.

Answer 10

This isn't present in all Apache Flume Agents. When there are multiple channels, the channel processor/selector defines the policy for distributing events to channels. If there are interceptors, then the channel selector is applied after them to make sure that the events were modified by the interceptors. Channel selector types include: 1. Replicating, which is the default option and replicates events to all connected channels. 2. Multiplexing, which distributes events to all connected channels that based on header attribute "A" of the event and its mapping properties (values of A)

Answer 11

This isn't present in all Apache Flume Agents. It present, the sin selector picks one sink from a specified group of sinks. Common sink selector types include: 1. Load Balancing: the load is distributed to sinks selected by random. If the sink fails, the next available sink is chosen. Failed sinks can be blacklisted for a given timeout that increased exponentially if the sink still fails after the timeout. 2. Failover: unique priorities are assigned to sinks. The sink with the highest priority writes the data until it fails. If the sink fails while sending an event, it is moved to a pool to "cooldown" for the maximum penalty time period. The next sink with the highest priority is then used.

Answer 12

These are tools and frameworks to collect and ingest data from various sources into big data storage and analytics frameworks.