Week 2: Data Collection Flashcards
Batch Mode
It’s an analysis mode where results are updated infrequently (after days or months).
Real-time Mode
It’s an analysis mode where results are updated frequently (after seconds).
Interactive Mode
It’s an analysis mode where results are updated on demand as answers to queries.
Hadoop/MapReduce
It’s a framework for distributed data processing. It operates on batch mode.
Pig
It’s a high-level language to write MapReduce programmes. It operates on batch mode.
Spark
It’s a cluster computing framework and has various data analytics components. It operates on batch mode.
Solr
It’s a scalabe framework for searching data. It operates on batch mode.
Spark Streaming Component
It’s an extension of the core Spark API used for stream processing. It operates in real-time mode.
Storm
It’s used for stream processing. It operates on real-time mode.
Hive
It’s a data warehousing framework built on HDFS (Hadoop Distributed File System), and uses a SQL-like language.
Spark SQL Component.
It’s a component of Apache Spark and allows for SQL-like queries within Spark programmes.
Publish-subscribe Messaging
It’s a type of data access connector. Examples include Apache Kafka and Amazon Kinesis. Publishers send messages to topics. The messages are managed by an intermediary broker. Subscribers subscribe to topics. The broker routes the message from publishers to subscribers.
Source-sink Connectors
It’s a type of data access connector. Apache Flume is an example. They import data from another system, i.e. a relational database, and send the data into a centralised data store, i.e. a distributed file system. Sink connectors export the data to another system, such as an HDFS.
Database Connectors
It’s a type of data access connector. Apache Sqoop is an example. It imports data from relational DBMS’s into big data storage and analytics frameworks.
Messaging Queues
It’s a type of data access connector. Examples include RabbitMQ, ZeroMQ, and AmazonSQS. Producers push the data into queues and consumers pull the data from the queues. Producers and consumers don’t need to be aware of each other.
Custom Connectors
It’s a type of data access connector. Examples include gathering data from social networks, NoSQL databases, or IoT. They’re built based on the data sources and data collection requirements.
Apache Sqoop Imports
It imports data from a RDBMS into an HDFS. The exact steps are as follows:
- The table to be imported is examined, mainly by looking at the metadata.
- JAVA code is made, mainly a class for a table, a method for an attribute, and methods to interact with the JDBC.
- Sqoop connects to the Hadoop cluster and sends a MapReduce job which transfers the data from the DBMS to the HDFS in parallel.
Apache Sqoop Exports
It exports data from an HDFS back to the RDBMS. Here are the exact steps:
- Receives strategy for target table
- Generates JAVA code to parse records from text files and generate INSERT statements.
- JAVA code is used in a submitted MapReduce Job that will export the data.
- For efficiency, “m” mappers write data in parallel, so the INSERT command may transfer multiple rows.
Sqoop Import Command Template
sqoop import \
–connect jdbc:mysql://mysql.example.com/sqoop \
–username Username \
–password Password \
–table tableName
JDBC
It’s a set of classes and interfaces written in Java that allows Java programmes to send SQL statements to the database.
Parallelism
This is when the same task is split into multiple channels, thus reducing the amount of time for completion. In the case of Sqoop, more mappers can send the data in parallel, but this requires an increase in the number of concurrent queries to the DBMS. The parameter “m” specifies the number of mappers to run in parallel, with each mapper reading only a certain section of the rows in the dataset.
Sqoop Imports Involving Updates
In this case, we use the “–incremental append” command, which focuses on the new data and doesn’t update existing rows.
Sqoop Export Command Template
sqoop export \
–connect jdbc:mysql://mysql.example.com/sqoop \
–username Username \
–password Password \
–export -dir Directory
Apache Flume
It’s a source-sink connector for collecting, aggregating, and moving data. Apache Flume gathers data from different sources and sends them into a centralised data store, which is either a distributed file system or a NoSQL database. Compared to ad-hoc alternatives, Apache Flume is reliable, scalable, and has high performance. It’s also manageable, customisable, and has low-cost installation, operation, and maintenance.