Bigdata Lineage Flashcards

Question 1

Q

What is Hadoop

Answer

A

HDFS
YARN
MapReduce
Hadoop Commons

Question 2

Q

What is HBase

Answer

A

Database on top of Hadoop. Random read/write access. Store large tables (1M columns, 1B rows) atop of commodity hardware. Bigtable-like capabilities

Question 3

Q

Hadoop history

Answer

A

2006
Yahoo

Question 4

Q

HBase history

Answer

A

2007
Hadoop subproject

Question 5

Q

What is Dynamo

Answer

A

Dynamo is a paper from Amazon. The paper describes an internal database at Amazon to handle their scale.

Question 6

Q

Where does the term NoSQL comes from

Answer

A

Dynamo paper helped launch the ‘NoSQL’ movement

Question 7

Q

Cassandra history

Answer

A

2008, DataStax. Origin: Facebook. 2 Facebook engineers from Amazon made much of the dev)

Question 8

Q

What is Cassandra

Answer

A

Column-oriented, scalable database

Question 9

Q

What is a column oriented database

Answer

A

Excels in handing time series, «group by»

Question 10

Q

What is Apache Pig

Answer

A

High-level scripting language to generate MapReduce jobs

Question 11

Q

Apache Pig history

Answer

A

Yahoo. Dead in 2017 (latest release 0.17)

Question 12

Q

Why Pig is dead?

Answer

A

Preference SQL:
- Hive
- spark SQL
Performance et écosystème: spark

Question 13

Q

Apache Hive history

Answer

A

Facebook.
Tend to be replaced by solutions designed for cloud storage such Iceberg, Hudi and Delta Lake

Question 14

Q

What is Apache Hive

Answer

A

Querying tool, SQL-like interface (HiveQL). Creates MapReduce jobs

Question 15

Q

Relation between Pig and Hive

Answer

A

Both provides a high-level solution to create MapReduce jobs. Pig uses a specific language (Pig Latin) while Hive uses a SQL-like language (HiveQL)

Question 16

Q

Lakehouse architecture solutions

Answer

Study These Flashcards

A

Hive, Iceberg, Hudi, Delta Lake

Question 17

Q

Next generation solutions compared with Hive

Answer

Study These Flashcards

A

Iceberg, Hudi and Delta Lake
They offer ACID transactions, schema evolution and data versioning. Designed for cloud storage

Question 18

Q

What came after Hadoop

Answer

Study These Flashcards

A

What slowed down Hadoop:

Cloud storage alternatives to HDFS
New Processing frameworks such as Spark
Docker, K8s
Hadoop Ozone

But Hadoop is still widely used

Question 19

Q

What is Apache Impala

Answer

Study These Flashcards

A

SQL engine for data stored in HDFS
Like spark but before spark - 2013, Cloudera
Process data in memory, in contrast to Hive
Replaced by Spark

Question 20

Q

Impala vs Hive vs Spark

Answer

Study These Flashcards

A

Impala: massively parallel processing in memory just like Spark
Hive: not in memory
Spark: not coupled with HDFS: general purpose

Bigdata Lineage Flashcards

(20 cards)