Bigdata Lineage Flashcards
(20 cards)
What is Hadoop
- HDFS
- YARN
- MapReduce
- Hadoop Commons
What is HBase
Database on top of Hadoop. Random read/write access. Store large tables (1M columns, 1B rows) atop of commodity hardware. Bigtable-like capabilities
Hadoop history
2006
Yahoo
HBase history
2007
Hadoop subproject
What is Dynamo
Dynamo is a paper from Amazon. The paper describes an internal database at Amazon to handle their scale.
Where does the term NoSQL comes from
- Dynamo paper helped launch the ‘NoSQL’ movement
Cassandra history
2008, DataStax. Origin: Facebook. 2 Facebook engineers from Amazon made much of the dev)
What is Cassandra
Column-oriented, scalable database
What is a column oriented database
Excels in handing time series, «group by»
What is Apache Pig
High-level scripting language to generate MapReduce jobs
Apache Pig history
- Yahoo. Dead in 2017 (latest release 0.17)
Why Pig is dead?
Preference SQL:
- Hive
- spark SQL
Performance et écosystème: spark
Apache Hive history
- Facebook.
Tend to be replaced by solutions designed for cloud storage such Iceberg, Hudi and Delta Lake
What is Apache Hive
Querying tool, SQL-like interface (HiveQL). Creates MapReduce jobs
Relation between Pig and Hive
Both provides a high-level solution to create MapReduce jobs. Pig uses a specific language (Pig Latin) while Hive uses a SQL-like language (HiveQL)
Lakehouse architecture solutions
Hive, Iceberg, Hudi, Delta Lake
Next generation solutions compared with Hive
Iceberg, Hudi and Delta Lake
They offer ACID transactions, schema evolution and data versioning. Designed for cloud storage
What came after Hadoop
What slowed down Hadoop:
Cloud storage alternatives to HDFS
New Processing frameworks such as Spark
Docker, K8s
Hadoop Ozone
But Hadoop is still widely used
What is Apache Impala
SQL engine for data stored in HDFS
Like spark but before spark - 2013, Cloudera
Process data in memory, in contrast to Hive
Replaced by Spark
Impala vs Hive vs Spark
Impala: massively parallel processing in memory just like Spark
Hive: not in memory
Spark: not coupled with HDFS: general purpose