SQL on Hadoop and Spark Flashcards
(32 cards)
SQL-on-Hadoop
tools that combine familiar SQL interface with scalability and flexibility of big data processing frameworks
Why is SQL special?
Users can leverage SQL knowledge to interact with large datasets, without learning new paradigms like MapReduce or Spark
Batch SQL
SQL-like queries translated into MapReduce/Spark jobs
(load datasets in memory then query)
Batch SQL query tool examples (2)
Apache Hive, Spark SQL
Interactive SQL
Tools that enable low-latency, interactive querying (enable traditional BI and analytics)
Interactive SQL query tool examples (2)
Apache Impala, Apache Drill
Operational SQL
Tools supporting small, more frequent writes and queries with fast response times (OLTP)
Examples of OLTP workloads
insert, update, delete operations (small, more frequent queries)
Operational SQL query tool examples (2)
NoSQL, HBase
Apache Hive
Provides data warehouse-like abstraction over Hadoop, enabling SQL-like queries with HiveSQL
What specifically does HiveSQL do?
Translate SQL-like queries into MapReduce or Spark jobs
Key features of Apache Hive (3) (SMO)
- Schema on read (not like real DW)
- Metastore that uses RDBMS (on single node)
- Organize data into units
What are the 4 data units in Hive?
- Databases (Folder in HDFS)
- Tables (set of files in HDFS folder) - Set of
records with same schema
Optional:
3. Partitions (of table records based off column value) - faster, better for low cardinality
4. Buckets (group table records into fixed number of files) - high cardinality
AWS version of Apache Hive
Athena, allows querying over S3 data
Spark SQL
Allos users to run SQL queries on data stored in Spark structures
Spark SQL Dataframe and Dataset
Essentially RDD’s with a schema attached to support relational (and procedural) processing
Query optimization
When you use SQL with spark, your SQL queries go through several optimization steps before being executed
Catalyst Optimizer
Generates optimized execution plans for SQL like queries, restructuring the query plan to make it more efficient
3 types of Catalyst Optimizers;
- Predicate Pushdown (filters/where clauses)
- Column Pruning
- JVM Code Generation (bytecode to reduce overhead of running queries)
Pro and Con of using SQL over Spark
Pro: Language simplicity (easier to optimize SQL query than user defined function)
Con: Structure imposes limits, RDD’s typically enable any computation through user defined functions
Logical vs Physical plans for optimizing queries
logical describes computations, physical outlines which algorithms are used to conduct them
Constant Folding (Logical Optimization Rules)
Resolves constant expressions (non variable, 2 + 3) at compile time instead of runtime
Predicate Pushdown (Logical Optimization Rules)
Push filter conditions as close to data source as possible (where department =) So that Spark only reads the where clause, not the full query
Column Pruning (Logical Optimization Stage Rules)
Select only necessary columns