SQL on Hadoop and Spark Flashcards by Eric Sultini

SQL-on-Hadoop

tools that combine familiar SQL interface with scalability and flexibility of big data processing frameworks

How well did you know this?

Not at all

Perfectly

Why is SQL special?

Users can leverage SQL knowledge to interact with large datasets, without learning new paradigms like MapReduce or Spark

How well did you know this?

Not at all

Perfectly

Batch SQL

SQL-like queries translated into MapReduce/Spark jobs

(load datasets in memory then query)

How well did you know this?

Not at all

Perfectly

Batch SQL query tool examples (2)

Apache Hive, Spark SQL

How well did you know this?

Not at all

Perfectly

Interactive SQL

Tools that enable low-latency, interactive querying (enable traditional BI and analytics)

How well did you know this?

Not at all

Perfectly

Interactive SQL query tool examples (2)

Apache Impala, Apache Drill

How well did you know this?

Not at all

Perfectly

Operational SQL

Tools supporting small, more frequent writes and queries with fast response times (OLTP)

How well did you know this?

Not at all

Perfectly

Examples of OLTP workloads

insert, update, delete operations (small, more frequent queries)

How well did you know this?

Not at all

Perfectly

Operational SQL query tool examples (2)

NoSQL, HBase

How well did you know this?

Not at all

Perfectly

Apache Hive

Provides data warehouse-like abstraction over Hadoop, enabling SQL-like queries with HiveSQL

How well did you know this?

Not at all

Perfectly

What specifically does HiveSQL do?

Translate SQL-like queries into MapReduce or Spark jobs

How well did you know this?

Not at all

Perfectly

Key features of Apache Hive (3) (SMO)

Schema on read (not like real DW)
Metastore that uses RDBMS (on single node)
Organize data into units

How well did you know this?

Not at all

Perfectly

What are the 4 data units in Hive?

Databases (Folder in HDFS)
Tables (set of files in HDFS folder) - Set of
records with same schema

Optional:
3. Partitions (of table records based off column value) - faster, better for low cardinality
4. Buckets (group table records into fixed number of files) - high cardinality

How well did you know this?

Not at all

Perfectly

AWS version of Apache Hive

Athena, allows querying over S3 data

How well did you know this?

Not at all

Perfectly

Spark SQL

Allos users to run SQL queries on data stored in Spark structures

How well did you know this?

Not at all

Perfectly

Spark SQL Dataframe and Dataset

Study These Flashcards

Essentially RDD’s with a schema attached to support relational (and procedural) processing

Query optimization

Study These Flashcards

When you use SQL with spark, your SQL queries go through several optimization steps before being executed

Catalyst Optimizer

Study These Flashcards

Generates optimized execution plans for SQL like queries, restructuring the query plan to make it more efficient

3 types of Catalyst Optimizers;

Study These Flashcards

Predicate Pushdown (filters/where clauses)
Column Pruning
JVM Code Generation (bytecode to reduce overhead of running queries)

Pro and Con of using SQL over Spark

Study These Flashcards

Pro: Language simplicity (easier to optimize SQL query than user defined function)

Con: Structure imposes limits, RDD’s typically enable any computation through user defined functions

Logical vs Physical plans for optimizing queries

Study These Flashcards

logical describes computations, physical outlines which algorithms are used to conduct them

Constant Folding (Logical Optimization Rules)

Study These Flashcards

Resolves constant expressions (non variable, 2 + 3) at compile time instead of runtime

Predicate Pushdown (Logical Optimization Rules)

Study These Flashcards

Push filter conditions as close to data source as possible (where department =) So that Spark only reads the where clause, not the full query

Column Pruning (Logical Optimization Stage Rules)

Study These Flashcards

Select only necessary columns

Join Reordering (Logical Optimization Stage Rules)

Reordering joins to process smallest tables first

Spark can apply logical optimization rules both ____________ and ____________ until plan reaches a fixed point

recursively and iteratively

Physical Optimization (2)

Cost model, join methods

How does spark sql decide how to physically execute the optimization?

A cost model is use to select the best one Ex: Cost = a x cost(cpu) + (1-a) x cost(IO)

Physical Optimization Join Methods (3)

1. Broadcast hash (smaller table loaded into memory on each node where larger table has partitions) 2. Shuffle hash (Both tables partitioned, relevant partition groups are sent to the same node. Hash table is created locally on each node for smaller table partition, saved in memory, then scan through bigger table for matches ) 3. Shuffle sort merge (partitions shuffled just like in shuffle hash, but instead of using hash table each partition is sorted by join key. Scan through both tables and merge where join keys match)

4 Factors that determine the cost of physical optimization (NDCD)

Network throughput Disk throughput CPU cost Data locality

Adaptive Query Execution (AQE)

In Spark 3.0 dynamically optimizes execution plan based on runtime statistics (Think of as another layer on top of spark catalyst optimizer)

Things AQE can impact

Adaptive number of shuffle partitions instead of fixed Switch type of join Optimize skewed joins? Dynamic Partition pruning?

SQL on Hadoop and Spark Flashcards

(32 cards)