Spark and DataBricks Flashcards
(78 cards)
Can you explain the design schemas relevant to data modeling?
There are three data modeling design schemas: Star, Snowflake, and Galaxy.
The star schema contains various dimension tables which are connected to that fact table in the center.
Snowflake is the extension of the star schema. It consists of a fact table and dimension tables with snowflake-like layers.
The Galaxy schema contains two fact tables, and it shares dimension tables between them.
Why do data systems require a disaster recovery plan?
Disaster recovery planning involves real-time backing up of files and media. The backup storage will be used to restore files in case of a cyber-attack or equipment failure. Security protocols are placed to monitor, trace, and restrict both incoming and outgoing traffic.
What is data orchestration, and what tools can you use to perform it?
Data orchestration is an automated process for accessing raw data from multiple sources, performing data cleaning, transformation, and modeling techniques, and serving it for analytical tasks. The most popular tools are Apache Airflow, Prefect, Dagster, and AWS Glue.
What issues does Apache Airflow resolve?
Apache Airflow allows you to manage and schedule pipelines for the analytical workflow, data warehouse management, and data transformation and modeling under one roof.
You can monitor execution logs in one place, and callbacks can be used to send failure alerts to Slack and Discord. Finally, it is easy to use, provides a helpful user interface and robust integrations, and is free to use.
What are the various modes in Hadoop?
Hadoop mainly works on 3 modes:
Standalone Mode: it is used for debugging where you don’t use HDFS. It uses a local file system for input and output.
Pseudo-distributed Mode: consists of a single node cluster where NameNode and Data node reside at the same place. It is mainly used for testing purposes.
Fully-Distributed Mode: it is a production-ready mode where multiple clusters are running. The data is distributed across multiple nodes. It has separate nodes for master and slave daemons.
What are the three V’s of big data?
Volume (of data)
Velocity (how fast it’s coming in)
Variety (diversity of structure and content)
Additional V’s:
Veracity (accuracy, trustworthiness)
Value
Validity
Visualisation
Variability
Vulnerability
Visibility
Volatility
What is the definition of big data?
Depends on situation, but typically any of:
- > 100TB
- Requires parallel processing
- Too large for operational databases
- Requires big data technology (even if it’s ‘small’ data)
What is data gravity?
Lots of data on single cloud platform:
- More value
- Harder to move data
What is Map Reduce?
MapReduce is a programming model or pattern within the Hadoop framework that is used to access big data stored in the Hadoop File System (HDFS).
- Split single large dataset into multiple smaller datasets
- Each dataset is sent to a node in compute cluster (called mapper)
- Mapper converts data to key-value format, processes and puts in series of output files
- Data is collated by key - all data for given key is put in same file - can put different keys into same file, but never split key across files
- Files are sent to other nodes in cluster called reducer nodes
- Reducers reduce the series of values for each key into a single value (aggregation)
- Outputs are combined into single output for the job
What is Massively Parallel Processing?
Massively parallel is the term for using a large number of computer processors to simultaneously perform a set of coordinated computations in parallel.
GPUs are massively parallel architecture with tens of thousands of threads.
- User submits single SQL query to data warehouse (cluster) master node
- Master node takes SQL query and breaks down into sub-queries which are sent to each worker node
- Worker nodes execute sub-queries (all sharing same data and storage), and all queries are executed in parallel
- Worker node results sent to master node and combined into single result which is sent to user
What is the difference between ETL and ELT pipelines?
ETL = Extract, Transform, Load
- Traditional warehousing approach
- Load and transform data in memory
ELT = Extract, Load, Transform
- Move data to destination first
- More efficient processing at destination
- More resilient (separation of data moving and processing)
What is Data Virtualisation?
Combine and transform data sources without physically modifying data (leave data where it is)
- Good when too many data sources for ETL/ELT to be sustainable
- Good when data movement too expensive
- Good for highly regulated data
- Federated querying (multiple data sources) is possible: connectivity to multiple backends
What is Spark SQL?
Allows developers to write declarative code in Spark jobs
- Abstracts out distributed nature
- Is to Spark what HIVE is to Hadoop; but MUCH faster than HIVE and easier to unit test
- Creates dataframes as containers for resulting data: same structures used for Spark Streaming and Spark ML (can mix and match jobs)
Compatible with multiple data sources: HIVE, JSON, CSV, Parquet etc
Additional optimisations:
- Predicate pushdown
- Column pruning
- Uniform API
- Code generation (performance gains, esp. for Python)
- Can hop in and out of RDDs and SQL as needed
What is predicate pushdown?
Parts of SQL queries that filter data are called ‘predicates’
A predicate push down filters the data in the database query, reducing the number of entries retrieved from the database and improving query performance. By default the Spark Dataset API will automatically push down valid WHERE clauses to the database
What is column pruning in Spark SQL?
An analyser decides if only a subset of columns are required for the output and drops unnecessary columns
What is Apache Parquet?
The data lake format of choice
- Stores data in columns
- Efficient for querying
- Enables compression
- Easy partitioning
What is PrestoDB?
MPP SQL on anything and data virtualisation engine (no storage engine)
- Displacing HIVE
- Increasingly popular for data lakes
- Functions like a data warehouse, but without storage
- Connects to multiple back end data sources
- Blurs lines between data lakes and warehouses
What is Apache Kafka?
Event streaming engine
- Uses message queue paradigm to model streaming data through ‘topics’
What is cluster computing?
A collection of servers (nodes) that are federated and can operate together
- One Driver node and multiple Worker nodes
- Apps talk to Driver, which controls Workers
- Workers parallelise the work (horizonal scaling)
- Designed for failure - redundancy and fault tolerance
What are Containers?
‘deployment packages’ or ‘lightweight virtual machines’
In contrast to virtual machines which are digital images of entire computers, containers only contain the software required for a specific piece of software (no OS etc)
- Much faster than VMs
- Can deploy groups to orchestrate together
- Portable between cloud/on prem etc
What are container orchestration (cluster manager) options for Spark?
Cluster manager: oversees multiple processes
- Spark Standalone: built in manager
- YARN: Hadoop manager
- Mesos: Comparable to YARN but more flexible
- Kubernetes: recently added
What are the key benefits of Spark vs Hadoop?
- Increased efficiency (less machines for same results as Hadoop)
- Much faster
- Less code (generalised abstractions)
- Caches data in memory
- Abstracts away distributed nature (can write code ignoring this)
- Interactive (can play with data on the fly)
- Fault tolerance
- Unify big data needs (answer to MapReduce explosion)
What is DataBricks relationship to Spark?
- Founded by Spark creators
- Maintain Spark repo and ecosystem
What languages can you use for Spark?
Spark is written in Scala and this is it’s native language
Java and Python can also be used
Python API mirrors Scala most closely