Architecture Flashcards
A dataframe is immutable, True/False
True
How are changes tracked on dataframes
The initial state is unchangeable and kept on each node. Modifications are shared with each node
How can you see the lineage of a data frame
.explain(“formated”)
What triggers a transformation on a dataframe
action
A transformation where one partition results in one output partition is called what
Narrow Transformation or Narrow Dependency
In the parsed logical plan and the analyzed logical plan, which uses the catalog
analyzed
how many cpu cores per partitions
1
T/F Cluster Manager is a component of a Spark App
False
Where is the driver in deploy-mode cluster
On a node inside the cluster. The Cluster Manager is responsible for maintaining the cluster and executor nodes
Where is the driver in deploy-mode client
On a node not in the cluster
Is there a performance difference between writing SQL Queries or DataFrame Code
NO
What kind of programming model is Spark
Functional - Same inputs lead to the same outputs; transformations are constant
When you perform a shuffle, Spark outputs how many partitions
200
What is schema inference
Take the best guess at what the schema of our data frame should be
What port does the spark ui run
4040
What type of transformation is aggregation
wide
What type of transformation is filter
Narrow
What are the 3 kind of actions
- View data in the console
- collect data to native objects in the respective language
- write to output data sources
.count() is an example of a what
an action
What is predicate pushdown
pushing down the filter automatically
What is lazy evaluation
Spark will wait till the very last moment to execute the graph of computation instructions
Shuffles will perform filters and then…
Write to disk
What is pipelining
on narrow transformations filters will be performed in memory
A wide dependency is
Input partitions contributing to many output partitions