Architecture Flashcards

1
Q

A dataframe is immutable, True/False

A

True

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

How are changes tracked on dataframes

A

The initial state is unchangeable and kept on each node. Modifications are shared with each node

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

How can you see the lineage of a data frame

A

.explain(“formated”)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What triggers a transformation on a dataframe

A

action

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

A transformation where one partition results in one output partition is called what

A

Narrow Transformation or Narrow Dependency

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

In the parsed logical plan and the analyzed logical plan, which uses the catalog

A

analyzed

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

how many cpu cores per partitions

A

1

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

T/F Cluster Manager is a component of a Spark App

A

False

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Where is the driver in deploy-mode cluster

A

On a node inside the cluster. The Cluster Manager is responsible for maintaining the cluster and executor nodes

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Where is the driver in deploy-mode client

A

On a node not in the cluster

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Is there a performance difference between writing SQL Queries or DataFrame Code

A

NO

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

What kind of programming model is Spark

A

Functional - Same inputs lead to the same outputs; transformations are constant

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

When you perform a shuffle, Spark outputs how many partitions

A

200

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

What is schema inference

A

Take the best guess at what the schema of our data frame should be

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

What port does the spark ui run

A

4040

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

What type of transformation is aggregation

A

wide

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

What type of transformation is filter

A

Narrow

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

What are the 3 kind of actions

A
  • View data in the console
  • collect data to native objects in the respective language
  • write to output data sources
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

.count() is an example of a what

A

an action

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

What is predicate pushdown

A

pushing down the filter automatically

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

What is lazy evaluation

A

Spark will wait till the very last moment to execute the graph of computation instructions

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q

Shuffles will perform filters and then…

A

Write to disk

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
23
Q

What is pipelining

A

on narrow transformations filters will be performed in memory

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
24
Q

A wide dependency is

A

Input partitions contributing to many output partitions

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
25
What is narrow dependencies
Each input partition will contribute to only one output partition
26
What are the 2 type of transformations
Narrow dependencies and wide dependencies
27
Spark will not act on a transformations till
an action is called
28
Core data Structures are muttable or immutable
immutable
29
With Dataframes, you have to manipulate partitions manually
False
30
If you ave one partition and many executors, what paralism do you have
1
31
What is a partition
A collection of rows that sit on one physical machine
32
To allow every executor to perform in parallel, Spark breaks the data into
Partitions
33
What is a dataframe
Structured API that represents a table of data with rows and columns
34
How many spark sessions can you have across a Spark App
1
35
You control your SparkApp through a driver process called
Spark Session
36
What are Spark's Language APIS
Scala, JAVA, R, Python, SQL
37
What is the point of the cluster manager
Keep track of resources available
38
What is local mode
Driver and Executor live on the same machine
39
What are the 3 core cluster managers
Spark's Standalone Manager Yarn Mesos
40
The driver process is responsible for what 3 things
Maintaining info about the Spark App Respond to the user program and input Analyze, distribute and schedule work across executors
41
Which process runs your main() function
driver
42
A spark app consists of what two processes
Driver | Executor
43
Executors are responsible for what two things
Executing code assigned to it | Reporting state of the computation back to the driver node
44
At which stage do the first set of optimizations take place?
Logical Optimization
45
When using DataFrame.persist() data on disk is always serialized. T/F
True
46
Spark dynamically handles skew in sort-merge join by splitting (and replicating if needed) skewed partitions. Which property needs to be enabled to achieve this ?
spark.sql.adative.skewJoin.enabled
47
The goal of Dynamic Partition Pruning (DPP) is to allow you to read only as much data as you need. Which property needs to be set in order to use this functionality ?
spark.sql.optimizer.dynamicPartitionPruning.enabled
48
The DataFrame class does not have an uncache() operation T/F
True
49
What are worker nodes
Worker nodes are the nodes of a cluster that perform computations
50
For text files, we can only have one column of a dataframe we want to write T/F
True
51
How do you specify a left outer join
left_outer
52
A job is
A parallel computation consisting of multiple tasks that gets spawned in response to a Spark action (e.g. save, collect).
53
What is the relationship between an executor and a worker
An executor is a Java Virtual Machine (JVM) running on a worker node.
54
How are global temp views addressed
spark.read.table("global_temp.whatever the view is")
55
When is a data frame writer treated as a global external/unmanaged table
Spark manages the metadata, while you control the data location. As soon as you add ‘path’ option in dataframe writer it will be treated as global external/unmanaged table. When you drop table only metadata gets dropped. A global unmanaged/external table is available across all clusters.
56
What are the possible strategies in order to decrease garbage collection time ?
Persist objects in serialized form Create fewer objects Increase java heap space size
57
Which property is used to scale up and down dynamically based on applications current number of pending tasks in a spark cluster ?
Dynamic Allocation
58
If spark is running in client mode, where is the driver located
on the client machine that submitted the application
59
What causes a stage boundry
a shuffle
60
What function will avoid a shuffle if the new partitions are known to be less than the existing partitions
.coalesce(lesser number)
61
When will a broadcast join be forced
By default spark.sql.autoBroadcastJoinThreshold= 10MB and any value above this thershold will not force a broadcast join.
62
What command can we use to get the number of partition of a dataframe name df ?
df.rdd.getNumPartitions()
63
Layout the Catalyst Optimizer steps
SQL Query | Data Frame to Unresolved Logical Plan to (analysis) (Catalog used) Logical plan to (logical optimization) Optimized Logical Plan to (physical planning) physical plans to cost model to selected physical plan to (code generation) rdds
64
What is dynamic allocation?
If you are running multiple Spark Applications on the same cluster, Spark provides a mechanism to dynamically adjust the resources your application occupies based on the workload.
65
What is required to turn on dynamic allocation
spark.dynamicAllocation.enabled to true set up an external shuffle service on each worker node in the same cluster and set spark.shuffle.service.enabled to true in your application
66
What is the purpose of the external shuffle service
The purpose of the external shuffle service is to allow executors to be removed without deleting shuffle files written by them.
67
What is the default file format for output
Parquet
68
is .25 an acceptabe input for a fraction
no
69
What does adaptive query execution (AQE) allow you to do?
AQE attempts to to do the following at runtime: 1. Reduce the number of reducers in the shuffle stage by decreasing the number of shuffle partitions. 2. Optimize the physical execution plan of the query, for example by converting a SortMergeJoin into a BroadcastHashJoin where appropriate. 3. Handle data skew during a join.
70
What can be done with Spark catalyst optimizer
1. Dynamically convert physical plans to RDDs. 2. Dynamically reorganize query orders. 3. Dynamically select physical plans based on cost.
71
What is an equivalent to equivalent code block to: df.filter(col("count") < 2)
df.where("count < 2")
72
What is the purpose of a cluster manager
The cluster manager allocates resources to the Spark Applications and maintains the executor process in client mode
73
What is the idea behind dynamic partition pruning in Spark
skip over data you do not need in the results of the query
74
Will spark's garbage collector clean up persisted objects
yes, but in least recently used
75
The Dataset API is not available in Python T/F
True
76
A viable way to improve Spark's performance when dealing with large amounts of data, given that there is only a single application running on the cluster
increase the values for spark.default.parallelism and spark.sql.shuffle.partitions
76
A viable way to improve Spark's performance when dealing with large amounts of data, given that there is only a single application running on the cluster
increase the values for spark.default.parallelism and spark.sql.shuffle.partitions
77
Dynamically injecting scan filters for join operations to limit the amount of data to be considered in a query is part of what
Dynamic Partition Pruning
78
What is a Stage
A stage represent a group of tasks that can be executed together to perform the same operation on multiple executors in parallel. A stage is a combination of transformations which does not cause any shuffling of data across nodes. Spark starts a new stage when shuffling is required in the job.
79
How many executors is a task sent to
1
80
What is a task
Each task is a combination of blocks of data and a set of transformations that will run on a single executor.
81
What is a possibility if the number of partitions is too small
If the number is too small it will reduce concurrency and possibly cause data skewing.
82
If there are too many partitions...
there will be a mismatch between task scheduling and task execution.
83
What is coalesce
Collapses partition on the same worker to avoid shuffling.
84
What are some examples of transformations
``` select sum groupBy orderBy filter limit ```
85
What are examples of an action
Show Count Collect save
86
Coalesce cannot be used to increase the number of partitions T/F
True
87
Is printSchema considered an action
No
88
Is first considered an action
Yes
89
When chosing storage method, what means seralized
SER
90
A driver
runs your main() function assigns work to be done in parallel maintains information about the Spark Application
91
What happens at a stage boundary in spark
data is written to the disk by tasks in the parent stages and the fetched over the network by tasks in the child stage
92
is .forEach an action
yes
93
is limit() considered an action
no
94
In cluster mode the driver will be put onto a worker node t/f
true