General Flashcards

Question 1

Q

how do you cache a dataframe

Answer

A

.persist
.cache.count
doesn’t have to be count but has to be an action that will touch every single record

Question 2

Q

How can you select where your cache is stored

Answer

A

.persist(storage level)

Question 3

Q

the default storage level for persist and cache is

Answer

A

MEMORY_AND_DISK

Question 4

Q

how do you un cache data

Answer

A

.unpresist.count

Question 5

Q

How can you determine the storage level of your data frame

Answer

A

.storageLevel

Question 6

Q

How do you register a function as a dataframe function

Answer

A

val function_udf = udf(stringConcat(_:ParamType…):ReturnType)

Question 7

Q

how do you register a function as SQL Function

Answer

A

spark.udf.register(“new_function_name”, function signature)

Question 8

Q

how can you create a table from a dataframe

Answer

A

.write.saveAsTable(“table_name”)

Question 9

Q

how can you set the number of partitions for a shuffle

Answer

A

spark.conf.set(“spark.sql.shuffle.partitions”,50)

Question 10

Q

how do you get the number of partitions available in a given dataframe

Answer

A

.rdd.getNumPartitions

Question 11

Q

how do you repartition a dataframe

Answer

A

.repartition(2)

Question 12

Q

how can you change the number of partitions on a single node

Answer

A

.coalesce(2)

Question 13

Q

which causes a shuffle repartition or coalesce

Answer

A

repartition

Question 14

Q

How do you enable adaptive query execution

Answer

A

spark.conf.set(“spark.sql.adaptive.enabled”, true)

Question 15

Q

What are the elements of an Apache Spark Execution Hierarchy

Answer

A

Job, Stages, and Tasks

Question 16

Q

Adaptive Query Execution re-optimizes the query plan in the middle of the query execution based on accurate runtime statistics T/F

Question 17

Q

With AQE, Logical optimization and physical planning is removed

Question 18

Q

what does spark.sql.autoBroadcastJoinThreshold do

Answer

A

Configures the maximum size in bytes for a table that will broadcast to all worker nodes when performing a join

Question 19

Q

How do you turn off dynamic partitions coalescing

Answer

A

spark.conf.set(“spark.sql.adaptive.coalescePartitions.enabled”,false)

Question 20

Q

What allows you to control how complex types are printed on schemas

Answer

A

.printSchema(1)

Question 21

Q

How do you set infer schema

Answer

A

.option(“inferSchema”, true)

Question 22

Q

How do you make a dataframe into a table or a view

Answer

A

createOrReplaceTempView()