Spark Flashcards

Question 1

Q

What is Spark?

Answer

A

An opensource cluster computing framework

Question 2

Q

What does Spark do?

Answer

A

It automates distribution of data and computations on a cluster of computers
Provides a fault tolerant abstraction to distributed datasets
Based on Functional Programming!!
Provides list-like RDDs and table-like Datasets

Question 3

Q

What do RDDs do?

Answer

A

They make a distributed list over a cluster of machines like a Scala collection.

Question 4

Q

What are some attributes of RDDs?

Answer

A

Immutable
Mostly reside in memory
Are distributed transparently
Works like Scala’s List

Question 5

Q

What are the types of operations on RDDs?

Answer

A

Transformations: Applying a function that returns a new RDD; they are lazy
Action: Request a computation as a result; these are eager

Question 6

Q

What does RDD stand for?

Answer

A

Resilient Distributed Dataset

Question 7

Q

What are the common transformations?

Answer

A

map, flatMap, filter

Question 8

Q

What are the common actions?

Answer

A

collect, take, reduce, fold, aggregate

Question 9

Q

What is the main aspect of Pair RDDs?

Answer

A

They can be iterated and indexed, and take the form RDD[(K, V)]. Operations such as joins are defined in Pair RDDs.

Question 10

Q

What are the 3 Spark partitioning schemes?

Answer

A

Default partitioning: Split into equally sized partitions without knowing underlying data properties
Range partitioning: (Pair) Takes into account the order of keys to split the dataset, requires the keys to be naturally ordered
Hash partitioning: (Pair) Calculates a hash over each item and produces the modulo to determine this partition

Question 11

Q

What is the difference between narrow and wide partition dependencies?

Answer

A

Narrow dependencies: Each partition of the source RDD is used by at most one partition of the target RDD
Wide dependencies: Multiple partitions in the target RDD depend on a single partition in the source RDD

Question 12

Q

How does persistence work in the Spark framework (i.e. Java/serialized/FS)

Answer

A

Java: each item in the RDD is an allocated object
Serialized data: Special memory-efficient format. CPU intensive, but faster to send across network
On FS: IN the case the RDD is too big, it can be mapped to the FS, usually HDFS

Question 13

Q

What are the two main dataset abstractions used by SparkSQL

Answer

A

Datasets: A collection of strongly typed objects
Dataframes: Essentially a Dataset[Row], where a Row is an array

Question 14

Q

How can Dataframes and Datasets be created?

Answer

A

From RDDs containing tuples: rdd.toDF(“name”, “id”, “address”)
From RDDs with known complex types e.g. RDD[Person]: rdd.toDF()
From RDDs with manual schema definition
By reading semi-structued data files: spark.read.json(“json.json”)

Question 15

Q

How can you access a column in a DataFrame?

Answer

A

df(“column_name”)

Question 16

Q

What are some common operations on DataFrames?

Answer

Study These Flashcards

A

Projection: df.select(“project_name”)
Selection: df.filter
Joins: left.join(right, condition, how = full_outer)
Aggregations can only be done after being grouped

Spark Flashcards

(16 cards)