Spark Flashcards

1
Q

What is Spark?

A

An opensource cluster computing framework

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What does Spark do?

A
  • It automates distribution of data and computations on a cluster of computers
  • Provides a fault tolerant abstraction to distributed datasets
  • Based on Functional Programming!!
  • Provides list-like RDDs and table-like Datasets
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What do RDDs do?

A

They make a distributed list over a cluster of machines like a Scala collection.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What are some attributes of RDDs?

A
  • Immutable
  • Mostly reside in memory
  • Are distributed transparently
  • Works like Scala’s List
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What are the types of operations on RDDs?

A
  • Transformations: Applying a function that returns a new RDD; they are lazy
  • Action: Request a computation as a result; these are eager
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What does RDD stand for?

A

Resilient Distributed Dataset

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What are the common transformations?

A

map, flatMap, filter

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

What are the common actions?

A

collect, take, reduce, fold, aggregate

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What is the main aspect of Pair RDDs?

A

They can be iterated and indexed, and take the form RDD[(K, V)]. Operations such as joins are defined in Pair RDDs.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

What are the 3 Spark partitioning schemes?

A
  • Default partitioning: Split into equally sized partitions without knowing underlying data properties
  • Range partitioning: (Pair) Takes into account the order of keys to split the dataset, requires the keys to be naturally ordered
  • Hash partitioning: (Pair) Calculates a hash over each item and produces the modulo to determine this partition
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What is the difference between narrow and wide partition dependencies?

A
  • Narrow dependencies: Each partition of the source RDD is used by at most one partition of the target RDD
  • Wide dependencies: Multiple partitions in the target RDD depend on a single partition in the source RDD
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

How does persistence work in the Spark framework (i.e. Java/serialized/FS)

A
  • Java: each item in the RDD is an allocated object
  • Serialized data: Special memory-efficient format. CPU intensive, but faster to send across network
  • On FS: IN the case the RDD is too big, it can be mapped to the FS, usually HDFS
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

What are the two main dataset abstractions used by SparkSQL

A
  • Datasets: A collection of strongly typed objects
  • Dataframes: Essentially a Dataset[Row], where a Row is an array
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

How can Dataframes and Datasets be created?

A
  • From RDDs containing tuples: rdd.toDF(“name”, “id”, “address”)
  • From RDDs with known complex types e.g. RDD[Person]: rdd.toDF()
  • From RDDs with manual schema definition
  • By reading semi-structued data files: spark.read.json(“json.json”)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

How can you access a column in a DataFrame?

A

df(“column_name”)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

What are some common operations on DataFrames?

A
  • Projection: df.select(“project_name”)
  • Selection: df.filter
  • Joins: left.join(right, condition, how = full_outer)
  • Aggregations can only be done after being grouped