Spark Flashcards
What is Spark?
An opensource cluster computing framework
What does Spark do?
- It automates distribution of data and computations on a cluster of computers
- Provides a fault tolerant abstraction to distributed datasets
- Based on Functional Programming!!
- Provides list-like RDDs and table-like Datasets
What do RDDs do?
They make a distributed list over a cluster of machines like a Scala collection.
What are some attributes of RDDs?
- Immutable
- Mostly reside in memory
- Are distributed transparently
- Works like Scala’s List
What are the types of operations on RDDs?
- Transformations: Applying a function that returns a new RDD; they are lazy
- Action: Request a computation as a result; these are eager
What does RDD stand for?
Resilient Distributed Dataset
What are the common transformations?
map, flatMap, filter
What are the common actions?
collect, take, reduce, fold, aggregate
What is the main aspect of Pair RDDs?
They can be iterated and indexed, and take the form RDD[(K, V)]. Operations such as joins are defined in Pair RDDs.
What are the 3 Spark partitioning schemes?
- Default partitioning: Split into equally sized partitions without knowing underlying data properties
- Range partitioning: (Pair) Takes into account the order of keys to split the dataset, requires the keys to be naturally ordered
- Hash partitioning: (Pair) Calculates a hash over each item and produces the modulo to determine this partition
What is the difference between narrow and wide partition dependencies?
- Narrow dependencies: Each partition of the source RDD is used by at most one partition of the target RDD
- Wide dependencies: Multiple partitions in the target RDD depend on a single partition in the source RDD
How does persistence work in the Spark framework (i.e. Java/serialized/FS)
- Java: each item in the RDD is an allocated object
- Serialized data: Special memory-efficient format. CPU intensive, but faster to send across network
- On FS: IN the case the RDD is too big, it can be mapped to the FS, usually HDFS
What are the two main dataset abstractions used by SparkSQL
- Datasets: A collection of strongly typed objects
- Dataframes: Essentially a Dataset[Row], where a Row is an array
How can Dataframes and Datasets be created?
- From RDDs containing tuples: rdd.toDF(“name”, “id”, “address”)
- From RDDs with known complex types e.g. RDD[Person]: rdd.toDF()
- From RDDs with manual schema definition
- By reading semi-structued data files: spark.read.json(“json.json”)
How can you access a column in a DataFrame?
df(“column_name”)