Spark Flashcards

1
Q

What are the limitations of vanilla (basic) Spark?

A
  • RDDs are schema-less. Inefficient: think of accessing raw text files. Expensive: high space overhead (e.g. strings vs real integers).
  • Important optimizations not supported. To access a single column we need to parse to find the position.
  • Queries cannot refer to attributes, they refer to part of strings. Implementing SQL capabilities are quite cumbersome.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Provide 4 features of the Spark SQL data model.

A
  • It uses a nested data model: nested data models refer to the organization of data in a hierarchical structure, where data elements can contain other data elements as their children.
  • Supports all primitive SQL types (boolean, integer, strings)
  • Supports complex & user-defined types (structs, arrays, maps, unions)
  • Complex data types are first-class citizens, amenable to optimizations. They are not just BLOBs (binary large object) , like they would be in relational databases.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Explain how UDFs can be created and used in Spark SQL.

A

(assuming you registered a table called ‘registeredTable’)

First we register the function:
~~~
sqlContext.registerFunction(“FunctionName”, lambda input1: (some operation on the input)
~~~

Then we use it in the SQL query:
~~~
sqlContext.sql(“SELECT col_1, FunctionName(col) AS alias, FROM registeredTable”)
~~~

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Briefly describe how Spark SQL code is transformed into regular spark code and how it is optimized by the Catalyst Optimizer.

A

Spark SQL code is converted into an Abstract Syntax Tree(AST) by the Spark SQL parser. The AST is then transformed into a logical plan, which represents a high-level description of the query’s data flow. The logical plan undergoes optimization by the Catalyst optimizer to improve query performance.

The optimized logical plan is then transformed into a physical plan, which is a detailed, executable representation of the query that is specific to Spark’s execution engine. Finally, the physical plan is translated into regular Spark code that can be executed as a series of RDD transformations and actions, leveraging Spark’s core APIs and infrastructure.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Provide the three most important optimization principles of Spark SQL.

A
  • Rules-based logic: Catalyst logic expressed with rules
  • High-level language to express user’s requirements. Write less code, while the optimization does the hard work.
  • Lazy evaluation: get ALL operations and optimize them as a whole. Works with both SQL and method-based
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Provide an explanation of the input and output of the catalyst optimizer, as well as the steps the Catalyzer performs.

A

Input: Abstract Syntax Tree (AST) created from a DataFrame.
Performs:
1. Analysis to resolve references (given attribute names are checked in the Catalog, types are verified and determined, named attributes are mapped to the input). Results in Resolved logical plan.
2. Logical optimization (Push-down predicates, projection pruning, simplify logical expressions) Created Optimized Logical plan.
3. Physical planning (define the actual implementation of each operator, combine operators and projections into a single map operations). Result in the optimized physiscal plan.
4. Code generation

Output: Java bytecode

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What is the Abstract Syntax Tree?

A

ASTs represent user queries. Contains nodes of different types, like Attributes, Literals and Operators. These are represented in a tree structure. For example an operator ‘Add’ can have two TreeNode children which can be literals or attributes. It will take these children and perform the operations on it.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

How can we make UDFs optimizable for Catalyst?

A

By implementing User Defined Aggregated Functions (UDAFS). These implement four methods:
- Initialize: initialize what should be executed at each mapper
- Update: processes each record and it will update the counters that it just created
- Merge: takes the partial results of the mapper and merge them
- Evaluate: takes the last merge and do the final evaluation.

We can use this to distribute the computation across the network.
These functions need to be associative!

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What is the difference between an action and a transformation?

A

Transformation:
* create a new RDD from an existing one, such as map, filter, and reduceByKey.
* Transformations are lazily evaluated,
* Allows chaining to create a series of operations on the data.
* Since transformations are only executed when an action is called, this allows Spark to optimize the execution plan.

Action:
* Actions are operations that return a value to the driver program or write data to an external storage system. Examples of actions include count, collect, reduce
* Actions trigger the execution of transformations.
* Actions return a value, rather than an RDD. This means that once an action has been executed, you can no longer chain additional transformations.
* Actions are eagerly evaluated, meaning they are executed as soon as they are called.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Explain the following Spark function, including input and output characteristics and whether it is an action or transformation:
* map

A

Type: Transformation
Input: An RDD with elements of any type.
Operation: Applies a given function to each element in the input RDD.
Output: A new RDD with the same number of elements as the input RDD, where each element is the result of applying the function to the corresponding element in the input RDD.
Characteristics: The output RDD has a one-to-one relationship with the input RDD. The function provided to map should take one input element and produce one output element.

Lambda example:
~~~
rdd.map(element -> element.someFunction());
~~~
or
rdd.map(element -> someFunction(element));

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Explain the following Spark function, including input and output characteristics and whether it is an action or transformation:
* mapToPair

A

Type: Transformation
Input: An RDD with elements of any type.
Operation: Applies a given function to each element in the input RDD, with the function returning a key-value pair.
Output: A new Pair RDD with key-value pairs obtained by applying the function to each element in the input RDD.
Characteristics: The output RDD has a one-to-one relationship with the input RDD. The function provided to mapToPair should take one input element and produce one key-value pair.

Lambda example:
~~~
rdd.mapToPair(element -> new Tuple2<>(element.getKey(), element.getValue()));
~~~
(apply a function on the element to generate a key-value pair.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Explain the following Spark transformation, including input and output characteristics:
* flatMap

A

Type: Transformation
Input: An RDD with elements of any type.
Operation: Applies a given function to each element in the input RDD. The function should return an iterator of zero or more elements.
Output: A new RDD formed by flattening the output iterators from the function applied to each input element. The output RDD may have a different number of elements than the input RDD.
Characteristics: The output RDD has a one-to-many relationship with the input RDD. The function provided to flatMap should take one input element and produce a sequence or iterable of output elements.

Lambda Example:
rdd.flatMap(element -> Arrays.asList(element.someFunctionReturningIterable()).iterator());

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Explain the following Spark function, including input and output characteristics and whether it is an action or transformation:
* filter

A

Type: Transformation
Input: An RDD with elements of any type.
Operation: Applies a given predicate function to each element in the input RDD. The predicate function should return a boolean value.
Output: A new RDD containing only the elements from the input RDD that satisfy the predicate function.
Characteristics: The output RDD is a subset of the input RDD. The function provided to filter should take one input element and return a boolean value indicating whether the element should be included in the output RDD.

Lambda Example:
~~~
rdd.filter(element -> element.someCondition());
~~~
or
~~~
rdd.filter(element -> someCondition(element));
~~~
(We just want to check some condition on the element)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Explain the following Spark function, including input and output characteristics and whether it is an action or transformation:
* reduceByKey

A

Type: Transformation
Input: A Pair RDD with key-value pairs.
Operation: Groups the input RDD’s elements by key, and then applies a given reduction function to the values of each group.
Output: A new Pair RDD with key-value pairs, where each key has a single value that is the result of reducing all the values associated with that key using the reduction function.
Characteristics: The output RDD has a one-to-one relationship with the distinct keys in the input RDD. The function provided to reduceByKey should take two input values and return a single output value. The function should be commutative and associative to ensure correct results.

Lambda Example:
~~~
pairRdd.reduceByKey((value1, value2) -> value1.someFunction(value2));
~~~
or
~~~
pairRdd.reduceByKey((value1, value2) -> someFunction(value1, value2);
~~~

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Explain the following Spark function, including input and output characteristics and whether it is an action or transformation:
* count

A

Type: Action
Input: An RDD with elements of any type.
Operation: Calculates the number of elements in the input RDD.
Output: An integer representing the number of elements in the RDD.
Characteristics: count is a simple action that provides an easy way to determine the size of an RDD. It does not modify the data in the RDD or produce a new RDD.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Explain the following Spark function, including input and output characteristics and whether it is an action or transformation:
* collect

A

Type: Action
Input: An RDD with elements of any type.
Operation: Retrieves all elements of the input RDD and returns them as an array.
Output: An array containing all elements of the RDD.
Characteristics: collect is an action that can be used to retrieve the entire contents of an RDD. However, be cautious when using this action, as it can cause out-of-memory errors if the RDD is too large to fit in the memory of the driver program. It is often used for testing and debugging purposes when working with small RDDs.

17
Q

Explain the following Spark function, including input and output characteristics and whether it is an action or transformation:
* reduce

A

Type: Action
Input: An RDD with elements of any type.
Operation: Aggregates the elements of the input RDD using a specified reduction function. The function should accept two arguments and return a single value. It should also be commutative and associative to ensure correct results.
Output: A single value resulting from the aggregation of all elements in the RDD using the reduction function.
Characteristics: reduce is an action that allows you to aggregate data in an RDD according to a specified function. It requires a function that can be applied cumulatively to the elements of the RDD to reduce them to a single value.

Java example:
~~~
final var result = rdd.reduce((input1, input1) -> someReduceFunction(input1, input2);
~~~
e.g.
~~~
int sum = rdd.reduce((num1, num2) -> num1 + num2);
~~~

18
Q

Explain the following Spark function, including input and output characteristics and whether it is an action or transformation:
* join

A

Type: Transformation
Input: Two Pair RDDs (key-value pairs) with matching key types, say RDD1 and RDD2.
Operation: Performs an inner join between RDD1 and RDD2 based on their keys. The join function matches the keys from both RDDs and combines the corresponding values in a tuple.
Output: A new Pair RDD with key-value pairs, where each key is present in both input RDDs, and the value is a tuple containing the corresponding values from RDD1 and RDD2.
Characteristics: The join transformation is used to combine two Pair RDDs based on their keys. The resulting RDD contains only those keys that are present in both input RDDs. The values from both RDDs are combined in a tuple. Note that if a key has multiple occurrences in either RDD, the join operation will produce all possible combinations of values for that key.

Java example:
~~~
pairRdd1.join(pairRdd2);
~~~

19
Q

Explain the following Spark function, including input and output characteristics and whether it is an action or transformation:
* cartesian

A

Type: Transformation
Input: Two RDDs with elements of any type, say RDD1 and RDD2.
Operation: Computes the Cartesian product of RDD1 and RDD2, which is the set of all possible pairs of elements where one element is taken from RDD1 and the other is taken from RDD2.
Output: A new RDD containing pairs of elements from RDD1 and RDD2 (PairRDD). The number of elements in the output RDD is equal to the product of the number of elements in RDD1 and the number of elements in RDD2.
Characteristics: The cartesian transformation is used to compute the Cartesian product of two RDDs. The resulting RDD contains pairs of elements from the input RDDs, with each element of RDD1 paired with each element of RDD2. This operation can produce a large output RDD, especially if the input RDDs have many elements, so use it with caution.

Java example:
~~~
rdd1.cartesian(rdd2);
~~~

20
Q

Select all the transformations that do NOT require internal shuffling in Spark:
a) map
b) filter
c) flatMap
d) join

A

A, B, C

21
Q

Which of the following transformations does NOT require internal shuffling in Spark?

a) join
b) map
c) reduceByKey
d) groupByKey

A

B

22
Q

What type of functions can we execute with reduceByKey?

A

Only commutative and associative functions:
- An operation is said to be associative if the order in which the operation is performed does not affect the result, as long as the sequence of operands remains the same.
- An operation is said to be commutative if the order of the operands does not affect the result.

e.g. calculating the median of a set of numbers is not possible as it is not C&A.

23
Q

How does spark cope with failing nodes during a transformation setep? (e.g. a single node fails)

A

First try: On another node, find a replica of the RDD we wanted to transform. The transformation that the failing node was supposed to do (partial result) is retried. Only partial work is retried, not the full transformation on the RDD.

If no suitable replica can be found, the lineage information (chains of transformation) is used to reconstruct the missing replica.

24
Q

Briefly describe how Spark SQL code is transformed into regular spark code and how it is optimized by the Catalyst Optimizer.

A

Spark SQL code is converted into an Abstract Syntax Tree (AST) by the Spark SQL parser. The AST is then transformed into a logical plan, which represents a high-level description of the query’s data flow. The logical plan undergoes optimization by the Catalyst optimizer to improve query performance.

The optimized logical plan is then transformed into a physical plan, which is a detailed, executable representation of the query that is specific to Spark’s execution engine. Finally, the physical plan is translated into regular Spark code that can be executed as a series of RDD transformations and actions, leveraging Spark’s core APIs and infrastructure. This process allows Spark SQL to seamlessly integrate with the rest of the Spark ecosystem, providing a unified platform for data processing and analysis.

25
Q

How can we cache our dataset to memory only? Name two methods

A

Using:
~~~
rdd.persist(StorageLevel.MEMORY_ONLY)
~~~
or
~~~
rdd.cache()
~~~

26
Q

What happens if you don’t cache your dataset after transformations?

A

Each time we query this uncached dataset it is computed again. When persisted, we can query many times untill we .unpersist() the dataset, without recomputing the prior transformations.

27
Q

When does caching make a difference? Write a minimal example.

A

Let’s take the word count example. Imagine we compute the word counts:
~~~
JavaRDD<String> textFile = ...; // Load your text file here
// Compute the word counts
JavaPairRDD<String, Integer> counts = textFile
.flatMap(line -> Arrays.asList(line.split(" ")).iterator())
.mapToPair(word -> new Tuple2<>(word, 1))
.reduceByKey((a, b) -> a + b);
~~~</String>

And now we want to calculate the maximum word count and minimum word count:

JavaPairRDD<String, Integer> maxWordCountPair = counts.reduce((count1, count2) -> count1._2() > count2._2() ? count1 : count2);
int maxWordCount = maxWordCountPair._2();

JavaPairRDD<String, Integer> minWordCountPair = counts.reduce((count1, count2) -> count1._2() < count2._2() ? count1 : count2);
int minWordCount = minWordCountPair._2();

Now if we don’t cache the counts RDD we need to calculate the full counts RDD again for both the minimum and maximum query.

28
Q

What happens if you persist to RAM but never unpersist?

A

The data that will be stored in memory will never be cleared up, meaning that the RAM will continue to get fuller. Spark will have less RAM to work with efficiently, or it can cause the RAM to get full, crashing your program.

29
Q

How can we optimize Spark for joining two datasets of wildly different sizes?

A

We can use ‘salting’. Add a random suffix (salt) that will allow us to distribute this key to one or more computers. This means we replicate row IDs of the smaller dataset and append a random salt, then perform the join, remove the added salt and re-aggregate.

30
Q

How can a small number of partitions harm the performance of your Spark program?

A

Large partition size: If the number of partitions is too low, each partition will have to handle more data, leading to larger partition sizes. This can cause increased memory usage on the nodes, potentially leading to memory issues or garbage collection overhead.

Idle workers: When the number of partitions is smaller than the number of workers (or cores), some workers will remain idle during the processing, resulting in inefficient resource utilization.

Imbalanced workload: If the number of partitions is not much greater than the number of workers, workload distribution may become imbalanced. Some workers may process larger partitions, causing them to take longer to complete their tasks. This can create bottlenecks in the processing, as faster workers have to wait for slower workers to finish before moving on to the next stage.

31
Q

How can a very large number of partitions harm the performance of your Spark program?

A

Overhead: Creating an excessive number of partitions can lead to increased overhead in managing and redistributing these partitions across the cluster. Each partition comes with metadata and task scheduling overhead, which can add up when there are many small partitions.

Underutilization of resources: With a large number of small partitions, each task may not have enough work to do, leading to underutilization of CPU and memory resources. This can result in decreased parallelism, as tasks finish quickly and spend more time waiting for new tasks to be scheduled, rather than actively processing data.

Network congestion: If there are many small partitions and shuffling is required between stages, network congestion can occur due to the increased amount of data being exchanged between nodes. This can slow down the overall processing time and impact the performance of the Spark application.