Sample Exam Flashcards

1
Q

explain the difference between an RDD object and a regular variable (say, an Array)
in Spark.

A

RDDs are distributed data structures in Spark for big data analysis, providing fault-tolerance and parallel processing across a cluster. They are immutable and support complex transformations. Regular variables, like Arrays, are mutable data structures used for in-memory processing of smaller datasets on a single machine. RDDs are designed for distributed computing, while regular variables are for single-machine processing. RDDs are used in Spark’s older APIs, while regular variables are part of general programming language constructs.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

At which levels in the memory hierarchy can an RDD object be persisted in Spark?

A

RDD objects in Spark can be persisted at multiple levels in the memory hierarchy, including disk, memory, and off-heap storage.

1- RDDs can be stored in memory, either as deserialized Java objects or as serialized data, depending on the configuration. This allows for faster access and efficient reuse of RDDs.

2- Disk: RDDs can also be stored on disk when the memory is not sufficient to hold the entire dataset. This allows Spark to spill data to disk when needed and retrieve it back when required.

3- Off-heap: Spark provides the option to store RDDs off-heap, meaning the data is stored outside the Java heap memory. This helps in reducing the memory footprint of the application and improving overall memory management.

4- External storage systems: RDDs can be persisted in external storage systems like Hadoop Distributed File System (HDFS), Amazon S3, or any other compatible distributed file system. This enables data durability and accessibility across different Spark applications or clusters.

Custom storage systems: Spark allows users to define custom storage systems for persisting RDDs. This can be useful when integrating with specialized or proprietary storage systems.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What is the difference between range- and hash-partitioning? Please provide two simple partitioning functions (in their mathematical form) that distribute the following set of keys according to these two partitioning schemes (assuming we have n = 4 worker nodes).
{1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12}

Illustrate the two partitioning strategies by applying your two partitioning functions to
the above set of example keys and indicate to which worker nodes they are each dis-
tributed.

A

-

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Briefly explain how the arithmetic mean (i.e., the “average”) of a set of input values that is
distributed as an RDD object can be computed correctly. You may explain the solution by
describing how either the
* aggregate(seed)(sequenceOperator, combineOperator)
or the
* combineByKey(createCombiner, mergeValue, mergeCombiner)

functions in Scala could be defined to implement such a function. Either a verbal description
or pseudo code is fine as illustration. No actual Java/Scala code is required to solve this
problem.

A
How well did you know this?
1
Not at all
2
3
4
5
Perfectly