Exam Questions Flashcards
(45 cards)
Take a GFS computing cluster with 300 chunk servers, each with a free disk of 10 TB, 10 GB of RAM available, a chunk size of 100MB, the standard replication factor of 3. About 10 KB of metadata needs to be stored per chunk handler.
You want to store a very big file on this cluster. What is the maximum file size that is possible to store on this cluster, and why?
You can get 1 PB (= 1000 TB = 10^15 B) of file on the disks, in 10^7 unique chunks.
You can only get metadata for 10^6 unique chunks in a RAM of 10 GB; this limits the file size to 100 TB (3%).
You need 10 KB metadata per unique chunk, not per replica (the replicas are the same, except e.g., location, version).
GFS: What metadata is kept by the master node?
Three major types of metadata: the file and chunk namespaces, the mapping from files to chunks and the locations of each chunk’s replicas.
GFS: In what type of memory is metadata stored on the master node, and why is this advantageous?
In RAM, but the file and chunk namespaces and mappings are also kept persistent in an operation log on the masters local disk and on a few remote machines for replicas.
If two or more applications want to overwrite the same data chunk (starting at the same offset) concurrently, what does GFS guarantee about the resulting chunk data?
The data will be consistent (all replicas are identical) but undefined (typically it consists of mingled fragments from multiple mutations).
GFS: A client can append to existing files. True or false, and explain why.
True, concurrently, and atomically.
GFS: Client applications need to know the chunk index in order to read or write in a file. True or false, and explain why.
True. The chunk index is based on a local file offset, however it is not to be confused with the chunk handle, which is a pointer that is retrieved from the master server.
GFS: The master keeps a record of all chunk locations at all times. True or false, and explain why.
Mostly true, but the chunk locations is in fact requested from the chunk servers at boot or when a new server joins the cluster and is kept in RAM.
List the names of the frameworks for distributed storage of big data that you know from the course.
GFS, Bigtable or their Apache alternatives.
For each such framework, what aspects of the system are indeed centralized.
GFS: The store for namespace, mapping, metadata; the first 2 steps in read / write operations, and the chunk management (migration, lease, garbage collection)
Bigtable: Store for location of the root tablet (reads the tablet assignment, but not their index); clients cache tablet locations, so no communication; only a global tablet management (like GFS).
For each such framework, what (if any) measures are taken when the “master” machine fails, to achieve fault tolerance?
For GFS, a shadow master is kept with the operation log and checkpoints of the master.
For Bigtable, 5 Chubby replicas are kept.*
Spark: Is an RDD partition big data?
No. An RDD can exceptionally be big if there are a lot of records per key. Initially, 1 partition = 1 chunk from disk.
List two ways in which the Spark map operation is different from the MapReduce map task.
Spark map is batch (inputs 1 batch = sets of tuples, and outputs 1 batch);
Does not necessarily read input from disk, nor save output to buffers;
Takes a function f as closure.
What is a task in Spark vocabulary?
Computation on partition, local inside an executor.
For a hypothetical Spark job executed on a cluster with the options –master yarn –deploy-mode cluster, estimate the number of computing cores that this job will occupy.
Depends on the number f free cores on the cluster, and the number of files and their size in the input.
It does not enforce the maximum number of executors, a dynamic allocation is done.
Why is a Spark join operation expensive?
It imposes a partitioning on both RDDs, and will likely need to repartition (shuffle) at least one of the inputs.
If a Spark join is really needed, what can the programmer do to make this operation less expensive?
Pre-partition both RDDs with the same partitioning function.
Storm & Spark Streaming: What is the most important difference between the basic data types processed on the cluster machines in these frameworks?
Spark Streaming process batches of records, whereas Storm process tuples.
Storm & Spark Streaming: How is intermediate data stored in these two frameworks?
Both locally in memory (but other configurations are possible).
Storm & Spark Streaming: How do these processing frameworks differ in terms of the processing speed (throughput and latency), and why is that?
Spark has a higher latency, but higher throughput because process in batches.
Storm & Spark Streaming: How do these processing frameworks differ in terms of the reliability guarantees they provide? Can you say that one of these frameworks has better reliability than the other?
Storm does not have exactly-once guarantees which Spark Streaming does.
MapReduce vs. Spark: A company has a computing cluster with HDFS. This cluster has a total of 1 PB of free disk space, and 1 TB of free memory. The company hasn’t installed a big-data processing framework in this cluster yet. The company asks you to develop certain types of big-data processing software, and also to state which big-data processing framework should be installed on the computing cluster to best support this future software.
A distributed sort of 100GB of input data from the company’s weekly logs. The company will want to run this code over night, and pick up the results every morning.
Speed is not needed, and the input can be loaded in the cluster RAM. Either framework will suffice.
MapReduce vs. Spark: A company has a computing cluster with HDFS. This cluster has a total of 1 PB of free disk space, and 1 TB of free memory. The company hasn’t installed a big-data processing framework in this cluster yet. The company asks you to develop certain types of big-data processing software, and also to state which big-data processing framework should be installed on the computing cluster to best support this future software.
A word count over 10 TB of input data from their Web crawl. The company will want to run this code over night, and pick up the results every morning.
Since the input data is larger than the cluster RAM, MapReduce is preferable because it outputs the processed data on disk.
Spark stores intermediate data in RAM by default, but can be configured to disk storage as well, in that case Spark will suffice as well.
MapReduce vs. Spark: A company has a computing cluster with HDFS. This cluster has a total of 1 PB of free disk space, and 1 TB of free memory. The company hasn’t installed a big-data processing framework in this cluster yet. The company asks you to develop certain types of big-data processing software, and also to state which big-data processing framework should be installed on the computing cluster to best support this future software.
Some in-depth statistics (aggregation and regression) over 10GB of the latest log data. This software will run every 2 hours, and should run as fast as possible.
Spark is preferable because speed is require and the input data can be loaded into the cluster ram.
Furthermore, machine learning libraries makes spark a more plug’n’play solution for statistical analysis.
What does a wide dependency mean in Spark?
One partition of the parent RDD has multiple partitions of the children RDD depending on it.