Exam Questions Flashcards by Benjamin Parbst

Take a GFS computing cluster with 300 chunk servers, each with a free disk of 10 TB, 10 GB of RAM available, a chunk size of 100MB, the standard replication factor of 3. About 10 KB of metadata needs to be stored per chunk handler.

You want to store a very big file on this cluster. What is the maximum file size that is possible to store on this cluster, and why?

You can get 1 PB (= 1000 TB = 10^15 B) of file on the disks, in 10^7 unique chunks.

You can only get metadata for 10^6 unique chunks in a RAM of 10 GB; this limits the file size to 100 TB (3%).

You need 10 KB metadata per unique chunk, not per replica (the replicas are the same, except e.g., location, version).

How well did you know this?

Not at all

Perfectly

GFS: What metadata is kept by the master node?

Three major types of metadata: the file and chunk namespaces, the mapping from files to chunks and the locations of each chunk’s replicas.

How well did you know this?

Not at all

Perfectly

GFS: In what type of memory is metadata stored on the master node, and why is this advantageous?

In RAM, but the file and chunk namespaces and mappings are also kept persistent in an operation log on the masters local disk and on a few remote machines for replicas.

How well did you know this?

Not at all

Perfectly

If two or more applications want to overwrite the same data chunk (starting at the same offset) concurrently, what does GFS guarantee about the resulting chunk data?

The data will be consistent (all replicas are identical) but undefined (typically it consists of mingled fragments from multiple mutations).

How well did you know this?

Not at all

Perfectly

GFS: A client can append to existing files. True or false, and explain why.

True, concurrently, and atomically.

How well did you know this?

Not at all

Perfectly

GFS: Client applications need to know the chunk index in order to read or write in a file. True or false, and explain why.

True. The chunk index is based on a local file offset, however it is not to be confused with the chunk handle, which is a pointer that is retrieved from the master server.

How well did you know this?

Not at all

Perfectly

GFS: The master keeps a record of all chunk locations at all times. True or false, and explain why.

Mostly true, but the chunk locations is in fact requested from the chunk servers at boot or when a new server joins the cluster and is kept in RAM.

How well did you know this?

Not at all

Perfectly

List the names of the frameworks for distributed storage of big data that you know from the course.

GFS, Bigtable or their Apache alternatives.

How well did you know this?

Not at all

Perfectly

For each such framework, what aspects of the system are indeed centralized.

GFS: The store for namespace, mapping, metadata; the first 2 steps in read / write operations, and the chunk management (migration, lease, garbage collection)

Bigtable: Store for location of the root tablet (reads the tablet assignment, but not their index); clients cache tablet locations, so no communication; only a global tablet management (like GFS).

How well did you know this?

Not at all

Perfectly

For each such framework, what (if any) measures are taken when the “master” machine fails, to achieve fault tolerance?

For GFS, a shadow master is kept with the operation log and checkpoints of the master.

For Bigtable, 5 Chubby replicas are kept.*

How well did you know this?

Not at all

Perfectly

Spark: Is an RDD partition big data?

No. An RDD can exceptionally be big if there are a lot of records per key. Initially, 1 partition = 1 chunk from disk.

How well did you know this?

Not at all

Perfectly

List two ways in which the Spark map operation is different from the MapReduce map task.

Spark map is batch (inputs 1 batch = sets of tuples, and outputs 1 batch);

Does not necessarily read input from disk, nor save output to buffers;

Takes a function f as closure.

How well did you know this?

Not at all

Perfectly

What is a task in Spark vocabulary?

Computation on partition, local inside an executor.

How well did you know this?

Not at all

Perfectly

For a hypothetical Spark job executed on a cluster with the options –master yarn –deploy-mode cluster, estimate the number of computing cores that this job will occupy.

Depends on the number f free cores on the cluster, and the number of files and their size in the input.

It does not enforce the maximum number of executors, a dynamic allocation is done.

How well did you know this?

Not at all

Perfectly

Why is a Spark join operation expensive?

It imposes a partitioning on both RDDs, and will likely need to repartition (shuffle) at least one of the inputs.

How well did you know this?

Not at all

Perfectly

If a Spark join is really needed, what can the programmer do to make this operation less expensive?

Pre-partition both RDDs with the same partitioning function.

How well did you know this?

Not at all

Perfectly

Storm & Spark Streaming: What is the most important difference between the basic data types processed on the cluster machines in these frameworks?

Spark Streaming process batches of records, whereas Storm process tuples.

How well did you know this?

Not at all

Perfectly

Storm & Spark Streaming: How is intermediate data stored in these two frameworks?

Both locally in memory (but other configurations are possible).

How well did you know this?

Not at all

Perfectly

Storm & Spark Streaming: How do these processing frameworks differ in terms of the processing speed (throughput and latency), and why is that?

Study These Flashcards

Spark has a higher latency, but higher throughput because process in batches.

Storm & Spark Streaming: How do these processing frameworks differ in terms of the reliability guarantees they provide? Can you say that one of these frameworks has better reliability than the other?

Study These Flashcards

Storm does not have exactly-once guarantees which Spark Streaming does.

MapReduce vs. Spark: A company has a computing cluster with HDFS. This cluster has a total of 1 PB of free disk space, and 1 TB of free memory. The company hasn’t installed a big-data processing framework in this cluster yet. The company asks you to develop certain types of big-data processing software, and also to state which big-data processing framework should be installed on the computing cluster to best support this future software.

A distributed sort of 100GB of input data from the company’s weekly logs. The company will want to run this code over night, and pick up the results every morning.

Study These Flashcards

Speed is not needed, and the input can be loaded in the cluster RAM. Either framework will suffice.

A word count over 10 TB of input data from their Web crawl. The company will want to run this code over night, and pick up the results every morning.

Study These Flashcards

Since the input data is larger than the cluster RAM, MapReduce is preferable because it outputs the processed data on disk.

Spark stores intermediate data in RAM by default, but can be configured to disk storage as well, in that case Spark will suffice as well.

Some in-depth statistics (aggregation and regression) over 10GB of the latest log data. This software will run every 2 hours, and should run as fast as possible.

Study These Flashcards

Spark is preferable because speed is require and the input data can be loaded into the cluster ram.

Furthermore, machine learning libraries makes spark a more plug’n’play solution for statistical analysis.

What does a wide dependency mean in Spark?

Study These Flashcards

One partition of the parent RDD has multiple partitions of the children RDD depending on it.

Give examples of Spark transformations which have a wide dependency.

join (if there are no co-partitions), groupByKey, reduceByKey, sortByKey, repartition.

What is a partitioning function (also called partitioner) in Spark?

A function which determines in which partition a record goes; often used to control data locality (which records stay together).

Give 2 examples of partitioning functions commonly used by Spark. For each partitioning function, give at least one Spark method which uses it.

hashByKey (used by groupByKey and reduceByKey). rangePartitioning (used by partitionBy)

In Spark, in which type of memory are intermediate results stored?

Spark keeps persistent RDDs (and also regular RDDs, but only before shuffles) in memory by default, but it can still spill them to disk if there is not enough RAM. User can also request other persistence strategies, such as storing the RDD only on disk.

What are the difference between RDDs and DataFrames?

The typing. RDD records are arbitrary Python types. With DataFrames, the records are of a fixed structured type as a row with columns of SQL data types.

If you run a Spark job in cluster mode, what is the maximum amount of data you should give as input to this job?

The input data should be less that or equal to the number of executors times the executor memory (1GB by default).

What data types can one store in Bigtable cells?

Each value in a cell is an uninterpreted array of bytes (or byte string). It's up to the programmer to know how to unpack that. Note that this isn't caused by Bigtable being stored in SortedStringTables; SSTables suit because the row key is a string.

Motivate briefly if reading a cell from a Bigtable is fast and why.

**Simply show that you understand and can describe the 2 levels of indexing: metadata and SSTabel indexing.

Motivate briefly if writing into a Bigtable is fast and why.

**Give a quick summary of how/when data is actually written (to memtable, commit log, SSTables on disk).

Both Spark Streaming and Storm need to transmit data between worker machines in the cluster, during the runtime of applications. This can be an expensive operation. For these two frameworks, compare how the programmer can specify which data goes where.

The programmer can't control exactly where the data goes, but they can control if and to how many machines the data goes. **The answer should say what D-Stream methods trigger data transmissions (in their vocabulary: shuffling) in the cluster. Similar to core Spark: the transformations with wide dependencies cause shuffling, the others less or zero. **The answer should say how traffic is caused in Storm: data is shuffled from a producer spout/bolt to a consumer bolt, an there are partitioning strategies which do cause traffic(which?). **Compare the two frameworks. Which Storm strategy is like which Spark Streaming transformation.

What is Bigtable?

A distributed storage system for managing structured data.

What is a Bigtable?

A Bigtable is a sparse, distributed, persistent multi-dimensional sorted map.

How is a Bigtable indexed?

Yes, it is indexed by row key, column key and a timestamp. The row keys are arbitrary strings of up to 64 KB but 10-100 bytes typically. Column keys are grouped into sets called column families. Each value in the map is an uninterpreted array of bytes.

The content of a column family is of the same type and compressed together

True.

Column key syntax:

family:qualifier

GFS is used for persistent storage in BigTable.

True.

Can you store arbitrary data types (integers, floats, strings, dictionaries, structs) in Bigtable cells?

Yes. With extra work, one can store almost anything: "Bigtable treats data as uninterpreted strings; clients often serialize various forms of structured and semi-structured data into these strings." [R6]

What's the point of adding a timestamp as an extra key into the design of Bigtable? 1. Google needed a way to have versioned data. 2. Google needed a way to have versions of the same data which would be automatically sorted by the system. 3. It was essentially a practical decision, which served the application scenarios that Google designed this for.

All are true. It was definitely a practical decision, for example: "In Webtable, we would store the contents of the web pages in the contents column under the timestamps when they were fetched." It also implements exactly what we'd call versioned data. It does also support sorting by timestamp.

What's the point of having column families (instead of just columns) to the design of Bigtable? 1. It was a practical decision, to allow easy storing of many columns with the same type. 2. It's easier to manage the data (store it, access it, even garbage collecting) by column family, than by individual columns. 3. It was a good alternative to storing data by timestamp.

The first two are correct. The third not, because timestamp-based storage does more things extra compared to the column-based storage (sorting, garbage-collecting, different types of keys, etc).

How is the data in a Bigtable stored on disk?

As indexed files of 100-200 MB The file behind every tablet is an SSTable, which is 100-200 MB. (The internal blocks in that are ~64 KB, or some configurable value in that area). There's indexing to find the right internal block.

What types of centralised facilities (machines, services, metadata) are needed to make Bigtable work? (Note: you'd also really need to know what each thing does exactly.) 1. A master server to store metadata and table schemas 2. A root tablet of metadata 3. Metadata tablets 4. Dedicated software: a Bigtable library 5. A commit log per cluster, to store updates on disk 6. A memtable per cluster, to store updates in memory

All options have something good in them, but options 1, 5, and 6 also are partially wrong: there is a master to store some things, but it's not a dedicated machine, just a service (executable code plus some stored files). The commit log and memtable are kept at the tablet servers, so there are many of them.

Exam Questions Flashcards

(45 cards)