3. Advanced MapReduce Programming Flashcards

1
Q

What is the purpose of using Chain Mapper in MapReduce?
A) To chain multiple Reducers in a single Reduce task
B) To apply multiple mapping operations in sequence within a single Map task
C) To join two datasets in the Map phase
D) To distribute data evenly across mappers

A

B) To apply multiple mapping operations in sequence within a single Map task

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What does the Distributed Cache in Hadoop provide?
A) A way to store intermediate MapReduce results
B) A mechanism to share large datasets across all nodes in the cluster
C) An efficient way to make small, read-only files available to all tasks in a job
D) A distributed file system for storing large files across multiple nodes

A

C) An efficient way to make small, read-only files available to all tasks in a job

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

In a Map-side join, what is a requirement for one of the datasets?
A) It must be larger than the other dataset
B) It must be stored in HDFS
C) It must be small enough to fit into memory
D) It must be sorted on the join key

A

C) It must be small enough to fit into memory

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What is a key difference between Map-side joins and Reduce-side joins in MapReduce?
A) Map-side joins can only be used with text data, while Reduce-side joins can be used with any data type
B) Map-side joins are more flexible and can handle larger datasets
C) Map-side joins perform the join in the Mapper, while Reduce-side joins perform the join in the Reducer
D) Reduce-side joins require one of the datasets to fit into memory

A

C) Map-side joins perform the join in the Mapper, while Reduce-side joins perform the join in the Reducer

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Which of the following is NOT an advantage of Map-side joins?
A) They avoid the need for shuffling and reducing
B) They are more efficient when one of the datasets is small
C) They can handle datasets of any size
D) They reduce the amount of data transferred to the Reduce stage

A

C) They can handle datasets of any size

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What is the role of the Reducer in a Reduce-side join?
A) To load one of the datasets into memory for the join
B) To shuffle and sort the data before the join
C) To perform the join operation on the data grouped by the join key
D) To distribute the joined data across the cluster

A

C) To perform the join operation on the data grouped by the join key

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Which of the following is a use case for the Distributed Cache in Hadoop?
A) Storing temporary data during MapReduce execution
B) Distributing large input files to mappers
C) Sharing a small lookup table with all mappers and reducers
D) Caching intermediate results between MapReduce jobs

A

C) Sharing a small lookup table with all mappers and reducers

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

What is the main advantage of using Chain Mapper in a MapReduce job?
A) It reduces the amount of data transferred over the network
B) It allows for parallel execution of multiple mappers
C) It enables sequential execution of multiple mapping operations within a single map task
D) It automatically balances the load between mappers and reducers

A

C) It enables sequential execution of multiple mapping operations within a single map task

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

In a Map-side join, the dataset that fits into memory is typically loaded during which phase of the MapReduce job?
A) Map phase
B) Reduce phase
C) Setup phase of the Mapper
D) Cleanup phase of the Reducer

A

C) Setup phase of the Mapper

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Which of the following statements is true about Reduce-side joins?
A) They are always faster than Map-side joins
B) They require both datasets to fit into memory
C) They are suitable for joining large datasets
D) They perform the join operation in the Mapper

A

C) They are suitable for joining large datasets

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

When using Chain Mapper, the output key-value pairs of one mapper are passed as input to the next mapper in the chain.
A) True
B) False

A

A) True

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

The Distributed Cache in Hadoop is used to:
A) Cache results from previous MapReduce jobs
B) Store intermediate data between map and reduce tasks
C) Distribute small read-only files to all nodes in the cluster
D) Replicate input data across multiple nodes for fault tolerance

A

C) Distribute small read-only files to all nodes in the cluster

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Which of the following is NOT a characteristic of Map-side joins?
A) Requires one dataset to be small enough to fit into memory
B) Involves shuffling and sorting data based on the join key
C) Can be more efficient than Reduce-side joins for certain datasets
D) Is performed entirely within the Map phase

A

B) Involves shuffling and sorting data based on the join key

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

In a Reduce-side join, the join operation is performed:
A) Before the map phase
B) During the map phase
C) During the shuffle and sort phase
D) During the reduce phase

A

D) During the reduce phase

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

In a Map-side join, the smaller dataset is:
A) Discarded
B) Loaded into memory
C) Stored in HDFS
D) Processed by reducers

A

B) Loaded into memory

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Which of the following best describes the Chain Mapper in Hadoop?
A) A sequence of reducers linked together
B) A series of mappers executed in parallel
C) A series of mappers executed sequentially within a single map task
D) A mechanism to chain map and reduce tasks in a single job

A

C) A series of mappers executed sequentially within a single map task

17
Q

The Distributed Cache in Hadoop is used to:
A) Store intermediate results of a MapReduce job
B) Cache frequently accessed data in memory
C) Distribute small, read-only files to all nodes in a cluster
D) Improve the performance of the NameNode

A

C) Distribute small, read-only files to all nodes in a cluster

18
Q

Which of the following statements is true about Reduce-side joins?
A) They are performed entirely within the Map phase
B) They require both datasets to be small enough to fit into memory
C) They involve shuffling and sorting data based on the join key
D) They are more efficient than Map-side joins for small datasets

A

C) They involve shuffling and sorting data based on the join key