Apache Hadoop Flashcards

1
Q

What is Hadoop?

A

open-source software framework for distributed storage and processing of large data sets, using a clustered network of machines

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What are the key components of Apache Hadoop? (3)

A
  1. HDFS
  2. YARN
  3. MapReduce
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

what is a node?

A

a physical or virtual machine that is part of a Hadoop cluster

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What is a daemon?

A

a background process

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

State the daemons related to YARN (computing) (3)

A
  1. NodeManager daemon
  2. ResourceManager daemon
  3. JobHistoryServer daemon
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

State the daemons related to HDFS (storage) (3)

A
  1. NameNode daemon
  2. DataNode daemon
  3. SecondaryNameNode daemon
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Describe the characteristics of the leader in leader-follower architecture (4)

A
  1. Aware of the follower nodes
  2. Receives external requests
  3. Decides which nodes execute what and when
  4. Communicates with follower nodes
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Describe the characteristics of the follower in leader-follower architecture (2)

A
  1. Acts as a worker node
  2. Executes tasks that leader tells it to
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Which two nodes operate in a leader-follower architecture?

A

Leader: NameNode
Follower(s): DataNode

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

What is HDFS?

A

Shared distributed storage among the nodes of the Hadoop cluster, tailored to map reduce jobs

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Where do daemons run?

A

On nodes

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

What is the HDFS responsible for storing?

A

Input and output of MapReduce jobs

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

How is data stored within the HDFS?

A

In blocks

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

What is the default block size?

A

128MB

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

How is the minimum parallelisation unit determined?

A

by the HDFS block size, e.g., mappers will work on a block

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Why is 128MB the ideal block size?

A

it balances parallelisation opportunity (favours smaller blocks) with data processing throughput (favours larger blocks)

17
Q

How does a file that is smaller than block size occupy the block?

A

It only occupies the same amount of disk space as the size of the file, so not the entire 128MB

18
Q

What is the purpose of the NameNode?

A

to manage the filesystem namespace, filesystem tree and metadata for all files and directories in the tree

19
Q

What is the purpose of the DataNodes?

A

store and retrieves blocks when instructed and to implement block caching for blocks which are frequently accessed

20
Q

Which node does the DataNode report to?

A

the NameNode

21
Q

Where is the data for the filesystem tree and the related metadata stored?

A

persistently on the local disk in the form of two files: the namespace image and the edit log

22
Q

What does the NameNode know about the files in the HDFS?

A

Which datanodes possess the blocks for a given file and where they are located (but not persistently)

23
Q

How many DataNodes are there per cluster?

A

at least one

24
Q

How many NameNodes are there per cluster?

25
What is the purpose of the HDFS SecondaryNamenode?
to store a backup copy of index table (communicates periodically with NameNode)
26
What information does the NameNode keep relating to the blocks?
An index table with (all) the locations of each block
27
What would happen if the machine running the NameNode was obliterated?
all the files on the filesystem would be lost since there would be no way of knowing how to reconstruct the files from the blocks on the datanodes
28
How many SecondayNamenodes are there per cluster?
only one
29
What is meant by the “move computation to data” principle with HDFS?
blocks are stored on certain machines, and the mapper and reduce function will both run locally on that machine without needing to move data between map and reduce processes
30
Which feature of HDFS achieves the “move computation to data” principle?
Block replication
31
Why are blocks replicated over the cluster?
for fault-tolerance purposes, spreading replicas among different physical locations to improve reliability
32
What is the default number of replicas for each block?
3
33
What is YARN?
Hadoop’s cluster resource management system
34
what is the relationship between a job and a task?
a job usually consists of multiple tasks
35
What are the Hadoop computation tasks? (3)
1. Resource management 2. Job allocation 3. Job execution/monitoring
36
How would the estimation for the number of map and reduce tasks be calculated? (2)
Based on: 1. input dataset 2. job definition (defined by user)
37
How can you calculate the number of mappers needed?
input size/split size
38
What are the different schedulers available in YARN (3)
FIFO Capacity Fair
39
Why is Hadoop not efficient with I/O? (2)
1. data must be loaded and written from HDFS 2. shuffle and sort incur long latency and produce large network traffic