Apache Hadoop Flashcards
What is Hadoop?
open-source software framework for distributed storage and processing of large data sets, using a clustered network of machines
What are the key components of Apache Hadoop? (3)
- HDFS
- YARN
- MapReduce
what is a node?
a physical or virtual machine that is part of a Hadoop cluster
What is a daemon?
a background process
State the daemons related to YARN (computing) (3)
- NodeManager daemon
- ResourceManager daemon
- JobHistoryServer daemon
State the daemons related to HDFS (storage) (3)
- NameNode daemon
- DataNode daemon
- SecondaryNameNode daemon
Describe the characteristics of the leader in leader-follower architecture (4)
- Aware of the follower nodes
- Receives external requests
- Decides which nodes execute what and when
- Communicates with follower nodes
Describe the characteristics of the follower in leader-follower architecture (2)
- Acts as a worker node
- Executes tasks that leader tells it to
Which two nodes operate in a leader-follower architecture?
Leader: NameNode
Follower(s): DataNode
What is HDFS?
Shared distributed storage among the nodes of the Hadoop cluster, tailored to map reduce jobs
Where do daemons run?
On nodes
What is the HDFS responsible for storing?
Input and output of MapReduce jobs
How is data stored within the HDFS?
In blocks
What is the default block size?
128MB
How is the minimum parallelisation unit determined?
by the HDFS block size, e.g., mappers will work on a block
Why is 128MB the ideal block size?
it balances parallelisation opportunity (favours smaller blocks) with data processing throughput (favours larger blocks)
How does a file that is smaller than block size occupy the block?
It only occupies the same amount of disk space as the size of the file, so not the entire 128MB
What is the purpose of the NameNode?
to manage the filesystem namespace, filesystem tree and metadata for all files and directories in the tree
What is the purpose of the DataNodes?
store and retrieves blocks when instructed and to implement block caching for blocks which are frequently accessed
Which node does the DataNode report to?
the NameNode
Where is the data for the filesystem tree and the related metadata stored?
persistently on the local disk in the form of two files: the namespace image and the edit log
What does the NameNode know about the files in the HDFS?
Which datanodes possess the blocks for a given file and where they are located (but not persistently)
How many DataNodes are there per cluster?
at least one
How many NameNodes are there per cluster?
only one