Batch - Yarn and MapReduce Flashcards
(31 cards)
YARN (Yet Another Resource Negotiator)
Resource management system designed to handle distributed computing
YARN APIs
Request and work with cluster resources (not made by user code, but by framework itself!)
Fundamental idea of YARN
Split functionalities of resource management and job scheduling.
What makes Yarn scheduler a “pure scheduler”?
Doesn’t monitor application/job status
Doesn’t restart application/job on failure
Application
Single job or DAG of jobs
Applications Manager job
Accept job submissions, negotiate container for executing AMP, provide service for restarting if AMP fails
FIFO scheduler
No configuration necessary, bad for clusters
Capacity Scheduler
Fixed amount of capacity to each job
Fair Scheduler
Balances available resources between running jobs
Resource Manager (RM) (def) (2)
Ultimate authority allocating containers,
1. Accept job submissions from
client
2. Set up ApplicationsMaster (w/ initial container)
Node Manager (NM)
A per-machine agent monitoring resource usage of containers and reports it to RM
ApplicationsMaster (2)
Manage job lifecycle and request containers from RM
Upon request from client, RM finds a NM to launch ______ in a ___________.
Application Master Process; container
Container
Slice of computing resources, reports job status to AMP
Data Locality (YARN)
Ensuring tasks are run as close to the data as possible
4 Levels of Data Locality
- Node-level
- Rack-level
- Data Center-level
- Inter-data center
MapReduce
A programming model that allows developers to write programs that can process large amounts of data in parallel across a cluster
Map Phase
Dataset is partitioned into smaller chunks (input splits) and processed in parallel, turning data into key-value pairs
Sort & Shuffle Phase
Data is sorted by key and shuffled (moved) in groups to reducers.
Reduce Phase
Intermediate data is aggregated by reduce function
Combiner
In some cases, a mini-reduce function used as an optimizer between Map and Reduce phases so less data is transferred.
Combiner function must be _______________ and ______________
Associative (a + b) + c = a + (b + c) - grouping
Commutative a + b = b + c - order
Map or Reduce task failure
Rescheduled on another node
Map or Reduce node failure
All tasks on node rescheduled on another node
worst case: restart entire job