Batch - Yarn and MapReduce Flashcards

(31 cards)

1
Q

YARN (Yet Another Resource Negotiator)

A

Resource management system designed to handle distributed computing

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

YARN APIs

A

Request and work with cluster resources (not made by user code, but by framework itself!)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Fundamental idea of YARN

A

Split functionalities of resource management and job scheduling.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What makes Yarn scheduler a “pure scheduler”?

A

Doesn’t monitor application/job status
Doesn’t restart application/job on failure

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Application

A

Single job or DAG of jobs

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Applications Manager job

A

Accept job submissions, negotiate container for executing AMP, provide service for restarting if AMP fails

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

FIFO scheduler

A

No configuration necessary, bad for clusters

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Capacity Scheduler

A

Fixed amount of capacity to each job

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Fair Scheduler

A

Balances available resources between running jobs

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Resource Manager (RM) (def) (2)

A

Ultimate authority allocating containers,
1. Accept job submissions from
client
2. Set up ApplicationsMaster (w/ initial container)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Node Manager (NM)

A

A per-machine agent monitoring resource usage of containers and reports it to RM

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

ApplicationsMaster (2)

A

Manage job lifecycle and request containers from RM

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Upon request from client, RM finds a NM to launch ______ in a ___________.

A

Application Master Process; container

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Container

A

Slice of computing resources, reports job status to AMP

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Data Locality (YARN)

A

Ensuring tasks are run as close to the data as possible

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

4 Levels of Data Locality

A
  1. Node-level
  2. Rack-level
  3. Data Center-level
  4. Inter-data center
17
Q

MapReduce

A

A programming model that allows developers to write programs that can process large amounts of data in parallel across a cluster

18
Q

Map Phase

A

Dataset is partitioned into smaller chunks (input splits) and processed in parallel, turning data into key-value pairs

19
Q

Sort & Shuffle Phase

A

Data is sorted by key and shuffled (moved) in groups to reducers.

20
Q

Reduce Phase

A

Intermediate data is aggregated by reduce function

21
Q

Combiner

A

In some cases, a mini-reduce function used as an optimizer between Map and Reduce phases so less data is transferred.

22
Q

Combiner function must be _______________ and ______________

A

Associative (a + b) + c = a + (b + c) - grouping

Commutative a + b = b + c - order

23
Q

Map or Reduce task failure

A

Rescheduled on another node

24
Q

Map or Reduce node failure

A

All tasks on node rescheduled on another node
worst case: restart entire job

25
5 Steps of MapReduce process
1. Input divided into fixed-size splits 2. User-defined map function for each record in split 3. Key-value pairs sorted by key and stored on disk 4. Sent to reducers that combine values of a given key 5. Results written onto DFS
26
One global reduce task solution
originally, hadoop had one reduce task for an entire job regardless of data size
27
One reduce task per CPU solution
number of reduce tasks based on number of CPU cores, but can cause an imbalance because some keys have more data than others
27
Many reduce tasks per CPU solution
Reduce tasks > CPU cores, each task handling fewer keys for more balanced workload, but more overhead
28
Rule of thumb for picking number of Reduce Tasks
Run for 5 minutes and produce at least one HDFS Block
29
Partition function
Optional MapReduce function to determine how intermediate data is distributed to reduce tasks. Same keys go to same reducers, or aim for balanced workload.
30
How does job get submitted in MapReduce?
Developer launches job on Java Virtual Machine (JVM) which then contacts YARN RM