batch system Flashcards

Question 1

Q

what is YARN and whats the main daemons

Answer

A

Yet another resource negotiator, its apache hadoop’s cluster resource management system.

it has two main daemons:

1)RM resource manager, its the ultimate global authority than control and manage and allocate the resources among all the apps.

2)a per-node slave, the NM, node manager. its responsible for containers, it monitors the resources and timely send the report to RM

Question 2

Q

what is data locality

Answer

A

instead of moving data to code, it moves or coding in the area where data resides. it saves time and efforts.

Question 3

Q

whats YARN working process

Answer

A

1)client submit the app(send requests)
2)RM finds a suitable NM and containers
3)Application master(project manager) register with RM
4)AMP launches on the container and execute the tasks
5)AMP deregister with RM when its done and shut down, allowing container to be repurposed.

Question 4

Q

whats the scheduler and whats the different type?

Answer

A

FiFO(First in first out)
capacity
fair

Question 5

Q

whats hadoop mapreduce and how it coop with node failures.

Answer

A

hadoop mapreduce is an open-source implementation of map reduce model.

1)if its the failure of a worker. then detected and managed by the AMP, it will reassign to another worker)

2)if its the failure of the AMP, detected and managed by the RM, it will use job history to recover the state of the tasks.

Question 6

Q

whats the limitation of Mapreduce(why we moves to SPARK)

Answer

A

Performance
1. **In-Memory Computation: **Spark processes data in memory rather than relying on disk storage for intermediate results, making it faster compared to MapReduce.
2. Optimized Execution Engine: Spark uses a Directed Acyclic Graph (DAG) execution engine that optimizes task execution, eliminating the synchronization barriers present in MapReduce.
3. Iterative Processing: Spark excels at iterative algorithms, such as those used in machine learning, because intermediate data is cached in memory, avoiding repeated disk I/O operations required by MapReduce.

Usability
1. Simpler Programming Model: Spark provides high-level APIs in multiple languages (Python, Scala, Java), making it easier to write and manage code compared to the verbose and complex Java-based programming required for MapReduce.
2. Interactive Mode: Spark :supports interactive data processing, allowing developers to test and debug code quickly, which is absent in MapReduce.

Versatility
1. Real-Time Processing: Spark supports both batch and real-time data processing through components like Spark Streaming, whereas MapReduce is limited to batch processing.
2. Unified Framework: Spark can handle diverse workloads—batch processing, streaming, machine learning, and graph analytics—on a single platform. In contrast, MapReduce often requires integration with other systems for such tasks.
3. Scalability: Spark can scale from small datasets to large clusters efficiently while maintaining high performance.

Limitations of MapReduce
1. High Latency:
MapReduce has significant latency due to its reliance on disk storage for intermediate results and its sequential task execution model.
2. Complexity:
Writing programs in MapReduce involves more manual coding and lacks flexibility compared to Spark’s intuitive APIs.
3.** Inefficiency for Small Jobs: **
The overhead of setting up and managing a Hadoop cluster makes MapReduce less suitable for smaller datasets or quick tasks.

Question 7

Q

what is spark and whats the key characteristics.

Answer

A

a framework for processing huge amount of data.

1)RDD, resilient distributed dataset:
data items automatically distributed.
and data are automatically rebuilt on failure. but its immutable, and its lazily evaluated(only work when requests), and data could cache temporarily.and it identify data type immediately

all operations are defined on RDDs

no more seperation between map and reduce.

2) lots of APIs are provided, based on the type of operations, the system automatically defines the tasks.

3)better usages of RAM. RDDs can be temporarily cached and results are saved in disk only when the tasks are finished.

Question 8

Q

what is DAG

Answer

A

directed acyclic graph, its like a logic execution plan in the form of a DAG, when you write transformations, it doesntw work directly, just build the DAG, DAG shows you how to work step by step

only when an action is triggered, spark uses DAG to work efficiently to optimizing performance.

Question 9

Q

whats the architecture of SPARK and how does it work.

Answer

A

its a master-slave architecture with a central coordinator (driver) and a lot of distributed workers(executors)

each spark only has one driver.

when it receives the users request(application), it converts to TASKS and create the sparkcontext, and the DAG.

once DAG is optimized, the driver converts the logical plan(DAG) to a physical plan. split the tasks based on the data’s partitioning and the operations. Each corresponds to a set of tasks and can be executed in parallel.

talk to the cluster manager(RM in yarn) to request resources .

the RM allocates appropriate executor processes on various worker nodes. tasks and assign tasks to executors

driver schedules and distributes the tasks to the allocated executors based on resource availability and data locality. Each task is a small unit of work assigned to process a slice of the overall dataset. This is where Spark’s ability to process data in parallel significantly speeds up the computation.

Executors receive their tasks and begin processing. They perform the computation on their respective partitions of data and execute the operations

As tasks complete, executors send the output back to the driver. The driver then aggregates these intermediate results, performs any final processing (such as reducing or sorting), and produces the final result of the application.

After all tasks are executed and results are aggregated, the driver concludes the job. The final output is either returned to the client that submitted the program or saved to a specified storage system. The entire process leverages parallel processing, fault tolerance, and optimized data flow to efficiently handle large-scale data processing tasks.

Question 10

Q

what is apache HIVE

Answer

A

hive is a data warehouse sw facilitates reading,writing,and managing larget datasets residing in distributed storage using SQL. its like the librarian helps to organize the library.(hadoop)

batch system Flashcards

(10 cards)