Big Data Refresher Flashcards
(38 cards)
What is Spark?
Spark is an open-source parallel processing framework for running large-scale data analytics applications across clustered computers.
What is Hadoop?
Hadoop is an open-source framework that utilizes a network of clustered computers to store and process large datasets.
What is Hive?
Hive™ is a data warehouse software that facilitates reading, writing, and managing large datasets residing in distributed storage using SQL on top of Apache Hadoop.
What are the core components of Spark?
Spark Core, Spark SQL, Spark Streaming, MLib, Graph X, Spark R
What are the core components of Hadoop?
HDFS, YARN, Map Reduce.
What is HDFS?
HDFS stands for Hadoop Distributed File system, and it is the storage component of Hadoop. It is responsible for storing large datasets of structured and unstructured data across various nodes. It consists of two core components Namenode and Datanode. The namenode is the primary or master node and contains the metadata of the data. The datanode is where the actual data is stored and reads, writes, processes, and replicates the data.
What is YARN?
Yarn stands for Yet another resource negotiator. It is the resource management component of Hadoop. Yarn consists of three components the Resource Manager, Node Manager, and the Application Master. The resource manager is in charge of allocating resources to all the applications in the system, the node manager is responsible for containers, monitoring their resource usage such as cpu, memory, and disk. The application master works as an interface between the resource manager and node manager and performs negotiations as per the requirement of the two.
What is MapReduce?
MapReduce is the processing component of Hadoop. MapReduce makes use of two functions map() and reduce(). Map() performs sorting and filtering of data and organizing them in the form of groups. Map generates a key-value pair based result which is later on processed by the Reduce() method.
Reduce() does the summarization by aggregating the mapped data. In simple, Reduce() takes the output generated by Map() as input and combines those tuples into smaller set of tuples.
What are the characteristics of HDFS?
Fault tolerant - Hadoop framework divides data into blocks. After that creates multiple copies of blocks on different machines in the cluster.
Scalable - whenever requirements increase you can scale the cluster. Two scalability mechanisms are available in HDFS: Vertical and Horizontal Scalability.
High Availability - At the time of unfavorable situations like a failure of a node, a user can easily access their data from the other nodes. Because duplicate copies of blocks are present on the other nodes in the HDFS cluster.
How is Apache Spark different from MapReduce?
- Spark processes data in real time and in batches whereas MapReduce only does batch processing.
- Spark is 100 times faster than map reduce.
- Spark stored data in RAM whereas MapReduce stores data to disk.
How does Spark run its applications with the help of its architecture?
Spark applications run as independent processes that are coordinated by the SparkSession object in the driver program. The resource manager or cluster manager assigns tasks to the worker nodes with one task per partition. Iterative algorithms apply operations repeatedly to the data so they can benefit from caching datasets across iterations. A task applies its unit of work to the dataset in its partition and outputs a new partition dataset. Finally, the results are sent back to the driver application or can be saved to the disk.
What are RDD?
RDD is an immutable distributed collection of elements of your data, partitioned across nodes in your cluster that can be operated in parallel.
What is lazy evaluation in Spark?
When spark operates on any dataset, it remembers the instructions. For example when a transformation is called on an RDD the operation is not performed instantly. Transformations in spark are not evaluated until you perform an action which aids in optimizing the overall data processing workflow.
What is a Parquet file and what are its advantages?
Parquet is a columnar storage file format that is used to efficiently store large datasets. Some of the advantages are that it enables you to fetch specific columns for access, consumes less space, follows type-specific encoding, and supports limited I/O operations.
What is Shuffling in Spark?
Shuffling is the process of redistributing data across partitions.
What is the use of coalesce in Spark?
Spark uses a coalesce method to reduce the number of partitions in a DataFrame.
What are the various functionalities supported by Spark Core?
Spark Core is the engine for parallel and distributed processing of large datasets. Some of the functionalities include scheduling and monitoring jobs, memory management, fault recovery, and task dispatching.
How do you convert an RDD into a DataFrame?
Use the function toDF()
Use the sparksession.createDataFrame()
What are transformations and actions?
Transformations are operations that are performed on an RDD to create a new RDD containing the results.(Ex: map, filter, join, union)
Actions are operations that return a value after running a computation on a RDD. (Ex: min, max, count, collect)
What is a broadcast variable?
Broadcast variables are read only shared variables that are cached and available to all nodes in the cluster. Using broadcast variables can improve performance by reducing the amount of network traffic and data serialization required to execute your Spark application because the variables are cached on all the nodes, we do not see to send the data to each node every single time it is being called.
What are accumulators?
Spark Accumulators are shared variables which are only “added” through an associative and commutative operation and are used to perform counter or sum operations.
What are some of the features of apache spark?
High processing speed, in-memory computation, fault-tolerance, stream processing in real-time, multiple language support.
What is client mode?
client mode is when the spark driver component runs on the machine node from where the spark job is submitted. The main disadvantage in this mode is that if the machine fails the entire job fails. This mode is not preferred in production environments.
What is cluster mode?
Cluster mode is where the spark job driver component does not run on the machine from which the spark job has been submitted. The spark job launches the driver component within the cluster as a part of the sub-process of ApplicationMaster. This mode has a dedicated cluster manager for allocating resources required for the job to run.