3. SAS and Hadoop Flashcards
What is a Cluster of Computers?
*A clusterof computers is a grouping of computers connected by a local area network.
*Each computer is referred to as a node in the cluster.
*The nodes work together as one system.
What is a node?
a computer in a cluster of computers. The nodes communicate with each other via the network and function as one unit.
What is Hadoop?
*an open-source software project supported by Apache
*a framework for distributed processing of large data sets
*designed to run on computer clusters.
* Made up of NameNodes and DataNodes
What are the 2 primary components of a Hadoop cluster?
A traditional Hadoop cluster consists of the NameNode, perhaps a backup NameNode, and many DataNodes.
What is meant by the Hadoop ecosystem and its 3 foundational components?
The ecosystem refers to the software components that make up the Hadoop framework. Each component has a unique function in the Hadoop ecosystem.
The 3 components or modules that serve as a foundation for Hadoop are HDFS, Yarn, and MapReduce.
What are some key features of Hadoop?
- Open-source.
- Simple to use, distributed file storage system.
- Supports highly parallel processing - makes it well-suited for performing analysis on huge volumes of data.
- Scales up well to handle massive amounts of data – it is easily extensible by adding more storage nodes into the cluster.
- Designed to work on low-cost hardware, so the cost-entry point is fairly low.
- Data is replicated across multiple hosts/nodes to make it fault-tolerant.
- SAS has integration points that make using Hadoop familiar to existing SAS customers - procedures, LIBNAME statements, and Data Integration Studio transforms.
What are some commercial distributions of Hadoop?
Cloudera
IBM BigInsights
Hortonworks
AWS EMR (Elastic MapReduce)
MapR
MSFT Azure HDInsight
What is the Hadoop Users Experience (HUE)?
It is an open-source application for browsing, querying, and visualizing data in Hadoop. Its browser-based interface enables you to perform a variety of tasks in Hadoop.
What is HDFS?
One of the 3 core modules, Hadoop Distributed File System is a virtual file system that distributes files across the Hadoop computer cluster
What is YARN?
One of the 3 core modules, Yet Another Resource Negotiator is a framework for job scheduling and cluster resource management.
What is MapReduce?
One of the 3 core modules, it is a YARN-based system for automating parallel processing of distributed data
What does a NameNode do?
The NameNode contains information about where the data is located on each DataNode. It does not hold the physical data.
What is a block in HDFS?
Blocks are how data is distributed across the Hadoop DataNodes
What does a DataNode do?
DataNodes are the components that contain blocks of data in HDFS. Data is replicated in HDFS in order to support fault tolerance. By default, each file block in HDFS is replicated on three other DataNodes. If any DataNode goes down, those backup copies are available for use.
What is the starting syntax for HDFS commands in Linux?
HDFS DFS followed by the command (e.g., HDFS DFS –LS)
What does HDFS DFS -ls do?
hdfs dfs -ls lists the contents of an HDFS directory. When you list the contents of a directory in HDFS, by default, the “home” directory of the current user is listed, such as /user/student.
What does the HDFS DFS –MKDIR do?
creates a directory within HDFS in the HDFS home directory of a user
What does hdfs dfs –copyFromLocal “/data/cust.txt” “/user/std” do?
copies local, non-distributed data into HDFS. In this example, cust.txt on the NameNode is copied to the /user/std directory on the DataNodes
Describe the 3 steps of the MapReduce process
- Map - makes initial read of the blocks of data in HDFS and completes initial row-level operations including filtering rows or computing new columns within rows
- Shuffle and Sort - orders and groups necessary rows together
- Reduce - Performs final calculations, including calculating summary statistics within groups and writes the final results to files in HDFS
What is Pig?
Pig is a stepwise, procedural programming method and platform for analysis. Pig programs can be submitted to Hadoop, where they are converted to MapReduce programs so that processing of the data can still occur in parallel.
What is Hive?
Hive is a data warehouse framework for files stored in HDFS built to query and manage large data sets stored in Hadoop. An SQL-like language called HiveQL is used to query the data. Most HiveQL queries are compiled into MapReduce programs.
What is Hadoop fs command?
Hadoop fs command can be used to interact with HDFS, along with other file systems that Hadoop supports, such as a local file system, WebHDFS, and Amazon S3 FS.
What does hdfs -put do?
hdfs -put copies a local file to an HDFS location
What does hdfs -get do?
hdfs -get copies an HDFS file to a local location.