Hadoop Flashcards

(43 cards)

1
Q

Hadoop Application consits of
-Hadoop comuting _
-Distributed_
-Hadoop _ _ _
-Hadoop _ _

A

Hadoop Computing Architecute
Distributed Approach
Hadoop Distriubted FIle System
Hadoop File Operations

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Current state of our world
-Data is exploding with _
-Social _
-Video _
-Photo _
-Wea—
-Internet of _

A

data is exploding with rapid gereration of data
social media
video streams
photo libraries
weather
Internet of things (IoT)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Value of Data
Which of these companies are data companies?Should companies track the value of data on the balance sheet?
More data beats ____
AI cannot run without ___

A

google, facebook, amazon, apple
more data beats better algorithms
AI cannot run without data

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Traditional Data processing
-Traditionally computation was ___ with __ amounts of data
-earlier approaches increased __ with ___

A

-Traditonally computation was processor bound with small amounts of data
-earlier approaches simply increased hardware with faster processors

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Hadoop Computing
-Hadoop introduced a _ _ of bringing the program to the _ rather than the _ to the program
-Distributed data storage on ____
-Run applications where the ___

A

-Hadoop introduced a radical appraoch of bringing the program to the data rather than the data to the program
-Distributed data storage on multiple server nodes
-run applictions where the data resides

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Hadoop Program
-Founddation of _
-Reliable and _
-Open source free + ____
-Primarly focused on ___
-Architected to not move _ around
-uses __ with processing where the data is stored

A

Foundation of HDP
 Reliable and scalable
 Open Source Free + Cost to Support
 Primarily focused on data storage
 Architected to not move data around
 Uses “data locality” with processing where the data is stored

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Characterisitcs of Hadoop
-___ to storing and executing large data files
-HDFS file systtem has default redundancy of _
-Default block size is __
-Batch _
-Not very useful for _
-Read centric architerure for _

A

Distributed approach to storing and executing large data files
 HDFS files system has default redundancy of 3
 Default block size is 128 MB
 Batch processing
 Not very useful for OLTP
 Read centric architecture for OLAP

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Hadoop capbilites
- handles ,-, and _ data
-Schema on-

-Scales linearly with more disks providing a ____ increase in storage cpacity
-scales
-Hadoop is ___, avoiding __ as much as possible
example of normalized vs.denormalized

A

Handles structured, semi-structured, and unstructured data
Schema on-read
Scales linearly with more disks providing almost a 1-to-1 increase
in storage capacity
Scales horizontally
Hadoop is de-normalized, avoiding joins as much as possible
Example of normalized vs. de-normalized

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

MapReduce
___ the universal processing appraoch
__ updates all of the data by writing it to a new file everytime
Mapreduce is not good for updating _______
approach is write _, read many ___
Analyzing historcial record weather records for the last sales year

A

MapReduce is the universal processing approach
 MapReduce updates all of the data by writing it to a new file
every time
 MapReduce is not good for updating only some of the data
 Approach is write once, read many times scenarios
 Analyzing historical weather records for the last sales year

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Hadoop Application system
___ utilities supporting other Hadoop
modules
___ distributed file system with high-throughput
____framework for job scheduling and cluster resource
management
___parallel processing of large data sets

A

Hadoop common
HDFS
YARN
MapReduce

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Relational Database Systems
-Realtional db maanagment system _____
-Highly structured with ___
-Normalized using joint to ____
-Seek time increase slower than ____
-Predominatly scales __ with hardware
-Excets at write updates to only some fo the data like an _______

A

Relational Database Management System (Oracle, DB2,
Sybase, SQL Server)
 Highly structured with schema on-write
 Normalized using joins to reconstruct a dataset
 Seek time increasing slower than transfer rate (bandwidth)
 Predominantly scales vertically with hardware
 Excels at write updates to only some of the data like an address
in a CRM system

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Traditional RDBMS vs MapReduce

Data Size
Access Updates
Transactions
Strcuture
Integrity
Scaling

A

data size-gigabytes, petabytes
access-interactive and batch, batch
updates-ead and write many times, write once read many times
trnsactions-acid, none
structure-schema on write, schema on read
integrity-high, low
scaling - nonlinerar, linear

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Data storage in Hadoop
-storage size is increasing __
-Read time is not incrasing as fast as _
-How do you speed up read times?
-Disk failures are managed with multiple copies of __
-MapReduce re-assemes the data into a ___

A

Storage size is increasing lowering the price
 Read time is not increasing as fast as size
 How do you speed up read times? Read from multiple
distributed disks at the same time
 Disk failures are managed with multiple copies of each record
 MapReduce re-assembles the data into a file

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

HDFS File System
-___ files across a netwrok of computers, each with its own storage
-It is a ___ using data locality
-More complex than a ___
-complexity is astracted _ from user
-Hadoop users do not need to ___

A

Distributes files across a network of computers, each with it’s
own storage
 It is a distributed file system using data locality
 More complex than a regular file system
 Complexity is abstracted away from user
 Hadoop users do not need to choose drives or server nodes

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Design of HDFS

A

Very large files (100 Megabytes, 100 gigabytes, 100 Terabytes. Peta) —->Straming data (read once)—> read many times

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

File layers in HDFS
-HDFS is a file system written in _
-sits on top a ____
Provides ___ storage for massive amounts of data

A

HDFS is a filesystem written in Java
 Sits on top of a native Linux filesystem
 Provides redundant storage for massive amounts of data

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

File storage in HDFS
-HDFS performs best with small number of
-Millions of large files vervsus billions of
-Files in HDFS are _ as we cannot modify an existing file
-Optimized for large files with data ___

A

HDFS performs best with a small number of large files
 Millions of large files versus billions of small ones
 Files in HDFS are Write Once as we cannot modify an existing
file
 Optimized for large files with data processed in large chunks

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

HDFS Limitations
-Response time contain __
-Filestytem metadata is held in ___ (not so for lots of ___)
-Writes are typically from a single _____
-Record locking are formally ____

A

Response times contain latency
 Filesystem metadata is held in memory
 Not good for lots of small files
 Writes are typically from a single writer appending a file
 Record locking are formally not supported

19
Q

Storing Blocks on Data Nodes
Data files are split into ___ which are distributed at load time
Each block is replicated on ___
File can be larger than any ____

A

Data files are split into 128MB blocks which are distributed at
load time
 Each block is replicated on multiple data nodes
 File can be larger than any single storage disk

20
Q

Hadoop Cluster

A

Cluster is a group of computers working together
 Node is an individual server blade in the cluster
 Daemon is a program running a node

21
Q

Hadoop Cluster Components

A

 Three main components of a cluster
 Components work together to provide distributed data
processing

22
Q

NameNode
-NameNode stores __
-Manages the filesystem __
-Knows where every block is stored for ___
-Composted of 2 files:
1.___
2.___

A

 NameNode stores metadata
 Manages the filesystem namespace
 Knows where every block is stored for every file
 Composed of 2 files
 Namespace image
 Edit log

23
Q

Characteristics of NameNode
-Is the single point of ___
-Does not ___
-Reads and loads ___ information in memory
-____ memory requirements
-Users do not interact with ___

A

 Is the single point of failure
 Does not store data
 Reads and loads block/file information in memory
 High RAM memory requirements
 Users do not interact with nodes

24
Q

Secondary NameNode
-NameNode daemon must be _____
-HDFS is setup for ____ with activie and standby NameNodes
-Periodically merges the namespace ____

A

NameNode daemon must be
running at all times
 HDFS is setup for high availability
with active and standby
NameNodes
 Periodically merges the namespace
image and the edit log

25
DataNodes -___ and ___ blocks when they are told to do so -__ to NameNode the list of block they are storing
Store and retrieve blocks when they are told to do so  Report to NameNode the list of block they are storing
26
HDFS File Operations -HDFS is a seperate file running on the big data clustr that lets you view and _____ directories and files -Necessary to specify the file system using ___ -Many of the Hadoop file comands use the same command as ____
 HDFS is a separate file system running on the big data cluster that lets you view and manage your HDFS directories and files  Necessary to specify the file system using hadoop fs  Many of the Hadoop file commands use the same command as Linux with a – (dash) preceding it
27
Hadoop File Storage -File 031512 split into blocks___ -File 042313 split into
File 031512 split into blocks B1, B2 & B3  File 042313 split into blocks B4 & B5
28
Hadoop Block Indetification -Client asks NameNode for contentns of ___ -NameNode responts it is found in blocks___
Client asks NameNode for contents of 042313  NameNode responds it is found in blocks B4 & B5
29
Hadoop File Retrieveal -Client begins retrieval attemps from____ -Client befins retrieval attemps from___ -Data is looked at directly from the ______
Client begins retrieval attempts from Nodes A, B & E  Client begins retrieval attempts from Nodes C, E & D  Data is looked at directly from the datanode to client
30
Soring Files with HDFS -Storage of files larger than an ___ -Blocks are large to reduce the amount of ____ -File metadata is stored in ____ -If a block replica is corrupt a ___ -1 MB file size only consumes _____
Storage of files larger than an entire hard drive  Blocks are large to reduce the amount of seek time  File metadata is stored in another system location  If a block replica is corrupt a replica is selected  1 MB file size only consumes 1 MB of block space
31
File Writes File Reads
look at pic
32
DataNode Replication -____ placed on the same node as client -____ paced on a differnt rack chosen at random -____ is paced on the same rack as the ___, on a ___, selected at ___
First replica placed on the same node as the client  Second replica is placed on a different rack chosen at random  Third replica is placed on the same rack as the second, on a different node, selected at random
33
List the Home Directory -_____ list the contnets of the user home directory -_____ lists the content of the Hadoop root directory -The __ begns at the Hadoop root -____ spells out the entire directory structures
hadoop fs –ls lists the contents of the user home directory  hadoop fs –ls / lists the contents of the Hadoop root directory  The / begins at the Hadoop root  -ls –R spells out the entire directory structures
34
Making Directories -_____ makes a new directory in hadoop -The new directory is called ____ - The _ begins at the home directory
hadoop fs -mkdir makes a new directory in hadoop  The new directory is called jdb101000  The / begins at the home directory
35
Put a File -___ copies a file from the Linus file system to the Hadoop file system -_____ is the syntax -____ denotes the Linus source -Destination is the ____ directory in Hadoop
-put copies a file from the Linux file system to the Hadoop file system  hadoop fs –put is the syntax  ~/foodsales.csv denotes the Linux source  Destination is the /jdb101000 directory in Hadoop
36
Get a File -____ -____ is the syntax -____ denoes the Linus destination -Source is the ___ file in the Hadoop directory
 -get copies a file from the Hadoop file system to the Linux file system  hadoop fs –get is the syntax  ~/outbound.csv denotes the Linux destination  Source is the foodsales.csv file in the Hadoop directory
37
Copy a directory -___ copies one directory and creates another -_____ is the syntax -_____ is the source directory -___ is the new destination directory
 -cp copies one directory and creates another  hadoop fs –cp is the syntax  /jdb101000 is the source directory  /production is the new destination directory
38
Copy a File -___ copies one file and creates another -____ is the syntax -____ is the directory for both files -___ is being copied to ____
 -cp copies one file and creates another  hadoop fs –cp is the syntax  /jdb101000 is the directory for both files  foodsales.csv is being copied to newfile.csv
39
Reviewing a File -____ displays the contents of a file -____can be edited -____ will exit the command
 -cat displays the contents of a file  File can be edited  Control D will exit the command
40
Moving -____ moves a file or directory to a new location -____ is the syntax -_____ is the source directory -That will be moved into ___ directory
 -mv moves a file or directory to a new location  hadoop fs –mv is the syntax  /production is the source directory  That will be moved into the /jdb101000 directory
41
Removing -___ removes files and directories -_____ is the syntax -___ option is required to remove a directory -The ____ directory will be removed
 -rm removes files and directories  hadoop fs –rm is the syntax  -r option is required to remove a directory  The /production directory will be removed
42
HDFS Reommendations -___ is repository for your ___ -Best practices: -Define a ____ -Include ____ for staging data Example ____ data and configuration belongint to a single user ___ work in progress in Extract/Transform/Load Stage ___ temmporary generated data shared between users ____ data sets that are processed and available
 HDFS is a repository for your big data files  Best practices  Define a standard directory structure  Include separate locations for staging data  Examples  /user - data and configuration belonging to a single user  /etl - work in progress in Extract/Transform/Load stage  /tmp - temporary generated data shared between users  /data - data sets that are processed and available
43
Summary
 Hadoop is a reliable distributed architecture for computing  HDFS is the storage layer file system for Hadoop  HDFS assigns three redundant file blocks to separate nodes and distributes them across a cluster  Uses a system of NameNodes and DataNodes organized with daemons  HDFS accessed using similar file commands with Linux