1. t/f A distributed file system is something where users are able to see everything at their end. ( Back end and front end are the same). 2. Which of the following does a DFS not provide? a. distributed storage b. fault tolerance c. low throughput data access

1. false- users dont see the distribution at their level 2. c - dfs provides HIGH throughput

t/f - hadoop infrastructure 1. Hadoop is powerful but not scalable. it can only handle a few compute nodes and terabytes of information 2. Hadoop is very cost-effective since it uses low cost commodity hardware 3. hadoop is efficient since it distributes the data and processes it individually but never at the same time.

1. false - it is scalable. it can handle thousands of compute nodes and PB of info 2. t 3. F - yes data is distributed, but its efficient bc the data is processed in parallel

t/f 1. hadoop is an open source system 2. hadoop provides analysis of only structured data 3. compute nodes in hadoop can offer local computation and storage 4 . Apache spark is a software library that allows us to do distributing processing of small datasets across computer 'clusters'

1. T 2. F - both struc and unstruc 3. T 4. F - apache HADOOP is the library -data sets are huge (big data )

which is true about hadoop? 1. The Hadoop framework includes master and slave nodes for hdfs and MapReduce 2. master node for hdfs is job-tracker and for map reduce its name node 3. slave node for hdfs is task-tracker and for MapReduce its data node. 4. hadoop is made up of only one component in its platform which is hive.

1. T 2. F- Master node: - hdfs = name node - map reduce= job tracker 3. F- slave node: - hdfs = name node - map reduce= job tracker 4.false, it has many components (hive, hbase, spark etc)

which of the following is not correct about hadoop stack? 1. pig is a component of hadoop used for system management and maintenance. It helps to grab or kill applications 2. Hive is a data warehouse-like tool that is built on top of mapreduce. 3. Hadoops main framework revolves around HDFS and Mapreduce 4. Zookeeper allows you to automate distributed data flow using a special language and do parallel computation.

1. F - system management is done by zoo keeper 2. F - its built on top of hdfs 3. T 4. F - Pig allows you to do this using language called Piglatin

T/F 1. Hdfs is responsible for distributed processing across computer clusters 2. To transfer data, you can use scoop 3. To process streaming data ( that is very fast), you use hbase 4. Hive is used to summarize data and to SQL-like queries 5. To deal with large tables in a distributed database, use apache spark as it can handle large datasets very quickly.

1. F - Mapreduce 2. T 3. F- use spark 4. T 5. F - for large tables- use hbase

which of the following is not an assumption / goal of HDFS? 1. assumes there will always be a second copy and thus minimizes replications to avoid redundancy 2. hard failure - always possible 3. we are always gonna be dealing with large data sets and hdfs goal is to handle this 4. hdfs wants to provide streaming data access which is best for batch processing 5. follows a simple coherency model -write once- read many ( best for transaction data) 6. portability across different hardware and software platforms.

1. F - this is not an assumption or a goal of hdfs

Big Data- Lec 2 (Hadoop) Flashcards by Mango S

what are the 2 components that Hadoop is built on?

HDFS and Mapreduce

How well did you know this?

Not at all

Perfectly

t/f

Hadoop is a centralized file system. All data is stored in one location making it easier to gather information.

false - hadoop is a distributed file system. purpose is to promote sharing of dispersed files.

How well did you know this?

Not at all

Perfectly

A distributed file system is something where users are able to see everything at their end. ( Back end and front end are the same).

Which of the following does a DFS not provide?

a. distributed storage
b. fault tolerance
c. low throughput data access

false- users dont see the distribution at their level
c - dfs provides HIGH throughput

How well did you know this?

Not at all

Perfectly

t/f - hadoop infrastructure

Hadoop is powerful but not scalable. it can only handle a few compute nodes and terabytes of information
Hadoop is very cost-effective since it uses low cost commodity hardware
hadoop is efficient since it distributes the data and processes it individually but never at the same time.

false - it is scalable. it can handle thousands of compute nodes and PB of info
t
F - yes data is distributed, but its efficient bc the data is processed in parallel

How well did you know this?

Not at all

Perfectly

t/f

hadoop is an open source system
hadoop provides analysis of only structured data
compute nodes in hadoop can offer local computation and storage

4 . Apache spark is a software library that allows us to do distributing processing of small datasets across computer ‘clusters’

T
F - both struc and unstruc
T
F
- apache HADOOP is the library
-data sets are huge (big data )

How well did you know this?

Not at all

Perfectly

which is true about hadoop?

The Hadoop framework includes master and slave nodes for hdfs and MapReduce
master node for hdfs is job-tracker and for map reduce its name node
slave node for hdfs is task-tracker and for MapReduce its data node.
hadoop is made up of only one component in its platform which is hive.

T
F- Master node:
- hdfs = name node
- map reduce= job tracker
F- slave node:
- hdfs = name node
- map reduce= job tracker

4.false, it has many components (hive, hbase, spark etc)

How well did you know this?

Not at all

Perfectly

which of the following is not correct about hadoop stack?

pig is a component of hadoop used for system management and maintenance. It helps to grab or kill applications
Hive is a data warehouse-like tool that is built on top of mapreduce.
Hadoops main framework revolves around HDFS and Mapreduce
Zookeeper allows you to automate distributed data flow using a special language and do parallel computation.

F - system management is done by zoo keeper
F - its built on top of hdfs
T
F - Pig allows you to do this using language called Piglatin

How well did you know this?

Not at all

Perfectly

T/F

Hdfs is responsible for distributed processing across computer clusters
To transfer data, you can use scoop
To process streaming data ( that is very fast), you use hbase
Hive is used to summarize data and to SQL-like queries
To deal with large tables in a distributed database, use apache spark as it can handle large datasets very quickly.

F - Mapreduce
T
F- use spark
T
F - for large tables- use hbase

How well did you know this?

Not at all

Perfectly

Which of the following is not true about what Hadoop is used for:

a. helps us deliver mashup services (GPS data, clickstream data)

b. helps give context for data (friends networks, social graphs)

c. creates a central data warehouse for our data for easy access

d. keeps apps running through edit logs, query logs

e. aggregates data-exhaust (messages,posts, blogs etc).

c- not one of the uses

How well did you know this?

Not at all

Perfectly

Describe hadoops server roles

hadoop has

distributed data processing = map reduce

master= job tracker
slave= task tracker

distributed data storage = hdfs

master= name node
slave= data node

masters - manage and track all interactions
slaves - actually do the job

How well did you know this?

Not at all

Perfectly

which of the following is not an assumption / goal of HDFS?

assumes there will always be a second copy and thus minimizes replications to avoid redundancy
hard failure - always possible
we are always gonna be dealing with large data sets and hdfs goal is to handle this
hdfs wants to provide streaming data access which is best for batch processing
follows a simple coherency model -write once- read many ( best for transaction data)
portability across different hardware and software platforms.

F - this is not an assumption or a goal of hdfs

How well did you know this?

Not at all

Perfectly

________ refers to the overall perf of the system

a. optimization
b. through-put access

How well did you know this?

Not at all

Perfectly

A distributed file system that provides high throughput access

a. hadoop
b. hdfs

b - hdfs

How well did you know this?

Not at all

Perfectly

what is not true about hdfs?

a. it breaks data into small blocks ( 128 MB = default)

b. blocks are stored on the name node

c. blocks are replicated to other nodes to accomplish fault tolerance

d. Data node keeps track of where all the blocks are kept through metadata and indexing each block

b- false bc blocks are stored on data node

How well did you know this?

Not at all

Perfectly

describe rack awareness script

rack awareness script is responsible for keeping track of which racks have which nodes and are they working

How well did you know this?

Not at all

Perfectly

_______ responsible for holding status of the data nodes inside a rack. points something out if there are faulty datanodes and helps racks talk to each other.

switch

How well did you know this?

Not at all

Perfectly

Collection / stack of diff nodes is called a:

a. node shelf
b rack
c. holding block

Study These Flashcards

Name the HDFS daemons

Study These Flashcards

name node
data node
secondary name node

which is false about the name node:

a. this is what the clients talk to when they want to locate a file or add/copy/move/delete something

b. A successful request means name node returns list of relevant hadoop clusters where the data live

c. it keeps track of the metadata which is all files and blocks in the system

d. it stores all the high-importance files as those require extra storage

Study These Flashcards

b. F - list is of the data nodes where the data lives

d. F - name node doesn’t do any storing

T/F

job tracker sends heartbeats to namenode to let it know job is going well
every 5th heartbeat is a block report
name node builds meta data from block reports
if name node is down, hdfs can still run but only for short while.
heartbeat say if DN is alive and block report says what blocks the DN has

Study These Flashcards

F - dn sends heartbeat to nn
F- 10th heartbeat
T
F - no nn = no hdfs
T

what is not true about fsimage?

a. fsimage is a snapshot of the system

b. fsimage snapshot contains: DN, NN, block reports, rack awareness scripts

c. refreshes every 24 hours

d. snapshot is taken after NN has started

Study These Flashcards

d- F

Fsimage is a snapshot taken right when NN starts not after

T/F

fsimage happens when NN starts and edit logs are for after NN has started
Edit logs keep track of changes made to file

Study These Flashcards

2.T

t/f

DN processes a job in the Hadoop file system
a functional file system can have 0 to many DN’s.
data across DN’s is replicated parallely, meaning at the same time. this is called a data pipeline.

Study These Flashcards

F- DN stores data - duhh
F - to be functional, you need many DN’s
F- its replicated sequentially , in a pipeline manner

Which is not true about the data node?

a. once location is provided by the NN, clients can talk to DN directly

b. DN connects to NN at startup but can receive requests directly from the client.

c. DN undergoes data replication to improve fault tolerance.

Study These Flashcards

T
F- only NN can give requests to DN
T

Data replication is necessary for all the following except: a. fault tolerance b. hardware/software failure c. improve speed of a result once inquiry is made d. Network partition

T/F 1. replication factor is always preset by the platform provider 2. data is replicated through a pipeline 3. ideal replications is 3 (default) where 2 copies stay with one rack and one is put on a different rack. This helps to protect the data.

1. F - can be decided by user 2. t 3. t

explain process of replication pipeline

1. client wants to write file.txt and makes request to name node 2. File is broken into blocks 3. Name node looks at rack script 4. Tells client write to x,y,z dn 5. Copied to dn x in rack 1 6. This is copied to dn y in rack 2 7. This is copied to dn z also in rack 2

whats not true about rack awarness? a. helps you to not loose data if a whole rack fails b. assumption that in rack is higher bandwidth with lower latency c. fails to keep bulky flows in rack

c - does keep bulky flows in rack

whats not true about rack awarness? a. helps you to not loose data if a whole rack fails b. assumption that in rack is higher bandwidth with lower latency c. fails to keep bulky flows in rack

c - does keep bulky flows in rack

which is not true about DN failure? a. dn failure happens when there is no heartbeat going to nn b. when this happens NN finds the blocks it contains through the client, makes copies using meta data and unregisters the dead/failed node c. can occur due to data corruption d. to maintain data integrity, you should apply checksum checking

b. f- find the blocks it contains through rack script

Re-replication. Which is not a step? 1. NN consults Metadata and finds affected data 2. NN consults Rack A script 3. NN tells job tracker to replicate the needed DN

3- F - DN can be told to replicate directly

which is false about the Secondary Name node? a. it can replace the NN when it fails b. main job is to perform periodic checkpoints c. it can download NN fsimages and edit logs, join them and re-upload them to NN

a= false

Which is false about the Name node a. NN failure shouldn't affect hdfs as long the system has enough space b. If NameNode can be restored, secondary can reestablish the most current metadata snapshot c. in the instance that a machine is lost, try your best to retrieve the old NN d. to make cluster believe its the original NN - use Domain Name System

1. F - NO NN = NO HDFS 2. T 3. F - just make a new one 4. T

T/F 1. the 2nd part of Hadoop is MapReduce: this is responsible for data transformation 2. data is split into chunks and processed one at a time 3. the output of a map task in MR = input of reduce task 4. MR is not responsible for a. scheduling tasks b. monitoring tasks c. redoing failed tasks d. sending reports for failed tasks to hdfs

1. F - responsible for Data processing 2. F - processed in parallel 3. T 4. d - DOESNT DO THAT

Which is false about MR architecture a. input is where file is broken into chunks (depends on user or default) called new-splits b. in shuffle/sort, data records that are similar are sent to the reducer while the others are discarded c. in map task, records are outputted in the form of key-value pairs based on the task given (COUNT, SUM, AVG) d. reducers job is to aggregate - means 3 similar records are counted as a 3 instead of separate 1's e. this process of MR is acyclic

a. F - chunks are called INPUT splits b - F in shuffle/sort, similar data is contained together ALL data gets sent to Reducer

what are the 4 phases of the MR process?

INPUT 1. splitting 2. mapping 3. shuffling 4. reducing OUTPUT

MR request is sent to which master node?

job tracker

Name the 2 MR daemons

job tracker and task tracker

Which is not true about the job tracker? a. accepts MR jobs from client applications b. talks to DN to find the location of the data it needs to work on c. locates an available task tracker through slotting system and gives it the job d.if task tracker is unavailable- JT will do the job itself

b-F -JT talks to NN about the data location d. F- JT will not do the job itself

Which is not true about the task tracker? a. configured with a set of slots = how many tasks it can take b. accepts map and reduces operations from JT. shuffle is not in its scope. c. TT updates JT about a job's status d. TT doesn't need to send heartbeats to JT since its already sending job status updates.

b- f TT accepts all 3 d. f TT does send heartbeats - says 1. if its available , 2. how many available slots it has

t/f 1. All master nodes and slave nodes contain both MR and HDFS components 2. each master node has 2 components: resource manager (MR) and HDFS 3. DN stores metadata

1. t 2t 3. F - NN = metadata, DN = actual file data

Big Data- Lec 2 (Hadoop) Flashcards

(41 cards)