Lesson 6: Distributed Storage Flashcards

1
Q

What is Moore’s Law?

A

The number of transistors in an integrated circuit doubles about every two years

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Why was there a switch from faster to parallel execution?

A

Paradigm shift:
speed of light, atomic boundaries, limited 3D layering

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What is Hadoop?

A
  • Framework that allows us to do reliable, scalable distributed computing and data storage
  • It is a flexible and highly available architecture for large scale computation and data processing on a network of commodity hardware
  • Apache top level project
  • Open-source
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What do we need to write for MapReduce?

A
  • Mapper: Application code
  • Partitioner: send data to correct Reducer machine
  • Sort: group input from different mappers by key
  • Reducer: application code
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

How do we store large files with the Hadoop Distributed File System

A
  • We have a file
  • HDFS splits it into blocks
  • HDFS will keep 3 copies of each block
  • HDFS stores these blocks on datanodes
  • HDFS distributes the blocks to the data nodes
  • The NameNode tracks blocks and data nodes
  • Sometimes a data node will die → not a problem
  • Namenode tells other data nodes to copy blocks, back to 3X replication
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What are the components of the Hadoop Architecture? (overview)

A
  • MapReduce Framework: implement MapReduce paradigm
  • Cluster: host machines (nodes)
  • HDFS federation: provides logical distributes storage
  • YARN Infrastructure: assign resources
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What is the infrastructure of YARN?

A

 Resource Manager (1/cluster): assign cluster resources to applications
 Node Manager (many/cluster): monitor node
 App Master: app (MapReduce)
 Container: task (map, reduce, …)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

What are the shortcomings of MapReduce?

A
  • Forces your data processing into MAP and REDUCE → other workflows missing
  • Based on “Acyclic Data Flow” from Disk to Disk (HDFS)
    → not efficient for iterative tasks
    *Only for batch processing (interactivity, streaming data)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What should be counted when calculating algorithmic complexity?

A
  • page faults
  • cache misses
  • memory accesses
  • disk accesses (swap space)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

What is Google MapReduce?

A

A framework for processing LOADS of data
-> framework’s job: fault tolerance, scaling & coordination
-> programmer’s job: write program in MapReduce Form

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What are the 2 main components of Hadoop?

A

HDFS - big data storage
MapReduce - big data processing

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

How can you tell that the Hadoop Architecture is inspired by LISP (list processing)?

A

Functional programming:
* Immutable data
* Pure functions (no side effects): map, reduce

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

What is the difference between a job and a task tracker in the Hadoop Architecture?

A

Job tracker: in charge of managing the resources of the cluster
-> first point of contact when a client submits a process
-> one per cluster

Task tracker: does the actual process
-> mostly connected to one or more specific data nodes

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

What are the 3 functions in Google MapReduce? (2 primary, one optional)

A
  • Map
  • Reduce
  • Shuffle
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

How does the Map function work, of Google’s MapReduce?

A

Map each <key, value> data pair of input list onto 0, 1, or more pairs of type <key2, value2> of output list
-> Map to 0 elements in the output = filtering
-> Map to +1 elements in the output = distribution

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

How does the Reduce function work, of Google’s MapReduce?

A

[summarizing]
Combine the <key, value> pairs of the input list to an aggregate output value

17
Q

What does the Shuffle function do, in Google’s MapReduce?

A

[consolidating relevant record]
* Helps the pipeline
* It will help channel the partial result to the right or most appropriate reduce node

18
Q

What is YARN short for?

A

Yet Another Resource Negotiator

19
Q

Explain the Hadoop eco-system.

A

Hadoop provides very good functions on its own but the main power of Hadoop comes out when we start combining it with different other technologies.
(Ex.: Pig, Hive, Kafka)

20
Q

What is Apache Spark?

A
  • works on top of Hadoop, HDFS
  • has many other workflows
  • in-memory caching of data
  • in-memory data sharing
  • supports data analysis, machine learning, graphs,…
  • allows development in multiple languages
  • can read/write from a range of data types
21
Q

What are RDD’s?

A

Resilient Distributed Datasets
-> immutable distributed collection of objects
-> fault tolerant
-> used in every spark comonents

22
Q

How do you create new RDD’s?

A

By using transformations
-> from storage
-> from other RDDs

23
Q

What are DataFrames?

A

A way to organize the data in named columns.
Similar to a relational database
-> immutable once constructed
-> enable distributed computations

24
Q

How can you construct data frames?

A
  • read from file(s)
  • transforming an existing dataframe
  • parallelizing a python collection list
  • apply transformations and actions
25
Q

Compare RDDs with DataFrames.

A
  • RDDs provide a low level interface into Spark
  • DataFrames have a schema
  • DataFrames are cached and optimized by Spark
  • DataFrames are built on top of the RDDs and the core Spark API
  • DataFrames are highly optimized and are faster
26
Q

How and why would we use directed acyclic graphs?

A

How:
Nodes are RDDs, arrows are transformations
Why:
* track dependencies
* program resonates with humans and computers
* improvement via sequential access to data & predictive processing

27
Q

What is the difference between a narrow and a wide transformation?

A

Narrow: everything can be found in the same partition
-> required elements for computation in a single partition live in the single partition of parent RDD (map)

Wide: we need to fetch from multiple partitions
-> Required elements for computation in a single partition may live in many partitions of parent RDD (groupByKey)

28
Q

When should you not use Spark?

A
  • for many simple use cases Apache MapREduce and Hive might be a more appropriate choice
  • spark was not designed as a multi-user environment
  • spark users are required to know that memory they have is sufficient for a dataset
  • adding more users adds complications, since the users will have to coordinate memory usage to run code