Spark Flashcards by Valari Graham

Two common classes of analytics apps

Iterative algorithms (machine learning, graph)
Interactive data mining

How well did you know this?

Not at all

Perfectly

Enhance programmability

Integrate into Scala programming language
Allow interactive use from Scala interpreter

How well did you know this?

Not at all

Perfectly

What part of the machine does Hadoop access?

Main memory (hard drive)

How well did you know this?

Not at all

Perfectly

What memory does Spark use?

Uses RAM for faster processing

How well did you know this?

Not at all

Perfectly

What is Scala?

Based on Java

How well did you know this?

Not at all

Perfectly

What did MapReduce simplify?

“big data” analysis on large, unreliable clusters

How well did you know this?

Not at all

Perfectly

What did users want more of as MapReduce became more popular?

-More complex, multi stage applications
-More interactive ad-hoc queries

How well did you know this?

Not at all

Perfectly

How to share data in MapReduce across jobs?

stable storage, which is slow

How well did you know this?

Not at all

Perfectly

What do complex apps and interactive queries both need that MapReduce lacks?

efficient primitives for data sharing

How well did you know this?

Not at all

Perfectly

Resilient Distributed Datasets (RDD)

Restricted form of distributed share memory
Efficient fault recovery using lineage (DAG)

How well did you know this?

Not at all

Perfectly

Immutable

Once you put the data on the RAM, it cannot be changed in that RDD

How well did you know this?

Not at all

Perfectly

How to modify RDD?

Can only be built through coarse-grained deterministic transformations (map, filter, join…) in a new RDD

How well did you know this?

Not at all

Perfectly

Directed Acyclic Graph (DAG)

an arrangement of edges and vertices
- vertices: indicate RDDs
- edges: operations applied on the RDD

How well did you know this?

Not at all

Perfectly

RDD Recovery

Using the lineage graph, the data can be restored

How well did you know this?

Not at all

Perfectly

SparkContext

represents the connection to a Spark cluster (main driver)
- to create RDDs, accumulators and broadcast variables on that cluster

How well did you know this?

Not at all

Perfectly

Cluster Manager

Study These Flashcards

provides resources to all worker nodes as per need and operates all nodes accordingly

RDD Types

Study These Flashcards

parallelized collections

What are the 3 types of operations programmers can perform on the RDD?

Study These Flashcards

Transformations, Actions, Persistence

Transformations

Study These Flashcards

create a new dataset form an existing one
- lazy in nature and are only executed when some action is performed

Actions

Study These Flashcards

returns to the driver program a value or exports data to a storage system after performing a computation

Persistence

Study These Flashcards

For caching datasets in-memory for future operations and option to store on disk or RAM or mixed (storage level)

Example transformations functions

Study These Flashcards

Map(func)
Filter(func)
Distinct()

Example actions functions

Study These Flashcards

Count()
Reduce(func)
Collect()
Take()

Example persistence functions

Study These Flashcards

Persist()
Cache()

Map(func)

return a new distributed dataset formed by passing each element of the source through a function func

flatMap(func)

first applies map function and then flattens the result

Which framework is better for larger data?

Hadoop

Log mining

load error messages from a log into memory, then interactively search for various patterns

reduce(func)

aggregate the elements of the dataset using a function func

What three options does Spark provide for persist RDDs?

(1) in-memory storage as deserialized Java Objs - fastest, JVM can access RDD natively (2) in-memory storage as serialized data - space limited, choose another efficient representation, lower performance cost (3) on-disk storage - RDD too large to keep in memory, and costly to recompute

Fault recovery

RDDs maintain lineage information that can be used to reconstruct lost partitions

What are the benefits of the RDD Model?

- Consistency is easy due to immutability - Inexpensive fault tolerance (log lineage rather than replicating/checkpointing data) - Locality-aware scheduling of tasks on partitions - Despite being restricted, model seems applicable to a broad variety of applications

Spark Flashcards

(32 cards)