Spark Flashcards
(32 cards)
Two common classes of analytics apps
- Iterative algorithms (machine learning, graph)
- Interactive data mining
Enhance programmability
- Integrate into Scala programming language
- Allow interactive use from Scala interpreter
What part of the machine does Hadoop access?
Main memory (hard drive)
What memory does Spark use?
Uses RAM for faster processing
What is Scala?
Based on Java
What did MapReduce simplify?
“big data” analysis on large, unreliable clusters
What did users want more of as MapReduce became more popular?
-More complex, multi stage applications
-More interactive ad-hoc queries
How to share data in MapReduce across jobs?
stable storage, which is slow
What do complex apps and interactive queries both need that MapReduce lacks?
efficient primitives for data sharing
Resilient Distributed Datasets (RDD)
- Restricted form of distributed share memory
- Efficient fault recovery using lineage (DAG)
Immutable
Once you put the data on the RAM, it cannot be changed in that RDD
How to modify RDD?
Can only be built through coarse-grained deterministic transformations (map, filter, join…) in a new RDD
Directed Acyclic Graph (DAG)
an arrangement of edges and vertices
- vertices: indicate RDDs
- edges: operations applied on the RDD
RDD Recovery
Using the lineage graph, the data can be restored
SparkContext
represents the connection to a Spark cluster (main driver)
- to create RDDs, accumulators and broadcast variables on that cluster
Cluster Manager
provides resources to all worker nodes as per need and operates all nodes accordingly
RDD Types
parallelized collections
What are the 3 types of operations programmers can perform on the RDD?
Transformations, Actions, Persistence
Transformations
create a new dataset form an existing one
- lazy in nature and are only executed when some action is performed
Actions
returns to the driver program a value or exports data to a storage system after performing a computation
Persistence
For caching datasets in-memory for future operations and option to store on disk or RAM or mixed (storage level)
Example transformations functions
Map(func)
Filter(func)
Distinct()
Example actions functions
Count()
Reduce(func)
Collect()
Take()
Example persistence functions
Persist()
Cache()