{ "@context": "https://schema.org", "@type": "Organization", "name": "Brainscape", "url": "https://www.brainscape.com/", "logo": "https://www.brainscape.com/pks/images/cms/public-views/shared/Brainscape-logo-c4e172b280b4616f7fda.svg", "sameAs": [ "https://www.facebook.com/Brainscape", "https://x.com/brainscape", "https://www.linkedin.com/company/brainscape", "https://www.instagram.com/brainscape/", "https://www.tiktok.com/@brainscapeu", "https://www.pinterest.com/brainscape/", "https://www.youtube.com/@BrainscapeNY" ], "contactPoint": { "@type": "ContactPoint", "telephone": "(929) 334-4005", "contactType": "customer service", "availableLanguage": ["English"] }, "founder": { "@type": "Person", "name": "Andrew Cohen" }, "description": "Brainscape’s spaced repetition system is proven to DOUBLE learning results! Find, make, and study flashcards online or in our mobile app. Serious learners only.", "address": { "@type": "PostalAddress", "streetAddress": "159 W 25th St, Ste 517", "addressLocality": "New York", "addressRegion": "NY", "postalCode": "10001", "addressCountry": "USA" } }

Spark Flashcards

(32 cards)

1
Q

Two common classes of analytics apps

A
  • Iterative algorithms (machine learning, graph)
  • Interactive data mining
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Enhance programmability

A
  • Integrate into Scala programming language
  • Allow interactive use from Scala interpreter
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What part of the machine does Hadoop access?

A

Main memory (hard drive)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What memory does Spark use?

A

Uses RAM for faster processing

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What is Scala?

A

Based on Java

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What did MapReduce simplify?

A

“big data” analysis on large, unreliable clusters

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What did users want more of as MapReduce became more popular?

A

-More complex, multi stage applications
-More interactive ad-hoc queries

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

How to share data in MapReduce across jobs?

A

stable storage, which is slow

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What do complex apps and interactive queries both need that MapReduce lacks?

A

efficient primitives for data sharing

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Resilient Distributed Datasets (RDD)

A
  • Restricted form of distributed share memory
  • Efficient fault recovery using lineage (DAG)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Immutable

A

Once you put the data on the RAM, it cannot be changed in that RDD

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

How to modify RDD?

A

Can only be built through coarse-grained deterministic transformations (map, filter, join…) in a new RDD

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Directed Acyclic Graph (DAG)

A

an arrangement of edges and vertices
- vertices: indicate RDDs
- edges: operations applied on the RDD

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

RDD Recovery

A

Using the lineage graph, the data can be restored

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

SparkContext

A

represents the connection to a Spark cluster (main driver)
- to create RDDs, accumulators and broadcast variables on that cluster

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Cluster Manager

A

provides resources to all worker nodes as per need and operates all nodes accordingly

17
Q

RDD Types

A

parallelized collections

18
Q

What are the 3 types of operations programmers can perform on the RDD?

A

Transformations, Actions, Persistence

19
Q

Transformations

A

create a new dataset form an existing one
- lazy in nature and are only executed when some action is performed

20
Q

Actions

A

returns to the driver program a value or exports data to a storage system after performing a computation

21
Q

Persistence

A

For caching datasets in-memory for future operations and option to store on disk or RAM or mixed (storage level)

22
Q

Example transformations functions

A

Map(func)
Filter(func)
Distinct()

23
Q

Example actions functions

A

Count()
Reduce(func)
Collect()
Take()

24
Q

Example persistence functions

A

Persist()
Cache()

25
Map(func)
return a new distributed dataset formed by passing each element of the source through a function func
26
flatMap(func)
first applies map function and then flattens the result
27
Which framework is better for larger data?
Hadoop
28
Log mining
load error messages from a log into memory, then interactively search for various patterns
29
reduce(func)
aggregate the elements of the dataset using a function func
30
What three options does Spark provide for persist RDDs?
(1) in-memory storage as deserialized Java Objs - fastest, JVM can access RDD natively (2) in-memory storage as serialized data - space limited, choose another efficient representation, lower performance cost (3) on-disk storage - RDD too large to keep in memory, and costly to recompute
31
Fault recovery
RDDs maintain lineage information that can be used to reconstruct lost partitions
32
What are the benefits of the RDD Model?
- Consistency is easy due to immutability - Inexpensive fault tolerance (log lineage rather than replicating/checkpointing data) - Locality-aware scheduling of tasks on partitions - Despite being restricted, model seems applicable to a broad variety of applications