Week6 - Apache Spark Flashcards

1
Q

What is RDD stand for?

A

Resilient Distributed dataset

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Is RDD read only?

A

Yes

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

RDDs can only be created through? (2)

A

1) Data in stable Storage

2) other RDDs

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

RDD is a restricted Distributed shared____ what

A

Memory System ( Cached dataset shared memory)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

RDD Contains dataset?

A

Atomic pieces of the dataset

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

RDD Contains dependencies on?

A

Parent RDDs

for fault tolerance

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

How does a RDD compute the dataset

A

It is based on its parents (for fault tolerance)

metadata about its partitioning scheme and data placement

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

RDD read only and

A

Partitioned collections of records

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Two important features of RDD and Apache Spark

A

1) Fault Tolerance

2) Lazy Evaluation

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Describe RDD Fault Tolerance

A

It is achieved through lineage retrieval

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Describe RDD Lazy Evaluation

A

A RDD will not be created until a reduce-like job or persist job is created ( create meaningful output)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

What two classes of operations can you do on RDDs

A

1) Transformations

2) actions

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

RDD Transformations

A

Build RDDs through operations on other RDDs

1) map, filter, join
2) lazy operations

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

RDD Actions

A

1) Count, Collect, save

2) trigger execution

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

hdfs is ?

A

1) text file (Hadoop file system)
2) Distributed file system
3) contain text, log files, errors

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

How to find errors in htfs files

A

file.filter(_.contians(“ERROR”))

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

DAG Scheduler

A

Partition DAG into efficient stages (think narrow and wide dependencies) Pu

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

Narrow Dependencies

A

Transformation: output needs input from only one partition (very title communications )

1) map
2) union

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

Wide Dependencies

A

Multiple dependencies… need data from other partition

1) Group by key
2) join with inputs not on the same partitioned

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

DAG wide dependencies early or late in the process

A

late (less amount of data)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

Hadoop scheduler

A

Only 2

1) Map
2) reduce

22
Q

Hadoop where is data stored?

A

Assumes all data is on disk (intermediate data has to be on disk)

Why fault tolerance

23
Q

Hadoop API

A

Only has Map and Reduce procedure programming

24
Q

Hadoop storage

A

Only on HDSF file system (now extended)

25
Hadoop more or less memory than spark
Less, stores data locally
26
Apache Spark has what APIs
1) Java 2) Scala 3) Python
27
4 Main Apache Spark Libraries
1) Spark SQL 2) Spark Streaming 3) MLibe (machine learning) 4) GraphX
28
What is a DataFrame?
Looks like a table (can run SQL operations on it)
29
What Spark Library does Google PageRank and Shortest Path use for
GraphX
30
What Spark Library is used for Streaming
Spark Streaming (process data in real time)
31
Spark Dstream
Abstraction that represents the data streaming source.
32
What does Spark Dstreamdo?
1) Chops data into batches of x seconds 2) Process the batches like RDD 3) returns the processed RDD results in batches
33
Spark Dstream batch sizes
1) Low 1/2 second | 2) Latency about 1 second
34
Apache Kafka
Apache Kafka is an open-source stream-processing software platform developed by the Apache Software Foundation, written in Scala and Java. The project aims to provide a unified, high-throughput, low-latency platform for handling real-time data feeds.
35
Dstream - Operations State / Transformations
1) Window (timed window, counting, finding first or last)
36
Apache Hadoop - Apache Hive
SQL like - A data warehouse infrastructure that provides data summarization and ad hoc querying (HiveQL)
37
Apache Hadoop - Apache Pig
A high-level data-flow language and execution framework for parallel computation (less ridged than normal Hadoop)
38
Apache Hadoop - Apache HBase
NoSQL database. (based on BigTable) | A scalable, distributed database that supports structured data storage for large tables.
39
Apache Hadoop - Apache Zookeeper
Basically Ansible - high-performance Coordination Service for distributed architecture
40
Apache Mahout
Machine Learn algorithms - distributed linear algebra framework and mathematically expressive Scala DSL
41
What does Geospark
It takes the RDD layer (generic data processing ) and extends it with spatial data processing operations.
42
Spatial query processing layer
Out-of-the-box implementation for de facto spatial queries that exist out there. (range query) (KNN k-nearest neighbor query) (join queries)
43
What might cause data skew (Geospark)
Creating a grid and inserting data based on the grid coordinates Creates load balancing problem
44
Load balancing problem
grid problem. data Skew. Some boxes emply, some boxes too full
45
For types of grids
1) uniform gird 2) Quad-tree - based on density 3) KDB-tree - no overlap 4) R-Tree - based on clusters (overlap)
46
When do you build a local index (Geospark)
When running a lot, maybe the same computation | hundreds of thousands of times or thousands of times per each partition
47
Spatial Join Query (Geospark)
Count the number of points within an area
48
(Geospark) Is a join or filter more expensive
Join
49
What are space-filling curves
1) Map 2D data into one number 2) lose details Space-filling curves is a kind of geometrical slash mathematical property of the 2D data as you can see that you can partition, you can easily map each 2D cell, which is represented in a space, which is just one number, and the numbers of cells represent how close they are in space.
50
Spatial indexes
1) similar hash tables (uniform grid) index based on grid id 2) Can also use Guad Tree (partition into 4, and partition that into 4..) 3) R-Tree -find rectangles 4) Voronoi Diagram it partitions the space into different cells such that it has a mathematical property, such that all like, if you do any queries within the cells, the points within these cells are the closest to that point k-nearest