spark1 Flashcards

(42 cards)

1
Q

Spark is best suited for_____data.
* Real time
* virtual
* structured
* All of the above

A

All of the above

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Which of the following is a feature of Apache Spark?
* Speeds
* Supports multiple languages
* Advanced Analytics
* All of the above

A

All of the above

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What does Spark Engine do?
* Scheduling
* Distributing data across cluster
* Monitoring data across cluster
* all of the above

A

all of the above

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

RDD can NOT be created from data stored on?
* LocalFS
* Oracle
* S3
* HDFS

A

Oracle

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

For resource management spark can use?
* Yam
* Mesos
* Standalone cluster manager
* All of the above

A

All of the above

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Fault Tolerance in RDD is achieved using?
* Immutable nature of RDD
* DAG(Directed Acyclic Graph) or Data Lineage
* Both A&B
* Neither A nor B

A

Both A&B

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What is transformation in Spark RDD?
* Takes RDD as input and produces one or more RDD as output
* Return final results of RDD computations
* The way to sent results from executors to the driver
* None of the above

A

Takes RDD as input and produces one or more RDD as output

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Which of the following is a feature of Spark RDD?
* In-memory computation
* Lazy evaluations
* Fault Tolerance
* All of the mentioned

A

All of the mentioned

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Four main component built in top of spark core

A
  • Spark ML
  • Spark SQL
  • Spark streaming
  • Spark GraphX
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Describe Spark ML

A

Spark ML provides simple APIs for execute the functions (classifications , clustering , regression) and creating execution pipelines

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Describe spark SQL

A

spark module for working with structured data

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Describe spark streaming

A

large-scale near-real-time stream processing framework

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Describe spark GraphX

A

spark API for graphs-parallel computation, include
-growing collection of graph algorithms
-builders to simplify graph analytics tasks

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

features of HIVE

A

good abstraction
declarative language
less error prone
easier to learn & analyze
compile to java map-reduce code

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Four key component at Hive architecture

A

meta store
thrift server
driver
Hive QL
Hive CLI

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Different mode of execution in Apache pig

A
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

4 Pig vs sql

18
Q

3 Key component of HBase

A

HBase RegionServer
HBase Master
ZooKeeper

19
Q

6 HBase vs DBMS

20
Q

Role of the zookeeper in HBase architecture

A

managing the state and configuration of the HBase cluster, providing distributed coordination, leader election, and synchronization and locking services. !!!!!!!!!!!!!!!!!!!!!!

21
Q

4 How Zookeeper achieves constantly, and how it achieves performance

22
Q

__ is a distributed graph processing framework on top of Spark.
* MLlib
* Spark streaming
* GrapghX
* None of the above

23
Q

Spark is 100x faster than MapReduce due to?
* In-memory computing
* Development in scala
* Stream processing
* Spark SQL

A

In-memory computing

24
Q

creating RDD

A

load from external RDD
create RDD from another RDD
parallelizing a centralized collection

25
transformation oparation examples
map , filter . join
26
action operations examples
count ,collect ,reduce . save
27
lazy evaluation
not computed right away
28
rdd fault tolerance
no replication in memory , lineage
29
lineage graph
maintain dependencies between rdd go back to the closest disk based rdd
30
representation of RDD
data part:----- metadata information:------
31
data part:----- metadata information:------
data part: multiple partiotions metadata information: dependencies on parent rdd
32
narrow dependency ex
filter map
33
wide dependency ex
join grouping
34
DAG
directed acycling graph
35
spark is faster than hadoop especially for
iterative algorithms
36
spark need more -------- than hadoop
memory
37
map() operate ------ record
entire record
38
mapValues() operate ------ record
second component of the record
39
cash()=persist(---------)
persist(MEMORY_ONLY)
40
benefits of spark SQL
can executes SQL operators optimizations like RDBMS build on top and extends RDD easy integration with other spark libraries
41
spark streaming requirements
scalable to large cluster second-scale latencies simple programming model integrated with batch & interactive processing efficient fault tolerance
42
spark streaming motivation
many important applications process large stream live data require large cluster to handle workload require latencies of few seconds