In Action Flashcards

1
Q

What are the components of spark

A

Spark Core, Spark SQL, Spark Streaming, Spark GraphX, and Spark MLlib

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

RDD

A

Resilient Distributed Dataset

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

How does spark streaming use DStreams

A

Uses DStreams to periodically create RDDs

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Spark Mlib models use ______ to represent data

A

Dataframes

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What are the datasources for SparkSQL

A
  • Relational Databases
  • No SQL Databases
  • Hive
  • JSON
  • Parquet Files
  • Protocol Buffer
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What does DStreams stand for

A

Discretised streams

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What are broadcast variables

A

Variables that are sent to all of the executors so that they are there when they are needed

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

How do you ship off broadcast variables

A

Just use sc.broadcast()

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

How do you retrieve broadcast variables

A

use .value on the object that is returned by sc.broadcast

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

If you want use map to parse a group of lines and there is a chance that some lines are missing what would you do

A

You would use flatmap

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What are the different deploy modes for the spark standalone cluster

A

cluster-deploy mode - the driver is on the cluster

client-deploy mode - the driver is on the client

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Given an RDD lines consisting of the tuple (stationId, entryType, temperature) where entryType is one of three values (TMIN, TMAX or TAVG) how would you create an rdd with only TMIN

A

mins = lines.filter(x => x._2 == “TMIN”)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Given an RDD lines consisting of the tuple (stationId, entryType, temperature)
How would you return a new RDD with only stationId and temperature)

A

newRdd = lines.map(x => (x._1, x._3.toFloat))

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

given the key value pair stationTemps consisting of a stationID and a temperature, how would you compute the minimum temperature for each station

A

stationTemps.reduceByKey((x,y) => min(x,y))

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

If you want a wordcount job to run on a cluster - why would you not use the scala function countByValue

A

It returns a scala map. You want it to return an RDD. You want to use ReduceByKey rather than CountByValue.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

what would this function do?

lowercaseWords.map(x =>(x,1))

A

It would map each word to a key value pair where each key value pair consists of the word and the number 1