quiz3 Flashcards

1
Q

what is a compute cluster

A

several computers working together to do some work

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

define the job of YARN
HDFS
Spark

How do they all work together

A

YARN: manages the compute jobs in a cluster
HDFS: stores data on cluster nodes
Spark: a framework to do computation on YARN

use spark to express the computation we want to do in a way that can be sent to a cluster and done in parallel. YARN will take this job and organize it and get it done, HDFS will store all the pieces of our data files

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

true or false: spark dataframe is the same as panda’s dataframe

A

false

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

when writing with a spark dataframe what does it create

A

it creates a directory with several files not just a file

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

what does .filter() do on the dataframes

A

.filter() similar to SQL where

it keeps only rows who satisfy the filter condition

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

what is a driver and executor in spark

A

a driver is the program u write, and an executor is the threads that run it

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

true or false: parallelism is controlled by the way the data is partitioned

A

true

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

there are a few ways you can control the number of partitions

describe them

spark.range(10000, numPartitions=6)

.coalese(num)

.repartition(num)

A
  1. sets them explicitly
  2. for concatenating if there are too many. will lower the amount of partitions,but not in the way you expect. can also be used to clean up your output if you know you are only small amount left in each partition
  3. rearrange partitions, but expensive, since it is done to get perfect partitions, lots of memory moving
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

name examples of shuffle operations

A

.repartition
.coalese
.groupBy
.sort

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

what is a pipeline operation

A

the opposite of shuffle operations.

where the partitions can be handled completely independently from each other. ideally each row is handled independently, most dataframe operations are in this category.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

why is the groupby shuffling not as bad

A

because it reduces the amount of rows needed before the shuffling occures.

ex. if we have a billion rows and you grouped by 10 values then only 10 rows from each parition have to be shuffled

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

why can spark constantly create dataframe

A

spark uses lazy evaluation

when you create a dataframe you actually didnt do the calculation needed to make it yet

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

why is coalese not as smart as repartition

A

coalese does lazy evalution

repartition waits for everything before hand

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

what is the downfall with lazy evaluation

how do you solve it

A

spark doesnt know if you are going to use the dame data twice, so it can throw away the values before it’s needed again, but you cant keep all intermediate results just incase bc it could be large

solution: caching it with the .cache() method. Tells the computer to store the results bc we are going to use it later

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

whats the downfall of join

A

too much memory moving (potentially)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

true or false, in column expressions in spark, they are not lazily evaluated

A

false, all column expressions in a spark dataframe are lazily evaluated

17
Q

in a spark dataframe, the actual implementation is in _____ which complies to the ___________

A

scala, java virtual machine

18
Q

what are user defined functions in spark and why are they used and what is going on in the back for it to work

A

when you want to use python to do the work can use functions.udf that can work on column objects, similar to np.vectorize

UDF is sent to the executors, data from the JVM is converted into python, the function is called in a python process, the results are sent back to the JVM

19
Q

what are rdds: resilient distributed dataset

what are the keys things to remember on a RDD

A

fundamental data structure that holds a row entry.

while working on an RDD, you do the work in python

each row is treated as strings

generally slower then dataframes

can be easier for extracting data in a non-dataframe friendly format

slower, you lose the JVM and optimizer

20
Q

which is faster, row-oriented or column oriented?

A

column, most operations you wanna do are gonna be on columns, you want good memory locality. columns in memory are arrays, pre-created and stored. Rows need to be made first

21
Q

how can you turn a panda dataframe into a spark dataframe

how can you turn a spark dataframe into a panda dataframe

A

use spark.createDataframe and give it a panda dataframe

use spkdata.toPandas()

22
Q

Polars are a new dataframe tool. What is it implemented in and how is it evaluated?

what does it not have

A

implemented in Rust, and strictly evaluated, but you an lazy to evaluate it, you need to create a lazy dataframe first tho. either read/write command for strict, scan/collect/write for lazy

no partitioning or clustering

23
Q

what is duckDB and what is it good for

what does it not have

A

duckDB is an in-process SQL database, it lets you create a relational database and do analytics with it without having to install an entire database server, but it can be used as a fast tool to manipulate tabular data

doesn’t have a compute cluster

24
Q

what is dask

A

a python data tool that recreates as much of Pandas/NumPy/etc as possible but does it with lazy evaluation and allows for distributed computation like spark. dask can also be deployed on a cluster

25
Q

what is broadcast used for in spark

A

if you have one small data frame and want to join with another large data frame, rather than and shuffling a bunch. you can broadcast to essentially have a lookup table instead

26
Q

in rdd, explain these functions

df.rdd

rdd.take(n)

rdd.map(f)

rdd.filter(f)

A

df.rdd : get equivalent rdd from a data frame

rdd.take(n): retrieve the first n elements from the RDD as a Python list.

rdd.map(f): apply function f to each element, creating a new RDD from the returned values.

rdd.filter(f): apply function f to each element, keep rows where it returned True.