Pyspark Flashcards

1
Q

A distributed process has

A

access to the computational resources across a number of machines connected through a network

Distributed machines also have the advantage of easily scaling, you can just add more machines

They also include fault tolerance…

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Hadoop is a way …

MapReduce

A

to distribute very large files across multiple machines.

MapReduce distribute a computational task to a distributed data set

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Spark is

A

one of the latest technologies being used to quickly and easily handle Big Data.

You can think of Spark as a flexible alternative to MapReduce.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Spark vs MapReduce

A

MapReduce requires files to be stored in HDFS, Spark doesn’t.

Spark also can perform operations up to 100x faster than MapReduce

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Scala

A

Spark itself is not a programming language. It’s just a framework for dealing with large data and distributing it and doing those calculations across a distributor network Spark itself is written in a programming language known as Scala.

So the Scala API for Spark is the one that gets the latest features which makes sense because Spark has literally written in Scala.

Scala is written in Java.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

DataBreaks : AWS

A

DataBreaks basically running on top of Amazon Web Services.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

show DataFrame

A

df.show()

display(df)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

schema

A

df.printSchema()

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

df column names

A

df.columns

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

df stat

A

df.describe().show()

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

a column values

A

df.select(‘net_bkd_pax’).show()

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

two column values

A

df_gdd.select([‘net_bkd_pax’,’net_ia_pax’]).show()

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Add a new column

A

df.withColumn(‘new’, df[‘net_ia_pax’]*2).show()

These changes are not permanent on our original dataframe. You would have to save this to a new variable.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Rename a column

A

df.withColumnRenamed(‘net_ia_pax’, ‘new_name’).show()

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

create temp view

A

df_gdd.createOrReplaceTempView(‘gdd’)
result = spark.sql(“SELECT * FROM gdd”)
result.show()

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

read csv

A

df = spark.read.csv(“address”, header=True, inferSchema=True)

17
Q

head

A

df.head(3)[0]

18
Q

filter

A

df.filter(“runid<10”).show()

19
Q

filter and show one column

A

df.filter(“runid<10”).select(“forecastname”).show()

df.filter(df[“runid”]<10).select(“forecastname”).show()

20
Q

filter based on two conditions

A

df.filter((df[“runid”]<10) & (df[“start_date”]<20)).select(“forecastname”).show()

used () for each condition

21
Q

.collect

A

if we use .collect() than .show(), we will get a list and the data can be used in future operations

22
Q

list to dic

A

result.asDict()