Cleaning Data with PySpark Flashcards

1
Q

How do you import different data types in spark?

A

from pyspark.sql.types import IntegerType, StringType, etc

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What is the main reason for pyspark using imutability and lazy processing?

A

Spark takes advantage of data immutability to efficiently share / create new data representations throughout the cluster.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What is the Parquet Format?

A

Parquet is a compressed columnar data format developed for use in any Hadoop based system. This includes Spark, Hadoop, Apache Impala, and so forth. The Parquet format is structured with data accessible in chunks, allowing efficient read / write operations without processing the entire file.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

If we want to use sql language in a pyspark dataframe, what method should we call first?

A

dataframe.createOrReplaceTempView(‘custom_table_name’)

then we can use sql with

pyspark.sql(‘SQL LANGUAGE’)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What is a DataFrame in pyspark?

A

Made up of rows and columns

immutable

uses transformations to deal with data

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q
A
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What are the primary functions to filter data on pyspark?

A

.filter and .where

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

What is the functin of pyspark.sql.functions.split?

What does it work to split a column with entries like: “john williams”.

How would you split on the whitespace?

A

Splits str around matches of the given pattern.

F.split(‘col_name’, pattern=’\s+’) # split on the whitespace

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What is the role of Column.getItem(key)?

A

Column.getItem(key)[source]

An expression that gets an item at position ordinal out of a list, or gets an item by key out of a dict.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

What are the 2 primary Conditional DataFrame column operations in pyspark?

A

F.when and F.otherwise

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Give an example to use F.when in pyspark and F.otherwise.

A

voter_df = voter_df.withColumn(‘random_val’,

when(voter_df.TITLE == ‘Councilmember’, F.rand())

.when(voter_df.TITLE == ‘Mayor’, 2)

.otherwise(0))

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

How do you generate a random value with pyspark?

A

pyspark.sql.functions.rand()

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

What is a UDF in pyspark?

A

user defined functions

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

What is lazy processing?

A

transformation operation is lazy; it’s more like a recipe than a command. It defines what should be done to a DataFrame rather than actually doing it

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

What is the role of F.monotonically_increasing_id?

A

The generated ID is guaranteed to be monotonically increasing and unique, but not consecutive.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

What is caching?

A
17
Q

What kind of operation could be used to recast a value type in a pyspark dataframe column?

A

for example, transform in integers: dataframe.column.cast(IntegerType())

18
Q

How would you handle blank lines, headers and comments when parsing a csv in pyspark?

A

Use the arguments of the csv parser.

19
Q

What kind of sep arguments in the csv parser can be used to handle nested columns?

A

sep=’*’