Rows Flashcards

1
Q

How do you filter a dataframe based on values being greater than using where

A

.where(year($”birthdate”) > 1980)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

How do you filter, using filter, to a specific month

A

.filter(month(`birthdate) === 1)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Using a SQL expression, how do you filter a dataframe

A

.where(“date(birthdate)>15”)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

How do you check for inequality when filtering

A

=!=

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

How do you make sure your data frame only has unique values taking all columns into account

A

.distinct

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

How can your remove duplicates and only take one column into account

A

.dropDuplicates(“column”)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

How can your remove duplicates and only take multiple columns into account

A

.dropDuplicates(List(“column1”, “column2”))

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

if dropDistinct does not have any columns passed in, what columns are taken into account to determine distinct values or does it fail?

A

All columns, equivalent to .distinct

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

How can you filter out null values from a dataframe

A

.where($”column_object”.isNotNull)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

How can you drop rows with all null values

A

.na.drop(how=”all”)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

How to remove a row where the any value in the row is null

A

.na.drop(“any”)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

How do you remove the row if two specific columns have nulls

A

.na.drop(“all”, Seq(“column_a”, “column_b”))

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

How do you remove a row with a null value in either of two columns

A

.na.drop(“any”, Seq(“column_a”, “column_b”))

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

How can you replace all null values with “nope”

A

.na.fill(“nope”)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

True/False

.na.fill(“Nope”) will only replace null where the column type is string

A

True

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

What is the default sorting order

A

ascending

17
Q

how can you sort

A

.sort(“column_name”)
.orderBy()
.sort(expr(“column_name”)

18
Q

How do you sort using an expression

A

.sort(expr(“column_name”))

19
Q

Sort by month in a column where the type is date, and sort in descending order

A

.orderBy(expr(“month(birthdate)”).desc)

20
Q

If you have two columns, one for customer id and one for items ids, how can you show how many items are associated with each customer

A

.groupBy(“customer_id”).agg(count(“item_id”).alias(“total”))

21
Q

What is it called when an operation that causes spark to move data across the cluster

A

A Shuffle

22
Q

When one input partition contributes to multiple output partitions, what is it called

A

Wide Transformation or Wide Dependencies

23
Q

What are some examples of wide transformations

A
GroupBy
Join
Distinct
Repartion
Coalesce
OrderBy
24
Q

for the SQL year function, what can be passed in

A

Column object only

25
Q

For the max sql function, what can be passed in

A

String or Column object

26
Q

How can you get the max of a column and the min of another column

A

.agg(
max(“column_name”),
min(“other_column”))

27
Q

How do you count rows where there is a value in the column “price”

A

.select(count(“price”))

28
Q

What is a way to pull the common aggregate values for each column in a dataframe

A

.describe()

29
Q

.describe takes what as input

A

column names

30
Q

.desc function takes in what input

A

coumn names