Join i Spark Flashcards

1
Q

What is a join in PySpark?

A

A join in PySpark is an operation that combines two DataFrames based on a related column.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

True or False: PySpark supports inner, outer, left, and right joins.

A

True

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What is the syntax to perform an inner join in PySpark?

A

df1.join(df2, on=’column_name’, how=’inner’)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Fill in the blank: The method used to perform a left join in PySpark is __________.

A

join

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Which join type returns all records from both DataFrames?

A

Full outer join

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What keyword is used to specify the join type in PySpark?

A

how

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

True or False: Left join returns only the matching records from the right DataFrame.

A

False

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Write the PySpark code to perform a right join on DataFrames df1 and df2.

A

df1.join(df2, on=’column_name’, how=’right’)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What is the default join type in PySpark if none is specified?

A

Inner join

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Which join type would you use to keep all records from the left DataFrame?

A

Left join

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

True or False: You can join DataFrames on multiple columns in PySpark.

A

True

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

How do you specify multiple columns for a join in PySpark?

A

Use a list, e.g., on=[‘column1’, ‘column2’]

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

What will be the result of an outer join if there are no matching records?

A

It will return null values for non-matching records.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Complete the PySpark join syntax: df1.join(df2, on=’key’, how=__________)

A

outer

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

What is the command to display the schema of a DataFrame after a join operation?

A

df.printSchema()

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

What happens to duplicate column names after a join in PySpark?

A

They are suffixed with ‘_1’, ‘_2’, etc.

17
Q

When using DataFrame aliases, how do you reference the columns in a join?

A

Use the alias followed by a dot, e.g., df1.alias.column_name

18
Q

How can you avoid column name conflicts when joining DataFrames?

A

Use the select method to rename columns before the join.

19
Q

What is the output of a left join if the right DataFrame has no matches?

A

It will return all records from the left DataFrame with nulls for the right.

20
Q

Write the code to perform a full outer join in PySpark.

A

df1.join(df2, on=’column_name’, how=’full_outer’)

21
Q

What method can be used to filter results after a join?

A

The filter or where method can be used.

22
Q

True or False: Joins in PySpark can be performed on non-equal conditions.

23
Q

What is the significance of the ‘on’ parameter in the join method?

A

It specifies the column(s) to join on.

24
Q

How do you perform a cross join in PySpark?

A

Use df1.crossJoin(df2)