Join i Spark Flashcards

Question 1

Q

What is a join in PySpark?

Answer

A

A join in PySpark is an operation that combines two DataFrames based on a related column.

Question 2

Q

True or False: PySpark supports inner, outer, left, and right joins.

Question 3

Q

What is the syntax to perform an inner join in PySpark?

Answer

A

df1.join(df2, on=’column_name’, how=’inner’)

Question 4

Q

Fill in the blank: The method used to perform a left join in PySpark is __________.

Question 5

Q

Which join type returns all records from both DataFrames?

Answer

A

Full outer join

Question 6

Q

What keyword is used to specify the join type in PySpark?

Question 7

Q

True or False: Left join returns only the matching records from the right DataFrame.

Question 8

Q

Write the PySpark code to perform a right join on DataFrames df1 and df2.

Answer

A

df1.join(df2, on=’column_name’, how=’right’)

Question 9

Q

What is the default join type in PySpark if none is specified?

Answer

A

Inner join

Question 10

Q

Which join type would you use to keep all records from the left DataFrame?

Answer

A

Left join

Question 11

Q

True or False: You can join DataFrames on multiple columns in PySpark.

Question 12

Q

How do you specify multiple columns for a join in PySpark?

Answer

A

Use a list, e.g., on=[‘column1’, ‘column2’]

Question 13

Q

What will be the result of an outer join if there are no matching records?

Answer

A

It will return null values for non-matching records.

Question 14

Q

Complete the PySpark join syntax: df1.join(df2, on=’key’, how=__________)

Question 15

Q

What is the command to display the schema of a DataFrame after a join operation?

Answer

A

df.printSchema()

Question 16

Q

What happens to duplicate column names after a join in PySpark?

Answer

Study These Flashcards

A

They are suffixed with ‘_1’, ‘_2’, etc.

Question 17

Q

When using DataFrame aliases, how do you reference the columns in a join?

Answer

Study These Flashcards

A

Use the alias followed by a dot, e.g., df1.alias.column_name

Question 18

Q

How can you avoid column name conflicts when joining DataFrames?

Answer

Study These Flashcards

A

Use the select method to rename columns before the join.

Question 19

Q

What is the output of a left join if the right DataFrame has no matches?

Answer

Study These Flashcards

A

It will return all records from the left DataFrame with nulls for the right.

Question 20

Q

Write the code to perform a full outer join in PySpark.

Answer

Study These Flashcards

A

df1.join(df2, on=’column_name’, how=’full_outer’)

Question 21

Q

What method can be used to filter results after a join?

Answer

Study These Flashcards

A

The filter or where method can be used.

Question 22

Q

True or False: Joins in PySpark can be performed on non-equal conditions.

Answer

Study These Flashcards

A

True

Question 23

Q

What is the significance of the ‘on’ parameter in the join method?

Answer

Study These Flashcards

A

It specifies the column(s) to join on.

Question 24

Q

How do you perform a cross join in PySpark?

Answer

Study These Flashcards

A

Use df1.crossJoin(df2)

Join i Spark Flashcards

(24 cards)