Join i Spark Flashcards
What is a join in PySpark?
A join in PySpark is an operation that combines two DataFrames based on a related column.
True or False: PySpark supports inner, outer, left, and right joins.
True
What is the syntax to perform an inner join in PySpark?
df1.join(df2, on=’column_name’, how=’inner’)
Fill in the blank: The method used to perform a left join in PySpark is __________.
join
Which join type returns all records from both DataFrames?
Full outer join
What keyword is used to specify the join type in PySpark?
how
True or False: Left join returns only the matching records from the right DataFrame.
False
Write the PySpark code to perform a right join on DataFrames df1 and df2.
df1.join(df2, on=’column_name’, how=’right’)
What is the default join type in PySpark if none is specified?
Inner join
Which join type would you use to keep all records from the left DataFrame?
Left join
True or False: You can join DataFrames on multiple columns in PySpark.
True
How do you specify multiple columns for a join in PySpark?
Use a list, e.g., on=[‘column1’, ‘column2’]
What will be the result of an outer join if there are no matching records?
It will return null values for non-matching records.
Complete the PySpark join syntax: df1.join(df2, on=’key’, how=__________)
outer
What is the command to display the schema of a DataFrame after a join operation?
df.printSchema()
What happens to duplicate column names after a join in PySpark?
They are suffixed with ‘_1’, ‘_2’, etc.
When using DataFrame aliases, how do you reference the columns in a join?
Use the alias followed by a dot, e.g., df1.alias.column_name
How can you avoid column name conflicts when joining DataFrames?
Use the select method to rename columns before the join.
What is the output of a left join if the right DataFrame has no matches?
It will return all records from the left DataFrame with nulls for the right.
Write the code to perform a full outer join in PySpark.
df1.join(df2, on=’column_name’, how=’full_outer’)
What method can be used to filter results after a join?
The filter or where method can be used.
True or False: Joins in PySpark can be performed on non-equal conditions.
True
What is the significance of the ‘on’ parameter in the join method?
It specifies the column(s) to join on.
How do you perform a cross join in PySpark?
Use df1.crossJoin(df2)