Data Processing Flashcards

Question 1

Q

What are the different data abstraction layers in spark?

Answer

A

RDD(Resilient Distributed Dataset)
DataFrame
Datasets
spark tables

The dataframe and spark table abstractions are provided by the spark SQL library.

Question 2

Q

spark dataframes vs spark tables

Answer

A

Spark tables enables SQL-style data manipulation and querying, making it easy to work with structured data using familiar SQL syntax and semantics.
So, if the data from the data source is already in a structured tabular form or at least the data is registered in a data catalog that specifies the schema like a Glue data catalog then spark tables can be preferred.

spark dataframes offers a more programmatical approach in the sense it has access functions that can be used for data filtering and transformations. The spark dataframe can read data in any format.

The DataFrame API offers more flexibility and is better suited for building ETL pipelines. It also integrates well with other spark libraries like spark streaming and MLlib.

Question 3

Q

how can i apply SQL queries to a dataframe API?

Answer

A

First the dataframe object needs to be converted into a spark table.

Register the DataFrame as a temporary view
df.createOrReplaceTempView(“people”)

then we can perform the SQL queries on the table.

result = spark.sql(“SELECT * FROM people WHERE age > 30”)
result.show()

Question 4

Q

different ways for reading a spark dataframe?

Answer

A

syntax - 1 for reading data into a Spark Data frame
fire_df = spark.read \
.format(“csv”) \
.option(“header”,”true”) \
.option(“inferschema”,”true”) \
.load(“data_location”)

Syntax - 2
fire_df = spark.read \
.csv(“csv_file_location”,
header=”true”,inferschema=”true”)

Data Processing Flashcards

Discuss code snippets