Spark Dataframe commands Flashcards

1
Q

Describe a dataframe in your own words

A

Dataframe is like a table with rows and columns

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

How do you read a table from Hive into Spark dataframe with select statement- Spark 2.6

A

spark.sql(“Select * from db.mytable”)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

How do you read a table from Hive into Spark dataframe without select statement- Spark 2.6

A

spark.table(“db.myTable”)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

How to display a dataframe

A

df.show()

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

How to display exactly 100 rows of a dataframe

A

df.show(100)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Why do we pass True/False in show

A

True expands the columns and False compresses the columns during show

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Select specific columns

A

df.select(‘col1’,’col2’,’col3’)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Can I pass a list of columns within the select statement

A

Yes. df.select([‘col1’, ‘col2’,’col3’])

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

How do I change the column name without using withColumnRenamed?

A

df.selectExpr(“col1”,”col2 as test_column”)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

How do I pull specific rows from a dataframe - For example where a certain column in my dataframe is true

A

df.filter(“col1 = True”)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

import statement to import functions

A

from pyspark.sql import functions as func

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Get the total number of records in a dataframe

A

df.count()

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

How do I get the count of distinct values in a column?

A

df.dropDuplicates(“col1”).count()

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

What is the difference between df.dropDuplicates() and df.dropDuplicates(“col1”)

A

dropDuplicates() drops the duplicates in the entire dataframe and dropDuplicates(“col1”) just drops the duplicates in specific column

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

How do I see the schema of a dataframe

A

df.printSchema()

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

How do I see the column names along with the datatypes

A

df.printSchema()

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

How do I retrieve the columns to a python list

A

df.columns

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

df.columns() - Is this correct and what will be the output

A

No. The braces shouldnt be present. It throws an error

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

How do I drop a column from a dataframe?

A

df.drop(“col1”)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

df.drop([“col1”,”col2”]) - Is this correct and why

A

yes it is correct. We can pass a list in drop function

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

Groupby syntax with count

A

df.groupBy(“col1”).agg(func.count(“col2”))

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q

Order the rows in a dataframe on a certain column.

A

df.orderBy(func.asc(“col1”))

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
23
Q

Groupby on multiple columns syntax with count

A

df.groupBy(“col1”,”col2”).agg(func.count(“col2”))

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
24
Q

Order the rows in a dataframe on multiple columns.

A

df.orderBy(func.asc(“col1”), func.desc(“col2”))

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
25
case expression general syntax
case when col1 = 'Y' then 'True' when col1 = 'N' then 'False' else 'NA' end
26
drop multiple columns
df.drop("col1","col2")
27
drop duplicate values in multiple columns
dropDuplicates(["col1", "col2"])
28
Create a new column in the dataframe. The new column is a flag that has true or false. If a column value is > 100 then True else false
df.withColumn("flag", func.expr("case when col1 >= 100 then True else False end"))
29
I have a dataframe with some records. I need to flag all the records as 'True' before I proceed further. How do I do that
df.withColumn("flag", func.lit(True))
30
Rename a column
df.withColumnRenamed("old_col_name", "new_col_name")
31
Two ways to rename a column
withColumnRenamed, selectExpr
32
Ways to create dataframe
1. Reading hive tables 2. Reading CSV or JSON files 3. Create dataframe from list 4. Create dataframe from rdd
33
How to read csv files into dataframe?
1. df = spark.read.csv("file.csv") | 2. df = spark.read.format("csv").load("file.csv")
34
Column names for this dataframe - df = spark.read.format("csv").load("file.csv")
_c0, _c1, _c2...
35
How to load header for csv read command
``` df2 = spark.read.option("header",True).csv("file.csv") df2 = spark.read.options(header = 'True').csv("file.csv") ```
36
PySpark reads all columns as a ________ data type by default
string
37
Read multiple csv files into a single dataframe
df = spark.read.csv("path1,path2,path3")
38
Read all CSV files from a directory into DataFrame
df = spark.read.csv("Folder path")
39
Specify a specific delimiter while reading csv
``` df3 = spark.read.option("delimiter",",") .csv("test.csv") df3 = spark.read.options(delimiter=',') .csv("test.csv") ```
40
How to change the default datatype read by spark from a csv
``` df3 = spark.read.option("inferschema", True) .csv("test.csv") df3 = spark.read.options(inferschema='True') .csv("test.csv") ```
41
Set both delimiter and inferschema
df3 = spark.read.option("delimiter",",").option("inferschema",True) .csv("test.csv") df3 = spark.read.options(inferschema='True', delimiter = '|') .csv("test.csv")
42
how to import datatypes
from pyspark.sql.types import *
43
Read custom schema - I don't want default string schema and also I don't want inferschema but would like to change to custom datatype
from pyspark.sql.types import * from pyspark.sql.types import StructType,StructField, StringType, IntegerType schema = StructType([ StructField('firstname', StringType(), True), StructField('middlename', StringType(), True), StructField('id', IntegerType(), True) ]) df = spark.read.format("csv").option("Header", True).schema(schema).load("file.csv") df = spark.read.option("Header", True).schema(schema).csv("file.csv")
44
Write a dataframe to csv file with no header
df. write.format("csv").option("header", True).save("demo.csv") df. write.option("header", True).csv("demo.csv")
45
Modes while saving a dataframe as a file
1. overwrite – mode is used to overwrite the existing file. 2. append – To add the data to the existing file. 3. ignore – Ignores write operation when the file already exists. 4. error – This is a default option when the file already exists, it returns an error.
46
append mode
df.write.mode("append").option("header", True).csv("demo.csv")
47
create spark session
spark = SparkSession \ .builder \ .appName("App1") \ .getOrCreate()
48
Check the type of variable
type(df)
49
Read a json file
1. df = spark.read.json("file.json") | 2. df = spark.read.format("json").load("file.json")
50
import statement to import SparkSession
from pyspark.sql import SparkSession
51
``` [{ "RecordNumber": 2, "Zipcode": 704, "ZipCodeType": "STANDARD", "City": "PASEO COSTA DEL SUR", "State": "PR" }, { "RecordNumber": 10, "Zipcode": 709, "ZipCodeType": "STANDARD", "City": "BDA SAN LUIS", "State": "PR" }] ``` Read the multiline json records
df = spark.read.options(mutliline="True").json("file.json")
52
Read multiple json files
df = spark.read.json(["json path1","json path2","json path3"])
53
Read all json files in a directory
df = spark.read.json("files/*.json")
54
Pass custom schema for each of the columns for json
from pyspark.sql.types import * from pyspark.sql.types import StructType,StructField, StringType schema = StructType([ StructField('firstname', StringType(), True), StructField('middlename', StringType(), True), StructField('lastname', StringType(), True) ]) df = spark.read.options(header = 'True').schema(schema).json("file.json")
55
Write a dataframe as json file
df.write.json("file.json")
56
Write a dataframe as json file - append mode
df.write.mode("append").json("file.json")
57
Create a dataframe using parallelize
from pyspark.sql import Row dept = [Row("A",10), Row("B",20), Row("C",30)] ``` rdd = spark.sparkContext.parallelize(dept) df = rdd.toDF(col1, col2) ```
58
Create dataframe from list without using parallelize
``` dept = [("A",10), ("B",20), ("C",30)] col_names= ("col_1_name", "col_2_name") df = spark.createDataFrame(data = dept, schema = col_names) ```
59
I have two tables. One table has id and location. The second table has all the ids who are assigned a parking space . I need a output report of the id, location and whether parking space is allocated - is_parking_allocated (Y/N)
join
60
I have two tables. Table 1 - ids, location; Table 2 - ids salary. Output: ids, location and salary
join
61
I have two tables. Table 1 - Parts and price Table 2 - Only the parts are purchased in the last 2 months. Output: Parts, price and a flag if the parts are purchased in the last 2 monts - flag_2_months
join
62
Declare udf
from pyspark.sql import functions as func ``` def split_str("s"): return s.split("_")[1] ``` split_str_udf = func.udf(split_str) #udf registration df1 = df.withColumn("last_name", split_str("full_name"))
63
Read avro
df = spark.read.format("avro").load("avro_file_path")
64
save avro
df.write.format("avro").save("avro_file_path")
65
left outer join df1 with df2 with alias; join on id present in both tables and select two columns, one from each table
df1.alias("a").join(df2.alias("b"), df1.id == df2.id, "left_outer").select("a.col1","b.col2")
66
inner join df1 with df2 with alias; join on id present in both tables and select two columns, one from each table
df1.alias("a").join(df2.alias("b"), df1.id == df2.id, "inner").select("a.col1","b.col2")
67
right outer join df2 with df1 with alias; join on id present in both tables and select two columns, one from each table
df2.alias("a").join(df1.alias("b"), df2.id == df1.id, "right_outer").select("df1.col1","df2.col2")