Practice Questions - exam-certified-associate-developer-for-apache-spark Flashcards
(188 cards)
Which of the following statements about Spark’s stability is incorrect?
A. Spark is designed to support the loss of any set of worker nodes.
B. Spark will rerun any failed tasks due to failed worker nodes.
C. Spark will recompute data cached on failed worker nodes.
D. Spark will spill data to disk if it does not fit in memory.
E. Spark will reassign the driver to a worker node if the driver’s node fails.
E. Spark will reassign the driver to a worker node if the driver’s node fails.
Explanation:
The driver program in Spark is responsible for coordinating and controlling the Spark application. It runs on a separate node and is not automatically reassigned to another worker node if it fails. If the driver node fails, the entire Spark application typically fails and needs to be restarted.
Options A, B, C, and D are correct statements about Spark’s stability features:
* A: Spark is designed to handle worker node failures by redistributing tasks to other available workers.
* B: Spark will automatically rerun failed tasks due to worker node failures to ensure fault tolerance.
* C: Spark can recompute data cached on failed worker nodes using lineage information.
* D: Spark will spill data to disk if it exceeds available memory, preventing the application from crashing.
Which of the following operations fails to return a DataFrame with no duplicate rows?
A.DataFrame.dropDuplicates()
B.DataFrame.distinct()
C.DataFrame.drop_duplicates()
D.DataFrame.drop_duplicates(subset = None)
E.DataFrame.drop_duplicates(subset = "all")
E. DataFrame.drop_duplicates(subset = "all")
DISCUSSION:
The question asks which operation fails to return a DataFrame with no duplicate rows. Options A, B, C, and D all correctly remove duplicate rows. dropDuplicates()
, distinct()
, and drop_duplicates()
are all equivalent and remove duplicate rows across all columns. drop_duplicates(subset=None)
is also equivalent to removing duplicates across all columns. Option E, drop_duplicates(subset = "all")
, is incorrect because the subset
parameter expects a list or tuple of column names, not the string “all”. This will cause an error, thus failing to return a DataFrame with no duplicate rows.
Of the following situations, in which will it be most advantageous to store DataFrame df at the MEMORY_AND_DISK storage level rather than the MEMORY_ONLY storage level?
A.
When all of the computed data in DataFrame df can fit into memory.
B.
When the memory is full and it’s faster to recompute all the data in DataFrame df rather than read it from disk.
C.
When it’s faster to recompute all the data in DataFrame df that cannot fit into memory based on its logical plan rather than read it from disk.
D.
When it’s faster to read all the computed data in DataFrame df that cannot fit into memory from disk rather than recompute it based on its logical plan.
E.
The storage level MENORY_ONLY will always be more advantageous because it’s faster to read data from memory than it is to read data from disk.
D. When it’s faster to read all the computed data in DataFrame df that cannot fit into memory from disk rather than recompute it based on its logical plan.
DISCUSSION:
The correct answer is D.
-
Why D is correct:
MEMORY_AND_DISK
spills data to disk when it doesn’t fit in memory. This is advantageous when recomputing the data (based on the DataFrame’s logical plan) is slower than reading it from disk. -
Why other options are incorrect:
-
A: If all data fits in memory,
MEMORY_ONLY
is preferable as it avoids disk I/O. -
B & C: If recomputation is faster than reading from disk,
MEMORY_ONLY
is better because the parts of the DataFrame that overflow memory won’t be stored on disk, and will instead be recomputed when needed. -
E:
MEMORY_ONLY
is not always more advantageous. When the data exceeds available memory and recomputation is expensive,MEMORY_AND_DISK
provides a performance benefit.
-
A: If all data fits in memory,
Which of the following cluster configurations is most likely to experience an out-of-memory error in response to data skew in a single partition?
Note: each configuration has roughly the same compute power using 100 GB of RAM and 200 cores.
A. Scenario #4
B. Scenario #5
C. Scenario #6
D. More information is needed to determine an answer.
E. Scenario #1
C. Scenario #6
Scenario #6 has the smallest executor size (12.5 GB). Data skew means one partition has significantly more data than others. With a small executor size, it’s more likely that a skewed partition will exceed the executor’s memory, resulting in an out-of-memory error.
The other scenarios have larger executor sizes, making them less susceptible to out-of-memory errors from a single skewed partition. Scenario #1 is the least likely to OOM because it has a single, very large executor.
Which of the following describes the relationship between nodes and executors?
A.
Executors and nodes are not related.
B.
A node is a processing engine running on an executor.
C.
An executor is a processing engine running on a node.
D.
There are always the same number of executors and nodes.
E.
There are always more nodes than executors.
C. An executor is a processing engine running on a node.
DISCUSSION:
The correct answer is C. In Spark, a node is a machine in the cluster, and an executor is a process that runs on that node to perform tasks. Therefore, an executor runs on a node.
Option A is incorrect because executors and nodes are directly related in a Spark cluster. Executors operate within nodes.
Option B is incorrect because the opposite is true; executors run on nodes, not the other way around.
Option D is incorrect because the number of executors and nodes is usually different. A node can have multiple executors.
Option E is incorrect because typically, a node has one or more executors, so there are not always more nodes than executors.
Which of the following will occur if there are more slots than there are tasks?
A. The Spark job will likely not run as efficiently as possible.
B. The Spark application will fail – there must be at least as many tasks as there are slots.
C. Some executors will shut down and allocate all slots on larger executors first.
D. More tasks will be automatically generated to ensure all slots are being used.
E. The Spark job will use just one single slot to perform all tasks.
A. The Spark job will likely not run as efficiently as possible.
DISCUSSION:
If there are more slots than tasks, it means some slots will be idle, leading to underutilization of resources and reduced efficiency.
Option A is correct because the job will still run, but with wasted resources, making it less efficient.
Option B is incorrect because the job will not fail simply because there are more slots than tasks.
Option C is incorrect because executors don’t automatically shut down simply due to unused slots, though dynamic allocation can release executors after an idle timeout.
Option D is incorrect because Spark will not automatically generate more tasks to fill the slots.
Option E is incorrect because Spark will distribute the existing tasks across available slots, not consolidate everything into a single slot.
Which of the following code blocks returns a new DataFrame with column storeDescription where the pattern “Description: “ has been removed from the beginning of column storeDescription in DataFrame storesDF?
A sample of DataFrame storesDF is below:
Image
A.
storesDF.withColumn("storeDescription", regexp_replace(col("storeDescription"), "^Description: "))
B.
storesDF.withColumn("storeDescription", col("storeDescription").regexp_replace("^Description: ", ""))
C.
storesDF.withColumn("storeDescription", regexp_extract(col("storeDescription"), "^Description: ", ""))
D.
storesDF.withColumn("storeDescription", regexp_replace("storeDescription", "^Description: ", ""))
E.
storesDF.withColumn("storeDescription", regexp_replace(col("storeDescription"), "^Description: ", ""))
E.
storesDF.withColumn("storeDescription", regexp_replace(col("storeDescription"), "^Description: ", ""))
DISCUSSION:
Option E is correct. It uses the withColumn
function to create a new column named “storeDescription” (or replace the existing one). The regexp_replace
function is used correctly here, with the first argument being the column to operate on (obtained using col("storeDescription")
), the second argument being the regex pattern to replace (“^Description: “), and the third (implicitly an empty string in this case, though explicitly clear due to “”) being the replacement string.
Option A is incorrect because it is missing the replacement string argument in regexp_replace
.
Option B is incorrect because regexp_replace
is a function in pyspark.sql.functions
and needs to be called as regexp_replace(col(...), pattern, replacement)
. It is not a method of the Column
object.
Option C is incorrect because it uses regexp_extract
which extracts a string that matches the regex instead of replacing it.
Option D is syntactically correct and often works (especially in later Spark versions) because it implicitly converts the column name to a column object. However, Option E is more explicit and compatible, making it the better answer for backward compatibility.
The code block shown contains an error. The code block is intended to return a new DataFrame where column sqft from DataFrame storesDF has had its missing values replaced with the value 30,000. Identify the error.
A sample of DataFrame storesDF is displayed below:
Image
Code block:
```python
storesDF.na.fill(30000, col(“sqft”))
~~~
A.
The argument to the subset parameter of fill() should be a string column name or a list of string column names rather than a Column object.
B.
The na.fill() operation does not work and should be replaced by the dropna() operation.
C.
The argument to the subset parameter of fill() should be a the numerical position of the column rather than a Column object.
D.
The na.fill() operation does not work and should be replaced by the nafill() operation.
E.
The na.fill() operation does not work and should be replaced by the fillna() operation.
A. The argument to the subset parameter of fill() should be a string column name or a list of string column names rather than a Column object.
DISCUSSION:
The correct answer is A. The na.fill()
(or fillna()
) method in PySpark expects a string or list of strings representing column names for the subset
argument, not a Column object created by col()
. Options B, D, and E are incorrect because na.fill()
is a valid function for filling missing values (and is often an alias for fillna()
), and dropna()
removes rows with missing values instead of filling them. Option C is incorrect because the numerical position of the column is not the correct way to reference the column.
Which of the following code blocks will most quickly return an approximation for the number of distinct values in column division
in DataFrame storesDF
?
A.storesDF.agg(approx_count_distinct(col("division")).alias("divisionDistinct"))
B.storesDF.agg(approx_count_distinct(col("division"), 0.01).alias("divisionDistinct"))
C.storesDF.agg(approx_count_distinct(col("division"), 0.15).alias("divisionDistinct"))
D.storesDF.agg(approx_count_distinct(col("division"), 0.0).alias("divisionDistinct"))
E.storesDF.agg(approx_count_distinct(col("division"), 0.05).alias("divisionDistinct"))
C. storesDF.agg(approx_count_distinct(col("division"), 0.15).alias("divisionDistinct"))
DISCUSSION:
The question asks for the code block that will most quickly return an approximation. The approx_count_distinct
function takes an optional second argument that specifies the maximum estimation error allowed. A larger error value allows for a faster, but less accurate, estimation. Option C has the largest error value (0.15), so it will be the fastest.
Options A, B, D, and E all have smaller error values than Option C, or use the default error value, and therefore will take longer to compute. Therefore, they are all incorrect.
Which of the following code blocks returns a DataFrame where column storeCategory
from DataFrame storesDF
is split at the underscore character into column storeValueCategory
and column storeSizeCategory
?
A.
```python
(storesDF.withColumn(“storeValueCategory”, split(col(“storeCategory”), “”)[1])
.withColumn(“storeSizeCategory”, split(col(“storeCategory”), “”)[2]))
~~~
B.
```python
(storesDF.withColumn(“storeValueCategory”, col(“storeCategory”).split(“”)[0])
.withColumn(“storeSizeCategory”, col(“storeCategory”).split(“”)[1]))
~~~
C.
```python
(storesDF.withColumn(“storeValueCategory”, split(col(“storeCategory”), “”)[0])
.withColumn(“storeSizeCategory”, split(col(“storeCategory”), “”)[1]))
~~~
D.
```python
(storesDF.withColumn(“storeValueCategory”, split(“storeCategory”, “”)[0])
.withColumn(“storeSizeCategory”, split(“storeCategory”, “”)[1]))
~~~
E.
```python
(storesDF.withColumn(“storeValueCategory”, col(“storeCategory”).split(“”)[1])
.withColumn(“storeSizeCategory”, col(“storeCategory”).split(“”)[2]))
~~~
C
DISCUSSION:
Option C is the correct answer. It uses the split
function from pyspark.sql.functions
along with col
to correctly split the storeCategory
column. The split
function returns an array, and [0]
and [1]
are used to access the first and second elements of the array, respectively, which are assigned to the new columns storeValueCategory
and storeSizeCategory
.
Option A is incorrect because it attempts to access the second and third elements of the split array using indices [1]
and [2]
. Since the storeCategory
only has two parts separated by an underscore, accessing index [2]
will result in an error.
Option B is incorrect because it uses the Python string .split()
method directly on the column object, which is not the correct way to perform this operation in Spark. It should use the split
function from pyspark.sql.functions
.
Option D is incorrect because it passes the string literal “storeCategory” to the split
function instead of the column itself (using col("storeCategory")
). This will result in splitting the string “storeCategory” instead of the values in the storeCategory
column.
Option E is incorrect because it attempts to access the second and third elements of the split array using indices [1]
and [2]
. Since the storeCategory
only has two parts separated by an underscore, accessing index [2]
will result in an error. Additionally, similar to option B, it uses the Python string .split()
method directly on the column object, which is not the correct way to perform this operation in Spark. It should use the split
function from pyspark.sql.functions
.
The code block shown below contains an error. The code block is intended to return a new DataFrame that is the result of a cross join between DataFrame storesDF and DataFrame employeesDF. Identify the error.
storesDF.join(employeesDF, "cross")
A.
A cross join is not implemented by the DataFrame.join() operations – the standalone CrossJoin() operation should be used instead.
B.
There is no direct cross join in Spark, but it can be implemented by performing an outer join on all columns of both DataFrames.
C.
A cross join is not implemented by the DataFrame.join()operation – the DataFrame.crossJoin()operation should be used instead.
D.
There is no key column specified – the key column “storeId” should be the second argument.
E.
A cross join is not implemented by the DataFrame.join() operations – the standalone join() operation should be used instead.
C. A cross join is not implemented by the DataFrame.join()operation – the DataFrame.crossJoin()operation should be used instead.
The code block shown below should create a single-column DataFrame from Python list years
which is made up of integers. Choose the response that correctly fills in the numbered blanks within the code block to complete this task.
Code block:_1_._2_(_3_, _4_)
A.
1. spark
2. createDataFrame
3. years
4. IntegerType
B.
1. DataFrame
2. create
3. [years]
4. IntegerType
C.
1. spark
2. createDataFrame
3. [years]
4. IntegertType
D.
1. spark
2. createDataFrame
3. [years]
4. IntegertType()
E.
1. spark
2. createDataFrame
3. years
4. IntegerType()
E.
Explanation:
Option E correctly uses the spark.createDataFrame()
method to create a DataFrame from the Python list years
, specifying the schema using IntegerType()
.
-
spark
is the SparkSession object. -
createDataFrame()
is the method used to create a DataFrame. -
years
is the Python list containing the data. -
IntegerType()
specifies that the data type of the column should be integer.
Why other options are incorrect:
-
A:
IntegerType
(without the parentheses) is a class and needs to be instantiated with()
. -
B:
DataFrame.create
is not the correct method for creating a DataFrame from a Python list. -
C: Incorrectly uses
IntegertType
(misspelled) and[years]
which would create a DataFrame with a single row containing a list. Also,IntegertType
is a class and needs to be instantiated with()
. -
D: Incorrectly uses
IntegertType
(misspelled) and[years]
which would create a DataFrame with a single row containing a list.
The code block shown below contains an error. The code block is intended to cache DataFrame storesDF only in Spark’s memory and then return the number of rows in the cached DataFrame. Identify the error.
Code block:
storesDF.cache().count()
A.
The cache() operation caches DataFrames at the MEMORY_AND_DISK level by default – the storage level must be specified to MEMORY_ONLY as an argument to cache().
B.
The cache() operation caches DataFrames at the MEMORY_AND_DISK level by default – the storage level must be set via storesDF.storageLevel prior to calling cache().
C.
The storesDF DataFrame has not been checkpointed – it must have a checkpoint in order to be cached.
D.
DataFrames themselves cannot be cached – DataFrame storesDF must be cached as a table.
E.
The cache() operation can only cache DataFrames at the MEMORY_AND_DISK level (the default) – persist() should be used instead.
A.
The cache() operation caches DataFrames at the MEMORY_AND_DISK level by default – the storage level must be specified to MEMORY_ONLY as an argument to cache().
Which of the following code blocks returns a DataFrame containing a column dayOfYear
, an integer representation of the day of the year from column openDate
from DataFrame storesDF
?
Note that column openDate
is of type integer and represents a date in the UNIX epoch format – the number of seconds since midnight on January 1st, 1970.
A sample of storesDF
is displayed below:
A.
```python
(storesDF.withColumn(“openTimestamp”, col(“openDate”).cast(“Timestamp”))
. withColumn(“dayOfYear”, dayofyear(col(“openTimestamp”))))
~~~
B.
```python
storesDF.withColumn(“dayOfYear”, get dayofyear(col(“openDate”)))
~~~
C.
```python
storesDF.withColumn(“dayOfYear”, dayofyear(col(“openDate”)))
~~~
D.
```python
(storesDF.withColumn(“openDateFormat”, col(“openDate”).cast(“Date”))
. withColumn(“dayOfYear”, dayofyear(col(“openDateFormat”))))
~~~
E.
```python
storesDF.withColumn(“dayOfYear”, substr(col(“openDate”), 4, 6))
~~~
A. First, the openDate
column, which is in UNIX epoch format (seconds since January 1, 1970), needs to be converted to a Timestamp type. This is done using col("openDate").cast("Timestamp")
. Then, the dayofyear
function can be applied to the Timestamp column to extract the day of the year.
Option B is incorrect because it contains invalid syntax (get dayofyear
).
Option C is incorrect because dayofyear
expects a Timestamp column, not an integer representing seconds since the epoch.
Option D is incorrect because casting to Date loses the time component, and while dayofyear
might work on a Date, it’s not the correct approach given the initial data format (seconds since epoch).
Option E is incorrect because substr
extracts a substring, which is not relevant to calculating the day of the year.
Which of the following is the most granular level of the Spark execution hierarchy?
A. Task
B. Executor
C. Node
D. Job
E. Slot
A. Task
DISCUSSION:
The Spark execution hierarchy, from highest to lowest level, is Job -> Stage -> Task. A Job is a high-level set of operations. A Job is broken down into Stages, which are groups of tasks that can be executed together. Stages are broken down into Tasks, which are the smallest unit of work that Spark can execute. An Executor is a process that runs tasks, and a Node is a machine in the cluster. A Slot is a unit of computation on an executor. Therefore, the most granular level is the Task.
Options B, C, D, and E are incorrect because they represent higher levels of abstraction in the Spark execution hierarchy or units within the executors.
Which of the following describes the Spark driver?
A. The Spark driver is responsible for performing all execution in all execution modes – it is the entire Spark application.
B. The Spare driver is fault tolerant – if it fails, it will recover the entire Spark application.
C. The Spark driver is the coarsest level of the Spark execution hierarchy – it is synonymous with the Spark application.
D. The Spark driver is the program space in which the Spark application’s main method runs coordinating the Spark entire application.
E. The Spark driver is horizontally scaled to increase overall processing throughput of a Spark application.
D. The Spark driver is the program space in which the Spark application’s main method runs coordinating the Spark entire application.
Explanation:
The Spark driver is the process that runs the main function of your Spark application and coordinates the execution of the Spark job. It’s responsible for creating the SparkContext, submitting tasks to the cluster, and monitoring their execution.
- A is incorrect because the Spark driver coordinates the execution, but it does not perform all execution itself. Executors on worker nodes do much of the processing.
- B is incorrect because the Spark driver is not inherently fault-tolerant. If the driver fails, the application typically fails.
- C is incorrect because the Spark driver is a component within the Spark application. It’s not synonymous with the entire application.
- E is incorrect because the Spark driver is generally not horizontally scaled. The executors are scaled to increase throughput.
Which of the following statements about Spark jobs is incorrect?
A. Jobs are broken down into stages.
B. There are multiple tasks within a single job when a DataFrame has more than one partition.
C. Jobs are collections of tasks that are divided up based on when an action is called.
D. There is no way to monitor the progress of a job.
E. Jobs are collections of tasks that are divided based on when language variables are defined.
D. There is no way to monitor the progress of a job.
Explanation:
Spark provides a web UI, metrics, and APIs to monitor job progress. Therefore, statement D is incorrect.
- A is correct: Spark jobs are indeed broken down into stages.
- B is correct: Each partition typically results in a task, so multiple partitions lead to multiple tasks within a job.
- C is correct: Actions trigger the execution of jobs, and tasks are divided based on these actions.
- E is incorrect: Task division is not based on when language variables are defined.
Which of the following operations is most likely to result in a shuffle?
A.
DataFrame.join()
B.
DataFrame.filter()
C.
DataFrame.union()
D.
DataFrame.where()
E.
DataFrame.drop()
A. DataFrame.join()
Explanation:
DataFrame.join()
is a wide transformation that often requires data shuffling. When joining two DataFrames based on a key, Spark needs to redistribute the data across the cluster to ensure that rows with the same key are located on the same partition. This redistribution process is called shuffling and involves significant data movement across the network.
The other options are less likely to cause a shuffle:
-
DataFrame.filter()
,DataFrame.where()
: These operations filter rows based on a condition and can be performed within each partition without shuffling data. -
DataFrame.union()
: This operation combines two DataFrames by appending the rows of one to the other. While it might involve some data movement, it doesn’t typically require a full shuffle. -
DataFrame.drop()
: This operation removes columns from a DataFrame and can be performed within each partition.
Which of the following is the most complete description of lazy evaluation?
A.
None of these options describe lazy evaluation
B.
A process is lazily evaluated if its execution does not start until it is put into action by some type of trigger
C.
A process is lazily evaluated if its execution does not start until it is forced to display a result to the user
D.
A process is lazily evaluated if its execution does not start until it reaches a specified date and time
E.
A process is lazily evaluated if its execution does not start until it is finished compiling
B. A process is lazily evaluated if its execution does not start until it is put into action by some type of trigger.
Explanation: Lazy evaluation means delaying the evaluation of an expression until its value is needed. Option B accurately describes this, as the execution is triggered only when the result is required. Options C, D, and E are incorrect because they describe specific types of triggers (displaying to user, reaching a date/time, finishing compiling) which are not the comprehensive definition of lazy evaluation. Option A is incorrect because option B is a valid description.
A Spark application has a 128 GB DataFrame A and a 1 GB DataFrame B. If a broadcast join were to be performed on these two DataFrames, which of the following describes which DataFrame should be broadcasted and why?
A.
Either DataFrame can be broadcasted. Their results will be identical in result and efficiency.
B.
DataFrame B should be broadcasted because it is smaller and will eliminate the need for the shuffling of itself.
C.
DataFrame A should be broadcasted because it is larger and will eliminate the need for the shuffling of DataFrame B.
D.
DataFrame B should be broadcasted because it is smaller and will eliminate the need for the shuffling of DataFrame A.
E.
DataFrame A should be broadcasted because it is larger and will eliminate the need for the shuffling of itself.
B.
DataFrame B should be broadcasted because it is smaller and will eliminate the need for the shuffling of itself.
DISCUSSION:
The correct answer is B. In a broadcast join, the smaller DataFrame is broadcasted to all executors. This avoids shuffling the smaller DataFrame, which is more efficient.
Option A is incorrect because broadcasting the larger DataFrame would be inefficient and could lead to memory issues. Also, the efficiency of the operation would not be identical.
Option C is incorrect because broadcasting the larger DataFrame would be inefficient and could lead to memory issues.
Option D is incorrect because while it’s true DataFrame B should be broadcasted, the primary reason is to avoid shuffling DataFrame B itself, not to eliminate shuffling of DataFrame A. DataFrame A will still be shuffled.
Option E is incorrect because DataFrame A is the larger DataFrame, not the smaller one. Broadcasting a larger DataFrame is generally not a good practice.
Which of the following operations can be used to create a DataFrame with a subset of columns from DataFrame storesDF that are specified by name?
A.
storesDF.subset()
B.
storesDF.select()
C.
storesDF.selectColumn()
D.
storesDF.filter()
E.
storesDF.drop()
B.
storesDF.select()
The select()
operation allows you to choose a subset of columns by specifying their names. Options A and C are not valid DataFrame operations. filter()
is used to select rows based on a condition, not columns. While drop()
can achieve a similar result to select()
by specifying columns to exclude, select()
is the more direct method for selecting a subset of columns.
The code block shown below contains an error. The code block is intended to return a DataFrame containing all columns from DataFrame storesDF except for column sqft and column customerSatisfaction. Identify the error.
storesDF.drop(sqft, customerSatisfaction)
A. The drop() operation only works if one column name is called at a time – there should be two calls in succession like storesDF.drop(“sqft”).drop(“customerSatisfaction”).
B. The drop() operation only works if column names are wrapped inside the col() function like storesDF.drop(col(sqft), col(customerSatisfaction)).
C. There is no drop() operation for storesDF.
D. The sqft and customerSatisfaction column names should be quoted like “sqft” and “customerSatisfaction”.
E. The sqft and customerSatisfaction column names should be subset from the DataFrame storesDF like storesDF.”sqft” and storesDF.”customerSatisfaction”.
D. The sqft and customerSatisfaction column names should be quoted like “sqft” and “customerSatisfaction”.
DISCUSSION:
The correct answer is D. In most DataFrame implementations (including Spark’s), when using the drop()
function to remove columns by name, the column names must be provided as strings. Therefore, sqft
and customerSatisfaction
should be enclosed in quotes like "sqft"
and "customerSatisfaction"
.
A is incorrect because while chaining drop()
calls is a valid approach, it is not the fundamental error in the original code. The immediate error is the unquoted column names.
B is incorrect because the col()
function is not required when simply specifying column names as strings to be dropped.
C is incorrect because the drop()
function is a standard DataFrame operation.
E is incorrect because accessing columns using storesDF."sqft"
is not the correct syntax for referencing column names for the drop()
function.
Which of the following code blocks returns a DataFrame containing only the rows from DataFrame storesDF
where the value in column sqft
is less than or equal to 25,000?
A.storesDF.filter("sqft" <= 25000)
B.storesDF.filter(sqft > 25000)
C.storesDF.where(storesDF[sqft] > 25000)
D.storesDF.where(sqft > 25000)
E.storesDF.filter(col("sqft") <= 25000)
E. storesDF.filter(col("sqft") <= 25000)
Explanation:
Option E is correct because it uses the filter()
method along with the col()
function to properly reference the ‘sqft’ column and applies the correct condition (less than or equal to 25000).
-
A: Incorrect.
"sqft" <= 25000
is a string comparison, not a column comparison. -
B: Incorrect. It uses
sqft > 25000
which is not how you reference a column withinfilter
withoutcol()
. It also filters for values greater than 25000, not less than or equal to. -
C: Incorrect.
storesDF[sqft]
is not the correct way to reference a column in PySpark and the condition is also reversed.where
is an alias forfilter
, but it still requires a column object. -
D: Incorrect.
sqft > 25000
is not the correct way to reference a column in PySpark, and the condition is reversed.
Which of the following code blocks returns a DataFrame containing only the rows from DataFrame storesDF where the value in column sqft is less than or equal to 25,000 OR the value in column customerSatisfaction is greater than or equal to 30?
A.storesDF.filter(col("sqft") <= 25000 | col("customerSatisfaction") >= 30)
B.storesDF.filter(col("sqft") <= 25000 or col("customerSatisfaction") >= 30)
C.storesDF.filter(sqft <= 25000 or customerSatisfaction >= 30)
D.storesDF.filter(col(sqft) <= 25000 | col(customerSatisfaction) >= 30)
E.storesDF.filter((col("sqft") <= 25000) | (col("customerSatisfaction") >= 30))
E. storesDF.filter((col("sqft") <= 25000) | (col("customerSatisfaction") >= 30))
The correct answer is E because it correctly uses the col()
function to reference the column names and uses the bitwise OR operator |
to combine the two conditions. The parentheses are also correctly placed to ensure the intended order of operations.
Option A is incorrect because it doesn’t have parentheses around the conditions. Option B is incorrect because it uses the Python or
operator instead of the bitwise |
operator which is needed for Spark SQL expressions. Option C is incorrect because it does not use the col()
function to reference the columns, and also uses Python’s or
. Option D is incorrect because it uses col(sqft)
instead of col("sqft")
.