Practice Questions - exam-certified-associate-developer-for-apache-spark Flashcards

(188 cards)

1
Q

Which of the following statements about Spark’s stability is incorrect?
A. Spark is designed to support the loss of any set of worker nodes.
B. Spark will rerun any failed tasks due to failed worker nodes.
C. Spark will recompute data cached on failed worker nodes.
D. Spark will spill data to disk if it does not fit in memory.
E. Spark will reassign the driver to a worker node if the driver’s node fails.

A

E. Spark will reassign the driver to a worker node if the driver’s node fails.

Explanation:

The driver program in Spark is responsible for coordinating and controlling the Spark application. It runs on a separate node and is not automatically reassigned to another worker node if it fails. If the driver node fails, the entire Spark application typically fails and needs to be restarted.

Options A, B, C, and D are correct statements about Spark’s stability features:
* A: Spark is designed to handle worker node failures by redistributing tasks to other available workers.
* B: Spark will automatically rerun failed tasks due to worker node failures to ensure fault tolerance.
* C: Spark can recompute data cached on failed worker nodes using lineage information.
* D: Spark will spill data to disk if it exceeds available memory, preventing the application from crashing.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Which of the following operations fails to return a DataFrame with no duplicate rows?
A.
DataFrame.dropDuplicates()
B.
DataFrame.distinct()
C.
DataFrame.drop_duplicates()
D.
DataFrame.drop_duplicates(subset = None)
E.
DataFrame.drop_duplicates(subset = "all")

A

E. DataFrame.drop_duplicates(subset = "all")

DISCUSSION:
The question asks which operation fails to return a DataFrame with no duplicate rows. Options A, B, C, and D all correctly remove duplicate rows. dropDuplicates(), distinct(), and drop_duplicates() are all equivalent and remove duplicate rows across all columns. drop_duplicates(subset=None) is also equivalent to removing duplicates across all columns. Option E, drop_duplicates(subset = "all"), is incorrect because the subset parameter expects a list or tuple of column names, not the string “all”. This will cause an error, thus failing to return a DataFrame with no duplicate rows.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Of the following situations, in which will it be most advantageous to store DataFrame df at the MEMORY_AND_DISK storage level rather than the MEMORY_ONLY storage level?
A.
When all of the computed data in DataFrame df can fit into memory.
B.
When the memory is full and it’s faster to recompute all the data in DataFrame df rather than read it from disk.
C.
When it’s faster to recompute all the data in DataFrame df that cannot fit into memory based on its logical plan rather than read it from disk.
D.
When it’s faster to read all the computed data in DataFrame df that cannot fit into memory from disk rather than recompute it based on its logical plan.
E.
The storage level MENORY_ONLY will always be more advantageous because it’s faster to read data from memory than it is to read data from disk.

A

D. When it’s faster to read all the computed data in DataFrame df that cannot fit into memory from disk rather than recompute it based on its logical plan.

DISCUSSION:
The correct answer is D.

  • Why D is correct: MEMORY_AND_DISK spills data to disk when it doesn’t fit in memory. This is advantageous when recomputing the data (based on the DataFrame’s logical plan) is slower than reading it from disk.
  • Why other options are incorrect:
    • A: If all data fits in memory, MEMORY_ONLY is preferable as it avoids disk I/O.
    • B & C: If recomputation is faster than reading from disk, MEMORY_ONLY is better because the parts of the DataFrame that overflow memory won’t be stored on disk, and will instead be recomputed when needed.
    • E: MEMORY_ONLY is not always more advantageous. When the data exceeds available memory and recomputation is expensive, MEMORY_AND_DISK provides a performance benefit.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Which of the following cluster configurations is most likely to experience an out-of-memory error in response to data skew in a single partition?

Image

Note: each configuration has roughly the same compute power using 100 GB of RAM and 200 cores.

A. Scenario #4
B. Scenario #5
C. Scenario #6
D. More information is needed to determine an answer.
E. Scenario #1

A

C. Scenario #6

Scenario #6 has the smallest executor size (12.5 GB). Data skew means one partition has significantly more data than others. With a small executor size, it’s more likely that a skewed partition will exceed the executor’s memory, resulting in an out-of-memory error.

The other scenarios have larger executor sizes, making them less susceptible to out-of-memory errors from a single skewed partition. Scenario #1 is the least likely to OOM because it has a single, very large executor.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Which of the following describes the relationship between nodes and executors?
A.
Executors and nodes are not related.
B.
A node is a processing engine running on an executor.
C.
An executor is a processing engine running on a node.
D.
There are always the same number of executors and nodes.
E.
There are always more nodes than executors.

A

C. An executor is a processing engine running on a node.

DISCUSSION:
The correct answer is C. In Spark, a node is a machine in the cluster, and an executor is a process that runs on that node to perform tasks. Therefore, an executor runs on a node.

Option A is incorrect because executors and nodes are directly related in a Spark cluster. Executors operate within nodes.
Option B is incorrect because the opposite is true; executors run on nodes, not the other way around.
Option D is incorrect because the number of executors and nodes is usually different. A node can have multiple executors.
Option E is incorrect because typically, a node has one or more executors, so there are not always more nodes than executors.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Which of the following will occur if there are more slots than there are tasks?
A. The Spark job will likely not run as efficiently as possible.
B. The Spark application will fail – there must be at least as many tasks as there are slots.
C. Some executors will shut down and allocate all slots on larger executors first.
D. More tasks will be automatically generated to ensure all slots are being used.
E. The Spark job will use just one single slot to perform all tasks.

A

A. The Spark job will likely not run as efficiently as possible.

DISCUSSION:
If there are more slots than tasks, it means some slots will be idle, leading to underutilization of resources and reduced efficiency.
Option A is correct because the job will still run, but with wasted resources, making it less efficient.
Option B is incorrect because the job will not fail simply because there are more slots than tasks.
Option C is incorrect because executors don’t automatically shut down simply due to unused slots, though dynamic allocation can release executors after an idle timeout.
Option D is incorrect because Spark will not automatically generate more tasks to fill the slots.
Option E is incorrect because Spark will distribute the existing tasks across available slots, not consolidate everything into a single slot.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Which of the following code blocks returns a new DataFrame with column storeDescription where the pattern “Description: “ has been removed from the beginning of column storeDescription in DataFrame storesDF?
A sample of DataFrame storesDF is below:
Image
A.

storesDF.withColumn("storeDescription", regexp_replace(col("storeDescription"), "^Description: "))

B.

storesDF.withColumn("storeDescription", col("storeDescription").regexp_replace("^Description: ", ""))

C.

storesDF.withColumn("storeDescription", regexp_extract(col("storeDescription"), "^Description: ", ""))

D.

storesDF.withColumn("storeDescription", regexp_replace("storeDescription", "^Description: ", ""))

E.

storesDF.withColumn("storeDescription", regexp_replace(col("storeDescription"), "^Description: ", ""))
A

E.

storesDF.withColumn("storeDescription", regexp_replace(col("storeDescription"), "^Description: ", ""))

DISCUSSION:
Option E is correct. It uses the withColumn function to create a new column named “storeDescription” (or replace the existing one). The regexp_replace function is used correctly here, with the first argument being the column to operate on (obtained using col("storeDescription")), the second argument being the regex pattern to replace (“^Description: “), and the third (implicitly an empty string in this case, though explicitly clear due to “”) being the replacement string.

Option A is incorrect because it is missing the replacement string argument in regexp_replace.

Option B is incorrect because regexp_replace is a function in pyspark.sql.functions and needs to be called as regexp_replace(col(...), pattern, replacement). It is not a method of the Column object.

Option C is incorrect because it uses regexp_extract which extracts a string that matches the regex instead of replacing it.

Option D is syntactically correct and often works (especially in later Spark versions) because it implicitly converts the column name to a column object. However, Option E is more explicit and compatible, making it the better answer for backward compatibility.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

The code block shown contains an error. The code block is intended to return a new DataFrame where column sqft from DataFrame storesDF has had its missing values replaced with the value 30,000. Identify the error.
A sample of DataFrame storesDF is displayed below:
Image
Code block:

```python
storesDF.na.fill(30000, col(“sqft”))
~~~

A.
The argument to the subset parameter of fill() should be a string column name or a list of string column names rather than a Column object.
B.
The na.fill() operation does not work and should be replaced by the dropna() operation.
C.
The argument to the subset parameter of fill() should be a the numerical position of the column rather than a Column object.
D.
The na.fill() operation does not work and should be replaced by the nafill() operation.
E.
The na.fill() operation does not work and should be replaced by the fillna() operation.

A

A. The argument to the subset parameter of fill() should be a string column name or a list of string column names rather than a Column object.

DISCUSSION:
The correct answer is A. The na.fill() (or fillna()) method in PySpark expects a string or list of strings representing column names for the subset argument, not a Column object created by col(). Options B, D, and E are incorrect because na.fill() is a valid function for filling missing values (and is often an alias for fillna()), and dropna() removes rows with missing values instead of filling them. Option C is incorrect because the numerical position of the column is not the correct way to reference the column.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Which of the following code blocks will most quickly return an approximation for the number of distinct values in column division in DataFrame storesDF?

A.
storesDF.agg(approx_count_distinct(col("division")).alias("divisionDistinct"))
B.
storesDF.agg(approx_count_distinct(col("division"), 0.01).alias("divisionDistinct"))
C.
storesDF.agg(approx_count_distinct(col("division"), 0.15).alias("divisionDistinct"))
D.
storesDF.agg(approx_count_distinct(col("division"), 0.0).alias("divisionDistinct"))
E.
storesDF.agg(approx_count_distinct(col("division"), 0.05).alias("divisionDistinct"))

A

C. storesDF.agg(approx_count_distinct(col("division"), 0.15).alias("divisionDistinct"))

DISCUSSION:
The question asks for the code block that will most quickly return an approximation. The approx_count_distinct function takes an optional second argument that specifies the maximum estimation error allowed. A larger error value allows for a faster, but less accurate, estimation. Option C has the largest error value (0.15), so it will be the fastest.

Options A, B, D, and E all have smaller error values than Option C, or use the default error value, and therefore will take longer to compute. Therefore, they are all incorrect.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Which of the following code blocks returns a DataFrame where column storeCategory from DataFrame storesDF is split at the underscore character into column storeValueCategory and column storeSizeCategory?

Image

A.

```python
(storesDF.withColumn(“storeValueCategory”, split(col(“storeCategory”), “”)[1])
.withColumn(“storeSizeCategory”, split(col(“storeCategory”), “
”)[2]))
~~~

B.

```python
(storesDF.withColumn(“storeValueCategory”, col(“storeCategory”).split(“”)[0])
.withColumn(“storeSizeCategory”, col(“storeCategory”).split(“
”)[1]))
~~~

C.

```python
(storesDF.withColumn(“storeValueCategory”, split(col(“storeCategory”), “”)[0])
.withColumn(“storeSizeCategory”, split(col(“storeCategory”), “
”)[1]))
~~~

D.

```python
(storesDF.withColumn(“storeValueCategory”, split(“storeCategory”, “”)[0])
.withColumn(“storeSizeCategory”, split(“storeCategory”, “
”)[1]))
~~~

E.

```python
(storesDF.withColumn(“storeValueCategory”, col(“storeCategory”).split(“”)[1])
.withColumn(“storeSizeCategory”, col(“storeCategory”).split(“
”)[2]))
~~~

A

C

DISCUSSION:
Option C is the correct answer. It uses the split function from pyspark.sql.functions along with col to correctly split the storeCategory column. The split function returns an array, and [0] and [1] are used to access the first and second elements of the array, respectively, which are assigned to the new columns storeValueCategory and storeSizeCategory.

Option A is incorrect because it attempts to access the second and third elements of the split array using indices [1] and [2]. Since the storeCategory only has two parts separated by an underscore, accessing index [2] will result in an error.

Option B is incorrect because it uses the Python string .split() method directly on the column object, which is not the correct way to perform this operation in Spark. It should use the split function from pyspark.sql.functions.

Option D is incorrect because it passes the string literal “storeCategory” to the split function instead of the column itself (using col("storeCategory")). This will result in splitting the string “storeCategory” instead of the values in the storeCategory column.

Option E is incorrect because it attempts to access the second and third elements of the split array using indices [1] and [2]. Since the storeCategory only has two parts separated by an underscore, accessing index [2] will result in an error. Additionally, similar to option B, it uses the Python string .split() method directly on the column object, which is not the correct way to perform this operation in Spark. It should use the split function from pyspark.sql.functions.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

The code block shown below contains an error. The code block is intended to return a new DataFrame that is the result of a cross join between DataFrame storesDF and DataFrame employeesDF. Identify the error.

storesDF.join(employeesDF, "cross")

A.
A cross join is not implemented by the DataFrame.join() operations – the standalone CrossJoin() operation should be used instead.
B.
There is no direct cross join in Spark, but it can be implemented by performing an outer join on all columns of both DataFrames.
C.
A cross join is not implemented by the DataFrame.join()operation – the DataFrame.crossJoin()operation should be used instead.
D.
There is no key column specified – the key column “storeId” should be the second argument.
E.
A cross join is not implemented by the DataFrame.join() operations – the standalone join() operation should be used instead.

A

C. A cross join is not implemented by the DataFrame.join()operation – the DataFrame.crossJoin()operation should be used instead.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

The code block shown below should create a single-column DataFrame from Python list years which is made up of integers. Choose the response that correctly fills in the numbered blanks within the code block to complete this task.

Code block:
_1_._2_(_3_, _4_)

A.
1. spark
2. createDataFrame
3. years
4. IntegerType

B.
1. DataFrame
2. create
3. [years]
4. IntegerType

C.
1. spark
2. createDataFrame
3. [years]
4. IntegertType

D.
1. spark
2. createDataFrame
3. [years]
4. IntegertType()

E.
1. spark
2. createDataFrame
3. years
4. IntegerType()

A

E.

Explanation:

Option E correctly uses the spark.createDataFrame() method to create a DataFrame from the Python list years, specifying the schema using IntegerType().

  • spark is the SparkSession object.
  • createDataFrame() is the method used to create a DataFrame.
  • years is the Python list containing the data.
  • IntegerType() specifies that the data type of the column should be integer.

Why other options are incorrect:

  • A: IntegerType (without the parentheses) is a class and needs to be instantiated with ().
  • B: DataFrame.create is not the correct method for creating a DataFrame from a Python list.
  • C: Incorrectly uses IntegertType (misspelled) and [years] which would create a DataFrame with a single row containing a list. Also, IntegertType is a class and needs to be instantiated with ().
  • D: Incorrectly uses IntegertType (misspelled) and [years] which would create a DataFrame with a single row containing a list.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

The code block shown below contains an error. The code block is intended to cache DataFrame storesDF only in Spark’s memory and then return the number of rows in the cached DataFrame. Identify the error.
Code block:
storesDF.cache().count()
A.
The cache() operation caches DataFrames at the MEMORY_AND_DISK level by default – the storage level must be specified to MEMORY_ONLY as an argument to cache().
B.
The cache() operation caches DataFrames at the MEMORY_AND_DISK level by default – the storage level must be set via storesDF.storageLevel prior to calling cache().
C.
The storesDF DataFrame has not been checkpointed – it must have a checkpoint in order to be cached.
D.
DataFrames themselves cannot be cached – DataFrame storesDF must be cached as a table.
E.
The cache() operation can only cache DataFrames at the MEMORY_AND_DISK level (the default) – persist() should be used instead.

A

A.
The cache() operation caches DataFrames at the MEMORY_AND_DISK level by default – the storage level must be specified to MEMORY_ONLY as an argument to cache().

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Which of the following code blocks returns a DataFrame containing a column dayOfYear, an integer representation of the day of the year from column openDate from DataFrame storesDF?

Note that column openDate is of type integer and represents a date in the UNIX epoch format – the number of seconds since midnight on January 1st, 1970.

A sample of storesDF is displayed below:

Image

A.

```python
(storesDF.withColumn(“openTimestamp”, col(“openDate”).cast(“Timestamp”))
. withColumn(“dayOfYear”, dayofyear(col(“openTimestamp”))))
~~~

B.

```python
storesDF.withColumn(“dayOfYear”, get dayofyear(col(“openDate”)))
~~~

C.

```python
storesDF.withColumn(“dayOfYear”, dayofyear(col(“openDate”)))
~~~

D.

```python
(storesDF.withColumn(“openDateFormat”, col(“openDate”).cast(“Date”))
. withColumn(“dayOfYear”, dayofyear(col(“openDateFormat”))))
~~~

E.

```python
storesDF.withColumn(“dayOfYear”, substr(col(“openDate”), 4, 6))
~~~

A

A. First, the openDate column, which is in UNIX epoch format (seconds since January 1, 1970), needs to be converted to a Timestamp type. This is done using col("openDate").cast("Timestamp"). Then, the dayofyear function can be applied to the Timestamp column to extract the day of the year.

Option B is incorrect because it contains invalid syntax (get dayofyear).
Option C is incorrect because dayofyear expects a Timestamp column, not an integer representing seconds since the epoch.
Option D is incorrect because casting to Date loses the time component, and while dayofyear might work on a Date, it’s not the correct approach given the initial data format (seconds since epoch).
Option E is incorrect because substr extracts a substring, which is not relevant to calculating the day of the year.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Which of the following is the most granular level of the Spark execution hierarchy?
A. Task
B. Executor
C. Node
D. Job
E. Slot

A

A. Task

DISCUSSION:
The Spark execution hierarchy, from highest to lowest level, is Job -> Stage -> Task. A Job is a high-level set of operations. A Job is broken down into Stages, which are groups of tasks that can be executed together. Stages are broken down into Tasks, which are the smallest unit of work that Spark can execute. An Executor is a process that runs tasks, and a Node is a machine in the cluster. A Slot is a unit of computation on an executor. Therefore, the most granular level is the Task.
Options B, C, D, and E are incorrect because they represent higher levels of abstraction in the Spark execution hierarchy or units within the executors.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Which of the following describes the Spark driver?
A. The Spark driver is responsible for performing all execution in all execution modes – it is the entire Spark application.
B. The Spare driver is fault tolerant – if it fails, it will recover the entire Spark application.
C. The Spark driver is the coarsest level of the Spark execution hierarchy – it is synonymous with the Spark application.
D. The Spark driver is the program space in which the Spark application’s main method runs coordinating the Spark entire application.
E. The Spark driver is horizontally scaled to increase overall processing throughput of a Spark application.

A

D. The Spark driver is the program space in which the Spark application’s main method runs coordinating the Spark entire application.

Explanation:
The Spark driver is the process that runs the main function of your Spark application and coordinates the execution of the Spark job. It’s responsible for creating the SparkContext, submitting tasks to the cluster, and monitoring their execution.

  • A is incorrect because the Spark driver coordinates the execution, but it does not perform all execution itself. Executors on worker nodes do much of the processing.
  • B is incorrect because the Spark driver is not inherently fault-tolerant. If the driver fails, the application typically fails.
  • C is incorrect because the Spark driver is a component within the Spark application. It’s not synonymous with the entire application.
  • E is incorrect because the Spark driver is generally not horizontally scaled. The executors are scaled to increase throughput.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

Which of the following statements about Spark jobs is incorrect?
A. Jobs are broken down into stages.
B. There are multiple tasks within a single job when a DataFrame has more than one partition.
C. Jobs are collections of tasks that are divided up based on when an action is called.
D. There is no way to monitor the progress of a job.
E. Jobs are collections of tasks that are divided based on when language variables are defined.

A

D. There is no way to monitor the progress of a job.

Explanation:

Spark provides a web UI, metrics, and APIs to monitor job progress. Therefore, statement D is incorrect.

  • A is correct: Spark jobs are indeed broken down into stages.
  • B is correct: Each partition typically results in a task, so multiple partitions lead to multiple tasks within a job.
  • C is correct: Actions trigger the execution of jobs, and tasks are divided based on these actions.
  • E is incorrect: Task division is not based on when language variables are defined.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

Which of the following operations is most likely to result in a shuffle?
A.
DataFrame.join()
B.
DataFrame.filter()
C.
DataFrame.union()
D.
DataFrame.where()
E.
DataFrame.drop()

A

A. DataFrame.join()

Explanation:

DataFrame.join() is a wide transformation that often requires data shuffling. When joining two DataFrames based on a key, Spark needs to redistribute the data across the cluster to ensure that rows with the same key are located on the same partition. This redistribution process is called shuffling and involves significant data movement across the network.

The other options are less likely to cause a shuffle:

  • DataFrame.filter(), DataFrame.where(): These operations filter rows based on a condition and can be performed within each partition without shuffling data.
  • DataFrame.union(): This operation combines two DataFrames by appending the rows of one to the other. While it might involve some data movement, it doesn’t typically require a full shuffle.
  • DataFrame.drop(): This operation removes columns from a DataFrame and can be performed within each partition.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

Which of the following is the most complete description of lazy evaluation?
A.
None of these options describe lazy evaluation
B.
A process is lazily evaluated if its execution does not start until it is put into action by some type of trigger
C.
A process is lazily evaluated if its execution does not start until it is forced to display a result to the user
D.
A process is lazily evaluated if its execution does not start until it reaches a specified date and time
E.
A process is lazily evaluated if its execution does not start until it is finished compiling

A

B. A process is lazily evaluated if its execution does not start until it is put into action by some type of trigger.

Explanation: Lazy evaluation means delaying the evaluation of an expression until its value is needed. Option B accurately describes this, as the execution is triggered only when the result is required. Options C, D, and E are incorrect because they describe specific types of triggers (displaying to user, reaching a date/time, finishing compiling) which are not the comprehensive definition of lazy evaluation. Option A is incorrect because option B is a valid description.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

A Spark application has a 128 GB DataFrame A and a 1 GB DataFrame B. If a broadcast join were to be performed on these two DataFrames, which of the following describes which DataFrame should be broadcasted and why?

A.
Either DataFrame can be broadcasted. Their results will be identical in result and efficiency.
B.
DataFrame B should be broadcasted because it is smaller and will eliminate the need for the shuffling of itself.
C.
DataFrame A should be broadcasted because it is larger and will eliminate the need for the shuffling of DataFrame B.
D.
DataFrame B should be broadcasted because it is smaller and will eliminate the need for the shuffling of DataFrame A.
E.
DataFrame A should be broadcasted because it is larger and will eliminate the need for the shuffling of itself.

A

B.
DataFrame B should be broadcasted because it is smaller and will eliminate the need for the shuffling of itself.

DISCUSSION:
The correct answer is B. In a broadcast join, the smaller DataFrame is broadcasted to all executors. This avoids shuffling the smaller DataFrame, which is more efficient.

Option A is incorrect because broadcasting the larger DataFrame would be inefficient and could lead to memory issues. Also, the efficiency of the operation would not be identical.

Option C is incorrect because broadcasting the larger DataFrame would be inefficient and could lead to memory issues.

Option D is incorrect because while it’s true DataFrame B should be broadcasted, the primary reason is to avoid shuffling DataFrame B itself, not to eliminate shuffling of DataFrame A. DataFrame A will still be shuffled.

Option E is incorrect because DataFrame A is the larger DataFrame, not the smaller one. Broadcasting a larger DataFrame is generally not a good practice.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

Which of the following operations can be used to create a DataFrame with a subset of columns from DataFrame storesDF that are specified by name?
A.
storesDF.subset()
B.
storesDF.select()
C.
storesDF.selectColumn()
D.
storesDF.filter()
E.
storesDF.drop()

A

B.
storesDF.select()
The select() operation allows you to choose a subset of columns by specifying their names. Options A and C are not valid DataFrame operations. filter() is used to select rows based on a condition, not columns. While drop() can achieve a similar result to select() by specifying columns to exclude, select() is the more direct method for selecting a subset of columns.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q

The code block shown below contains an error. The code block is intended to return a DataFrame containing all columns from DataFrame storesDF except for column sqft and column customerSatisfaction. Identify the error.

storesDF.drop(sqft, customerSatisfaction)

A. The drop() operation only works if one column name is called at a time – there should be two calls in succession like storesDF.drop(“sqft”).drop(“customerSatisfaction”).
B. The drop() operation only works if column names are wrapped inside the col() function like storesDF.drop(col(sqft), col(customerSatisfaction)).
C. There is no drop() operation for storesDF.
D. The sqft and customerSatisfaction column names should be quoted like “sqft” and “customerSatisfaction”.
E. The sqft and customerSatisfaction column names should be subset from the DataFrame storesDF like storesDF.”sqft” and storesDF.”customerSatisfaction”.

A

D. The sqft and customerSatisfaction column names should be quoted like “sqft” and “customerSatisfaction”.

DISCUSSION:
The correct answer is D. In most DataFrame implementations (including Spark’s), when using the drop() function to remove columns by name, the column names must be provided as strings. Therefore, sqft and customerSatisfaction should be enclosed in quotes like "sqft" and "customerSatisfaction".

A is incorrect because while chaining drop() calls is a valid approach, it is not the fundamental error in the original code. The immediate error is the unquoted column names.
B is incorrect because the col() function is not required when simply specifying column names as strings to be dropped.
C is incorrect because the drop() function is a standard DataFrame operation.
E is incorrect because accessing columns using storesDF."sqft" is not the correct syntax for referencing column names for the drop() function.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
23
Q

Which of the following code blocks returns a DataFrame containing only the rows from DataFrame storesDF where the value in column sqft is less than or equal to 25,000?

A.
storesDF.filter("sqft" <= 25000)
B.
storesDF.filter(sqft > 25000)
C.
storesDF.where(storesDF[sqft] > 25000)
D.
storesDF.where(sqft > 25000)
E.
storesDF.filter(col("sqft") <= 25000)

A

E. storesDF.filter(col("sqft") <= 25000)

Explanation:

Option E is correct because it uses the filter() method along with the col() function to properly reference the ‘sqft’ column and applies the correct condition (less than or equal to 25000).

  • A: Incorrect. "sqft" <= 25000 is a string comparison, not a column comparison.
  • B: Incorrect. It uses sqft > 25000 which is not how you reference a column within filter without col(). It also filters for values greater than 25000, not less than or equal to.
  • C: Incorrect. storesDF[sqft] is not the correct way to reference a column in PySpark and the condition is also reversed. where is an alias for filter, but it still requires a column object.
  • D: Incorrect. sqft > 25000 is not the correct way to reference a column in PySpark, and the condition is reversed.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
24
Q

Which of the following code blocks returns a DataFrame containing only the rows from DataFrame storesDF where the value in column sqft is less than or equal to 25,000 OR the value in column customerSatisfaction is greater than or equal to 30?
A.
storesDF.filter(col("sqft") <= 25000 | col("customerSatisfaction") >= 30)
B.
storesDF.filter(col("sqft") <= 25000 or col("customerSatisfaction") >= 30)
C.
storesDF.filter(sqft <= 25000 or customerSatisfaction >= 30)
D.
storesDF.filter(col(sqft) <= 25000 | col(customerSatisfaction) >= 30)
E.
storesDF.filter((col("sqft") <= 25000) | (col("customerSatisfaction") >= 30))

A

E. storesDF.filter((col("sqft") <= 25000) | (col("customerSatisfaction") >= 30))

The correct answer is E because it correctly uses the col() function to reference the column names and uses the bitwise OR operator | to combine the two conditions. The parentheses are also correctly placed to ensure the intended order of operations.

Option A is incorrect because it doesn’t have parentheses around the conditions. Option B is incorrect because it uses the Python or operator instead of the bitwise | operator which is needed for Spark SQL expressions. Option C is incorrect because it does not use the col() function to reference the columns, and also uses Python’s or. Option D is incorrect because it uses col(sqft) instead of col("sqft").

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
25
Which of the following code blocks returns a new DataFrame from DataFrame storesDF where column storeId is of the type string? A. storesDF.withColumn("storeId, cast(col("storeId"), StringType())) B. storesDF.withColumn("storeId, col("storeId").cast(StringType())) C. storesDF.withColumn("storeId, cast(storeId).as(StringType) D. storesDF.withColumn("storeId, col(storeId).cast(StringType) E. storesDF.withColumn("storeId, cast("storeId").as(StringType()))
The correct answer is B. Option B, `storesDF.withColumn("storeId", col("storeId").cast(StringType()))`, is the closest to the correct syntax for casting the column "storeId" to a StringType. It uses `withColumn` to replace the existing column (or create a new one if it doesn't exist) named "storeId". It then references the existing "storeId" column using `col("storeId")` and chains the `.cast(StringType())` method to perform the type conversion. Although this option contains a typo (an extra comma) it is the closest to the correct code. Option A is incorrect because it has an extra quotation mark after "storeId" and passes the entire `cast` function as a string, instead of calling it on a column. Option C is incorrect because it doesn't use `col()` to refer to the column and has syntax errors with `as(StringType)`. Option D is incorrect because it's missing quotes around `storeId` in the `col()` function. Option E is incorrect because it attempts to cast the string literal `"storeId"` rather than the column.
26
Which of the following code blocks returns a new DataFrame with a new column `employeesPerSqft` that is the quotient of column `numberOfEmployees` and column `sqft`, both of which are from DataFrame `storesDF`? Note that column `employeesPerSqft` is not in the original DataFrame `storesDF`. A. ```python storesDF.withColumn("employeesPerSqft", col("numberOfEmployees") / col("sqft")) ``` B. ```python storesDF.withColumn("employeesPerSqft", "numberOfEmployees" / "sqft") ``` C. ```python storesDF.select("employeesPerSqft", "numberOfEmployees" / "sqft") ``` D. ```python storesDF.select("employeesPerSqft", col("numberOfEmployees") / col("sqft")) ``` E. ```python storesDF.withColumn(col("employeesPerSqft"), col("numberOfEmployees") / col("sqft")) ```
A. DISCUSSION: Option A is correct because `withColumn` is the correct method to create a new column in a Spark DataFrame. It takes the new column name as its first argument and the expression to compute the column as its second argument. Using `col("numberOfEmployees") / col("sqft")` correctly refers to the DataFrame columns `numberOfEmployees` and `sqft` and divides them. Option B is incorrect because it attempts to perform the division using string literals `"numberOfEmployees"` and `"sqft"` instead of referencing the actual DataFrame columns using `col()`. Option C is incorrect because `select` is used to select existing columns, not to create new ones. Also, `employeesPerSqft` does not exist, so it cannot be selected. Option D is incorrect because, similar to option C, it uses `select` instead of `withColumn` to create a new column. While it correctly refers to the existing columns using `col()`, it will still fail because `employeesPerSqft` does not yet exist. Option E is incorrect because the first argument of `withColumn` should be the name of the new column as a string literal (e.g., `"employeesPerSqft"`), not a column object created using `col()`.
27
The code block shown below should return a new DataFrame from DataFrame storesDF where column modality is the constant string "PHYSICAL", Assume DataFrame storesDF is the only defined language variable. Choose the response that correctly fills in the numbered blanks within the code block to complete this task. Code block: storesDF. _1_(_2_,_3_(_4_)) A. 1. withColumn 2. "modality" 3. col 4. "PHYSICAL" B. 1. withColumn 2. "modality" 3. lit 4. PHYSICAL C. 1. withColumn 2. "modality" 3. lit 4. "PHYSICAL" D. 1. withColumn 2. "modality" 3. SrtringType 4. "PHYSICAL" E. 1. newColumn 2. modality 3. SrtringType 4. PHYSICAL
C. 1. withColumn 2. "modality" 3. lit 4. "PHYSICAL" **Explanation:** The `withColumn` function is used to add a new column to a Spark DataFrame. The first argument is the name of the new column ("modality" in this case), and the second argument is the column expression that defines the values for the new column. To create a column with a constant literal value, the `lit` function should be used. The argument to `lit` is the literal value you want to assign to each row in the new column. Since the desired value is the string "PHYSICAL", it needs to be enclosed in quotes. * **Why C is correct:** This option correctly uses `withColumn` to add the new column "modality", uses `lit` to specify a literal value and correctly encloses the string "PHYSICAL" in quotes. * **Why A is incorrect:** Option A uses `col("PHYSICAL")`. `col` is used to reference an *existing* column, not create a constant value. * **Why B is incorrect:** Option B uses `PHYSICAL` without quotes, which would be interpreted as a variable name (which is not defined). * **Why D is incorrect:** Option D uses `StringType` which is incorrect. `StringType` is a datatype, but we need to provide a value using `lit` function. * **Why E is incorrect:** Option E uses `newColumn` which is not a function in spark and passes `modality` without quotes, which would be interpreted as a variable name (which is not defined). Additionally, `StringType` is a datatype, but we need to provide a value using `lit` function.
28
Which of the following code blocks returns a new DataFrame where column division from DataFrame storesDF has been replaced and renamed to column state and column managerName from DataFrame storesDF has been replaced and renamed to column managerFullName? A. ```python (storesDF.withColumnRenamed(["division", "state"], ["managerName", "managerFullName"]) ``` B. ```python (storesDF.withColumn("state", col("division")) .withColumn("managerFullName", col("managerName"))) ``` C. ```python (storesDF.withColumn("state", "division") .withColumn("managerFullName", "managerName")) ``` D. ```python (storesDF.withColumnRenamed("state", "division") .withColumnRenamed("managerFullName", "managerName")) ``` E. ```python (storesDF.withColumnRenamed("division", "state") .withColumnRenamed("managerName", "managerFullName")) ```
E. ```python (storesDF.withColumnRenamed("division", "state") .withColumnRenamed("managerName", "managerFullName")) ``` DISCUSSION: Option E is correct. The `withColumnRenamed` function is used to rename a column in a DataFrame. The question requires renaming the "division" column to "state" and the "managerName" column to "managerFullName". Option E achieves this by first renaming "division" to "state" and then renaming "managerName" to "managerFullName". Option A is incorrect because it uses `withColumnRenamed` with lists, which is not the correct way to rename multiple columns. It also incorrectly maps "division" to "managerName" and "state" to "managerFullName". Option B is incorrect because it uses `withColumn`, which creates new columns rather than renaming existing ones. Also it overwrites the column "state" with the content of column "division" and "managerFullName" with the content of "managerName". Option C is incorrect because it uses `withColumn` to create new columns named "state" and "managerFullName", but assigns the literal strings "division" and "managerName" as their values, respectively, instead of renaming existing columns or copying column contents. Option D is incorrect because it renames "state" to "division" and "managerFullName" to "managerName", which is the opposite of what the question asks for.
29
Which of the following code blocks returns a new DataFrame where column productCategories only has one word per row, resulting in a DataFrame with many more rows than DataFrame storesDF? A sample of storesDF is displayed below: [Image](https://img.examtopics.com/certified-associate-developer-for-apache-spark/image3.png) A. ```python storesDF.withColumn("productCategories", explode(col("productCategories"))) ``` B. ```python storesDF.withColumn("productCategories", split(col("productCategories"))) ``` C. ```python storesDF.withColumn("productCategories", col("productCategories").explode()) ``` D. ```python storesDF.withColumn("productCategories", col("productCategories").split()) ``` E. ```python storesDF.withColumn("productCategories", explode("productCategories")) ```
A. ```python storesDF.withColumn("productCategories", explode(col("productCategories"))) ``` DISCUSSION: The question asks for code that transforms a DataFrame column containing arrays into multiple rows, one for each element of the array. The `explode` function is designed specifically for this purpose. Option A correctly uses `explode` with `col("productCategories")` to specify the column to be exploded within the `withColumn` transformation. Option E is also potentially correct, and some versions of Spark accept a string directly. However, using `col()` is generally considered better practice for compatibility and clarity. Given the choice, A is slightly preferred. Options B and D use the `split` function, which splits a string into an array of strings based on a delimiter (by default, whitespace). This does not achieve the desired transformation of creating new rows. Option C attempts to call `explode` as a method on a Column object, which is not the correct syntax. `explode` is a function in `pyspark.sql.functions`.
30
The code block shown below contains an error. The code block is intended to return a new DataFrame with the mean of column sqft from DataFrame storesDF in column sqftMean. Identify the error. ``` storesDF.agg(mean("sqft").alias("sqftMean")) ``` A. The argument to the mean() operation should be a Column object rather than a string column name. B. The argument to the mean() operation should not be quoted. C. The mean() operation is not a standalone function – it’s a method of the Column object. D. The agg() operation is not appropriate here – the withColumn() operation should be used instead. E. The only way to compute a mean of a column is with the mean() method from a DataFrame.
A. The argument to the mean() operation should be a Column object rather than a string column name. DISCUSSION: The `mean()` function, when used with `agg()`, expects a Column object as its argument. Passing a string directly as the column name is not the correct way to specify the column for which the mean is to be calculated. Instead, you should use `col("sqft")` to create a Column object. Option B is incorrect because while it might appear syntactically permissible in some contexts, it is not the standard or recommended way to pass the column name to `mean()`. Option C is incorrect because `mean()` is a standalone function when used with `agg()`. Option D is incorrect because `agg()` is the appropriate function to use when calculating aggregate statistics like the mean. `withColumn()` is used to add a new column, not to compute aggregate values. Option E is incorrect because the `mean()` function can be used as a method of the DataFrame or within `agg()` as shown in the question.
31
Which of the following code blocks returns a collection of summary statistics for all columns in DataFrame storesDF? A. `storesDF.summary("mean")` B. `storesDF.describe(all = True)` C. `storesDF.describe("all")` D. `storesDF.summary("all")` E. `storesDF.describe()`
E.
32
Which of the following code blocks returns a 15 percent sample of rows from DataFrame storesDF without replacement? A. storesDF.sample(fraction = 0.10) B. storesDF.sampleBy(fraction = 0.15) C. storesDF.sample(True, fraction = 0.10) D. storesDF.sample() E. storesDF.sample(fraction = 0.15)
E
33
Which of the following code blocks returns all the rows from DataFrame storesDF? A. storesDF.head() B. storesDF.collect() C. storesDF.count() D. storesDF.take() E. storesDF.show()
B
34
Which of the following code blocks applies the function `assessPerformance()` to each row of DataFrame `storesDF`? A. `[assessPerformance(row) for row in storesDF.take(3)]` B. `[assessPerformance() for row in storesDF]` C. `storesDF.collect().apply(lambda: assessPerformance)` D. `[assessPerformance(row) for row in storesDF.collect()]` E. `[assessPerformance(row) for row in storesDF]`
D. `[assessPerformance(row) for row in storesDF.collect()]` DISCUSSION: Option D is correct because it first collects all rows of the `storesDF` DataFrame into a list using `.collect()`. Then, it uses a list comprehension to iterate through each `row` in the collected list and applies the `assessPerformance()` function to it. Option A is incorrect because `.take(3)` only processes the first three rows. Option B is incorrect because it doesn't pass the `row` to the `assessPerformance()` function. Option C is incorrect because `.apply()` is not the correct method to apply a function to each row after using `.collect()`. Also, the `lambda` function is not correctly implemented. Option E is incorrect because it attempts to iterate over the DataFrame directly, which is not the intended way to process rows in Spark DataFrames. The `.collect()` method is needed to bring the data to the driver node as a list. While it might work, it's generally less efficient than using Spark's built-in transformations when dealing with large datasets.
35
The code block shown below contains an error. The code block is intended to print the schema of DataFrame storesDF. Identify the error. Code block: storesDF.printSchema A. There is no printSchema member of DataFrame – schema and the print() function should be used instead. B. The entire line needs to be a string – it should be wrapped by str(). C. There is no printSchema member of DataFrame – the getSchema() operation should be used instead. D. There is no printSchema member of DataFrame – the schema() operation should be used instead. E. The printSchema member of DataFrame is an operation and needs to be followed by parentheses.
E. The printSchema member of DataFrame is an operation and needs to be followed by parentheses. DISCUSSION: The `printSchema` member is a method and needs to be called with parentheses: `storesDF.printSchema()`. Options A, C, and D are incorrect because `printSchema` is a valid method. Option B is incorrect because the line of code doesn't need to be a string.
36
The code block shown below should create and register a SQL UDF named "ASSESS_PERFORMANCE" using the Python function assessPerformance() and apply it to column customerSatisfaction in table stores. Choose the response that correctly fills in the numbered blanks within the code block to complete this task. Code block: spark._1_._2_(_3_, _4_) spark.sql("SELECT customerSatisfaction, _5_(customerSatisfaction) AS result FROM stores") A. 1. udf 2. register 3. "ASSESS_PERFORMANCE" 4. assessPerformance 5. ASSESS_PERFORMANCE B. 1. udf 2. register 3. assessPerformance 4. "ASSESS_PERFORMANCE" 5. "ASSESS_PERFORMANCE" C. 1. udf 2. register 3."ASSESS_PERFORMANCE" 4. assessPerformance 5. "ASSESS_PERFORMANCE" D. 1. register 2. udf 3. "ASSESS_PERFORMANCE" 4. assessPerformance 5. "ASSESS_PERFORMANCE" E. 1. udf 2. register 3. ASSESS_PERFORMANCE 4. assessPerformance 5. ASSESS_PERFORMANCE
A. 1. udf 2. register 3. "ASSESS_PERFORMANCE" 4. assessPerformance 5. ASSESS_PERFORMANCE DISCUSSION: The correct answer is A. The `udf.register` method is used to register a UDF. The first argument is the name of the UDF (a string), and the second argument is the Python function to use. When calling the UDF in SQL, you use the name that was registered. Option B is incorrect because it swaps the name and the function in the register call, and uses quotes incorrectly in the SQL call. Option C is the closest incorrect answer, but it is incorrect because you do not put quotes around the UDF name when calling it in SQL. Option D is incorrect because it reverses the order of udf and register. Option E is incorrect because the registered UDF name needs to be a string literal.
37
The code block shown below contains an error. The code block is intended to create a Python UDF `assessPerformanceUDF()` using the integer-returning Python function `assessPerformance()` and apply it to column `customerSatisfaction` in DataFrame `storesDF`. Identify the error. ``` assessPerformanceUDF – udf(assessPerformance) storesDF.withColumn("result", assessPerformanceUDF(col("customerSatisfaction"))) ``` A. The `assessPerformance()` operation is not properly registered as a UDF. B. The `withColumn()` operation is not appropriate here – UDFs should be applied by iterating over rows instead. C. UDFs can only be applied via SQL and not through the DataFrame API. D. The return type of the `assessPerformanceUDF()` is not specified in the `udf()` operation. E. The `assessPerformance()` operation should be used on column `customerSatisfaction` rather than the `assessPerformanceUDF()` operation.
D. The return type of the `assessPerformanceUDF()` is not specified in the `udf()` operation. DISCUSSION: The PySpark `udf()` function requires the return type to be specified, otherwise it defaults to `StringType()`. Since the problem states that the function `assessPerformance()` returns an integer, the `udf()` function needs to be explicitly told to expect an integer return type. Therefore, option D is correct. Option A is incorrect because the `assessPerformance()` function *is* being passed to the `udf()` function. Option B is incorrect because `withColumn()` is the correct method for applying a UDF to a DataFrame column. Option C is incorrect because UDFs *can* be applied through the DataFrame API. Option E is incorrect because you must first create the UDF with `udf()` and then apply it to the column, not apply the original Python function directly.
38
The code block shown below contains an error. The code block is intended to use SQL to return a new DataFrame containing column storeId and column managerName from a table created from DataFrame storesDF. Identify the error. ```python storesDF.createOrReplaceTempView("stores") storesDF.sql("SELECT storeId, managerName FROM stores") ``` A. The createOrReplaceTempView() operation does not make a Dataframe accessible via SQL. B. The sql() operation should be accessed via the spark variable rather than DataFrame storesDF. C. There is the sql() operation in DataFrame storesDF. The operation query() should be used instead. D. This cannot be accomplished using SQL – the DataFrame API should be used instead. E. The createOrReplaceTempView() operation should be accessed via the spark variable rather than DataFrame storesDF.
B. The sql() operation should be accessed via the spark variable rather than DataFrame storesDF. DISCUSSION: Option B is correct because the `sql()` function is a method of the SparkSession object (typically named `spark`), not a method of the DataFrame object. Therefore, to execute a SQL statement against a temporary view, you need to call `spark.sql("SELECT storeId, managerName FROM stores")`. Option A is incorrect because `createOrReplaceTempView()` does indeed make a DataFrame accessible via SQL. Option C is incorrect because there is no `query()` operation directly available on a DataFrame for executing SQL-like queries; the correct approach is to use `spark.sql()`. Option D is incorrect because using SQL to query a DataFrame's temporary view is a valid approach in Spark. Option E is incorrect because `createOrReplaceTempView()` is correctly called on the DataFrame.
39
Which of the following operations can be used to return a new DataFrame from DataFrame `storesDF` without inducing a shuffle? A. `storesDF.intersect()` B. `storesDF.repartition(1)` C. `storesDF.union()` D. `storesDF.coalesce(1)` E. `storesDF.rdd.getNumPartitions()`
D. `storesDF.coalesce(1)` DISCUSSION: The question asks for an operation that returns a new DataFrame without inducing a shuffle. * **A. `storesDF.intersect()`**: This operation finds the common rows between two DataFrames and requires a shuffle to compare the data across partitions. Thus, it's incorrect. * **B. `storesDF.repartition(1)`**: This operation repartitions the DataFrame into a single partition, which requires a full shuffle of the data. Thus, it's incorrect. * **C. `storesDF.union()`**: While `union` itself is a narrow transformation and doesn't necessarily *always* induce a shuffle, it requires another DataFrame as an argument to union with. The question implies using only `storesDF`. Furthermore, the documentation indicates that the behavior of `union` regarding shuffling depends on the specific implementation and data characteristics. * **D. `storesDF.coalesce(1)`**: This operation aims to reduce the number of partitions in a DataFrame. When reducing the number of partitions, `coalesce` avoids a full shuffle if possible. In this case, reducing to a single partition can be done without a shuffle, making it the most suitable answer. * **E. `storesDF.rdd.getNumPartitions()`**: This operation simply returns the number of partitions in the RDD and does not return a new DataFrame or induce a shuffle. Thus, it's incorrect. Therefore, `coalesce(1)` is the best answer because it can reduce the number of partitions to 1 without necessarily inducing a full shuffle, unlike `repartition(1)` or `intersect()`.
40
The code block shown below contains an error. The code block is intended to return a new 12-partition DataFrame from the 8-partition DataFrame storesDF by inducing a shuffle. Identify the error. `storesDF.coalesce(12)` A. The `coalesce()` operation cannot guarantee the number of target partitions – the `repartition()` operation should be used instead. B. The `coalesce()` operation does not induce a shuffle and cannot increase the number of partitions – the `repartition()` operation should be used instead. C. The `coalesce()` operation will only work if the DataFrame has been cached to memory – the `repartition()` operation should be used instead. D. The `coalesce()` operation requires a column by which to partition rather than a number of partitions – the `repartition()` operation should be used instead. E. The number of resulting partitions, 12, is not achievable for an 8-partition DataFrame.
B. The `coalesce()` operation does not induce a shuffle and cannot increase the number of partitions – the `repartition()` operation should be used instead.
41
Which of the following Spark properties is used to configure whether DataFrame partitions that do not meet a minimum size threshold are automatically coalesced into larger partitions during a shuffle? A. spark.sql.shuffle.partitions B. spark.sql.autoBroadcastJoinThreshold C. spark.sql.adaptive.skewJoin.enabled D. spark.sql.inMemoryColumnarStorage.batchSize E. spark.sql.adaptive.coalescePartitions.enabled
E. spark.sql.adaptive.coalescePartitions.enabled
42
Which of the following operations can perform an outer join on two DataFrames? A. DataFrame.crossJoin() B. Standalone join() function C. DataFrame.outerJoin() D. DataFrame.join() E. DataFrame.merge()
D. DataFrame.join()
43
The below code block contains a logical error resulting in inefficiency. The code block is intended to efficiently perform a broadcast join of DataFrame `storesDF` and the much larger DataFrame `employeesDF` using key column `storeId`. Identify the logical error. ``` storesDF.join(broadcast(employeesDF), "storeId") ``` A. The larger DataFrame `employeesDF` is being broadcasted rather than the smaller DataFrame `storesDF`. B. There is never a need to call the `broadcast()` operation in Apache Spark 3. C. The entire line of code should be wrapped in `broadcast()` rather than just DataFrame `employeesDF`. D. The `broadcast()` operation will only perform a broadcast join if the Spark property `spark.sql.autoBroadcastJoinThreshold` is manually set. E. Only one of the DataFrames is being broadcasted rather than both of the DataFrames.
A. The larger DataFrame `employeesDF` is being broadcasted rather than the smaller DataFrame `storesDF`. DISCUSSION: The purpose of a broadcast join is to optimize performance by broadcasting the smaller DataFrame to all worker nodes. This avoids shuffling a large DataFrame across the network. In this case, `employeesDF` is the larger DataFrame and `storesDF` is the smaller DataFrame. Broadcasting the larger DataFrame defeats the purpose of the broadcast join. Therefore, option A is correct. Option B is incorrect because, while Spark can automatically perform broadcast joins under certain conditions, explicitly using `broadcast()` is still useful for ensuring that a specific DataFrame is broadcasted, especially when automatic broadcasting isn't triggered. Option C is incorrect. The `broadcast()` function is intended to wrap the smaller DataFrame that you want to broadcast, not the entire join operation. Option D is incorrect. While `spark.sql.autoBroadcastJoinThreshold` controls the size threshold for automatic broadcast joins, explicitly using `broadcast()` overrides this setting for the specified DataFrame. Option E is incorrect. Broadcast join involves broadcasting only the smaller table, not both.
44
The code block shown below contains an error. The code block is intended to return a new DataFrame that is the result of a position-wise union between DataFrame storesDF and DataFrame acquiredStoresDF. Identify the error. ``` storesDF.unionByName(acquiredStoresDF) ``` A. There is no DataFrame.unionByName() operation – the concat() operation should be used instead with both DataFrames as arguments. B. There are no key columns specified – similar column names should be the second argument. C. The DataFrame.unionByName() operation does not union DataFrames based on column position – it uses column name instead. D. The unionByName() operation is a standalone operation rather than a method of DataFrame – it should have both DataFrames as arguments. E. There are no column positions specified – the desired column positions should be the second argument.
C. The DataFrame.unionByName() operation does not union DataFrames based on column position – it uses column name instead. `unionByName()` performs a union based on column names, not positions. The question specifies that a position-wise union is desired, making `unionByName()` the wrong choice. Options A and D are incorrect because `unionByName()` is a valid method of the DataFrame. Options B and E are incorrect because `unionByName` does not accept column positions or similar column names as arguments.
45
Which of the following code blocks writes DataFrame storesDF to file path filePath as JSON? A. storesDF.write.option("json").path(filePath) B. storesDF.write.json(filePath) C. storesDF.write.path(filePath) D. storesDF.write(filePath) E. storesDF.write().json(filePath)
B
46
The code block shown below contains an error. The code block intended to read a parquet at the file path filePath into a DataFrame. Identify the error. Code block: ``` spark.read.load(filePath, source – "parquet") ``` A. There is no source parameter to the load() operation – the schema parameter should be used instead. B. There is no load() operation – it should be parquet() instead. C. The spark.read operation should be followed by parentheses to return a DataFrameReader object. D. The filePath argument to the load() operation should be quoted. E. There is no source parameter to the load() operation – it can be removed.
E. There is no source parameter to the load() operation – it can be removed. DISCUSSION: The error in the code is the use of the parameter `source`. The correct parameter to specify the data source format in the `load()` function is `format`. While removing the `source` parameter would technically allow the code to run (as it defaults to "parquet"), it's not ideal. However, since `format` is the correct parameter, the absence of `source` is the direct error. A is incorrect because while the `schema` parameter can be used, it's not a replacement for specifying the data source format and is not the immediate error in the code. B is incorrect because the `load()` operation does exist. C is incorrect because `spark.read` correctly returns a `DataFrameReader` object without parentheses. D is incorrect because while quoting `filePath` is good practice, it's not the direct error in the provided code.
47
Which of the following DataFrame operations is classified as a wide transformation? A. DataFrame.filter() B. DataFrame.join() C. DataFrame.select() D. DataFrame.drop() E. DataFrame.union()
B. DataFrame.join() DataFrame.join() is a wide transformation because it requires shuffling data across the network to combine data from different partitions based on a common key. This contrasts with narrow transformations like filter, select, drop, and union, which operate on individual partitions without requiring data redistribution.
48
Which of the following operations can be used to create a new DataFrame that has 12 partitions from an original DataFrame `df` that has 8 partitions? A. `df.repartition(12)` B. `df.cache()` C. `df.partitionBy(1.5)` D. `df.coalesce(12)` E. `df.partitionBy(12)`
A. `df.repartition(12)` DISCUSSION: The correct answer is A. `df.repartition(12)` can increase or decrease the number of partitions. In this case, it increases the number of partitions from 8 to 12. Option B, `df.cache()`, is incorrect because it caches the DataFrame but doesn't change the number of partitions. Option C, `df.partitionBy(1.5)`, is incorrect because `partitionBy` requires column names, not a numerical value, as its argument. Option D, `df.coalesce(12)`, is incorrect because `coalesce` is generally used to *reduce* the number of partitions. While it *can* increase the number of partitions, it will only do so if shuffling is enabled (and the source has fewer than 12 partitions). In this case, since shuffling is not explicitly enabled, it will not increase the number of partitions. `repartition` is more appropriate to *increase* partitions. Option E, `df.partitionBy(12)`, is incorrect because `partitionBy` requires column names, not an integer representing the number of partitions. It partitions by the values in the specified columns.
49
Which of the following DataFrame operations is classified as an action? A. DataFrame.drop() B. DataFrame.coalesce() C. DataFrame.take() D. DataFrame.join() E. DataFrame.filter()
C. DataFrame.take()
50
The code block shown below contains an error. The code block is intended to return a DataFrame containing a column openDateString, a string representation of Java’s SimpleDateFormat. Identify the error. Note that column openDate is of type integer and represents a date in the UNIX epoch format – the number of seconds since midnight on January 1st, 1970. An example of Java’s SimpleDateFormat is "Sunday, Dec 4, 2008 1:05 PM". A sample of storesDF is displayed below: [Image](https://img.examtopics.com/certified-associate-developer-for-apache-spark/image6.png) Code block: ```python storesDF.withColumn("openDateString", from_unixtime(col("openDate"), "EEE, MMM d, yyyy h:mm a", TimestampType())) ``` A. The from_unixtime() operation only accepts two parameters – the TimestampType() arguments not necessary. B. The from_unixtime() operation only works if column openDate is of type long rather than integer – column openDate must first be converted. C. The second argument to from_unixtime() is not correct – it should be a variant of TimestampType() rather than a string. D. The from_unixtime() operation automatically places the input column in java’s SimpleDateFormat – there is no need for a second or third argument. E. The column openDate must first be converted to a timestamp, and then the Date() function can be used to reformat to java’s SimpleDateFormat.
A. The from_unixtime() operation only accepts two parameters – the TimestampType() arguments not necessary. DISCUSSION: The `from_unixtime()` function in Spark SQL's `pyspark.sql.functions` module accepts a timestamp column and an optional format string. The `TimestampType()` argument in the provided code block is unnecessary and causes an error, as `from_unixtime` only takes the column to convert and the format. Therefore, option A is the correct answer. Option B is incorrect because `from_unixtime()` can work with integer types representing Unix timestamps. Option C is incorrect because the second argument *is* a string. Option D is incorrect because `from_unixtime()` *requires* a format string if you don't want the default. Option E suggests an alternative approach but doesn't identify the error in the given code.
51
The code block shown below contains an error. The code block intended to return a new DataFrame that is the result of an inner join between DataFrame storesDF and DataFrame employeesDF on column storeId. Identify the error. `StoresDF.join(employeesDF, "inner", "storeID")` A. The key column storeID needs to be wrapped in the col() operation. B. The key column storeID needs to be in a list like `["storeID"]`. C. The key column storeID needs to be specified in an expression of both DataFrame columns like `storesDF.storeId == employeesDF.storeId`. D. There is no DataFrame.join() operation – DataFrame.merge() should be used instead. E. The column key is the second parameter to join() and the type of join in the third parameter to join() – the second and third arguments should be switched.
E. The column key is the second parameter to join() and the type of join in the third parameter to join() – the second and third arguments should be switched. DISCUSSION: The `join()` method in Spark DataFrame expects the join condition (column name) as the second argument and the join type as the third argument. The given code has these arguments in the wrong order. Options A, B, and C suggest alternative ways to specify the join column, but the fundamental error is the incorrect order of arguments. Option D is incorrect because `DataFrame.join()` is a valid operation.
52
Which of the following pairs of arguments cannot be used in DataFrame.join() to perform an inner join on two DataFrames, named and aliased with "a" and "b" respectively, to specify two key columns? A. `on = [a.column1 == b.column1, a.column2 == b.column2]` B. `on = [col("column1"), col("column2")]` C. `on = [col("a.column1") == col("b.column1"), col("a.column2") == col("b.column2")]` D. All of these options can be used to perform an inner join with two key columns. E. `on = ["column1", "column2"]`
B DISCUSSION: Option B is the correct answer. When using `col("column1")` and `col("column2")` without specifying the DataFrame alias, Spark becomes ambiguous as to which DataFrame the columns belong. This will throw an `AnalysisException` because it doesn't know if `column1` is from DataFrame `a` or `b`. Options A, C, and E all provide enough information for Spark to determine which columns to join on. * Option A uses the DataFrame aliases directly (a.column1 == b.column1). * Option C explicitly specifies the DataFrame aliases within the `col()` function (col("a.column1") == col("b.column1")). * Option E assumes that you join columns with the same name from both tables, which is a valid approach when the column names are unambiguous.
53
In what order should the below lines of code be run in order to read a JSON file at the file path `filePath` into a DataFrame with the specified schema `schema`? Lines of code: 1. `.json(filePath, schema = schema)` 2. `.storesDF` 3. `.spark \` 4. `.read() \` 5. `.read \` 6. `.json(filePath, format = schema)` A. 3, 5, 6 B. 2, 4, 1 C. 3, 5, 1 D. 2, 5, 1 E. 3, 4, 1
C. The correct order is 3, 5, 1. The correct syntax to read a JSON file into a DataFrame with a specified schema is `spark.read.json(filePath, schema=schema)`. Thus, you start with `spark` (line 3), then specify that you want to read a file (line 5), and then specify that it is a JSON file with a specified schema (line 1). Option A is incorrect because line 6 uses `format = schema`, which is not a valid parameter for the `.json()` method. Options B and D are incorrect because they start with `.storesDF` (line 2), which is not the correct starting point for this operation. Option E is incorrect because it uses `.read()` (line 4) instead of `.read` (line 5) which is syntactically incorrect to chain `.json()` to `.read()` with parenthesis.
54
The default value of spark.sql.shuffle.partitions is 200. Which of the following describes what that means? A. By default, all DataFrames in Spark will be spit to perfectly fill the memory of 200 executors. B. By default, new DataFrames created by Spark will be split to perfectly fill the memory of 200 executors. C. By default, Spark will only read the first 200 partitions of DataFrames to improve speed. D. By default, all DataFrames in Spark, including existing DataFrames, will be split into 200 unique segments for parallelization. E. By default, DataFrames will be split into 200 unique partitions when data is being shuffled.
E. By default, DataFrames will be split into 200 unique partitions when data is being shuffled. DISCUSSION: The correct answer is E. The `spark.sql.shuffle.partitions` parameter controls the number of partitions used when shuffling data, which occurs during operations like joins or aggregations. A value of 200 means that by default, data will be divided into 200 partitions during shuffle operations. Option A is incorrect because the number of partitions is not directly tied to the memory of the executors. Options B and D are incorrect because the setting only applies during shuffles, not to all DataFrames. Option C is incorrect because the parameter does not limit the number of partitions read.
55
Which of the following object types cannot be contained within a column of a Spark DataFrame? A. DataFrame B. String C. Array D. null E. Vector
A
56
Which of the following Spark properties is used to configure whether skewed partitions are automatically detected and subdivided into smaller partitions when joining two DataFrames together? A. spark.sql.adaptive.skewedJoin.enabled B. spark.sql.adaptive.coalescePartitions.enable C. spark.sql.adaptive.skewHints.enabled D. spark.sql.shuffle.partitions E. spark.sql.shuffle.skewHints.enabled
A. spark.sql.adaptive.skewedJoin.enabled **Explanation:** * **A. spark.sql.adaptive.skewedJoin.enabled:** This property controls whether Spark automatically detects and handles skewed partitions during joins by splitting them into smaller partitions. While the property name in the question contains a typo ("skewedJoin" instead of "skewJoin"), it's the closest and most relevant option. * **B. spark.sql.adaptive.coalescePartitions.enable:** This property enables or disables the coalescing of partitions after a shuffle, which is a different optimization technique. * **C. spark.sql.adaptive.skewHints.enabled:** This is not a valid Spark property and therefore incorrect. Also, "skewHints" are not the same as automated skew handling. * **D. spark.sql.shuffle.partitions:** This property sets the default number of partitions to use when shuffling data, but it doesn't specifically address skew. * **E. spark.sql.shuffle.skewHints.enabled:** This is not a valid Spark property and therefore incorrect. Also, "skewHints" are not the same as automated skew handling.
57
Which of the following code blocks returns the first 3 rows of DataFrame storesDF? A. storesDF.top_n(3) B. storesDF.n(3) C. storesDF.take(3) D. storesDF.head(3) E. storesDF.collect(3)
D
58
The code block shown below should efficiently perform a broadcast join of DataFrame `storesDF` and the much larger DataFrame `employeesDF` using key column `storeId`. Choose the response that correctly fills in the numbered blanks within the code block to complete this task. ``` __1__.join(__2__(__3__), "storeId") ``` A. 1. `employeesDF` 2. `broadcast` 3. `storesDF` B. 1. `broadcast(employeesDF)` 2. `broadcast` 3. `storesDF` C. 1. `broadcast` 2. `employeesDF` 3. `storesDF` D. 1. `storesDF` 2. `broadcast` 3. `employeesDF` E. 1. `broadcast(storesDF)` 2. `broadcast` 3. `employeesDF`
A. 1. `employeesDF` 2. `broadcast` 3. `storesDF` **Explanation:** The goal is to perform a broadcast join where the smaller DataFrame (`storesDF`) is broadcast to all nodes to be joined with the larger DataFrame (`employeesDF`). The correct syntax for this in Spark is `largerDF.join(broadcast(smallerDF), joinKey)`. * Option A correctly places `employeesDF` as the DataFrame on which the `join` operation is called. It then uses `broadcast(storesDF)` to specify that `storesDF` should be broadcast. "storeId" is the key. * Option B incorrectly tries to broadcast the larger DataFrame before the join. Also, it includes `broadcast` twice. * Option C is syntactically incorrect. `broadcast` is a function and cannot be called directly on a DataFrame this way. * Option D incorrectly places `storesDF` before the join and attempts to broadcast the larger DataFrame. * Option E incorrectly places `broadcast(storesDF)` before the join and then puts `employeesDF` as the argument to `broadcast` which isn't valid syntax. Also, it includes `broadcast` twice.
59
Which of the following code blocks fails to return a DataFrame reverse sorted alphabetically based on column division? A. `storesDF.orderBy("division", ascending – False)` B. `storesDF.orderBy(["division"], ascending = [0])` C. `storesDF.orderBy(col("division").asc())` D. `storesDF.sort("division", ascending – False)` E. `storesDF.sort(desc("division"))`
C. **Explanation:** The question asks for the code block that *fails* to return a DataFrame reverse sorted alphabetically. Reverse sorted alphabetically is equivalent to descending order. * **A, B, D, and E:** These options all specify a descending order. Options A and D have a typo (`ascending – False`), but are intended to mean `ascending = False`, which sorts in descending order. Option B `ascending = [0]` also sorts descending, where `0` represents `False`. Option E uses `desc("division")` which explicitly sorts the "division" column in descending order. * **C:** This option uses `col("division").asc()`, which explicitly sorts the "division" column in ascending order. Thus, it does not return a reverse sorted (descending) DataFrame. Therefore, option C is the correct answer because it sorts in ascending order, failing to meet the requirement of reverse alphabetical sorting (descending order).
60
Which of the following describes the difference between cluster and client execution modes? A. The cluster execution mode runs the driver on a worker node within a cluster, while the client execution mode runs the driver on the client machine (also known as a gateway machine or edge node). B. The cluster execution mode is run on a local cluster, while the client execution mode is run in the cloud. C. The cluster execution mode distributes executors across worker nodes in a cluster, while the client execution mode runs a Spark job entirely on one client machine. D. The cluster execution mode runs the driver on the cluster machine (also known as a gateway machine or edge node), while the client execution mode runs the driver on a worker node within a cluster. E. The cluster execution mode distributes executors across worker nodes in a cluster, while the client execution mode submits a Spark job from a remote machine to be run on a remote, unconfigurable cluster.
A. The cluster execution mode runs the driver on a worker node within a cluster, while the client execution mode runs the driver on the client machine (also known as a gateway machine or edge node). Explanation: Option A correctly describes the difference between cluster and client execution modes. In cluster mode, the driver process runs on one of the worker nodes within the cluster, managed by the cluster manager. In client mode, the driver process runs on the client machine that submits the Spark application. Option B is incorrect because execution modes do not dictate whether the cluster runs locally or in the cloud. Both modes can be used in either environment. Option C is incorrect because in both cluster and client modes, the executors run on worker nodes in the cluster, not just the client machine. Option D is incorrect because it reverses the roles of the driver in cluster and client modes. Option E is incorrect because it misrepresents the behavior of client mode. Client mode still involves executors running in the cluster; it's the driver that runs on the client machine.
61
The code block shown below contains an error. The code block is intended to return a new DataFrame where column managerName from DataFrame storesDF is split at the space character into column managerFirstName and column managerLastName. Identify the error. A sample of DataFrame storesDF is displayed below: [Image](https://img.examtopics.com/certified-associate-developer-for-apache-spark/image8.png) Code block: storesDF.withColumn("managerFirstName", col("managerName").split(" ").getItem(0)) .withColumn("managerLastName", col("managerName").split(" ").getItem(1)) A. The index values of 0 and 1 are not correct – they should be 1 and 2, respectively. B. The index values of 0 and 1 should be provided as second arguments to the split() operation rather than indexing the result. C. The split() operation comes from the imported functions object. It accepts a string column name and split character as arguments. It is not a method of a Column object. D. The split() operation comes from the imported functions object. It accepts a Column object and split character as arguments. It is not a method of a Column object. E. The withColumn operation cannot be called twice in a row.
D. The split() operation comes from the imported functions object. It accepts a Column object and split character as arguments. It is not a method of a Column object. DISCUSSION: The error lies in the fact that the `split()` function in PySpark is part of the `pyspark.sql.functions` module (often aliased as `F` or `functions`) and should be called as `split(col("columnName"), "delimiter")`, not as a method directly on a Column object. Option A is incorrect because the index values 0 and 1 are correct for accessing the first and second elements of the array resulting from the split operation (first name and last name respectively). Option B is incorrect as there are no index values that can be provided as arguments to the split() operation. Option C is similar to the correct answer, but it incorrectly states that the split() function accepts a string column name. It accepts a Column object which can be obtained via `col("columnName")`. Option E is incorrect because `withColumn` can be chained multiple times to add or modify multiple columns.
62
Which of the following code blocks returns a new DataFrame where column division from DataFrame storesDF has been replaced and renamed to column state and column managerName from DataFrame storesDF has been replaced and renamed to column managerFullName? A. ``` storesDF.withColumnRenamed("division", "state") .withColumnRenamed("managerName", "managerFullName") ``` B. ``` storesDF.withColumn("state", "division") .withColumn("managerFullName", "managerName") ``` C. ``` storesDF.withColumn("state", col("division")) .withColumn("managerFullName", col("managerName")) ``` D. ``` storesDF.withColumnRenamed(Seq("division", "state"), Seq("managerName", "managerFullName")) ``` E. ``` storesDF.withColumnRenamed("state", "division") .withColumnRenamed("managerFullName", "managerName") ```
A
63
Which of the following code blocks returns a DataFrame sorted alphabetically based on column division? A. ```python storesDF.sort("division") ``` B. ```python storesDF.orderBy(desc("division")) ``` C. ```python storesDF.orderBy(col("division").desc()) ``` D. ```python storesDF.orderBy("division", ascending - true) ``` E. ```python storesDF.sort(desc("division")) ```
A. ```python storesDF.sort("division") ``` DISCUSSION: Option A is correct because the `sort` method in PySpark, by default, sorts in ascending order (alphabetically for strings). Option B is incorrect because `desc("division")` sorts in descending order (reverse alphabetical). Option C is incorrect because `col("division").desc()` also sorts in descending order. Option D is incorrect due to a syntax error: `ascending - true` is not valid Python syntax for specifying ascending order. It should be `ascending=True`. Option E is incorrect because `desc("division")` sorts in descending order (reverse alphabetical).
64
The code block shown below contains an error. The code block intended to create a single-column DataFrame from Scala List years which is made up of integers. Identify the error. Code block: ```scala spark.createDataset(years) ``` A. The years list should be wrapped in another list like List(years) to make clear that it is a column rather than a row. B. The data type is not specified – the second argument to createDataset should be IntegerType. C. There is no operation createDataset – the createDataFrame operation should be used instead. D. The result of the above is a Dataset rather than a DataFrame – the toDF operation must be called at the end. E. The column name must be specified as the second argument to createDataset.
D. The result of the above is a Dataset rather than a DataFrame – the toDF operation must be called at the end. DISCUSSION: Option D is the correct answer. In Scala with Spark, `createDataset()` creates a Dataset rather than a DataFrame. To obtain a DataFrame, the `.toDF()` operation needs to be called on the resulting Dataset. Option A is incorrect because wrapping the list in another list is not necessary for creating a single-column DataFrame. Option B is incorrect because Spark can automatically infer the data type of the elements in the list. Specifying `IntegerType` is not mandatory. Option C is incorrect because `createDataset` is a valid operation in Spark's Scala API. Option E is incorrect because while specifying the column name is good practice, it is not the reason the code has an error. The primary issue is that `createDataset` returns a Dataset, not a DataFrame.
65
The code block shown below should return a new DataFrame that is the result of an inner join between DataFrame storeDF and DataFrame employeesDF on column storeId. Choose the response that correctly fills in the numbered blanks within the code block to complete this task. ``` storesDF.__1__(__2__, __3__, __4__) ``` A. 1. join 2. employeesDF 3. "inner" 4. storesDF.storeId === employeesDF.storeId B. 1. join 2. employeesDF 3. "storeId" 4. "inner" C. 1. merge 2. employeesDF 3. "storeId" 4. "inner" D. 1. join 2. employeesDF 3. "inner" 4. "storeId" E. 1. join 2. employeesDF 3. "inner" 4. "storeDF.storeId === employeesDF.storeId"
B. 1. join 2. employeesDF 3. "storeId" 4. "inner" DISCUSSION: The correct way to perform an inner join in Spark (and many other DataFrame libraries) is to use the `join` method, specify the DataFrame to join with, the join column, and the join type. Thus, option B is correct: `storesDF.join(employeesDF, "storeId", "inner")`. Option A is incorrect because it attempts to use a boolean expression for the join condition when a column name is expected and places the join type in the wrong position. Option C is incorrect because `merge` is not the correct function name for joining. Option D is incorrect because it places the join type `"inner"` in the wrong position. Option E is incorrect because it attempts to use a boolean expression for the join condition when a column name is expected and has the join type in the wrong position.
66
The code block shown below contains an error. The code block is intended to create and register a SQL UDF named “ASSESS_PERFORMANCE” using the Scala function assessPerformance() and apply it to column customerSatisfaction in the table stores. Identify the error. ``` spark.udf.register(“ASSESS_PERFORMANCE”, assessPerforance) spark.sql(“SELECT customerSatisfaction, assessPerformance(customerSatisfaction) AS result FROM stores”) ``` A. The customerSatisfaction column cannot be called twice inside the SQL statement. B. Registered UDFs cannot be applied inside of a SQL statement. C. The order of the arguments to spark.udf.register() should be reversed. D. The wrong SQL function is used to compute column result - it should be ASSESS_PERFORMANCE instead of assessPerformance. E. There is no sql() operation - the DataFrame API must be used to apply the UDF assessPerformance().
D. The wrong SQL function is used to compute column result - it should be ASSESS_PERFORMANCE instead of assessPerformance. Explanation: The error lies in calling `assessPerformance` instead of the registered UDF name `ASSESS_PERFORMANCE` within the SQL statement. After registering the UDF with `spark.udf.register`, you must use the registered name (in this case, "ASSESS_PERFORMANCE") in your SQL queries to invoke the function. Option A is incorrect because it's perfectly valid to reference a column multiple times in a SELECT statement. Option B is incorrect because registering UDFs is specifically done so they *can* be used in SQL statements. Option C is incorrect because the argument order `spark.udf.register(name, function)` is correct. Option E is incorrect; `spark.sql()` is a valid way to execute SQL queries.
67
The code block shown below should return a new DataFrame where single quotes in column storeSlogan have been replaced with double quotes. Choose the response that correctly fills in the numbered blanks within the code block to complete this task. A sample of DataFrame storesDF is below: [Image](https://img.examtopics.com/certified-associate-developer-for-apache-spark/image9.png) Code block: storesDF.__1__(__2__, __3__(__4__, __5__, __6__)) A. 1. withColumn 2. "storeSlogan" 3. regexp_extract 4. col("storeSlogan") 5. "\"" 6. "'" B. 1. newColumn 2. storeSlogan 3. regexp_extract 4. col(storeSlogan) 5. "\"" 6. "'" C. 1. withColumn 2. "storeSlogan" 3. regexp_replace 4. col("storeSlogan") 5. "\"" 6. "'" D. 1. withColumn 2. "storeSlogan" 3. regexp_replace 4. col("storeSlogan") 5. "'" 6. "\"" E. 1. withColumn 2. "storeSlogan" 3. regexp_extract 4. col("storeSlogan") 5. "'" 6. "\""
D. 1. withColumn 2. "storeSlogan" 3. regexp_replace 4. col("storeSlogan") 5. "'" 6. "\"" DISCUSSION: The correct answer is D. The `withColumn` function is used to add a new column or replace an existing one. The `regexp_replace` function is used to replace substrings within a column based on a regular expression. In this case, we want to replace all occurrences of single quotes (') with double quotes (\") in the "storeSlogan" column. Option A and E are incorrect because they use `regexp_extract`, which extracts substrings based on a regular expression, rather than replacing them. Option B is incorrect because it uses `newColumn` which is not a valid Spark DataFrame function and because the column name `storeSlogan` should be a string. Option C is incorrect because it tries to replace double quotes with single quotes, when the prompt specified the reverse operation.
68
The code block shown below should return a new DataFrame with the mean of column sqft from DataFrame storesDF in column sqftMean. Choose the response that correctly fills in the numbered blanks within the code block to complete this task. Code block: storesDF.__1__(__2__(__3__).alias("sqftMean")) A. 1. agg 2. mean 3. col("sqft") B. 1. withColumn 2. mean 3. col("sqft") C. 1. agg 2. average 3. col("sqft") D. 1. mean 2. col 3. "sqft" E. 1. agg 2. mean 3. "sqft"
A. **Explanation:** The correct answer is A. * `agg` is used to perform aggregation operations. * `mean` is the correct function to calculate the mean. * `col("sqft")` correctly references the 'sqft' column. **Incorrect Options:** * B: `withColumn` is used to add or replace a column, not for aggregation. * C: `average` is not the correct function in Spark to calculate the mean; `mean` should be used. * D: The syntax and order are incorrect. `mean` is not a DataFrame transformation function. * E: `col("sqft")` is needed to specify the column. Using `"sqft"` directly as an argument to `mean` will produce an error.
69
Which of the following code blocks returns a 10 percent sample of rows from DataFrame storesDF with replacement? A. storesDF.sample(true) B. storesDF.sample(true, fraction = 0.1) C. storesDF.sample(true, fraction = 0.15) D. storesDF.sampleBy(fraction = 0.1) E. storesDF.sample(false, fraction = 0.1)
B
70
The code block shown below contains an error. The code block is intended to print the schema of DataFrame storesDF. Identify the error. Code block: storesDF.printSchema.getAs[String] A. There is no printSchema member of DataFrame – the getSchema() operation should be used instead. B. There is no printSchema member of DataFrame – the schema() operation should be used instead. C. The entire line needs to be a string – it should be wrapped by str(). D. The printSchema member of DataFrame is an operation prints the DataFrame – there is no need to call getAs. E. There is no printSchema member of DataFrame – schema and the print() function should be used instead.
D. The printSchema member of DataFrame is an operation prints the DataFrame – there is no need to call getAs.
71
The code block shown below should return a new DataFrame that is the result of an outer join between DataFrame storesDF and DataFrame employeesDF on column storeId. Choose the response that correctly fills in the numbered blanks within the code block to complete this task. Code block: storesDF.__1__(__2__, __3__, __4__) A. 1. join 2. employeesDF 3. "outer" 4. Seq("storeId") B. 1. merge 2. employeesDF 3. "outer" 4. Seq("storeId") C. 1. join 2. employeesDF 3. "outer" 4. storesDF.storeId === employeesDF.storeId D. 1. merge 2. employeesDF 3. Seq("storeId") 4. "outer" E. 1. join 2. employeesDF 3. Seq("storeId") 4. "outer"
E. The correct way to perform an outer join in Spark (Scala) is using the `join` function. The `join` function requires the other DataFrame to join with, the join condition, and the join type. In this case, the join condition is based on the `storeId` column, and the join type is "outer". Note that when joining using column names, the column names should be passed as a sequence. Option A is incorrect because `Seq("storeId")` is not a valid way to specify the join column(s). Option B is incorrect because `merge` is not the correct function for joining in Spark. Also, `Seq("storeId")` is not a valid way to specify the join column(s) with `merge`. Option C is incorrect because `storesDF.storeId === employeesDF.storeId` is a valid join condition syntax, but it must be provided as the third argument and the join type as the fourth. Option D is incorrect because `merge` is not the correct function for joining in Spark, and the join type and column names are in the wrong order.
72
The code block shown below should extract the integer value for column sqft from the first row of DataFrame storesDF. Choose the response that correctly fills in the numbered blanks within the code block to complete this task. Code block: __1__.__2__.__3__[Int](__4__) A. 1. storesDF 2. first() 3. getAs() 4. “sqft” B. 1. storesDF 2. first 3. getAs 4. sqft C. 1. storesDF 2. first() 3. getAs 4. col(“sqft”) D. 1. storesDF 2. first 3. getAs 4. “sqft”
A. The correct code to extract the integer value for the column 'sqft' from the first row of the DataFrame `storesDF` is `storesDF.first().getAs[Int]("sqft")`. * `storesDF` refers to the DataFrame. * `.first()` retrieves the first row as a Row object. * `.getAs[Int]("sqft")` extracts the value from the column named "sqft" in the Row object and casts it to an Integer. Option B is incorrect because it is missing the parentheses after `first`, making it an attempt to reference the method, not execute it. It also lacks quotes around `sqft`. Option C is incorrect because it uses `col("sqft")` which is not how you specify the column name in the `getAs` method. Option D is incorrect because it is missing the parentheses after `first`, making it an attempt to reference the method, not execute it. getAs() requires parentheses.
73
The code block shown below should print the schema of DataFrame storesDF. Choose the response that correctly fills in the numbered blanks within the code block to complete this task. Code block: __1__.__2__ A. 1. storesDF 2. printSchema(“all”) B. 1. storesDF 2. schema C. 1. storesDF 2. getAs[str] D. 1. storesDF 2. printSchema(true) E. 1. storesDF 2. printSchema
E. 1. storesDF 2. printSchema DISCUSSION: The question asks for the code that prints the schema of a DataFrame called `storesDF`. The correct method to print the schema in Spark (Scala or Python) is `printSchema()`. Therefore, the correct code should be `storesDF.printSchema`. Option A is incorrect because `printSchema("all")` is not a valid method call. Option B is incorrect because `storesDF.schema` returns the schema object but doesn't print it. Option C is incorrect because `getAs[str]` is used to extract data from a specific column as a string. Option D is incorrect because `printSchema(true)` is not a valid method call.
74
The code block shown below contains an error. The code block intended to return a new DataFrame that is the result of an inner join between DataFrame storesDF and DataFrame employeesDF on column storeId. Identify the error. ``` StoresDF.join(employeesDF, Seq("storeId") ``` A. The key column storeId needs to be a string like “storeId”. B. The key column storeId needs to be specified in an expression of both Data Frame columns like storesDF.storeId ===employeesDF.storeId. C. The default argument to the joinType parameter is “inner” - an additional argument of “left” must be specified. D. There is no DataFrame.join() operation - DataFrame.merge() should be used instead. E. The key column storeId needs to be wrapped in the col() operation.
A. The key column storeId needs to be a string like “storeId”. The `join` function in Spark expects the join column to be specified as a string within a `Seq`. While the code snippet provides `Seq("storeId")`, the error likely stems from a typo in the code block `StoresDF.join(employeesDF, Seq("storeId")` where the closing parenthesis is missing. However, based on the options provided, the closest plausible answer, assuming the presence of a typo is A, where "storeId" should be a string. Although, this is likely not the root of the problem. B. This is incorrect. While specifying the join condition using expressions like `storesDF.storeId === employeesDF.storeId` is a valid approach, it's not the only way, especially for inner joins using a common column name. C. This is incorrect. The default join type is indeed "inner," so specifying it explicitly is unnecessary. "left" is also not a requirement, which makes this option particularly flawed. D. This is incorrect. `DataFrame.join()` is a valid operation in Spark. E. This is incorrect. Wrapping the column name with `col()` is generally needed when specifying the join condition as an expression (like in option B), but not when providing the column name as a string.
75
Which of the following describes the difference between DataFrame.repartition(n) and DataFrame.coalesce(n)? A. DataFrame.repartition(n) will split a DataFrame into n number of new partitions with data distributed evenly. DataFrame.coalesce(n) will more quickly combine the existing partitions of a DataFrame but might result in an uneven distribution of data across the new partitions. B. While the results are similar, DataFrame.repartition(n) will be more efficient than DataFrame.coalesce(n) because it can partition a Data Frame by the column. C. DataFrame.repartition(n) will split a Data Frame into any number of new partitions while minimizing shuffling. DataFrame.coalesce(n) will split a DataFrame onto any number of new partitions utilizing a full shuffle. D. While the results are similar, DataFrame.repartition(n) will be less efficient than DataFrame.coalesce(n) because it can partition a Data Frame by the column. E. DataFrame.repartition(n) will combine the existing partitions of a DataFrame but may result in an uneven distribution of data across the new partitions. DataFrame.coalesce(n) will more slowly split a Data Frame into n number of new partitions with data distributed evenly.
A. DataFrame.repartition(n) will split a DataFrame into n number of new partitions with data distributed evenly. DataFrame.coalesce(n) will more quickly combine the existing partitions of a DataFrame but might result in an uneven distribution of data across the new partitions.
76
Which of the following code blocks returns a new DataFrame with a new column `customerSatisfactionAbs` that is the absolute value of column `customerSatisfaction` in DataFrame `storesDF`? Note that column `customerSatisfactionAbs` is not in the original DataFrame `storesDF`. A. `storesDF.withColumn(“customerSatisfactionAbs”, abs(col(“customerSatisfaction”)))` B. `storesDF.withColumnRenamed(“customerSatisfactionAbs”, abs(col(“customerSatisfaction”)))` C. `storesDF.withColumn(col(“customerSatisfactionAbs”, abs(col(“customerSatisfaction”)))` D. `storesDF.withColumn(“customerSatisfactionAbs”, abs(col(customerSatisfaction)))` E. `storesDF.withColumn(“customerSatisfactionAbs”, abs(“customerSatisfaction”))`
A. `storesDF.withColumn(“customerSatisfactionAbs”, abs(col(“customerSatisfaction”)))` DISCUSSION: Option A is correct because it uses the `withColumn` function to create a new column named "customerSatisfactionAbs" and assigns it the absolute value of the "customerSatisfaction" column. The `abs()` function from `pyspark.sql.functions` is used correctly with `col("customerSatisfaction")` to specify the column to operate on. Option B is incorrect because `withColumnRenamed` is used to rename an existing column, not create a new one. Option C is incorrect because the first argument to `withColumn` should be the new column's name as a string, not a Column object. Option D is incorrect because `customerSatisfaction` isn't wrapped in `col()`, meaning it will be interpreted as a literal value rather than a column name. Option E is incorrect because it tries to take the absolute value of the literal string "customerSatisfaction" rather than the values in the column.
77
Which of the following statements about the Spark driver is true? A. Spark driver is horizontally scaled to increase overall processing throughput. B. Spark driver is the most coarse level of the Spark execution hierarchy. C. Spark driver is fault tolerant — if it fails, it will recover the entire Spark application. D. Spark driver is responsible for scheduling the execution of data by various worker nodes in cluster mode. E. Spark driver is only compatible with its included cluster manager.
D. Spark driver is responsible for scheduling the execution of data by various worker nodes in cluster mode. **Explanation:** * **D is correct:** The Spark driver is responsible for coordinating the execution of a Spark application. This includes analyzing, distributing, and scheduling work across the executors on the worker nodes. While the cluster manager handles resource allocation, the driver determines the specific tasks that are sent to each worker. * **A is incorrect:** The Spark driver is a single process and is not horizontally scaled. The executors on the worker nodes are the components that are scaled horizontally. * **B is incorrect:** The Spark driver is a key component in the Spark architecture, but not a level of execution hierarchy itself. * **C is incorrect:** If the Spark driver fails, the entire Spark application typically fails. While there are mechanisms for driver fault tolerance (e.g., using YARN), it doesn't automatically recover the entire application in all cases. * **E is incorrect:** The Spark driver can work with different cluster managers (YARN, Mesos, Kubernetes, Spark's standalone cluster manager), it is not bound to only one.
78
Which of the following code blocks returns a DataFrame containing only the rows from DataFrame storesDF where the value in column sqft is less than or equal to 25,000 AND the value in column customerSatisfaction is greater than or equal to 30? A. storesDF.filter(col("sqft") <= 25000 and col("customerSatisfaction") >= 30) B. storesDF.filter(col("sqft") <= 25000 or col("customerSatisfaction") >= 30) C. storesDF.filter(sqft) <= 25000 and customerSatisfaction >= 30) D. storesDF.filter(col("sqft") <= 25000 & col("customerSatisfaction") >= 30) E. storesDF.filter(sqft <= 25000) & customerSatisfaction >= 30)
D. storesDF.filter(col("sqft") <= 25000 & col("customerSatisfaction") >= 30) DISCUSSION: The correct answer is D. In PySpark, when filtering with multiple conditions, you should use bitwise operators like `&` (and), `|` (or), and `~` (not) instead of the standard Python `and`, `or`, and `not`. Also, the `col()` function must be used to reference column names within the `filter()` function. Option A is incorrect because it uses the Python `and` operator instead of the bitwise `&` operator. Option B is incorrect because it uses the Python `or` keyword. Also, it would return rows where either condition is true, not where both are true. Option C is incorrect because it doesn't use the `col()` function to reference the column names and uses the Python `and` operator. Option E is incorrect because it doesn't use the `col()` function for the `sqft` column in the filter, and also applies the bitwise `&` outside of the filter.
79
The code block shown below contains an error. The code block is intended to adjust the number of partitions used in wide transformations like join() to 32. Identify the error. ``` spark.conf.set("spark.default.parallelism", "32") ``` A. spark.default.parallelism is not the right Spark configuration parameter – spark.sql.shuffle.partitions should be used instead. B. There is no way to adjust the number of partitions used in wide transformations – it defaults to the number of total CPUs in the cluster. C. Spark configuration parameters cannot be set in runtime. D. Spark configuration parameters are not set with spark.conf.set(). E. The second argument should not be the string version of "32" – it should be the integer 32.
A. spark.default.parallelism is not the right Spark configuration parameter – spark.sql.shuffle.partitions should be used instead. DISCUSSION: The correct answer is A. The configuration parameter `spark.default.parallelism` affects the number of partitions for transformations after reading data but not wide transformations like joins. `spark.sql.shuffle.partitions` is the correct parameter to adjust the number of partitions in wide transformations. B is incorrect because it *is* possible to adjust the number of partitions used in wide transformations. C is incorrect because Spark configuration parameters *can* be set at runtime using `spark.conf.set()`. D is incorrect because `spark.conf.set()` *is* the correct way to set Spark configuration parameters. E is incorrect because Spark configuration parameters can be set as strings.
80
The code block shown below contains an error. The code block is intended to return the exact number of distinct values in column division in DataFrame storesDF. Identify the error. Code block: ``` storesDF.agg(approx_count_distinct(col(“division”)).alias(“divisionDistinct”)) ``` A. The `approx_count_distinct()` operation needs a second argument to set the rsd parameter to ensure it returns the exact number of distinct values. B. There is no `alias()` operation for the `approx_count_distinct()` operation's output. C. There is no way to return an exact distinct number in Spark because the data is distributed across partitions. D. The `approx_count_distinct()` operation is not a standalone function - it should be used as a method from a Column object. E. The `approx_count_distinct()` operation cannot determine an exact number of distinct values in a column.
E. The `approx_count_distinct()` operation cannot determine an exact number of distinct values in a column. DISCUSSION: The question states that the code is intended to return the *exact* number of distinct values. The function `approx_count_distinct()` is designed to provide an *approximate* count, not an exact count. Therefore, the fundamental error is using the wrong function for the intended purpose. Option A is incorrect because even with the 'rsd' parameter set, `approx_count_distinct()` still provides an approximation. Option B is incorrect because the `alias()` operation is valid and used to name the resulting column. Option C is incorrect because while distributed processing can complicate exact distinct counts, Spark provides the `countDistinct()` function for this purpose. Option D is incorrect because `approx_count_distinct` is used as a function that takes the column as an argument, as shown in the code block.
81
Which of the following operations can be used to return a new DataFrame from DataFrame `storesDF` without columns that are specified by name? A. `storesDF.filter()` B. `storesDF.select()` C. `storesDF.drop()` D. `storesDF.subset()` E. `storesDF.dropColumn()`
C. **Explanation:** * **C. `storesDF.drop()`** is the correct answer. The `drop()` function in pandas (and similar DataFrame libraries) is specifically designed to remove columns or rows by specifying their names or indices. It returns a new DataFrame with the specified columns removed. * **A. `storesDF.filter()`** is incorrect. The `filter()` function is generally used to select rows based on conditions applied to the data within the DataFrame, not to remove columns by name. * **B. `storesDF.select()`** is incorrect. While `select()` might seem like a possible answer, it's not a standard pandas function for dropping columns. `select()` (if available in a specific library) is usually used to choose a subset of columns to *keep*, not to specify columns to remove. * **D. `storesDF.subset()`** is incorrect. The `subset()` function is not a standard pandas function for dropping columns or selecting data. * **E. `storesDF.dropColumn()`** is incorrect. While the intent is clear, `dropColumn()` is not a standard pandas function. The correct function is `drop()`.
82
Which of the following operations can be used to return a DataFrame with no duplicate rows? Please select the most complete answer. A. DataFrame.distinct() B. DataFrame.dropDuplicates() and DataFrame.distinct() C. DataFrame.dropDuplicates() D. DataFrame.drop_duplicates() E. DataFrame.dropDuplicates(), DataFrame.distinct() and DataFrame.drop_duplicates()
E. dropDuplicates(), DataFrame.distinct() and DataFrame.drop_duplicates() drop_duplicates() is an alias for dropDuplicates() in PySpark. DataFrame.distinct() also removes duplicate rows. Since the question asks for the most complete answer, option E which includes all three methods is the most appropriate. Options A, B, C, and D are not as complete as option E because they do not include all the possible methods.
83
Which of the following Spark properties is used to configure the maximum size of an automatically broadcasted DataFrame when performing a join? A. spark.sql.broadcastTimeout B. spark.sql.autoBroadcastJoinThreshold C. spark.sql.shuffle.partitions D. spark.sql.inMemoryColumnarStorage.batchSize E. spark.sql.adaptive.skewedJoin.enabled
B. spark.sql.autoBroadcastJoinThreshold
84
[LLM error]
85
[LLM error]
86
Which of the following storage levels should be used to store as much data as possible in memory on two cluster nodes while storing any data that does not fit in memory on disk to be read in when needed? A. MEMORY_ONLY_2 B. MEMORY_AND_DISK_SER C. MEMORY_AND_DISK D. MEMORY_AND_DISK_2 E. MEMORY_ONLY
D. MEMORY_AND_DISK_2 MEMORY_AND_DISK_2 stores data in memory and on disk, replicating it across two cluster nodes. This satisfies the requirement of storing as much data as possible in memory, using disk when necessary, and doing so on two nodes. Option A, MEMORY_ONLY_2, stores only in memory, potentially losing data if it doesn't all fit. Option B, MEMORY_AND_DISK_SER, uses memory and disk but serializes the data, which isn't necessary for the prompt, and doesn't specify the number of nodes. Option C, MEMORY_AND_DISK, uses memory and disk but doesn't specify the number of nodes. Option E, MEMORY_ONLY, stores only in memory, and doesn't use disk when the memory is full, and doesn't specify the number of nodes.
87
In what order should the below lines of code be run in order to write DataFrame storesDF to file path filePath as parquet and partition by values in column division? Lines of code: 1. .write() \ 2. .partitionBy("division") \ 3. .parquet(filePath) 4. .storesDF \ 5. .repartition("division") 6. .write 7. .path(filePath, "parquet") A. 4, 1, 2, 3 B. 4, 1, 5, 7 C. 4, 6, 2, 3 D. 4, 1, 5, 3 E. 4, 6, 2, 7
C The correct order is 4, 6, 2, 3. This corresponds to: 4. `.storesDF` (start with the DataFrame) 5. `.write` (access the DataFrameWriter) 6. `.partitionBy("division")` (specify partitioning) 7. `.parquet(filePath)` (specify the output format and file path) Option A is incorrect because it uses `.write()` which is not a valid method directly after the DataFrame. Option B is incorrect because it uses `.write()` which is not a valid method directly after the DataFrame, and `.path` is not the correct method to write to a parquet file. Option D is incorrect because it uses `.write()` which is not a valid method directly after the DataFrame. Option E is incorrect because `.path` is not the correct method to write to a parquet file.
88
Which of the following operations can be used to return the number of rows in a DataFrame? A. DataFrame.numberOfRows() B. DataFrame.n() C. DataFrame.sum() D. DataFrame.count() E. DataFrame.countDistinct()
D. DataFrame.count() DataFrame.count() is the correct method to return the number of rows in a DataFrame. The other options are either invalid methods or perform different operations.
89
Which of the following operations returns a GroupedData object? A. DataFrame.GroupBy() B. DataFrame.cubed() C. DataFrame.group() D. DataFrame.groupBy() E. DataFrame.grouping_id()
D. DataFrame.groupBy()
90
Which of the following code blocks fails to return a new DataFrame that is the result of an inner join between DataFrame `storesDF` and DataFrame `employeesDF` on column `storeId` and column `employeeId`? A. `storesDF.join(employeesDF, Seq(col("storeId"), col("employeeId")))` B. `storesDF.join(employeesDF, Seq("storeId", "employeeId"))` C. `storesDF.join(employeesDF, storesDF("storeId") === employeesDF("storeId") and storesDF("employeeId") === employeesDF("employeeId"))` D. `storesDF.join(employeesDF, Seq("storeId", "employeeId"), "inner")` E. `storesDF.alias("s").join(employeesDF.alias("e"), col("s.storeId") === col("e.storeId") and col("s.employeeId") === col("e.employeeId"))`
A. `storesDF.join(employeesDF, Seq(col("storeId"), col("employeeId")))` DISCUSSION: Option A is the most likely to fail. When using `Seq` to specify join columns, you should provide the column names as strings directly, not as `Column` objects created with `col()`. Option B is correct because it uses the appropriate `Seq("storeId", "employeeId")` to specify the join columns by name. Option C is incorrect because, although it correctly specifies the join condition, it uses `and` instead of `&&`, which is more common in Spark's DataFrame API for combining boolean conditions. But since the question asks for an option that *fails*, and option A is more clearly wrong, A is the better answer. The wording implies a syntax/runtime error, not just a style issue. Note that `and` *can* work in some Spark contexts, so is not guaranteed to fail. Option D is correct because it explicitly specifies an "inner" join using the `join` function's third parameter and correctly specifies the join columns using `Seq`. Option E is correct because it uses aliases to disambiguate column names and constructs the join condition using `===` and `and` on the aliased columns. However, like option C, the use of `and` might be flagged as poor style. Using `&&` is better.
91
The code block shown below contains an error. The code block is intended to return a new DataFrame that is the result of a position-wise union between DataFrame storesDF and DataFrame acquiredStoresDF. ``` A. concat(storesDF, acquiredStoresDF) B. storesDF.unionByName(acquiredStoresDF) C. union(storesDF, acquiredStoresDF) D. unionAll(storesDF, acquiredStoresDF) E. storesDF.union(acquiredStoresDF) ``` Which code block contains the error? A. ``` concat(storesDF, acquiredStoresDF) ``` B. ``` storesDF.unionByName(acquiredStoresDF) ``` C. ``` union(storesDF, acquiredStoresDF) ``` D. ``` unionAll(storesDF, acquiredStoresDF) ``` E. ``` storesDF.union(acquiredStoresDF) ```
A. ``` concat(storesDF, acquiredStoresDF) ``` DISCUSSION: The question states that the intended operation is a "position-wise union" which is equivalent to a union by name. Thus, any valid union operation would accomplish the stated goal. The only code block that contains an error is `concat(storesDF, acquiredStoresDF)` because `concat` is not a valid PySpark function for DataFrames. Options B, C, D, and E all use valid PySpark functions for combining DataFrames.
92
Which of the following cluster configurations is most likely to experience delays due to garbage collection of a large Dataframe? [Image](https://img.examtopics.com/certified-associate-developer-for-apache-spark/image14.png) Note: each configuration has roughly the same compute power using 100GB of RAM and 200 cores. A. More information is needed to determine an answer. B. Scenario #5 C. Scenario #4 D. Scenario #1 E. Scenario #2
D. Scenario #1 Scenario #1 is most likely to experience delays due to garbage collection because it has the largest heap space per executor (50GB), leading to longer garbage collection times when managing large DataFrames. The other scenarios have smaller heap sizes per executor, allowing for more parallelism and potentially faster garbage collection. A larger heap means the garbage collector has more objects to scan and process, increasing the likelihood of delays.
93
The code block shown below should cache DataFrame storesDF only in Spark's memory. Choose the response that correctly fills in the numbered blanks within the code block to complete this task. Code block: __1__.__2__(__3__).count() A. 1. storesDF 2. cache 3. StorageLevel.MEMORY_ONLY B. 1. storesDF 2. storageLevel 3. cache C. 1. storesDF 2. cache 3. Nothing D. 1. storesDF 2. persist 3. Nothing E. 1. storesDF 2. persist 3. StorageLevel.MEMORY_ONLY
E
94
Which of the following code blocks returns a DataFrame containing only the rows from DataFrame storesDF where the value in column sqft is less than or equal to 25,000 OR the value in column customerSatisfaction is greater than or equal to 30? A. `storesDF.filter(col("sqft") <= 25000 and col("customerSatisfaction") >= 30)` B. `storesDF.filter(col("sqft") <= 25000 | col("customerSatisfaction") >= 30)` C. `storesDF.filter(col(sqft) <= 25000 or col(customerSatisfaction) >= 30)` D. `storesDF.filter(sqft <= 25000 | customerSatisfaction >= 30)` E. `storesDF.filter(col("sqft") <= 25000 or col("customerSatisfaction") >= 30)`
B DISCUSSION: Option B is correct because it uses the correct syntax for filtering a Spark DataFrame based on two conditions joined by an OR operator. It uses `col("sqft")` and `col("customerSatisfaction")` to refer to the DataFrame columns and the `|` operator for the OR condition. Option A is incorrect because it uses `and` instead of `or`, and would filter for rows satisfying both conditions instead of either. Option C is incorrect because it does not put the column names in quotes when calling `col()`. Option D is incorrect because it does not use the `col()` function to refer to the DataFrame columns. Option E is incorrect because it uses `or` which is a Python operator instead of `|` which is a bitwise operator and the correct operator to use in this context.
95
The code block shown below contains an error. The code block is intended to return a new DataFrame from DataFrame storesDF where column storeId is of the type string. Identify the error. ``` storesDF.withColumn(“storeId”, cast(col(“storeId”), StringType())) ``` A. Calls to withColumn() cannot create a new column of the same name on which it is operating. B. DataFrame columns cannot be converted to a new type inside of a call to withColumn(). C. The call to StringType should not be followed by parentheses. D. The column name storeId inside the col() operation should not be quoted. E. The cast() operation is a method in the Column class rather than a standalone function.
E. The `cast()` operation is a method in the Column class rather than a standalone function. `cast()` is indeed a method of the Column class in Spark, not a standalone function. Therefore, it should be called on a Column object, like `col("storeId").cast(StringType())`. Options A, B, C, and D are incorrect because `withColumn()` can replace existing columns, DataFrame columns can be converted to new types, `StringType()` should be called with parentheses, and column names inside `col()` should be quoted.
96
The code block shown below should return a new DataFrame where column division from DataFrame storesDF has been renamed to column state and column managerName from DataFrame storesDF has been renamed to column managerFullName. Choose the response that correctly fills in the numbered blanks within the code block to complete this task. Code block: storesDF. __1__(__2__, __3__).__4__(__5__, __6__) A. 1. withColumnRenamed 2. "state" 3. "division" 4. withColumnRenamed 5. "managerFullName" 6. "managerName" B. 1. withColumnRenamed 2. division 3. col("state") 4. withColumnRenamed 5. "managerName" 6. col("managerFullName") C. 1. withColumnRenamed 2. "division" 3. "state" 4. withColumnRenamed 5. "managerName" 6. "managerFullName" D. 1. withColumn 2. "division" 3. "state" 4. withcolumn 5. "managerName" 6. "managerFullName" E. 1. withColumn 2. "division" 3. "state" 4. withColumn 5. "managerName" 6. "managerFullName"
C. 1. withColumnRenamed 2. "division" 3. "state" 4. withColumnRenamed 5. "managerName" 6. "managerFullName" DISCUSSION: Option C is the correct answer because the `withColumnRenamed` function takes the existing column name as the first argument and the new column name as the second argument. The code first renames the "division" column to "state" and then renames the "managerName" column to "managerFullName". Option A reverses the arguments for renaming, which is incorrect. Option B uses `col("state")` and `col("managerFullName")` which is unnecessary and not the intended usage of `withColumnRenamed`. Options D and E use `withColumn` instead of `withColumnRenamed`, which is used to create a new column or replace an existing one, not to rename a column.
97
The code block shown below should read a CSV at the file path filePath into a DataFrame with the specified schema schema. Choose the response that correctly fills in the numbered blanks within the code block to complete this task. Code block: __1__.__2__.__3__(__4__).format("csv").__5__(__6__) A. 1. spark 2. read() 3. schema 4. schema 5. json 6. filePath B. 1. spark 2. read() 3. schema 4. schema 5. load 6. filePath C. 1. spark 2. read 3. format 4. "json" 5. load 6. filePath D. 1. spark 2. read() 3. json 4. filePath 5. format 6. schema E. 1. spark 2. read 3. schema 4. schema 5. load 6. filePath
E. 1. spark 2. read 3. schema 4. schema 5. load 6. filePath **Explanation:** The correct way to read a CSV file into a DataFrame with a specified schema in Spark is as follows: * `spark.read`: This initiates the read operation from the SparkSession. `read` is an attribute, not a method, so parenthesis are not needed. * `.schema(schema)`: This applies the provided schema to the DataFrame being read. * `.format("csv")`: This specifies the format of the input file as CSV. * `.load(filePath)`: This loads the CSV file from the specified file path. Therefore, option E correctly fills in the blanks: `spark.read.schema(schema).format("csv").load(filePath)`. **Why other options are incorrect:** * A and B are incorrect because `read` is an attribute of spark, not a method, so parenthesis are not needed. * C is incorrect because you would not specify the format as "json" when you are trying to read in a csv. * D is incorrect because the schema and filePath are reversed, and `read` has parenthesis after it when it should not.
98
Which of the following code blocks returns a new DataFrame where column division is the first two characters of column division in DataFrame storesDF? A. `storesDF.withColumn(“division”, substr(col(“division”), 0, 2))` B. `storesDF.withColumn(“division”, susbtr(col(“division”), 1, 2))` C. `storesDF,withColumn(“division”, col(“division”).substr(0, 3))` D. `storesDF.withColumn(“division”, col(“division”).substr(0, 2))` E. `storesDF.withColumn(“division”, col(“division”).substr(l, 2))`
D. `storesDF.withColumn(“division”, col(“division”).substr(0, 2))` DISCUSSION: Option D is correct. The `substr` function in PySpark (when called as a method on a Column object) takes a starting index and a length. Indexing starts at 0, so `substr(0, 2)` extracts the first two characters. Option A is incorrect because `substr` used in that way is a function from `pyspark.sql.functions` which starts indexing at 1, not 0. Option B has a typo (`susbtr` instead of `substr`). Also, with the correct spelling, `substr` used in that way is a function from `pyspark.sql.functions` which starts indexing at 1, not 0, which returns the second and third characters, not the first two. Option C has a typo (`storesDF,withColumn` should be `storesDF.withColumn`). Furthermore, it asks for three characters instead of two. Option E has `l` instead of a number as the first argument, which is not a valid argument. Also the user probably intended to put `1` as the first argument because of the 1-based index, but even with the correct number of arguments it would be incorrect.
99
The code block shown below should return a collection of summary statistics for column sqft in DataFrame storesDF. Choose the response that correctly fills in the numbered blanks within the code block to complete this task. Code block: storesDF.__1__(__2__) A. 1. summary 2. col("sqft") B. 1. describe 2. col("sqft") C. 1. summary 2. "sqft" D. 1. describe 2. "sqft" E. 1. summary 2. "all"
D. The `describe()` method in pandas (and Spark) is used to generate summary statistics. To specify a column, you pass the column name as a string to the `describe()` method. Therefore, `storesDF.describe("sqft")` is the correct way to get the summary statistics for the 'sqft' column. Options A and C are incorrect because while `summary()` might exist in other contexts, `describe()` is the standard method for this task. Option B is incorrect because while `describe` is correct, `col("sqft")` is not the correct way to refer to a column name. Option E is incorrect because "all" is not a valid argument.
100
The code block shown below should return a new DataFrame where rows in DataFrame storesDF with missing values in every column have been dropped. Choose the response that correctly fills in the numbered blanks within the code block to complete this task. Code block: storesDF.__1__.2__(3__ = __4__) A. 1. na 2. drop 3. how 4. "any" B. 1. na 2. drop 3. subset 4. "all" C. 1. na 2. drop 3. subset 4. "any" D. 1. na 2. drop 3. how 4. "all" E. 1. drop 2. na 3. how 4. "all"
D. The correct way to drop rows with missing values in a Pandas DataFrame is to use the `.dropna()` method. To specify that a row should only be dropped if *all* of its values are missing, we use the argument `how="all"`. The correct syntax is thus `storesDF.dropna(how="all")`, so the correct choice is D. Option A is incorrect because it drops rows if *any* values are missing. Option B is incorrect because the `subset` argument expects a list of column names, not the string "all". Option C is incorrect because it combines the incorrect `subset` argument with `"any"`. Option E is incorrect because it reverses `.na.drop` to `.drop.na` which is not valid syntax. Also, the na method must come before the drop method.
101
Which of the following code blocks returns a new DataFrame with column storeReview where the pattern " End" has been removed from the end of column storeReview in DataFrame storesDF? A sample DataFrame storesDF is below: [Image](https://img.examtopics.com/certified-associate-developer-for-apache-spark/image19.png) A. ```python storesDF.withColumn("storeReview", col("storeReview").regexp_replace(" End$", "")) ``` B. ```python storesDF.withColumn("storeReview", regexp_replace(col("storeReview"), " End$", "")) ``` C. ```python storesDF.withColumn("storeReview”, regexp_replace(col("storeReview"), " End$")) ``` D. ```python storesDF.withColumn("storeReview", regexp_replace("storeReview", " End$", "")) ``` E. ```python storesDF.withColumn("storeReview", regexp_extract(col("storeReview"), " End$", "")) ```
B. `storesDF.withColumn("storeReview", regexp_replace(col("storeReview"), " End$", ""))` **Explanation:** * The goal is to replace the pattern " End" at the end of the `storeReview` column. * `withColumn` is used to create a new column or replace an existing one. * `regexp_replace` is the correct function to replace a pattern using regular expressions. * `col("storeReview")` correctly refers to the column `storeReview`. * `" End$"` is the regular expression pattern to match " End" at the end of the string (`$` represents the end of the string). * `""` is the replacement string (empty string, effectively removing the matched pattern). **Why other options are incorrect:** * **A:** While this would likely work, it is not as concise as option B. * **C:** The code contains an incorrect quotation mark `”` after `storeReview`. * **D:** This is incorrect because the first argument of `regexp_replace` should be a Column object, not a string literal `"storeReview"`. * **E:** `regexp_extract` is used to extract a part of a string that matches a pattern, not to replace it.
102
The code block shown below contains an error. The code block is intended to create a single-column DataFrame from Python list years which is made up of integers. Identify the error. ``` spark.createDataFrame(years, IntegerType) ``` A. The column name must be specified. B. The years list should be wrapped in another list like `[years]` to make clear that it is a column rather than a row. C. There is no `createDataFrame` operation in spark. D. The `IntegerType` call must be followed by parentheses. E. The `IntegerType` call should not be present — Spark can tell that list `years` is full of integers.
D. The `IntegerType` call must be followed by parentheses.
103
Which of the following operations will fail to trigger evaluation? A. DataFrame.collect() B. DataFrame.count() C. DataFrame.first() D. DataFrame.join() E. DataFrame.take()
D. DataFrame.join()
104
Which of the following code blocks returns a new DataFrame where column sqft from DataFrame storesDF has had its missing values replaced with the value 30,000? A sample of DataFrame storesDF is below: [Image](https://img.examtopics.com/certified-associate-developer-for-apache-spark/image10.png) A. ```python storesDF.na.fill(30000, Seq("sqft")) ``` B. ```python storesDF.nafill(30000, col("sqft")) ``` C. ```python storesDF.na.fill(30000, col("sqft")) ``` D. ```python storesDF.fillna(30000, col("sqft")) ``` E. ```python storesDF.na.fill(30000, "sqft") ```
E. ```python storesDF.na.fill(30000, "sqft") ``` DISCUSSION: Option E is the correct answer. The `na.fill()` method in PySpark's DataFrame API is used to replace missing values. It takes two arguments: the value to fill with and the column name(s) to apply the fill to. When filling specific columns, the second argument should be a string (column name) or a list/tuple of strings (column names). * **A:** Incorrect. `Seq` is not a valid function in PySpark, it is a Scala function. * **B:** Incorrect. There is no `nafill` method in PySpark DataFrame. * **C:** Incorrect. The `na.fill()` method expects a string or a list/tuple of strings, not a Column object. * **D:** Incorrect. The `fillna()` method expects a string or a list/tuple of strings, not a Column object. Also, it is better to use `na.fill()` instead of `fillna()` according to Spark documentation.
105
Which of the following statements about the Spark DataFrame is true? A. Spark DataFrames are mutable unless they've been collected to the driver. B. A Spark DataFrame is rarely used aside from the import and export of data. C. Spark DataFrames cannot be distributed into partitions. D. A Spark DataFrame is a tabular data structure that is the most common Structured API in Spark. E. A Spark DataFrame is exactly the same as a data frame in Python or R.
D
106
Which of the following code blocks returns the number of rows in DataFrame storesDF for each unique value in column division? A. storesDF.groupBy("division").agg(count()) B. storesDF.agg(groupBy("division").count()) C. storesDF.groupby.count("division") D. storesDF.groupBy().count("division") E. storesDF.groupBy("division").count()
E. **Explanation:** Option E, `storesDF.groupBy("division").count()`, is correct because it first groups the DataFrame `storesDF` by the unique values in the "division" column using `groupBy("division")`. Then, it applies the `count()` function to each group, which returns the number of rows in each group, effectively counting the number of rows for each unique division. Option A is incorrect because while it uses `groupBy` correctly, it uses `.agg(count())`, which requires importing the `count` function, and might be more verbose than necessary. More critically, in some contexts, `count()` within `agg()` requires specifying the column to count (e.g. `count("some_column")`). Option B is incorrect because `.agg()` is used after `.groupBy()` not before. Option C is syntactically incorrect. The `groupBy` method needs to be called with parenthesis. Also, `count` is not a method that can be called this way Option D is incorrect because it groups without specifying a column, which is not the intended behavior.
107
Which of the following code blocks applies the function assessPerformance() to each row of DataFrame storesDF? A. storesDF.collect.foreach(assessPerformance(row)) B. storesDF.collect().apply(assessPerformance) C. storesDF.collect.apply(row => assessPerformance(row)) D. storesDF.collect.map(assessPerformance(row)) E. storesDF.collect.foreach(row => assessPerformance(row))
E. storesDF.collect.foreach(row => assessPerformance(row)) collect() retrieves all the rows of the DataFrame and returns them as an array. foreach() applies the specified function to each element of the array. Therefore, foreach(row => assessPerformance(row)) applies the function assessPerformance() to each row of the DataFrame storesDF. Option A is incorrect because it does not specify the row as an argument to `assessPerformance`. Option B is incorrect because `apply` is not a valid function on an array. Option C is incorrect because `apply` is not a valid function on an array. Option D is incorrect because `map` would return a new array with the results of `assessPerformance` which isn't the intention of the question.
108
Which of the following code blocks will always return a new 4-partition DataFrame from the 8-partition DataFrame storesDF without inducing a shuffle? A. `storesDF.repartition(4, "sqft")` B. `storesDF.repartition()` C. `storesDF.coalesce(4)` D. `storesDF.repartition(4)` E. `storesDF.coalesce`
C
109
Which of the following code blocks returns a DataFrame containing a column openDateString, a string representation of Java’s SimpleDateFormat? Note that column openDate is of type integer and represents a date in the UNIX epoch format — the number of seconds since midnight on January 1st, 1970. An example of Java's SimpleDateFormat is "Sunday, Dec 4, 2008 1:05 pm". A sample of storesDF is displayed below: [Image](https://img.examtopics.com/certified-associate-developer-for-apache-spark/image16.png) A. `storesDF.withColumn("openDatestring", from unixtime(col("openDate“), “EEEE, MMM d, yyyy h:mm a"))` B. `storesDF.withColumn("openDateString", from_unixtime(col("openDate“), "EEEE, MMM d, yyyy h:mm a", TimestampType()))` C. `storesDF.withColumn("openDateString", date(col("openDate"), "EEEE, MMM d, yyyy h:mm a"))` D. `storesDF.newColumn(col("openDateString"), from_unixtime("openDate", "EEEE, MMM d, yyyy h:mm a"))` E. `storesDF.withColumn("openDateString", date(col("openDate“), "EEEE, MMM d, yyyy h:mm a", TimestampType))`
A. Option A is the most likely correct answer. The `from_unixtime` function correctly converts a Unix epoch timestamp (seconds since 1970-01-01 00:00:00 UTC) to a string representation. The specified format "EEEE, MMM d, yyyy h:mm a" matches the example SimpleDateFormat. Although Option A has a typo, I am disregarding that and assuming the intention is correct. B. Option B is incorrect because `TimestampType()` is not needed. `from_unixtime` returns a string by default when a format is specified. Also, there is a typo. C. Option C is incorrect because the `date` function doesn't accept a format string like `from_unixtime` does. It extracts the date part from a timestamp or date value. D. Option D is incorrect. `newColumn` is not a valid PySpark DataFrame function, and it also passes the string "openDate" instead of the column. E. Option E is incorrect because the `date` function doesn't accept a format string like `from_unixtime` does. It extracts the date part from a timestamp or date value. Also, it does not convert from Unix epoch time.
110
The code block shown below should read a parquet at the file path filePath into a DataFrame. Choose the response that correctly fills in the numbered blanks within the code block to complete this task. Code block: __1__.__2__.__3__(__4__) A. 1. spark 2. read() 3. parquet 4. filePath B. 1. spark 2. read() 3. load 4. filePath C. 1. spark 2. read 3. load 4. filePath, source = "parquet" D. 1. storesDF 2. read() 3. load 4. filePath E. 1. spark 2. read 3. load 4. filePath
E. **Explanation:** The correct way to read a Parquet file into a DataFrame using Spark is: `spark.read.load(filePath)`. * `spark.read`: This initiates the DataFrameReader, which provides methods for reading data. * `load(filePath)`: This method reads data from the specified file path. Spark automatically infers the file format (Parquet in this case) from the file extension or metadata. **Why other options are incorrect:** * A: `spark.read().parquet(filePath)` is also a valid way to read parquet files but doesn't fit the given blank structure. `read()` is a property, not a method. * B: `spark.read().load(filePath)` is syntactically incorrect because `read` is a property of the spark session, not a method. * C: `spark.read.load(filePath, source = "parquet")` while functionally correct, the blank structure provided in the question does not support the inclusion of `source = "parquet"`. Spark can automatically infer the Parquet format in this case. * D: `storesDF.read().load(filePath)` is incorrect because `read` is a property of the `spark` session, not an existing DataFrame.
111
Which of the following code blocks fails to return the number of rows in DataFrame storesDF for each distinct combination of values in column division and column storeCategory? A. storesDF.groupBy((col("division"), col("storeCategory")]).count() B. storesDF.groupBy("division").groupBy("storeCategory").count() C. storesDF.groupBy(["division", "storeCategory"]).count() D. storesDF.groupBy("division", "storeCategory").count() E. storesDF.groupBy(col("division“), col("storeCategory")).count()
B. storesDF.groupBy("division").groupBy("storeCategory").count() DISCUSSION: The correct answer is B. In Spark, the groupBy operation returns a GroupedData object. You can perform aggregations like count() on this object, but you cannot call groupBy() on it again. Thus, option B will result in an error because it attempts to call groupBy() on a GroupedData object. Options A, C, D and E are all valid ways to group by multiple columns and then count the number of rows in each group. A is syntactically incorrect due to the parentheses and square brackets. E is syntactically incorrect due to the invalid quotation mark. C and D are correct. However, only B fails to return the number of rows.
112
The code block shown below should use SQL to return a new DataFrame containing column storeId and column managerName from a table created from DataFrame storesDF. Choose the response that correctly fills in the numbered blanks within the code block to complete this task. Code block: __1__.__2__("stores") __3__.__4__("SELECT storeId, managerName FROM stores") A. 1. spark 2. createOrReplaceTempView 3. storesDF 4. query B. 1. spark 2. createTable 3. storesDF 4. sql C. 1. storesDF 2. createOrReplaceTempView 3. spark 4. query D. 1. spark 2. createOrReplaceTempView 3. storesDF 4. sql E. 1. storesDF 2. createOrReplaceTempView 3. spark 4. sql
E. **Explanation:** The correct order of operations is to first create a temporary view from the DataFrame `storesDF` using `createOrReplaceTempView()`. Then, you use `spark.sql()` to execute the SQL query against that temporary view. * **storesDF.createOrReplaceTempView("stores")**: This creates a temporary view named "stores" from the `storesDF` DataFrame. This allows you to query the DataFrame using SQL. * **spark.sql("SELECT storeId, managerName FROM stores")**: This executes the SQL query against the "stores" temporary view, selecting the `storeId` and `managerName` columns. **Why other options are incorrect:** * **A, B, and D:** `spark.createOrReplaceTempView()` is incorrect because `createOrReplaceTempView()` is a DataFrame method, not a SparkSession method. Additionally, `storesDF.query()` is not valid syntax for querying in Spark SQL; `spark.sql()` should be used instead. * **C:** `storesDF.createOrReplaceTempView()` is correct, but using `spark.query()` is incorrect. `spark.sql()` must be used to query the table after it has been created from the DataFrame.
113
The code block shown below contains an error. The code block is intended to return a collection of summary statistics for column sqft in Data Frame storesDF. Identify the error. Code block: storesDF.describes(col("sgft ")) A. The describe() operation doesn't compute summary statistics for a single column — the summary() operation should be used instead. B. The column sqft should be subsetted from DataFrame storesDF prior to computing summary statistics on it alone. C. The describe() operation does not accept a Column object as an argument outside of a list — the list [col("sqft")] should be specified instead. D. The describe() operation does not accept a Column object as an argument — the column name string "sqft" should be specified instead. E. The describe() operation doesn't compute summary statistics for numeric columns — the sumwary() operation should be used instead.
D. The describe() operation does not accept a Column object as an argument — the column name string "sqft" should be specified instead.
114
Which of the following code blocks returns a DataFrame with column `storeSlogan` where single quotes in column `storeSlogan` in DataFrame `storesDF` have been replaced with double quotes? A sample of DataFrame `storesDF` is below: [Image](https://img.examtopics.com/certified-associate-developer-for-apache-spark/image21.png) A. ```python storesDF.withColumn("storeSlogan", col("storeSlogan").regexp_replace("’" "\"")) ``` B. ```python storesDF.withColumn("storeSlogan", regexp_replace(col("storeSlogan"), "’")) ``` C. ```python storesDF.withColumn("storeSlogan", regexp_replace(col("storeSlogan"), "’", "\"")) ``` D. ```python storesDF.withColumn("storeSlogan", regexp_replace("storeSlogan", "’", "\"")) ``` E. ```python storesDF.withColumn("storeSlogan", regexp_extract(col("storeSlogan"), "’", "\"")) ```
C. DISCUSSION: Option C is the correct answer. The `regexp_replace` function takes the column to operate on, the pattern to replace, and the replacement string as arguments. In this case, it correctly replaces single quotes (’) with double quotes ("). Option A is incorrect because `regexp_replace` is called on the column object and given incorrect arguments. Option B is incorrect because `regexp_replace` requires three arguments: the column, the pattern to replace, and the replacement string. Option D is incorrect because it passes the string "storeSlogan" instead of the column object `col("storeSlogan")` to `regexp_replace`. Option E is incorrect because it uses `regexp_extract`, which extracts a string matching a regex instead of replacing it.
115
Which of the following DataFrame operations is classified as a transformation? A. DataFrame.select() B. DataFrame.count() C. DataFrame.show() D. DataFrame.first() E. DataFrame.collect()
A. DataFrame.select() DataFrame.select() is a transformation because it returns a new DataFrame with selected columns. DataFrame.count(), DataFrame.show(), DataFrame.first(), and DataFrame.collect() are actions that trigger computation and return non-DataFrame results.
116
Which of the following code blocks extracts the value for column sqft from the first row of DataFrame storesDF? A. storesDF.first()[col("sqft")] B. storesDF[0]["sqft"] C. storesDF.collect(l)[0]["sqft"] D. storesDF.first.sqft E. storesDF.first().sqft
E
117
Which of the following code blocks writes DataFrame storesDF to file path filePath as CSV? A. storesDF.write().csv(filePath) B. storesDF.write(filePath) C. storesDF.write.csv(filePath) D. storesDF.write.option("csv").path(filePath) E. storesDF.write.path(filePath)
C
118
Which of the following operations performs a cross join on two DataFrames? A. DataFrame.join() B. The standalone join() function C. The standalone crossJoin() function D. DataFrame.crossJoin() E. DataFrame.merge()
D. DataFrame.crossJoin()
119
Which of the following code blocks writes DataFrame storesDF to file path filePath as parquet and partitions by values in column division? A. storesDF.write.partitionBy(col("division")).path(filePath) B. storesDF.write.option("parquet").partitionBy("division").path(filePath) C. storesDF.write.option("parquet").partitionBy(col("division")).path(filePath) D. storesDF.write.partitionBy("division").parquet(filePath) E. storesDF.write().partitionBy("division").parquet(filePath)
D
120
Of the following, which is the coarsest level in the Spark execution hierarchy? A. Slot B. Job C. Task D. Stage E. Executor
B. Job DISCUSSION: The Spark execution hierarchy, from coarsest to finest, is generally: Job -> Stage -> Task -> (Executor -> Slot). Therefore, 'Job' represents the highest and broadest level of abstraction, making it the coarsest level. * **A. Slot:** Slots are the smallest unit of execution within an executor. * **C. Task:** Tasks are units of work that are executed on a single partition of data. * **D. Stage:** Stages are a collection of tasks that can be executed in parallel. * **E. Executor:** Executors are worker nodes that execute tasks.
121
Which of the following code blocks returns the number of rows in DataFrame storesDF for each distinct combination of values in column division and column storeCategory? A. storesDF.groupBy(Seq(col(“division”), col(“storeCategory”))).count() B. storesDF.groupBy(division, storeCategory).count() C. storesDF.groupBy(“division”, “storeCategory”).count() D. storesDF.groupBy(“division”).groupBy(“StoreCategory”).count() E. storesDF.groupBy(Seq(“division”, “storeCategory”)).count()
C. storesDF.groupBy(“division”, “storeCategory”).count() **Explanation:** Option C is correct because it uses the correct syntax for grouping by multiple columns in a Spark DataFrame. The column names are passed as strings to the `groupBy` method. * **A & E:** `Seq(col(“division”), col(“storeCategory”))` and `Seq(“division”, “storeCategory”)` are valid ways to define the columns to group by but are less common and might depend on specific imports or context. Option C is more straightforward and widely used. * **B:** `storesDF.groupBy(division, storeCategory).count()` would require `division` and `storeCategory` to be column objects or variables already defined, not string literals representing column names. * **D:** `storesDF.groupBy(“division”).groupBy(“StoreCategory”).count()` groups by "division" first, then groups the result by "StoreCategory". This calculates the count of "StoreCategory" within each "division", not for each distinct combination of "division" and "StoreCategory".
122
Which of the following code blocks returns a DataFrame where column divisionDistinct is the approximate number of distinct values in column division from DataFrame storesDF? A. storesDF.withColumn("divisionDistinct", approx_count_distinct(col("division"))) B. storesDF.agg(col("division").approx_count_distinct("divisionDistinct")) C. storesDF.agg(approx_count_distinct(col("division")).alias("divisionDistinct")) D. storesDF.withColumn("divisionDistinct", col("division").approx_count_distinct()) E. storesDF.agg(col("division").approx_count_distinct().alias("divisionDistinct"))
C. Explanation: Option A is incorrect because `withColumn` is used to add a new column based on an existing column. However, `approx_count_distinct` is an aggregate function and should be used with `agg`. Option B is incorrect because the syntax for `agg` is incorrect. The `approx_count_distinct` function needs to be called directly, not as a method on the column. Option C is correct because it uses the `agg` function with `approx_count_distinct` to calculate the approximate number of distinct values in the 'division' column and then aliases the resulting column as 'divisionDistinct'. Option D is incorrect because `approx_count_distinct` is not a method of a column. Also `withColumn` requires the second argument to be a Column. Option E is incorrect because `approx_count_distinct` is not a method of a column.
123
The code block shown below contains an error. The code block is intended to return a collection of summary statistics for column sqft in Data Frame storesDF. Identify the error. Code block: storesDF.describes(col(“sgft”)) A. The column sqft should be subsetted from DataFrame storesDF prior to computing summary statistics on it alone. B. The describe() operation does not accept a Column object as an argument outside of a sequence — the sequence Seq(col(“sqft”)) should be specified instead. C. The describe()operation doesn’t compute summary statistics for a single column — the summary() operation should be used instead. D. The describe()operation doesn't compute summary statistics for numeric columns — the summary() operation should be used instead. E. The describe()operation does not accept a Column object as an argument — the column name string “sqft” should be specified instead.
E. The describe()operation does not accept a Column object as an argument — the column name string “sqft” should be specified instead.
124
The code block shown below contains an error. The code block is intended to create the Scala UDF assessPerformanceUDF() and apply it to the integer column customers1t1sfaction in Data Frame storesDF. Identify the error. Code block: [Image](https://img.examtopics.com/certified-associate-developer-for-apache-spark/image12.png) A. The input type of customerSatisfaction is not specified in the udf() operation. B. The return type of assessPerformanceUDF() must be specified. C. The withColumn() operation is not appropriate here - UDFs should be applied by iterating over rows instead. D. The assessPerformanceUDF() must first be defined as a Scala function and then converted to a UDF. E. UDFs can only be applied via SQL and not through the Data Frame API.
B. The return type of assessPerformanceUDF() must be specified. The `udf()` function in Spark requires the return type of the UDF to be explicitly specified. In the provided code, the return type (IntegerType) is missing when defining `assessPerformanceUDF`. A is incorrect because the input type of customerSatisfaction is already specified as Int within the UDF's definition: `(customerSatisfaction: Int)`. C is incorrect because `withColumn()` is the correct way to apply a UDF to a DataFrame column in Spark. D is incorrect because the code already defines the UDF as a Scala function and implicitly converts it to a UDF using `udf()`. E is incorrect because UDFs can be applied using both SQL and the DataFrame API. `withColumn()` is a DataFrame API method.
125
The code block shown below should create a single-column DataFrame from Scala list years which is made up of integers. Choose the response that correctly fills in the numbered blanks within the code block to complete this task. Code block: __1__.__2__(__3__).__4__ A. 1. spark 2. createDataFrame 3. years 4. IntegerType B. 1. spark 2. createDataset 3. years 4. IntegerType C. 1. spark 2. createDataset 3. List(years) 4. toDF D. 1. spark 2. createDataFrame 3. List(years) 4. IntegerType
C. 1. spark 2. createDataset 3. List(years) 4. toDF The correct answer is C. In Scala, to create a single-column DataFrame from an existing list, you should use `spark.createDataset(List(years)).toDF`. This first creates a Dataset from the list and then converts it into a DataFrame. Option A is incorrect because `spark.createDataFrame(years, IntegerType)` expects `years` to be an RDD[Row] or a Java/Scala Bean class, not a simple List. Also, `IntegerType` is not the correct way to define the schema when creating a DataFrame directly from data. Option B is incorrect because `IntegerType` is not a valid method to use after creating a dataset. Option D is incorrect because, similar to option A, `spark.createDataFrame(List(years), IntegerType)` doesn't align with the correct usage pattern for creating a DataFrame from a Scala List with a specified schema in this manner.
126
Which of the following code blocks returns a DataFrame containing a column month, an integer representation of the day of the year from column openDate from DataFrame storesDF. Note that column openDate is of type integer and represents a date in the UNIX epoch format – the number of seconds since midnight on January 1st, 1970. A sample of storesDF is displayed below: [Image](https://img.examtopics.com/certified-associate-developer-for-apache-spark/image13.png) Code block: stored.withColumn(“openTimestamp”, col(“openDate”).cast(__1__)) .withColumn(__2__, __3__(__4__)) A. 1. “Data” 2. month 3. “month” 4. “openTimestamp” B. 1. “Timestamp” 2. month 3. “month” 4. col(“openTimestamp”) C. 1. “Timestamp” 2. month 3. getMonth 4. col(“openTimestamp”) D. 1. “Timestamp” 2. “month” 3. month 4. col(“openTimestamp”)
D. **Explanation:** The goal is to convert the `openDate` column (which is in UNIX epoch format) to a timestamp and then extract the month as an integer. * **1. “Timestamp”:** The `cast()` function needs to convert the `openDate` to a Timestamp type, so "Timestamp" is correct. "Data" is not a valid type to cast to. * **2. “month”:** We want to create a new column named "month", so this is correct. * **3. month:** This refers to the Spark SQL function `month()` that extracts the month from a timestamp. * **4. col(“openTimestamp”)**: This refers to the `openTimestamp` column created in the previous `withColumn` transformation. Option A is incorrect because casting to "Data" is invalid. Option B is incorrect because `"month"` in the third position would be a string literal instead of the SQL function. Option C is incorrect because `getMonth` is not a valid Spark SQL function.
127
Which of the following pairs of arguments cannot be used in DataFrame.join() to perform an inner join on two DataFrames, named and aliased with "a" and "b" respectively, to specify two key columns column1 and column2? A. joinExprs = col(“a.column1”) === col(“b.column1”) and col(“a.column2”) === col(“b.column2”) B. usingColumns = Seq(col(“column1”), col(“column2”)) C. All of these options can be used to perform an inner join with two key columns. D. joinExprs = storesDF(“column1”) === employeesDF(“column1”) and storesDF(“column2”) === employeesDF (“column2”) E. usingColumns = Seq(“column1”, “column2”)
B. DISCUSSION: Option B is the correct answer because when using `usingColumns`, you should provide a sequence of column names as strings, not column objects created using `col()`. Options A, D, and E are all valid ways to specify join columns, either using join expressions with qualified column names or using a sequence of column names. Note that option D, although perhaps less common, will work assuming `storesDF` and `employeesDF` refer to the dataframes aliased as "a" and "b" in the question.
128
Which of the following code blocks writes DataFrame storesDF to file path filePath as parquet overwriting any existing files in that location? A. storesDF.write(filePath, mode = “overwrite”) B. storesDF.write().mode(“overwrite”).parquet(filePath) C. storesDF.write.mode(“overwrite”).parquet(filePath) D. storesDF.write.option(“parquet”, “overwrite”).path(filePath) E. storesDF.write.mode(“overwrite”).path(filePath)
C. Explanation: Option C is correct because it uses the correct syntax in Spark to write a DataFrame to a specified path in Parquet format, while also specifying the "overwrite" mode. The `write` attribute is accessed directly from the DataFrame. The `mode` function specifies the write mode, and `parquet` specifies the output format and path. Option A is incorrect because it is missing the format. Option B is incorrect because the write function does not return anything to chain off of. Option D is incorrect because the option function does not take the format as an argument. Option E is incorrect because it does not specify the format.
129
Which of the following code blocks reads a CSV at the file path `filePath` into a Data Frame with the specified schema `schema`? A. ```python spark.read().csv(filePath) ``` B. ```python spark.read().schema(“schema”).csv(filePath) ``` C. ```python spark.read.schema(schema).csv(filePath) ``` D. ```python spark.read.schema(“schema”).csv(filePath) ``` E. ```python spark.read().schema(schema).csv(filePath) ```
C **Explanation:** Option C is the correct way to specify the schema when reading a CSV file into a DataFrame using Spark. `spark.read.schema(schema).csv(filePath)` correctly chains the `schema()` method (to define the schema) and the `csv()` method (to read the CSV file) using the provided `filePath`. * **Why other options are incorrect:** * **A:** `spark.read().csv(filePath)` reads the CSV file but infers the schema, which might not be the desired behavior if a specific schema is required. * **B & D:** `spark.read().schema(“schema”).csv(filePath)` and `spark.read.schema(“schema”).csv(filePath)` both pass the literal string `"schema"` as the schema instead of the schema object stored in the variable `schema`. * **E:** `spark.read().schema(schema).csv(filePath)` is syntactically incorrect because the `schema()` function cannot be called after `spark.read()` without chaining, the correct syntax is `spark.read.schema(schema).csv(filePath)`.
130
Which of the following sets of DataFrame methods will both return a new DataFrame only containing rows that meet a specified logical condition? A. drop(), where() B. filter(), select() C. filter(), where() D. select(), where() E. filter(), drop()
C
131
The code block shown below should return a DataFrame containing all columns from DataFrame storesDF except for column sqft and column customerSatisfaction. Choose the response that correctly fills in the numbered blanks within the code block to complete this task. Code block: __1__.__2__(__3__) A. 1. drop 2. storesDF 3. col(“sqft”), col(“customerSatisfaction”) B. 1. storesDF 2. drop 3. sqft, customerSatisfaction C. 1. storesDF 2. drop 3. “sqft”, “customerSatisfaction” D. 1. storesDF 2. drop 3. col(sqft), col(customerSatisfaction) E. 1. drop 2. storesDF 3. col(sqft), col(customerSatisfaction)
C. 1. storesDF 2. drop 3. “sqft”, “customerSatisfaction” **Explanation:** The correct way to drop columns from a DataFrame is to use the `drop` method on the DataFrame object itself. Therefore, the DataFrame `storesDF` should come first, followed by `.drop()`. The `drop` method accepts a sequence of column names (strings) that need to be dropped. * **Why C is correct:** `storesDF.drop("sqft", "customerSatisfaction")` correctly specifies that the `drop` operation should be performed on the `storesDF` DataFrame, and that the columns "sqft" and "customerSatisfaction" should be dropped. * **Why other options are incorrect:** * **A and E:** Incorrect order. `drop.storesDF` is syntactically incorrect. * **B:** `sqft` and `customerSatisfaction` without quotes are interpreted as variable names, not column names (strings). * **D:** `col(sqft)` and `col(customerSatisfaction)` attempts to use a `col` function which is unnecessary when simply referring to columns by name in a `drop` operation. Also, `sqft` and `customerSatisfaction` without quotes are interpreted as variable names, not column names (strings).
132
Which of the following describes a partition? A. A partition is the amount of data that fits in a single executor. B. A partition is an automatically-sized segment of data that is used to create efficient logical plans. C. A partition is the amount of data that fits on a single worker node. D. A partition is a portion of a Spark application that is made up of similar jobs. E. A partition is a collection of rows of data that fit on a single machine in a cluster.
E. A partition is a collection of rows of data that fit on a single machine in a cluster. **Explanation:** The correct answer is **E**. In Spark, a partition is a fundamental unit of data division, representing a subset of the data that resides on a single machine (node) within the cluster. Spark operations are performed in parallel on these partitions, enabling distributed computing. Here's why the other options are incorrect: * **A:** While executors process partitions, a partition isn't defined by the amount of data that *fits* in an executor. The partition size can vary. * **B:** While partitions play a role in creating logical plans, they are not "automatically-sized segments of data that are used to create efficient logical plans". Partitions are the data itself. * **C:** This is similar to E, but E is more accurate. A worker node *contains* the machine and all of its resources. The partition is the data itself. * **D:** Partitions are not related to the application structure or job organization. They are a data division concept.
133
The code block shown below should read a JSON at the file path filePath into a DataFrame with the specified schema schema. Choose the response that correctly fills in the numbered blanks within the code block to complete this task. Code block: __1__.__2__.__3__(__4__).format("csv").__5__(__6__) A. 1. spark 2. read() 3. schema 4. schema 5. json 6. filePath B. 1. spark 2. read() 3. json 4. filePath 5. format 6. schema C. 1. spark 2. read() 3. schema 4. schema 5. load 6. filePath D. 1. spark 2. read 3. schema 4. schema 5. load 6. filePath E. 1. spark 2. read 3. format 4. "json" 5. load 6. filePath
D. 1. spark 2. read 3. schema 4. schema 5. load 6. filePath DISCUSSION: The correct code to read a JSON file into a DataFrame with a specified schema is: `spark.read.schema(schema).format("csv").load(filePath)`. This means: * `spark.read` is the starting point for reading data. * `.schema(schema)` specifies the schema to be used for the DataFrame. * `.format("csv")` specifies the format of the data being read. Note: While the question states that the code block should read a JSON, the `.format` part indicates that it is intended to read a CSV. * `.load(filePath)` specifies the path to the file being read. Therefore, option D correctly fills in the blanks. Options A, B, C, and E are incorrect because they do not follow the correct syntax for reading a file with a specified schema and format in Spark. Specifically, they have syntax errors (`read()` instead of `read`) or incorrect ordering or method names (`json` instead of `load`, `format` instead of `schema`, `"json"` instead of `"csv"`).
134
The code block shown below should write DataFrame storesDF to file path filePath as parquet and partition by values in column division. Choose the response that correctly fills in the numbered blanks within the code block to complete this task. Code block: storesDF.__1__.__2__(__3__).__4__(__5__) A. 1. write 2. partitionBy 3. “division” 4. path 5. filePath, node = parquet B. 1. write 2. partitionBy 3. “division” 4. parquet 5. filePath C. 1. write 2. partitionBy 3. col(“division”) 4. parquet 5. filePath D. 1. write() 2. partitionBy 3. col(“division”) 4. parquet 5. filePath E. 1. write 2. repartition 3. “division” 4. path 5. filePath, mode = “parquet”
B. 1. write 2. partitionBy 3. “division” 4. parquet 5. filePath **Explanation:** The correct way to write a DataFrame to a Parquet file, partitioned by a column, is as follows: * `write`: This is the method called on the DataFrame to initiate the write operation. * `partitionBy`: This method specifies the column(s) to partition the data by. * `"division"`: This is the name of the column to partition by, passed as a string. * `parquet`: This specifies the output format as Parquet. * `filePath`: This specifies the output file path. Therefore, option B correctly fills in the blanks. **Why other options are incorrect:** * A: `node = parquet` is not a valid parameter for the `path` function. Also, there is no `path` function, it should be `parquet`. * C and D: `col("division")` is not needed here, just passing in `"division"` as a string works. Also, D is incorrect because write() does not exist. it's write * E: `repartition` is used to change the number of partitions, not to partition the data based on column values for writing. Also, there is no `path` function, it should be `parquet`.
135
Which of the following cluster configurations will induce the least network traffic during a shuffle operation? [Image](https://img.examtopics.com/certified-associate-developer-for-apache-spark/image15.png) Note: each configuration has roughly the same compute power using 100GB of RAM and 200 cores. A. This cannot be determined without knowing the number of partitions. B. Scenario 5 C. Scenario 1 D. Scenario 4 E. Scenario 6
C. Scenario 1 DISCUSSION: Scenario 1 involves a single node. During a shuffle operation, data is redistributed among the nodes based on the partitioning scheme. If all data remains on a single node, there will be minimal to no network traffic, as all operations occur locally. Options B, D, and E (Scenarios 5, 4, and 6) all involve multiple nodes, thus incurring network traffic during shuffling as data moves between nodes. Option A is incorrect because the scenario with the least network traffic can be determined. A single node configuration eliminates network traffic during shuffles.
136
The code block shown below should return a new DataFrame from DataFrame storesDF where column storeId is of the type string. Choose the response that correctly fills in the numbered blanks within the code block to complete this task. ``` storesDF.__1__("storeId", __2__("storeId").__3__(__4__)) ``` A. 1. withColumn 2. col 3. cast 4. StringType() B. 1. withColumn 2. cast 3. col 4. StringType() C. 1. newColumn 2. col 3. cast 4. StringType() D. 1. withColumn 2. cast 3. col 4. StringType E. 1. withColumn 2. col 3. cast 4. StringType
A. **Explanation:** The correct way to transform a column to a specific data type in a Spark DataFrame is using the following syntax: `withColumn("column_name", col("column_name").cast(DataType()))`. * `withColumn`: This is the correct function to create a new column or replace an existing one in a DataFrame. * `col`: This function is used to refer to an existing column in the DataFrame by its name. * `cast`: This function is used to cast the data type of a column to a new data type. * `StringType()`: This specifies that the column should be cast to a String type. **Incorrect Options:** * B: This is functionally equivalent to A, but stylistically less common and more verbose. Option A is more idiomatic Spark code. * C: `newColumn` is not a valid PySpark DataFrame function for adding or replacing columns. * D: `StringType` is missing the parenthesis. It needs to be `StringType()` to call the constructor and create an object representing the StringType. * E: Similar to B, but with the same verbosity issue. Option A is the cleanest and most common implementation.
137
Which of the following code blocks returns a new DataFrame from DataFrame storesDF where column modality is the constant string "PHYSICAL"? Assume DataFrame storesDF is the only defined language variable. A. storesDF.withColumn("modality", lit(PHYSICAL)) B. storesDF.withColumn("modality", col("PHYSICAL")) C. storesDF.withColumn("modality", lit("PHYSICAL")) D. storesDF.withColumn("modality", StringType("PHYSICAL")) E. storesDF.withColumn("modality", "PHYSICAL")
C. Option C correctly uses the `lit` function to create a literal column with the string value "PHYSICAL". This ensures that every row in the new 'modality' column will have the value "PHYSICAL". Option A is incorrect because it assumes `PHYSICAL` is a defined variable, which it is not. Option B is incorrect because it attempts to create a column named "modality" using the values from a column literally named "PHYSICAL", but the question requires "PHYSICAL" to be a constant. Option D is incorrect because `StringType` is a datatype, not a value. It cannot be passed to withColumn in this manner. Option E is incorrect because it directly passes the string "PHYSICAL" without using `lit`, which is the proper way to create a literal column.
138
Which of the following code blocks creates and registers a SQL UDF named "ASSESS_PERFORMANCE" using the Scala function assessPerformance() and applies it to column customerSatisfaction in table stores? A. ```scala spark.udf.register("ASSESS_PERFORMANCE", assessPerformance) spark.sql("SELECT customerSatisfaction, ASSESS_PERFORMANCE(customerSatisfaction) AS result FROM stores") ``` B. ```scala spark.udf.register("ASSESS_PERFORMANCE", assessPerformance) ``` C. ```scala spark.udf.register("ASSESS_PERFORMANCE", assessPerformance) spark.sql("SELECT customerSatisfaction, assessPerformance(customerSatisfaction) AS result FROM stores") ``` D. ```scala spark.udf.register("ASSESS_PERFORMANCE", assessPerformance) storesDF.withColumn("result", assessPerformance(col("customerSatisfaction"))) ``` E. ```scala spark.udf.register("ASSESS_PERFORMANCE", assessPerformance) storesDF.withColumn("result", ASSESS_PERFORMANCE(col("customerSatisfaction"))) ```
A
139
The code block shown below should return a new 12-partition DataFrame from DataFrame storesDF. Choose the response that correctly fills in the numbered blanks within the code block to complete this task. Code block: __1__.__2__(__3__) A. 1. storesDF 2. coalesce 3. 4 B. 1. storesDF 2. coalesce 3. 4, "storeId" C. 1. storesDF 2. repartition 3. "storeId" D. 1. storesDF 2. repartition 3. 12 E. 1. storesDF 2. repartition 3. Nothing
D. The question asks for a new 12-partition DataFrame. `repartition(12)` will create the desired number of partitions. `coalesce` reduces the number of partitions. Passing `"storeId"` to `repartition` repartitions by that column, but does not guarantee 12 partitions. `Nothing` is not a valid argument. Therefore, option D is the only option that correctly fills in the blanks.
140
Which of the following code blocks returns a new Data Frame from DataFrame storesDF with no duplicate rows? A. storesDF.removeDuplicates() B. storesDF.getDistinct() C. storesDF.duplicates.drop() D. storesDF.duplicates() E. storesDF.dropDuplicates()
E
141
Which of the following types of processes induces a stage boundary? A. Shuffle B. Caching C. Executor failure D. Job delegation E. Application failure
A. Shuffle DISCUSSION: A shuffle operation in Spark induces a stage boundary because it requires data to be redistributed across executors, forming new partitions for downstream operations. This redistribution marks a clear separation in the execution flow, hence a stage boundary. Caching, executor failure, job delegation, and application failure do not inherently cause a stage boundary, although failures can lead to recomputation that might involve shuffles.
142
Which of the following identifies multiple narrow operations that are executed in sequence? A. Slot B. Job C. Stage D. Task E. Executor
C. Stage
143
Spark's execution/deployment mode determines where the driver and executors are physically located when a Spark application is run. Which of the following Spark execution/deployment modes does not exist? If they all exist, please indicate so with Response E. A. Client mode B. Cluster mode C. Standard mode D. Local mode E. All of these execution/deployment modes exist
C
144
Which of the following will cause a Spark job to fail? A. Never pulling any amount of data onto the driver node. B. Trying to cache data larger than an executor's memory. C. Data needing to spill from memory to disk. D. A failed worker node. E. A failed driver node.
E. **Explanation:** * **E is correct:** If the driver node fails, the entire Spark application fails because the driver is responsible for coordinating and managing the execution of the job. * **A is incorrect:** Spark jobs don't necessarily require pulling data onto the driver node. Many operations are performed in a distributed manner on the executors. * **B is incorrect:** While trying to cache data larger than an executor's memory can cause performance issues, Spark will attempt to spill data to disk or evict older cached data, so the job may still succeed if memory management is configured correctly. * **C is incorrect:** Data spilling to disk is a normal part of Spark's operation when memory is constrained. It can slow down performance, but it doesn't necessarily cause the job to fail. * **D is incorrect:** Spark is designed to be fault-tolerant. If a worker node fails, Spark will attempt to reschedule the tasks that were running on that node to other available worker nodes.
145
Which of the following best describes the similarities and differences between the MEMORY_ONLY storage level and the MEMORY_AND_DISK storage level? A. The MEMORY_ONLY storage level will store as much data as possible in memory and will store any data that does on fit in memory on disk and read it as it's called. The MEMORY_AND_DISK storage level will store as much data as possible in memory and will recompute any data that does not fit in memory as it’s called. B. The MEMORY_ONLY storage level will store as much data as possible in memory on two cluster nodes and will recompute any data that does not fit in memory as it’s called. The MEMORY_AND_DISK storage level will store as much data as possible in memory on two cluster nodes and will store any data that does on fit in memory on disk and read it as it's called. C. The MEMORY_ONLY storage level will store as much data as possible in memory on two cluster nodes and will store any data that does on fit in memory on disk and read it as it's called. The MEMORY_AND_DISK storage level will store as much data as possible in memory on two cluster nodes and will recompute any data that does not fit in memory as it's called. D. The MEMORY_ONLY storage level will store as much data as possible in memory and will recompute any data that does not fit in memory as it's called. The MEMORY_AND_DISK storage level will store as much data as possible in memory and will store any data that does on fit in memory on disk and read it as it's called. E. The MEMORY_ONLY storage level will store as much data as possible in memory and will recompute any data that does not fit in memory as it’s called. The MEMORY_AND_DISK storage level will store half of the data in memory and store half of the memory on disk. This provides quick preview and better logical plan design.
D. The MEMORY_ONLY storage level will store as much data as possible in memory and will recompute any data that does not fit in memory as it's called. The MEMORY_AND_DISK storage level will store as much data as possible in memory and will store any data that does on fit in memory on disk and read it as it's called. **Explanation:** * **MEMORY\_ONLY:** This storage level attempts to store all data in memory. If the data exceeds the available memory, the excess data is recomputed when needed. * **MEMORY\_AND\_DISK:** This storage level also tries to store as much data as possible in memory. However, when the data exceeds available memory, the excess data is stored on disk and read from disk when needed, instead of being recomputed. **Why other options are incorrect:** * **A:** Incorrect because it states MEMORY_ONLY stores overflow on disk and MEMORY_AND_DISK recomputes overflow, which is the opposite of the truth. * **B & C:** The phrase "on two cluster nodes" is not generally part of the storage level descriptions. * **E:** The statement about half the data in memory and half on disk, and "quick preview and better logical plan design" is not accurate regarding MEMORY_AND_DISK storage level. It stores as much as it can in memory, then spills to disk.
146
Which of the following code blocks returns a 15 percent sample of rows from DataFrame storesDF without replacement? A. storesDF.sample(True, fraction = 0.15) B. storesDF.sample(fraction = 0.15) C. storesDF.sampleBy(fraction = 0.15) D. storesDF.sample(fraction = 0.10) E. storesDF.sample()
B
147
Which of the following code blocks uses SQL to return a new DataFrame containing column storeId and column managerName from a table created from DataFrame storesDF? A. ``` storesDF.createOrReplaceTempView() spark.sql("SELECT storeId, managerName FROM stores") ``` B. ``` storesDF.query(”SELECT storeid, managerName from stores") ``` C. ``` spark.createOrReplaceTempView("storesDF") storesDF.sql("SELECT storeId, managerName from stores") ``` D. ``` storesDF.createOrReplaceTempView("stores") spark.sql("SELECT storeId, managerName FROM stores") ``` E. ``` storesDF.createOrReplaceTempView("stores") storesDF.query("SELECT storeId, managerName FROM stores") ```
D
148
The code block shown below should adjust the number of partitions used in wide transformations like join() to 32. Choose the response that correctly fills in the numbered blanks within the code block to complete this task. Code block: __1__(__2__, __3__) A. 1. spark.conf.get 2. "spark.sql.shuffle.partitions" 3. "32" B. 1. spark.conf.set 2. "spark.default.parallelism" 3. 32 C. 1. spark.conf.text 2. "spark.default.parallelism" 3. "32" D. 1. spark.conf.set 2. "spark.default.parallelism" 3. "32" E. 1. spark.conf.set 2. "spark.sql.shuffle.partitions" 3. "32"
E. 1. spark.conf.set 2. "spark.sql.shuffle.partitions" 3. "32" DISCUSSION: The correct way to set the number of partitions for wide transformations like `join()` is to use `spark.conf.set("spark.sql.shuffle.partitions", "32")`. This configures the Spark SQL engine to use 32 partitions during shuffle operations. Option A is incorrect because `spark.conf.get` is used to retrieve a configuration value, not set one. Option B is incorrect because it attempts to set `spark.default.parallelism` which primarily affects RDD operations and not the shuffle partitions for Spark SQL. Also, it uses an integer instead of a string for the value. Option C is incorrect because `spark.conf.text` is not a valid Spark configuration method. Also, it attempts to set `spark.default.parallelism` which primarily affects RDD operations and not the shuffle partitions for Spark SQL. Option D is incorrect because it attempts to set `spark.default.parallelism` which primarily affects RDD operations and not the shuffle partitions for Spark SQL.
149
Which of the following code blocks returns a new DataFrame that is the result of a cross join between DataFrame storesDF and DataFrame employeesDF? A. storesDF.crossJoin(employeesDF) B. storesDF.join(employeesDF, "storeId", "cross") C. crossJoin(storesDF, employeesDF) D. join(storesDF, employeesDF, "cross") E. storesDF.join(employeesDF, "cross")
A. storesDF.crossJoin(employeesDF) The `crossJoin` method is the correct way to perform a cross join in Spark. Options B, D, and E use the `join` method incorrectly for a cross join. Option C has an incorrect syntax.
150
Which of the following code blocks returns a new DataFrame that is the result of a position-wise union between DataFrame storesDF and DataFrame acquiredStoresDF? A. storesDF.unionByName(acquiredStoresDF) B. unionAll(storesDF, acquiredStoresDF) C. union(storesDF, acquiredStoresDF) D. concat(storesDF, acquiredStoresDF) E. storesDF.union(acquiredStoresDF)
E. storesDF.union(acquiredStoresDF)
151
In what order should the below lines of code be run in order to read a parquet at the file path filePath into a DataFrame? Lines of code: 1. storesDF 2. .load(filePath, source = "parquet") 3. .read \ 4. spark \ 5. .read() \ 6. .parquet(filePath) A. 1, 5, 2 B. 4, 5, 2 C. 4, 3, 6 D. 4, 5, 6 E. 4, 3, 2
C
152
Which of the following describes slots? A. Slots are the most coarse level of execution in the Spark execution hierarchy. B. Slots are resource threads that can be used for parallelization within a Spark application. C. Slots are resources that are used to run multiple Spark applications at once on a single cluster. D. Slots are the most granular level of execution in the Spark execution hierarchy. E. Slots are unique segments of data from a DataFrame that are split up by row.
B
153
Which of the following operations is least likely to result in a shuffle? A. DataFrame.join() B. DataFrame.fliter() C. DataFrame.orderBy() D. DataFrame.distinct() E. DataFrame.intersect()
B
154
Which of the following cluster configurations is least likely to experience delays due to garbage collection of a large DataFrame? [Image](https://img.examtopics.com/certified-associate-developer-for-apache-spark/image17.png) Note: each configuration has roughly the same compute power using 100GB of RAM and 200 cores. A. Scenario #4 B. Scenario #1 C. Scenario #5 D. More information is needed to determine an answer. E. Scenario #6
E. Scenario #6 DISCUSSION: Scenario #6 uses many smaller partitions. Smaller partitions mean smaller objects in memory for the garbage collector to manage, which minimizes the impact of GC pauses. A larger cluster (more executors) can keep large DataFrame objects live longer, and GC will take longer to collect those. Therefore, Scenario #6, which has the largest number of executors (100) and a small number of cores per executor (2), is least likely to experience delays due to garbage collection. The other scenarios are wrong because they involve fewer, larger partitions, increasing the burden on garbage collection.
155
The code block shown below should return a new DataFrame where column productСategories only has one word per row, resulting in a DataFrame with many more rows than DataFrame storesDF. Choose the response that correctly fills in the numbered blanks within the code block to complete this task. A sample of storesDF is displayed below: [Image](https://img.examtopics.com/certified-associate-developer-for-apache-spark/image18.png) Code block: storesDF.__1__(__2__, __3__(__4__(__5__))) A. 1. newColumn 2. "productCategories" 3. col 4. split 5. "productCategories" B. 1. withColumn 2. "productCategory" 3. split 4. col 5. "productCategories" C. 1. withColumn 2. "productCategory" 3. explode 4. col 5. "productCategories" D. 1. newColumn 2. "productCategory" 3. explode 4. col 5. "productCategories" E. 1. withColumn 2. "productCategories" 3. explode 4. col 5. "productCategories"
E. 1. withColumn 2. "productCategories" 3. explode 4. col 5. "productCategories" DISCUSSION: The correct answer is E. Here's why: * `withColumn` is used to add or replace a column in a DataFrame. This fills blank 1. * The question specifies that the existing "productCategories" column should be transformed, so "productCategories" is the correct name, filling blank 2. * `explode` transforms each element of an array or map to a row. This fills blank 3. * `col` is used to reference an existing column. This fills blank 4. * `col("productCategories")` indicates the column on which the explode operation is performed. This fills blank 5. Options A and D are incorrect because `newColumn` is not a valid PySpark DataFrame method. Options B and C are incorrect because the column name should be "productCategories" as it already exists. Furthermore, splitting would not result in the desired output of having each category in a new row.
156
The code block shown below should return a new DataFrame where rows in DataFrame storesDF containing at least one missing value have been dropped. Choose the response that correctly fills in the numbered blanks within the code block to complete this task. Code block: StoresDF.__1__.__2__(__3__ = __4__) A. 1. na 2. drop 3. subset 4. "any" B. 1. na 2. drop 3. how 4. "all" C. 1. na 2. drop 3. subset 4. "all" D. 1. na 2. drop 3. how 4. "any" E. 1. drop 2. na 3. how 4. "any"
D. 1. na 2. drop 3. how 4. "any" Explanation: The correct way to drop rows with missing values in a Pandas DataFrame is to use `storesDF.dropna(how="any")`. * `na` is used to access the missing value operations. * `drop` is the function to drop rows/columns with missing values. * `how="any"` means drop the row if any of the values are missing. `how="all"` would only drop the row if all values are missing. * `subset` is used to specify the columns to consider when dropping NA values, it is not appropriate here since we want to check all columns. Therefore, option D is the correct answer.
157
The code block shown below should return a 25 percent sample of rows from DataFrame storesDF with reproducible results. Choose the response that correctly fills in the numbered blanks within the code block to complete this task. Code block: StoresDF.__1__(__2__ = __3__, __4__ = __5__) A. 1. sample 2. fraction 3. 0.25 4. seed 5. True B. 1. sample 2. withReplacement 3. True 4. seed 5. True C. 1. sample 2. fraction 3. 0.25 4. seed 5. 1234 D. 1. sample 2. fraction 3. 0.15 4. seed 5. 1234 E. 1. sample 2. withReplacement 3. True 4. seed 5. 1234
C. 1. sample 2. fraction 3. 0.25 4. seed 5. 1234 **Explanation:** The `sample` method is used to get a random sample of rows from a DataFrame. To get a 25% sample, we use the `fraction` parameter and set it to 0.25. To ensure reproducible results, we set the `seed` parameter to a specific value (e.g., 1234). * **Why C is correct:** This option correctly uses the `sample` method with `fraction=0.25` to get 25% of the rows and `seed=1234` for reproducible results. * **Why A is incorrect:** `seed=True` does not provide a specific seed for reproducible sampling. * **Why B is incorrect:** `withReplacement=True` would allow the same row to be sampled multiple times, which is not the intention of getting a 25% sample. Also, `seed=True` is not a valid seed. * **Why D is incorrect:** `fraction=0.15` would return a 15% sample, not a 25% sample. * **Why E is incorrect:** `withReplacement=True` would allow the same row to be sampled multiple times, which is not the intention of getting a 25% sample.
158
Which of the following code blocks creates a Python UDF `assessPerformanceUDF()` using the integer-returning Python function `assessPerformance()` and applies it to Column `customerSatisfaction` in DataFrame `storesDF`? A. ```python assessPerformanceUDF = udf(assessPerformance, IntegerType) storesDF.withColumn("result", assessPerformanceUDF(col("customerSatisfaction"))) ``` B. ```python assessPerformanceUDF = udf(assessPerformance, IntegerType()) storesDF.withColumn("result", assessPerformanceUDF(col("customerSatisfaction"))) ``` C. ```python assessPerformanceUDF - udf(assessPerformance) storesDF.withColumn("result", assessPerformance(col(“customerSatisfaction"))) ``` D. ```python assessPerformanceUDF = udf(assessPerformance) storesDF.withColumn("result", assessPerformanceUDF(col("customerSatisfaction"))) ``` E. ```python assessPerformanceUDF = udf(assessPerformance, IntegerType()) storesDF.withColumn("result", assessPerformance(col("customerSatisfaction"))) ```
B. **Explanation:** * **Option B is correct:** It correctly defines the UDF using `udf(assessPerformance, IntegerType())` and then applies it to the `customerSatisfaction` column using `storesDF.withColumn("result", assessPerformanceUDF(col("customerSatisfaction")))`. The `IntegerType()` ensures the UDF is properly typed as returning an integer. * **Option A is incorrect:** `IntegerType` should be instantiated as `IntegerType()` * **Option C is incorrect:** Uses subtraction `-` instead of assignment `=` when creating the UDF and also tries to call the underlying python function directly instead of calling the UDF. * **Option D is incorrect:** Does not define the return type of the UDF. While this *might* work in some cases, explicitly defining the return type is best practice and generally required. * **Option E is incorrect:** Defines the UDF correctly but then tries to call the underlying python function directly instead of calling the UDF.
159
The code block shown below should return a DataFrame containing a column openDateString, a string representation of Java’s SimpleDateFormat. Choose the response that correctly fills in the numbered blanks within the code block to complete this task. Note that column openDate is of type integer and represents a date in the UNIX epoch format — the number of seconds since midnight on January 1st, 1970. An example of Java’s SimpleDateFormat is "Sunday, Dec 4, 2008 1:05 pm". A sample of storesDF is displayed below: [Image](https://img.examtopics.com/certified-associate-developer-for-apache-spark/image20.png) Code block: storesDF.__1__("openDateString", __2__(__3__, __4__)) A. 1. withColumn 2. from_unixtime 3. col("openDate") 4. “EEEE, MMM d, yyyy h:mm a" B. 1. withColumn 2. date_format 3. col("openDate") 4. "EEEE, mmm d, yyyy h:mm a" C. 1. newColumn 2. from_unixtinie 3. "openDate" 4. "EEEE, MMM d, yyyy h:mm a" D. 1. withColumn 2. from_unixtlme 3. col("openDate") 4. SimpleDateFormat E. 1. withColumn 2. from_unixtime 3. col("openDate") 4. "dw, MMM d, yyyy h:mm a"
A. 1. withColumn 2. from_unixtime 3. col("openDate") 4. “EEEE, MMM d, yyyy h:mm a" DISCUSSION: The correct answer is A. `withColumn` is used to add a new column to the DataFrame. `from_unixtime` converts the UNIX epoch time (seconds since January 1, 1970) to a timestamp. `col("openDate")` specifies the column containing the epoch time. `"EEEE, MMM d, yyyy h:mm a"` is the correct SimpleDateFormat string, where EEEE represents the full day of the week, MMM represents the full month name, d is the day of the month, yyyy is the year, h is the hour (1-12), mm is the minute, and a is the AM/PM indicator. Option B is incorrect because `mmm` should be `MMM` to represent the full month name. Option C is incorrect because `newColumn` is not a valid DataFrame method, and `from_unixtinie` is misspelled. Also, it should use `col("openDate")` instead of just `"openDate"`. Option D is incorrect because `from_unixtlme` is misspelled, and the date format should be a string, not `SimpleDateFormat`. Option E is incorrect because `dw` is not a valid SimpleDateFormat specifier for the full day of the week; it should be `EEEE`.
160
Which of the following operations can be used to perform a left join on two DataFrames? A. DataFrame.join() B. DataFrame.crossJoin() C. DataFrame.merge() D. DataFrame.leftJoin() E. Standalone join() function
A. DataFrame.join()
161
The code block shown below should return a new DataFrame that is the result of a position-wise union between DataFrame storesDF and DataFrame acquiredStoresDF. Choose the response that correctly fills in the numbered blanks within the code block to complete this task. Code block: __1__.__2__(__3__) A. 1. DataFrame 2. union 3. storesDF, acquiredStoresDF B. 1. DataFrame 2. concat 3. storesDF, acqulredStoresDF C. 1. storesDF 2. union 3. acquiredStoresDF D. 1. storesDF 2. unionByName 3. acquiredStoresDF E. 1. DataFrame 2. unionAll 3. storesDF, acquiredStoresDF
C. 1. storesDF 2. union 3. acquiredStoresDF DISCUSSION: The question asks for a position-wise union between two DataFrames. Option C, `storesDF.union(acquiredStoresDF)`, correctly performs this operation. `storesDF` is the DataFrame on which the `union` operation is called, and `acquiredStoresDF` is the DataFrame to be unioned with `storesDF`. The `union` method performs a simple union of the rows of the two DataFrames, appending the rows of `acquiredStoresDF` to `storesDF`. Option A is incorrect because `DataFrame.union(storesDF, acquiredStoresDF)` is not the correct syntax. The `union` method is called on a DataFrame object. Option B is incorrect because `concat` is not a DataFrame method and `acqulredStoresDF` is a typo. Option D, `storesDF.unionByName(acquiredStoresDF)`, would perform a union based on column names, not position. Option E is incorrect because `DataFrame.unionAll(storesDF, acquiredStoresDF)` is not the correct syntax. Also, `unionAll` is deprecated, and `union` should be used instead.
162
The code block shown below should return a new DataFrame that is the result of an inner join between DataFrame storesDF and DataFrame employeesDF on column storeId and column employeeId. Choose the response that correctly fills in the numbered blanks within the code block to complete this task. ``` storesDF.join(employeesDF, [__1__ == __2__, __3__ == __4__]) ``` A. 1. storesDF.storeId 2. storesDF.employeeId 3. employeesDF.storeId 4. employeesDF.employeeId B. 1. col("storeId") 2.col("storeId") 3.col("employeeId") 4. col("employeeId") C. 1. storeId 2. storeId 3. employeeId 4. employeeId D. 1. col("storeId") 2. col("employeeId") 3. col("employeeId") 4. col(''storeId") E. 1. storesDF.storeId 2. employeesDF.storeId 3. storesDF.employeeId 4. employeesDF.employeeId
E. 1. storesDF.storeId 2. employeesDF.storeId 3. storesDF.employeeId 4. employeesDF.employeeId **Explanation:** The join operation requires specifying the columns to join on from each DataFrame. The goal is to join `storesDF` on `storeId` with `employeesDF` on `storeId`, and `storesDF` on `employeeId` with `employeesDF` on `employeeId`. Therefore, the correct comparisons should be `storesDF.storeId == employeesDF.storeId` and `storesDF.employeeId == employeesDF.employeeId`. * **Why E is correct:** This option correctly specifies the column names from each DataFrame to be compared for the join. * **Why other options are incorrect:** * Options A, B, C, and D all have incorrect column comparisons, which would not result in the desired join. For instance, Option A incorrectly tries to compare `storesDF.storeId` with `storesDF.employeeId` and `employeesDF.storeId` with `employeesDF.employeeId`, which are not the columns intended for joining. Option B incorrectly uses the `col()` function but compares storeId to itself and employeeId to itself which is not a join between the two dataframes. Option C is missing the dataframe names. Option D incorrectly compares storeId with employeeId and vice versa.
163
Which of the following code blocks writes DataFrame storesDF to file path filePath as text files overwriting any existing files in that location? A. storesDF.write(filePath, mode = "overwrite", source = "text") B. storesDF.write.mode("overwrite").text(filePath) C. storesDF.write.mode("overwrite").path(filePath) D. storesDF.write.option("text", "overwrite").path(filePath) E. storesDF.write().mode("overwrite").text(filePath)
B. storesDF.write.mode("overwrite").text(filePath) Explanation: Option B correctly chains the `write` operation with `.mode("overwrite")` to specify the overwrite mode and `.text(filePath)` to specify writing in text format to the given file path. Option A is incorrect because the `write` function does not accept arguments in this manner. Option C is incorrect because `.path()` only sets the output path, not the format. Option D is incorrect because `.option()` is not the correct way to specify the output format. Option E is syntactically correct and functionally the same as option B. However, the highest-voted answers specifically highlight option B, and do not highlight E as a correct answer. Therefore, while either could be correct, we are choosing B as the most likely correct answer.
164
The code block shown below contains an error. The code block is intended to read JSON at the file path filePath into a DataFrame with the specified schema schema. Identify the error. Code block: ``` spark.read.schema("schema").format("json").load(filePath) ``` A. The schema operation from read takes a schema object rather than a string — the argument should be schema. B. There is no load() operation for DataFrameReader — it should be replaced with the json() operation. C. The spark.read operation should be followed by parentheses in order to return a DataFrameReader object. D. There is no read property of spark — spark should be replaced with DataFrame. E. The schema operation from read takes a column rather than a string — the argument should be col("schema").
A. The schema operation from read takes a schema object rather than a string — the argument should be schema.
165
Which of the following describes executors? A. Executors are the communication pathways from the driver node to the worker nodes. B. Executors are the most granular level of execution in the Spark execution hierarchy. C. Executors always have a one-to-one relationship with worker nodes. D. Executors are synonymous with worker nodes. E. Executors are processing engine instances for performing data computations which run on a worker node.
E
166
The code block shown below should return a DataFrame containing only the rows from DataFrame storesDF where the value in column sqft is less than or equal to 25,000 AND the value in column customerSatisfaction is greater than or equal to 30. Choose the response that correctly fills in the numbered blanks within the code block to complete this task. Code block: storesDF.__1__(__2__ __3__ __4__) A. 1. filter 2. (col("sqft") <= 25000) 3. & 4. (col("customerSatisfaction") >= 30) B. 1. filter 2. (col("sqft") <= 25000 3. & 4. col("customerSatisfaction") >= 30 C. 1. filter 2. (col("sqft") <= 25000) 3. and 4. (col("customerSatisfaction") >= 30) D. 1. drop 2. (col(sqft) <= 25000) 3. & 4. (col(customerSatisfaction) >= 30) E. 1. filter 2. col("sqft") <= 25000 3. and 4. col("customerSatisfaction") >= 30
A. 1. filter 2. (col("sqft") <= 25000) 3. & 4. (col("customerSatisfaction") >= 30) **Explanation:** The `filter` function is used to select rows based on a condition. The condition should be a boolean expression. In PySpark, column expressions need to be created using `col("column_name")`. The logical AND operator is `&`. Each comparison also needs to be wrapped in parentheses for correct evaluation order. Option A is correct because it uses `filter` to select rows based on the combined condition, constructs column references correctly using `col()`, and uses the correct logical operator `&`. The conditions are also enclosed in parentheses. Option B is incorrect because it is missing a closing parenthesis after 25000 in line 2 `(col("sqft") <= 25000`. and has no paranthesis around the second condition. Option C is incorrect because it uses `and` instead of `&` as the logical operator. Option D is incorrect because it uses `drop` which would remove rows that satisfy the condition, instead of filtering to keep them. It also has no quotes around the sqft and customerSatisfaction column names. Option E is incorrect because it uses `and` instead of `&` as the logical operator. It also requires paranthesis around the conditions.
167
Which of the following operations can be used to rename and replace an existing column in a DataFrame? A. DataFrame.renamedColumn() B. DataFrame.withColumnRenamed() C. DataFrame.wlthColumn() D. col() E. DataFrame.newColumn()
B
168
The code block shown below should print the schema of DataFrame storesDF. Choose the response that correctly fills in the numbered blanks within the code block to complete this task. Code block: __1__.__2__(__3__) A. 1. storesDF 2. schema 3. Nothing B. 1. storesDF 2. str 3. schema C. 1. storesDF 2. printSchema 3. True D. 1. storesDF 2. printSchema 3. Nothing E. 1. storesDF 2. printSchema 3. "all"
D. 1. storesDF 2. printSchema 3. Nothing Explanation: The `printSchema()` method is the correct way to display the schema of a DataFrame in PySpark. The `printSchema()` method does not require any arguments. Therefore, 'Nothing' is appropriate for blank 3. Option A is incorrect because `.schema` would return a schema object, but wouldn't print it. Option B is incorrect because `.str` is not a valid method for printing the schema. Option C is incorrect because the `printSchema()` method does not take a boolean argument. Option E is incorrect because the `printSchema()` method does not take a string argument.
169
The code block shown below contains an error. The code block is intended to create and register a SQL UDF named "ASSESS_PERFORMANCE" using the Python function assessPerformance() and apply it to column customerSatistfaction in table stores. Identify the error. ``` spark.udf.register("ASSESS_PERFORMANCE", assessPerformance) spark.sql("SELECT customerSatisfaction, assessPerformance(customerSatisfaction) AS result FROM stores") ``` A. There is no sql() operation — the DataFrame API must be used to apply the UDF assessPerformance(). B. The order of the arguments to spark.udf.register() should be reversed. C. The customerSatisfaction column cannot be called twice inside the SQL statement. D. Registered UDFs cannot be applied inside of a SQL statement. E. The wrong SQL function is used to compute column result — it should be ASSESS_PERFORMANCE instead of assessPerformance.
E. The wrong SQL function is used to compute column result — it should be ASSESS_PERFORMANCE instead of assessPerformance. DISCUSSION: The error lies in the SQL statement where the UDF is invoked. When a UDF is registered using `spark.udf.register()`, it is registered with a specific name (in this case, "ASSESS_PERFORMANCE"). This registered name is what should be used within SQL queries to call the UDF, not the original Python function name (`assessPerformance`). Option A is incorrect because `spark.sql()` can be used with UDFs. Option B is incorrect because the order of arguments in `spark.udf.register()` is correct. Option C is incorrect because a column can be called multiple times in a SQL statement. Option D is incorrect because registered UDFs *can* be applied inside SQL statements.
170
Which of the following code blocks attempts to cache the partitions of DataFrame storesDF only in Spark’s memory? A. storesDF.cache(StorageLevel.MEMORY_ONLY).count() B. storesDF.persist().count() C. storesDF.cache().count() D. storesDF.persist(StorageLevel.MEMORY_ONLY).count() E. storesDF.persist("MEMORY_ONLY").count()
D. storesDF.persist(StorageLevel.MEMORY_ONLY).count() DISCUSSION: Option D is the correct answer. The `persist()` method with `StorageLevel.MEMORY_ONLY` explicitly specifies that the DataFrame should be cached only in memory. The `.count()` action triggers the caching. Option A is incorrect because `.cache()` without arguments defaults to `MEMORY_AND_DISK`. Option B is incorrect because `.persist()` without arguments defaults to `MEMORY_AND_DISK`. Option C is incorrect because `.cache()` without arguments defaults to `MEMORY_AND_DISK`. Option E is incorrect because the `persist()` method expects a `StorageLevel` object, not a string.
171
Which of the following operations will always return a new DataFrame with updated partitions from DataFrame storesDF by inducing a shuffle? A. storesDF.coalesce() B. storesDF.rdd.getNumPartitions() C. storesDF.repartition() D. storesDF.union() E. storesDF.intersect()
C. storesDF.repartition() **Explanation:** * **C. storesDF.repartition():** This operation is specifically designed to change the number of partitions in a DataFrame. It always induces a shuffle to redistribute the data evenly across the new partitions. * **A. storesDF.coalesce():** This operation is used to decrease the number of partitions. While it can avoid a full shuffle if you are only reducing partitions, it doesn't *always* induce a shuffle. * **B. storesDF.rdd.getNumPartitions():** This is not a DataFrame operation; it only returns the number of partitions in the RDD and doesn't modify the DataFrame. * **D. storesDF.union():** This operation combines two DataFrames. While it might result in a new DataFrame, it doesn't necessarily induce a shuffle for repartitioning. * **E. storesDF.intersect():** This operation returns the common rows between two DataFrames. Similar to union, it doesn't inherently trigger a shuffle for repartitioning.
172
Which of the following code blocks returns a DataFrame containing a column month, an integer representation of the month from column openDate from DataFrame storesDF? Note that column openDate is of type integer and represents a date in the UNIX epoch format — the number of seconds since midnight on January 1 st, 1970. A sample of storesDF is displayed below: [Image](https://img.examtopics.com/certified-associate-developer-for-apache-spark/image20.png) A. ``` storesDF.withColumn("month", getMonth(col("openDate"))) ``` B. ``` storesDF.withColumn("month", substr(col("openDate"), 4, 2)) ``` C. ``` (storesDF.withColumn("openDateFormat", col("openDate").cast("Date")) .withColumn("month", month(col("openDateFormat")))) ``` D. ``` (storesDF.withColumn("openTimestamp", col("openDate").cast("Timestamp")) .withColumn("month", month(col("openTimestamp")))) ``` E. ``` storesDF.withColumn("month", month(col("openDate"))) ```
D. ``` (storesDF.withColumn("openTimestamp", col("openDate").cast("Timestamp")) .withColumn("month", month(col("openTimestamp")))) ``` DISCUSSION: Option D is correct because the `month` function requires a Date or Timestamp column as input. The `openDate` column is in UNIX epoch format (integer representing seconds since 1970-01-01), so it needs to be cast to a Timestamp first. Option A is incorrect because there is no standard `getMonth` function in Spark SQL. Option B is incorrect because `substr` would treat the `openDate` as a string, which is not the correct way to extract the month from a UNIX epoch timestamp. Option C is incorrect because casting directly to "Date" might not handle the UNIX epoch format correctly, and even if it did, the `month` function expects a Timestamp or Date type. Option E is incorrect because `month` function expects a Date or Timestamp type column, not an integer representing the UNIX epoch.
173
Which of the following operations calculates the simple average of a group of values, like a column? A. simpleAvg() B. mean() C. agg() D. average() E. approxMean()
B
174
Which of the following code blocks returns a DataFrame where rows in DataFrame storesDF containing missing values in every column have been dropped? A. storesDF.na.drop() B. storesDF.dropna() C. storesDF.na.drop("all", subset = "sqft") D. storesDF.na.drop("all") E. storesDF.nadrop("all")
D. storesDF.na.drop("all") Explanation: The function `na.drop()` is used to drop rows with missing values in a DataFrame. The argument "all" specifies that rows will only be dropped if all columns in that row contain missing values. If "all" is not specified, the default behavior is "any", which means rows with missing values in any column will be dropped. Option A is incorrect because it drops rows with any NA values. Option B is incorrect because it drops rows with any NA values, and it doesn't use the `na` accessor. Option C is incorrect because it only considers the 'sqft' column for missing values when using the "all" argument. Option E is incorrect because the function name `nadrop` is invalid. The correct function is `na.drop`.
175
Which of the following Spark properties is used to configure whether DataFrames found to be below a certain size threshold at runtime will be automatically broadcasted? A. spark.sql.broadcastTimeout B. spark.sql.autoBroadcastJoinThreshold C. spark.sql.shuffle.partitions D. spark.sql.inMemoryColumnarStorage.batchSize E. spark.sql.adaptive.localShuffleReader.enabled
B. spark.sql.autoBroadcastJoinThreshold **Explanation:** The property `spark.sql.autoBroadcastJoinThreshold` is used to configure the threshold (in bytes) below which Spark will automatically broadcast a DataFrame to all executor nodes when performing a join operation. This can significantly improve performance for smaller DataFrames as it avoids shuffling data. * **A. spark.sql.broadcastTimeout:** This property defines the timeout for broadcast waits in seconds. It's related to broadcasting but doesn't control the automatic broadcasting behavior based on size. * **C. spark.sql.shuffle.partitions:** This property controls the number of partitions to use when shuffling data. It's not directly related to broadcasting. * **D. spark.sql.inMemoryColumnarStorage.batchSize:** This property configures the batch size for in-memory columnar storage, which is related to caching data in memory but not broadcasting. * **E. spark.sql.adaptive.localShuffleReader.enabled:** This property enables or disables the local shuffle reader in adaptive query execution, which is a different optimization technique than broadcasting.
176
Which of the following describes why garbage collection in Spark is important? A. Spark logical results will be incorrect if inaccurate data is not collected and removed from the Spark job. B. Spark jobs will fail or run slowly if inaccurate data is not collected and removed from the Spark job. C. Spark jobs will fail or run slowly if memory is not available for new objects to be created. D. Spark jobs will produce inaccurate results if there are too many different transformations called before a single action. E. Spark jobs will produce inaccurate results if memory is not available for new tasks to run and complete.
C
177
Which of the following statements describing a difference between transformations and actions is incorrect? A. There are wide and narrow transformations but there are not wide and narrow actions. B. Transformations do not trigger execution while actions do trigger execution. C. Transformations work on DataFrames/Datasets while actions are reserved for native language objects. D. Some actions can be used to return data objects in a format native to the programming language being used to access the Spark API while transformations do not provide this ability. E. Transformations are typically logic operations while actions are typically focused on returning results.
C. Transformations work on DataFrames/Datasets while actions are reserved for native language objects. **Explanation:** The incorrect statement is C. Actions also work on DataFrames/Datasets, not just transformations. Actions trigger the execution of the transformations performed on these DataFrames/Datasets. * **A is correct:** Transformations can be wide (shuffle data across partitions) or narrow (operate within a partition), while actions don't have this wide/narrow distinction. * **B is correct:** Transformations are lazy and only define the operations. Actions trigger the actual computation. * **D is correct:** Actions like `collect()` or `take()` return data to the driver program in a format (e.g., a Python list) that is native to the language being used. Transformations don't directly return data in this way. * **E is correct:** Transformations specify the logic (e.g., filter, map), while actions are about getting results (e.g., count, save).
178
The code block shown below contains an error. The code block is intended to return a new DataFrame where column productCategories only has one word per row, resulting in a DataFrame with many more rows than DataFrame storesDF. Identify the error and how to fix it. A sample of storesDF is displayed below: [Image](https://img.examtopics.com/certified-associate-developer-for-apache-spark/image22.png) `storesDF.withColumn("productCategories", split(col("productCategories")))` A. The `split()` operation does not accomplish the requested task in the way that it is used. It should be used provided an alias. B. The `split()` operation does not accomplish the requested task. The `broadcast()` operation should be used instead. C. The `split()` operation does not accomplish the requested task in the way that it is used. It should be used as a column object method instead. D. The `split()` operation does not accomplish the requested task. The `explode()` operation should be used instead. E. The `split()` operation does not accomplish the requested task. The `array_distinct()` operation should be used instead.
D. The `split()` operation does not accomplish the requested task. The `explode()` operation should be used instead.
179
Which of the following code blocks returns a DataFrame where column `managerName` from DataFrame `storesDF` is split at the space character into column `managerFirstName` and `managerLastName`? A sample of DataFrame `storesDF` is displayed below: [Image](https://img.examtopics.com/certified-associate-developer-for-apache-spark/image23.png) A. ``` (storesDF.withColumn("managerFirstName", split(col("managerName"), " ")[0]) .withColumn("managerLastName", split(col("managerName"), " ")[1])) ``` B. ``` (storesDF.withColumn("managerFirstName", col("managerName"). split(" ")[1]) .withColumn("managerLastName", col("managerName").split(" ")[2])) ``` C. ``` (storesDF.withColumn("managerFirstName", split(col("managerName"), " ")[1]) .withColumn("managerLastName", split(col("managerName"), " ")[2])) ``` D. ``` (storesDF.withColumn("managerFirstName", col("managerName").split(" ")[0]) .withColumn("managerLastName", col("managerName").split(" ")[1])) ``` E. ``` (storesDF.withColumn("managerFirstName", split("managerName"), " ")[0]) .withColumn("managerLastName", split("managerName"), " ")[1])) ```
A. ``` (storesDF.withColumn("managerFirstName", split(col("managerName"), " ")[0]) .withColumn("managerLastName", split(col("managerName"), " ")[1])) ``` DISCUSSION: Option A is the correct answer. The `split(col("managerName"), " ")` function splits the `managerName` column into an array of strings, using a space as the delimiter. `[0]` accesses the first element (first name) of the array, and `[1]` accesses the second element (last name). `withColumn` then creates the new columns. Option B is incorrect because it attempts to use `col("managerName").split(" ")` which is not the correct syntax with Spark. Additionally, it uses index `[1]` and `[2]` which would skip the first name and potentially cause an error if there is no third element. Option C is incorrect because it uses index `[1]` and `[2]` which would skip the first name and potentially cause an error if there is no third element. Option D is syntactically correct in terms of using `col("managerName").split(" ")`, but is not standard for Spark. The split function must be imported from `pyspark.sql.functions`. Additionally, `col("managerName").split(" ")[0]` is incorrect and will not work as intended. Option E is incorrect because it passes the string literal `"managerName"` to the `split` function instead of the column object `col("managerName")`.
180
Which of the following cluster configurations will fail to ensure completion of a Spark application in light of a worker node failure? [Image](https://img.examtopics.com/certified-associate-developer-for-apache-spark/image24.png) Note: each configuration has roughly the same compute power using 100GB of RAM and 200 cores. A. Scenario #5 B. Scenario #4 C. Scenario #6 D. Scenario #1 E. They should all ensure completion because worker nodes are fault tolerant
D. Scenario #1 Scenario #1 has only one executor. If the worker node with that executor fails, the application will fail because there's no other executor to continue. Options A, B, and C show multiple executors spread across multiple workers. Option E is incorrect as worker nodes are fault tolerant only if there are multiple executors.
181
Which of the following code blocks returns a DataFrame containing only the rows from DataFrame storesDF where the value in column sqft is less than or equal to 25,000 AND the value in column customerSatisfaction is greater than or equal to 30? A. storesDF.filter(col("sqft") <= 25000 and col("customerSatisfaction") >= 30) B. storesDF.filter(col(sqft) <= 25000 & col(customerSatisfaction) >= 30) C. storesDF.filter(col("sqft") <= 25000 & col("customerSatisfaction") >= 30) D. storesDF.filter((col("sqft") <= 25000) & (col("customerSatisfaction") >= 30)) E. storesDF.filter(sqft <= 25000 and customerSatisfaction >= 30)
D. storesDF.filter((col("sqft") <= 25000) & (col("customerSatisfaction") >= 30)) DISCUSSION: Option D is correct because it uses the `filter()` method along with the bitwise AND operator (`&`) to combine the two conditions. Each condition is also enclosed in parentheses, ensuring correct order of operations. The `col()` function is used to properly reference the columns by name. Option A is incorrect because it uses the Python `and` operator instead of the bitwise `&` operator required for Spark DataFrames. Option B is incorrect because it does not put the column names `sqft` and `customerSatisfaction` in quotes within the `col()` function. Option C is incorrect because it doesn't wrap each condition in parentheses. Although it might work in some cases, explicitly using parentheses improves readability and avoids potential operator precedence issues. Option E is incorrect because it directly references the column names `sqft` and `customerSatisfaction` without using the `col()` function to create Column objects. This is not the correct way to refer to columns within a Spark DataFrame filter.
182
Which of the following statements about Spark DataFrames is incorrect? A. Spark DataFrames are the same as a data frame in Python or R. B. Spark DataFrames are built on top of RDDs. C. Spark DataFrames are immutable. D. Spark DataFrames are distributed. E. Spark DataFrames have common Structured APIs.
A. Spark DataFrames are the same as a data frame in Python or R. **Explanation** Spark DataFrames, although conceptually similar to data frames in Python (pandas) or R, are not the same. Spark DataFrames are distributed, immutable, and built on top of RDDs, designed to handle large-scale data processing across a cluster. In contrast, data frames in Python (pandas) or R are typically in-memory, single-node constructs. Options B, C, D, and E are correct statements about Spark DataFrames. Spark DataFrames are built on top of RDDs, they are immutable, they are distributed across a cluster, and they provide a common set of Structured APIs for data manipulation.
183
Which of the following code blocks returns a DataFrame containing only the rows from DataFrame storesDF where the value in column sqft is less than or equal to 25,000? A. storesDF.where(storesDF[sqft] > 25000) B. storesDF.filter(sqft > 25000) C. storesDF.filter("sqft" <= 25000) D. storesDF.filter(col("sqft") <= 25000) E. storesDF.where(sqft > 25000)
D. Option D is correct because it uses the `filter` method along with the `col` function to correctly specify the column "sqft" and the desired condition (<= 25000). Option A is incorrect because `.where` is used to replace values, not filter rows and has incorrect syntax. Option B is incorrect because it's missing `col()` to specify the column. Option C is incorrect because `filter` expects a Column expression, not a string. Option E is incorrect because it's missing `col()` and has incorrect syntax.
184
Which of the following code blocks returns a new DataFrame where column managerName from DataFrame storesDF has had its missing values replaced with the value "No Manager"? A sample of DataFrame storesDF is below: [Image](https://img.examtopics.com/certified-associate-developer-for-apache-spark/image25.png) A. ```python storesDF.na.fill("No Manager", "managerName") ``` B. ```python storesDF.nafill("No Manager", col("managerName")) ``` C. ```python storesDF.na.fill("No Manager", col("managerName")) ``` D. ```python storesDF.fillna("No Manager", col("managerName")) ``` E. ```python storesDF.nafill("No Manager", "managerName") ```
A. ```python storesDF.na.fill("No Manager", "managerName") ``` **Explanation:** Option A is the correct way to fill missing values in a specific column ("managerName") of a DataFrame (storesDF) with the string "No Manager" using the `na.fill` method. * `storesDF.na.fill()` is the correct syntax for using the fill method. * The first argument is the value to fill the missing values with. * The second argument specifies the column to apply the fill to. Options B and E are incorrect because `nafill` is not a valid method in PySpark DataFrame API. Options C and D are incorrect because the second argument in `na.fill` or `fillna` should be the column name as a string, not a `col` object. While `fillna` is an alias for `na.fill`, it also requires the column name as a string, not a `col` object, when filling a specific column.
185
Which of the following code blocks prints the schema of DataFrame storesDF? A. print(storesDF) B. storesDF.printSchema() C. print(storesDF.schema()) D. storesDF.schema E. storesDF.schema()
B
186
The code block shown below should return a new 4-partition DataFrame from the 8-partition DataFrame storesDF without inducing a shuffle. Choose the response that correctly fills in the numbered blanks within the code block to complete this task. Code block: _1_._2_(_3_) A. 1. storesDF 2. coalesce 3. Nothing B. 1. storesDF 2. coalesce 3. 4 C. 1. storesDF 2. coalesce 3. 4, "storeId" D. 1. storesDF 2. coalesce 3. "storeId"
B. The `coalesce()` transformation reduces the number of partitions in a DataFrame. In this case, it reduces the number of partitions of `storesDF` to 4. This operation avoids a full shuffle. Option A is incorrect because it provides no argument to the `coalesce()` transformation. Options C and D are incorrect because they include a column name as an argument, which is not part of the `coalesce()` syntax.
187
Which of the following code blocks returns a new DataFrame with the mean of column sqft from DataFrame storesDF in column sqftMean? A. storesDF.withColumn(mean(col("sqft")).alias("sqftMean")) B. storesDF.agg(col("sqft").mean().alias("sqftMean")) C. storesDF.agg(mean("sqft").alias("sqftMean")) D. storesDF.agg(mean(col("sqft")).alias("sqftMean")) E. storesDF.withColumn("sqftMean", mean(col("sqft")))
D. **Explanation:** * **D is correct:** `storesDF.agg(mean(col("sqft")).alias("sqftMean"))` correctly calculates the mean of the "sqft" column using the `mean()` function and the `col()` function to specify the column. The `agg()` function is used to perform aggregation, and `.alias("sqftMean")` assigns the alias "sqftMean" to the resulting column. * **A is incorrect:** `storesDF.withColumn(mean(col("sqft")).alias("sqftMean"))` is incorrect because `withColumn` is used to add a column with a value computed row-wise, not an aggregated value for the entire DataFrame. It expects a column name as the first argument, not a column object with an alias. Also, `mean` is used without aggregation. * **B is incorrect:** `storesDF.agg(col("sqft").mean().alias("sqftMean"))` is incorrect. The `mean()` function can't be called directly on a Column object like this. You need to use the `mean` function from `pyspark.sql.functions`. * **C is incorrect:** `storesDF.agg(mean("sqft").alias("sqftMean"))` is correct but not complete. While `mean("sqft")` works it's better to be explicit by calling `col("sqft")` within the `mean()` function * **E is incorrect:** `storesDF.withColumn("sqftMean", mean(col("sqft")))` is similar to option A. It attempts to add a column using `withColumn`, but `mean(col("sqft"))` returns an aggregated value, not a row-wise calculation. Also, the aggregation function will not work properly.
188
Which of the following code blocks fails to return a DataFrame sorted alphabetically based on column division? A. `storesDF.sort(asc("division"))` B. `storesDF.orderBy(["division"], ascending = [1])` C. `storesDF.orderBy(col("division").desc())` D. `storesDF.orderBy("division")` E. `storesDF.sort("division")`
C. `storesDF.orderBy(col("division").desc())` DISCUSSION: Option C is the correct answer because `.desc()` sorts the column in descending order, not alphabetically (ascending). Options A, D, and E will sort the 'division' column in ascending order (alphabetically) by default. Option B also sorts in ascending order, as `ascending = [1]` indicates ascending order.