Practice Questions - exam-certified-machine-learning-professional Flashcards

(59 cards)

1
Q

A machine learning engineer is migrating a machine learning pipeline to use Databricks Machine Learning. They have programmatically identified the best run from an MLflow Experiment and stored its URI in the model_uri variable and its Run ID in the run_id variable. They have also determined that the model was logged with the name “model”. Now, the machine learning engineer wants to register that model in the MLflow Model Registry with the name “best_model”.

Which of the following lines of code can they use to register the model to the MLflow Model Registry?

A.

```python
mlflow.register_model(model_uri, “best_model”)
~~~

B.

```python
mlflow.register_model(run_id, “best_model”)
~~~

C.

```python
mlflow.register_model(f”runs:/{run_id}/best_model”, “model”)
~~~

D.

```python
mlflow.register_model(model_uri, “model”)
~~~

E.

```python
mlflow.register_model(f”runs:/{run_id}/model”)
~~~

A

A. mlflow.register_model(model_uri, "best_model")

The mlflow.register_model function requires the model_uri (location of the model) and the desired registered model name as arguments. Option A correctly passes the model_uri and the desired name “best_model”.

Option B is incorrect because it uses the run_id instead of the model_uri.
Option C is incorrect because it constructs a URI using the run_id but also incorrectly uses “best_model” within the URI when the model was logged as “model”. It also uses the wrong name, “model” instead of “best_model”.
Option D is incorrect because it uses the model_uri correctly, but it uses the wrong name, “model” instead of “best_model”.
Option E is incorrect because it constructs a URI using the run_id and doesn’t specify the registered model name.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

A machine learning engineer wants to move their model version model_version for the MLflow Model Registry model model from the Staging stage to the Production stage using MLflow Client client.
Which of the following code blocks can they use to accomplish the task?
link
A.

B.

C.

D.

E.

A

C.

The transition_model_version_stage method is the correct method to promote a model version to a new stage. Option C correctly uses this method, passing the model name, model version, and the target stage “Production”.

Options A, B, D, and E use incorrect methods or parameters and therefore are not the correct answer.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

A machine learning engineer is manually refreshing a model in an existing machine learning pipeline. The pipeline uses the MLflow Model Registry model “project”. The machine learning engineer would like to add a new version of the model to “project”.

Which of the following MLflow operations can the machine learning engineer use to accomplish this task?

A.
mlflow.register_model

B.
MlflowClient.update_registered_model

C.
mlflow.add_model_version

D.
MlflowClient.get_model_version

E.
The machine learning engineer needs to create an entirely new MLflow Model Registry model

A

A. mlflow.register_model

The question states that the engineer wants to add a new version of the model to “project”. mlflow.register_model will create a new model version in the model registry for the model files specified by model_uri.

Option B is incorrect because MlflowClient.update_registered_model updates the metadata for the registered model (like the description), not the model version.
Option C is incorrect because mlflow.add_model_version is not a valid MLflow function.
Option D is incorrect because MlflowClient.get_model_version retrieves information about a specific model version, not create a new one.
Option E is incorrect because the model already exists, so a new one does not need to be created.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

A machine learning engineer has developed a random forest model using scikit-learn, logged the model using MLflow as random_forest_model, and stored its run ID in the run_id Python variable. They now want to deploy that model by performing batch inference on a Spark DataFrame spark_df.
Which of the following code blocks can they use to create a function called predict that they can use to complete the task?
A.

B.
It is not possible to deploy a scikit-learn model on a Spark DataFrame.
C.

D.

E.

A

E.

Explanation:

Option E is correct because it demonstrates the proper usage of mlflow.pyfunc.spark_udf to deploy an MLflow model for batch inference on a Spark DataFrame. The mlflow.pyfunc.spark_udf function requires the SparkSession as its first argument, and the model URI as the second. The result is a Spark UDF that can be applied to the Spark DataFrame to generate predictions.

Option A is incorrect because it passes spark_df (the Spark DataFrame) as the first argument to mlflow.pyfunc.spark_udf, which is incorrect. The first argument must be the SparkSession object.
Option B is incorrect because it is possible to deploy scikit-learn models on Spark DataFrames using MLflow.
Option C is incorrect because it attempts to load the model using mlflow.spark.load_model, which is intended for models trained within Spark MLlib, not generic Python models. The loaded model is also not properly used as a UDF.
Option D is incorrect because it incorrectly assumes that mlflow.pyfunc.load_model can directly operate on Spark DataFrames, which is not the correct approach for batch inference. This method loads the model into the driver’s memory but doesn’t distribute the prediction workload across the Spark cluster.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Which of the following describes the purpose of the context parameter in the predict method of Python models for MLflow?

A. The context parameter allows the user to specify which version of the registered MLflow Model should be used based on the given application’s current scenario
B. The context parameter allows the user to document the performance of a model after it has been deployed
C. The context parameter allows the user to include relevant details of the business case to allow downstream users to understand the purpose of the model
D. The context parameter allows the user to provide the model with completely custom if-else logic for the given application’s current scenario
E. The context parameter allows the user to provide the model access to objects like preprocessing models or custom configuration files

A

E. The context parameter in the predict method of Python models for MLflow is used to provide the model with access to external objects like preprocessing models or custom configuration files. This allows the model to utilize necessary resources for making accurate predictions. Options A, B, C, and D describe functionalities that are not the primary purpose of the context parameter. The context parameter is not meant to determine model version (A), document model performance (B), provide business case details (C), or inject custom logic (D).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

A machine learning engineer has developed a model and registered it using the FeatureStoreClient fs. The model has model URI model_uri. The engineer now needs to perform batch inference on customer-level Spark DataFrame spark_df, but it is missing a few of the static features that were used when training the model. The customer_id column is the primary key of spark_df and the training set used when training and logging the model.

Which of the following code blocks can be used to compute predictions for spark_df when the missing feature values can be found in the Feature Store by searching for features by customer_id?

A.

```python
df = fs.get_missing_features(spark_df, model_uri)
fs.score_model(model_uri, df)
~~~

B.

```python
fs.score_model(model_uri, spark_df)
~~~

C.

```python
df = fs.get_missing_features(spark_df, model_uri)
fs.score_batch(model_uri, df)
~~~

D.

```python
df = fs.get_missing_features(spark_df)
fs.score_batch(model_uri, df)
~~~

E.

```python
fs.score_batch(model_uri, spark_df)
~~~

A

E.
The score_batch method of the FeatureStoreClient automatically retrieves missing features from the Feature Store during batch inference, given that the primary key is available in the input DataFrame. Therefore, it is sufficient to call fs.score_batch(model_uri, spark_df) to perform batch inference. Options A, C, and D include the method get_missing_features which is not a valid method of the FeatureStoreClient. Option B uses the score_model method which is not designed for batch inference.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Which of the following describes the concept of MLflow Model flavors?

A. A convention that deployment tools can use to wrap preprocessing logic into a Model
B. A convention that MLflow Model Registry can use to version models
C. A convention that MLflow Experiments can use to organize their Runs by project
D. A convention that deployment tools can use to understand the model
E. A convention that MLflow Model Registry can use to organize its Models by project

A

D. A convention that deployment tools can use to understand the model

The MLflow Model flavor is a convention that allows deployment tools to understand the structure and requirements of a model, enabling efficient deployment across different platforms. Flavors provide a standardized way to package models, including necessary metadata and environment details, so they can be loaded and used consistently.

Options A, B, C, and E are incorrect because they do not accurately describe the purpose of MLflow Model flavors. Flavors are not primarily for wrapping preprocessing logic, versioning models in the Model Registry, organizing runs in Experiments, or organizing models in the Model Registry. Their primary function is to enable deployment tools to understand and deploy models effectively.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

In a continuous integration, continuous deployment (CI/CD) process for machine learning pipelines, which of the following events commonly triggers the execution of automated testing?

A. The launch of a new cost-efficient SQL endpoint
B. CI/CD pipelines are not needed for machine learning pipelines
C. The arrival of a new feature table in the Feature Store
D. The launch of a new cost-efficient job cluster
E. The arrival of a new model version in the MLflow Model Registry

A

E. The arrival of a new model version in the MLflow Model Registry

Automated testing in a CI/CD pipeline for ML is triggered when a new model version is registered. This ensures the new model performs as expected before deployment.

Option A is incorrect because SQL endpoint launches are related to data access, not model performance. Option B is incorrect because CI/CD pipelines are crucial for automating and managing ML model deployment. Option C is incorrect because while new feature tables are important, they don’t directly trigger model testing. Option D is incorrect because job cluster launches are related to resource allocation, not model validation.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

A machine learning engineering team has written predictions computed in a batch job to a Delta table for querying. However, the team has noticed that the querying is running slowly. The team has already tuned the size of the data files. Upon investigating, the team has concluded that the rows meeting the query condition are sparsely located throughout each of the data files.

Based on the scenario, which of the following optimization techniques could speed up the query by colocating similar records while considering values in multiple columns?

A. Z-Ordering
B. Bin-packing
C. Write as a Parquet file
D. Data skipping
E. Tuning the file size

A

A. Z-Ordering

Z-Ordering is a data locality technique used in Delta Lake to optimize query performance. It achieves this by clustering similar records together based on the values of multiple columns. This co-location of related data reduces the amount of data that needs to be scanned to satisfy a query, especially when filtering on multiple columns, thus speeding up queries.

Options B, C, D, and E are incorrect because:
* Bin-packing is typically used to optimize storage space, not query performance directly.
* Writing as a Parquet file is a general storage format and doesn’t, by itself, address the problem of sparse data distribution.
* Data skipping relies on metadata about data distribution within files, but doesn’t re-organize the data for co-location.
* Tuning the file size has already been done, as stated in the problem description.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

A machine learning engineer needs to deliver predictions of a machine learning model in real-time. However, the feature values needed for computing the predictions are available one week before the query time.
Which of the following is a benefit of using a batch serving deployment in this scenario rather than a real-time serving deployment where predictions are computed at query time?

A. Batch serving has built-in capabilities in Databricks Machine Learning
B. There is no advantage to using batch serving deployments over real-time serving deployments
C. Computing predictions in real-time provides more up-to-date results
D. Testing is not possible in real-time serving deployments
E. Querying stored predictions can be faster than computing predictions in real-time

A

E. Querying stored predictions can be faster than computing predictions in real-time

Explanation:
Since the feature values are available a week in advance, predictions can be pre-computed and stored using batch serving. When a prediction is needed, it can be quickly retrieved from storage, which is faster than computing it on-demand in real-time.

  • A is incorrect because while Databricks may offer capabilities for batch serving, it’s not the primary benefit in this specific scenario.
  • B is incorrect because there is a clear advantage (speed) to batch serving in this scenario.
  • C is incorrect because batch serving allows for pre-computation with data available in advance, negating the need for up-to-the-minute real-time computation.
  • D is incorrect because testing is possible in both real-time and batch serving deployments.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Which of the following tools can assist in real-time deployments by packaging software with its own application, tools, and libraries?

A. Cloud-based compute
B. None of these tools
C. REST APIs
D. Containers
E. Autoscaling clusters

A

D. Containers
Containers, like Docker, package an application with all its dependencies (libraries, tools, etc.) into a single, portable unit. This ensures consistent execution across different environments, crucial for real-time deployments.

Option A is incorrect because while cloud-based compute provides the infrastructure, it doesn’t handle the packaging of software and its dependencies. Option B is incorrect as containers are a valid tool. Option C is incorrect because REST APIs are used for communication between software systems, not for packaging applications. Option E is incorrect because autoscaling clusters adjust resources based on demand, but they don’t package software with its dependencies.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

A machine learning engineer has registered a sklearn model in the MLflow Model Registry using the sklearn model flavor with UI model_uri.
Which of the following operations can be used to load the model as an sklearn object for batch deployment?

A.
mlflow.spark.load_model(model_uri)
B.
mlflow.pyfunc.read_model(model_uri)
C.
mlflow.sklearn.read_model(model_uri)
D.
mlflow.pyfunc.load_model(model_uri)
E.
mlflow.sklearn.load_model(model_uri)

A

E. mlflow.sklearn.load_model(model_uri)

The question specifies that the model was saved using the sklearn flavor. Therefore, the mlflow.sklearn.load_model function should be used to load it back as a scikit-learn object. Options A, B, C, and D use incorrect functions (mlflow.spark, mlflow.pyfunc.read_model, mlflow.sklearn.read_model, mlflow.pyfunc.load_model) which are not appropriate for loading a scikit-learn model.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

A data scientist set up a machine learning pipeline to automatically log a data visualization with each run. They now want to view the visualizations in Databricks.

Which of the following locations in Databricks will show these data visualizations?

A. The MLflow Model Registry Model page
B. The Artifacts section of the MLflow Experiment page
C. Logged data visualizations cannot be viewed in Databricks
D. The Artifacts section of the MLflow Run page
E. The Figures section of the MLflow Run page

A

D. The Artifacts section of the MLflow Run page

Explanation: When data visualizations are logged in MLflow, they are stored as artifacts associated with a specific run. These artifacts can be accessed and viewed in the Artifacts section of the corresponding MLflow Run page within Databricks.

  • A is incorrect because the Model Registry is for managing MLflow models, not run artifacts.
  • B is incorrect because while the Experiment page provides an overview of runs, the artifacts are found within the individual Run pages.
  • C is incorrect because logged data visualizations can be viewed in Databricks.
  • E is incorrect because while some visualizations may be displayed as figures, the artifacts tab is the general destination for visualizations.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

A data scientist has developed a scikit-learn model sklearn_model and they want to log the model using MLflow.
They write the following incomplete code block:

Which of the following lines of code can be used to fill in the blank so the code block can successfully complete the task?
A.
mlflow.spark.track_model(sklearn_model, “model”)
B.
mlflow.sklearn.log_model(sklearn_model, “model”)
C.
mlflow.spark.log_model(sklearn_model, “model”)
D.
mlflow.sklearn.load_model(“model”)
E.
mlflow.sklearn.track_model(sklearn_model, “model”)

A

B.
mlflow.sklearn.log_model(sklearn_model, “model”)

The goal is to log a scikit-learn model using MLflow. mlflow.sklearn.log_model is the correct function for this purpose. It takes the model object and a path (artifact_path) as arguments.
Option A is incorrect because mlflow.spark is for logging Spark models, not scikit-learn models. track_model is not a valid function in mlflow.spark.
Option C is incorrect because mlflow.spark is for logging Spark models, not scikit-learn models.
Option D is incorrect because mlflow.sklearn.load_model is used to load a previously logged model, not to log a model.
Option E is incorrect because track_model is not a valid function in mlflow.sklearn.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

A machine learning engineer has deployed a model recommender using MLflow Model Serving. They now want to query the version of that model that is in the Production stage of the MLflow Model Registry.
Which of the following model URIs can be used to query the described model version?

A. https://<databricks-instance>/model-serving/recommender/Production/invocations
B. The version number of the model version in Production is necessary to complete this task.
C. https://<databricks-instance>/model/recommender/stage-production/invocations
D. https://<databricks-instance>/model-serving/recommender/stage-production/invocations
E. https://<databricks-instance>/model/recommender/Production/invocations

A

E. The correct URI structure to query a model in a specific stage of the MLflow Model Registry using MLflow Model Serving is https://<databricks-instance>/model/<model_name>/<stage>/invocations. Therefore, option E is correct. Options A and D are incorrect because they use /model-serving/ instead of /model/. Option C is incorrect because it uses /stage-production/ instead of /Production/. Option B is incorrect because the version number is not necessary when querying by stage.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q
A

[Parsing error] Full response:
QUESTION:
A data scientist has created a Python function compute_features that returns a Spark DataFrame with the following schema:

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

Which of the following is a benefit of logging a model signature with an MLflow model?

A. The model will have a unique identifier in the MLflow experiment
B. The schema of input data can be validated when serving models
C. The model can be deployed using real-time serving tools
D. The model will be secured by the user that developed it
E. The schema of input data will be converted to match the signature

A

B. The schema of input data can be validated when serving models
Logging a model signature in MLflow defines the expected schema for the model’s inputs and outputs. This allows MLflow’s deployment tools to validate incoming data during serving, ensuring it matches the expected format. This validation helps prevent errors and unexpected behavior during inference.

Option A is incorrect because while MLflow tracks models within experiments, the signature itself doesn’t provide a unique identifier for the experiment. Option C is incorrect because while MLflow can be used with real-time serving tools, the signature doesn’t directly enable deployment. Option D is incorrect because model signatures don’t provide security features related to user access control. Option E is incorrect because the signature allows validation against a schema; it does not convert the input data to match the signature.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

A machine learning engineer wants to deploy a model for real-time serving using MLflow Model Serving. For the model, the machine learning engineer currently has one model version in each of the stages in the MLflow Model Registry. The engineer wants to know which model versions can be queried once Model Serving is enabled for the model.

Which of the following lists all of the MLflow Model Registry stages whose model versions are automatically deployed with Model Serving?

A.
Staging, Production, Archived
B.
Production
C.
None, Staging, Production, Archived
D.
Staging, Production
E.
None, Staging, Production

A

D. Staging, Production

The MLflow Model Serving automatically deploys model versions that are in the Staging or Production stages. Model versions in the Archived or None stages are not automatically deployed. Therefore, option D is correct.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

A machine learning engineering team wants to build a continuous pipeline for data preparation of a machine learning application. The team would like the data to be fully processed and made ready for inference in a series of equal-sized batches.
Which of the following tools can be used to provide this type of continuous processing?
A.
Spark UDFs
B.
Structured Streaming
C.
MLflow
D.
Delta Lake
E.
AutoML

A

B. Structured Streaming

Structured Streaming in Spark allows for continuous processing of data streams, where data can be processed in real-time and output in equal-sized batches. This makes it suitable for building continuous pipelines for data preparation in machine learning applications.

Incorrect Options:
A. Spark UDFs (User Defined Functions) are for custom data transformations, not continuous processing.
C. MLflow is a platform for managing the ML lifecycle, including experimentation, reproducibility, deployment, and a central model registry.
D. Delta Lake is a storage layer that brings reliability to data lakes.
E. AutoML automates the process of selecting, training, and tuning machine learning models.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

A data scientist has written a function to track the runs of their random forest model. The data scientist is changing the number of trees in the forest across each run.

Which of the following MLflow operations is designed to log single values like the number of trees in a random forest?

A.
mlflow.log_artifact
B.
mlflow.log_model
C.
mlflow.log_metric
D.
mlflow.log_param
E.
There is no way to store values like this.

A

D. mlflow.log_param

The question is asking which MLflow function is appropriate for logging a parameter, like the number of trees. mlflow.log_param is designed specifically for this purpose.

  • A. mlflow.log_artifact: This is used for logging files or directories, not single values.
  • B. mlflow.log_model: This is used for logging an entire machine learning model.
  • C. mlflow.log_metric: This is used for logging metrics (evaluation results), which are typically numeric but represent performance, not hyperparameters.
  • E. There is no way to store values like this: MLflow provides functionality for logging parameters.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

Which of the following deployment paradigms can centrally compute predictions for a single record with exceedingly fast results?
A. Streaming
B. Batch
C. Edge/on-device
D. None of these strategies will accomplish the task.
E. Real-time

A

E. Real-time deployment is designed for low-latency, high-throughput inference, making it suitable for scenarios where predictions need to be computed quickly for individual records. Streaming, Batch, and Edge are not designed to provide immediate results for single records.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q

A machine learning engineer and data scientist are working together to convert a batch deployment to an always-on streaming deployment. The machine learning engineer has expressed that rigorous data tests must be put in place as a part of their conversion to account for potential changes in data formats.

Which of the following describes why these types of data type tests and checks are particularly important for streaming deployments?

A. Because the streaming deployment is always on, all types of data must be handled without producing an error
B. All of these statements
C. Because the streaming deployment is always on, there is no practitioner to debug poor model performance
D. Because the streaming deployment is always on, there is a need to confirm that the deployment can autoscale
E. None of these statements

A

The correct answer is B. Streaming deployments are always on, so handling all data types without errors is crucial. Also, continuous operation means less opportunity for manual debugging, and autoscaling is a key requirement for handling varying data volumes. Therefore, all of the statements A, C, and D are valid, making option B the correct answer. Options A, C, and D are individually correct, but incomplete. Option E is incorrect because options A, C, and D contain correct statements.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
23
Q

A data scientist has developed a scikit-learn random forest model model, but they have not yet logged model with MLflow. They want to obtain the input schema and the output schema of the model so they can document what type of data is expected as input.
Which of the following MLflow operations can be used to perform this task?
A.
mlflow.models.schema.infer_schema
B.
mlflow.models.signature.infer_signature
C.
mlflow.models.Model.get_input_schema
D.
mlflow.models.Model.signature
E.
There is no way to obtain the input schema and the output schema of an unlogged model.

A

B. mlflow.models.signature.infer_signature

The mlflow.models.signature.infer_signature function can be used to infer the input and output schema of an unlogged model by inspecting the training data and (optionally) the model’s labels. This information is encapsulated in a ModelSignature object.

Option A is incorrect because mlflow.models.schema.infer_schema is used to infer schema from data, not directly from a model.
Options C and D are incorrect because they are methods that operate on already logged MLflow models.
Option E is incorrect because mlflow.models.signature.infer_signature provides a way to obtain the schema of unlogged models.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
24
Q

Which of the following describes label drift?
A. Label drift is when there is a change in the distribution of the predicted target given by the model
B. None of these describe label drift
C. Label drift is when there is a change in the distribution of an input variable
D. Label drift is when there is a change in the relationship between input variables and target variables
E. Label drift is when there is a change in the distribution of a target variable

A

E. Label drift refers to a change in the distribution of the target variable itself. Options A, C, and D describe other types of drift, such as model prediction drift, feature drift, and concept drift, respectively.

25
A data scientist would like to enable MLflow Autologging for all machine learning libraries used in a notebook. They want to ensure that MLflow Autologging is used no matter what version of the Databricks Runtime for Machine Learning is used to run the notebook and no matter what workspace-wide configurations are selected in the Admin Console. Which of the following lines of code can they use to accomplish this task? A. ```python mlflow.sklearn.autolog() ``` B. ```python mlflow.spark.autolog() ``` C. ```python spark.conf.set(“autologging”, True) ``` D. It is not possible to automatically log MLflow runs. E. ```python mlflow.autolog() ```
E. `mlflow.autolog()` `mlflow.autolog()` enables MLflow Autologging for all supported libraries without needing to specify each library individually. This ensures that all relevant libraries will be autologged, regardless of the Databricks Runtime version or workspace configurations. Option A is incorrect because `mlflow.sklearn.autolog()` only enables autologging for scikit-learn models. Option B is incorrect because `mlflow.spark.autolog()` only enables autologging for Spark MLlib models. Option C is incorrect because `spark.conf.set(“autologging”, True)` is not the correct way to enable MLflow autologging. Option D is incorrect because MLflow Autologging is indeed possible.
26
A data scientist has developed a model model and computed the RMSE of the model on the test set. They have assigned this value to the variable rmse. They now want to manually store the RMSE value with the MLflow run. They write the following incomplete code block: ![Image](https://img.examtopics.com/certified-machine-learning-professional/image9.png) Which of the following lines of code can be used to fill in the blank so the code block can successfully complete the task? A. `log_artifact` B. `log_model` C. `log_metric` D. `log_param` E. There is no way to store values like this.
C. `log_metric` The question asks how to store the RMSE value with the MLflow run. RMSE is a metric used to evaluate the performance of a model. MLflow provides the `log_metric` function to log metrics. Option A is incorrect because `log_artifact` is used to log data files or models. Option B is incorrect because `log_model` is used to log the model itself. Option D is incorrect because `log_param` is used to log input parameters, not metrics. Option E is incorrect because MLflow does provide a way to store values like this through the `log_metric` function.
27
Which of the following is a probable response to identifying drift in a machine learning application? A. None of these responses B. Retraining and deploying a model on more recent data C. All of these responses D. Rebuilding the machine learning application with a new label variable E. Sunsetting the machine learning application
B. Retraining and deploying a model on more recent data **Explanation:** * **B is correct:** When drift (changes in the input data that affect model performance) is detected, retraining the model with more recent data is a standard and effective response. This allows the model to learn the new patterns and maintain accuracy. * **A is incorrect:** There is a probable response. * **C is incorrect:** Not all responses are probable. * **D is incorrect:** Rebuilding with a new label variable is not a typical response to data drift. It would suggest a fundamental change in the problem being solved, not just a shift in the existing data. * **E is incorrect:** Sunsetting the application is an extreme measure that is usually taken when the model is no longer useful or when maintaining it is not cost-effective, but it's not a standard response to drift. Retraining is attempted first.
28
A data scientist has computed updated feature values for all primary key values stored in the Feature Store table features. In addition, feature values for some new primary key values have also been computed. The updated feature values are stored in the DataFrame features_df. They want to replace all data in features with the newly computed data. Which of the following code blocks can they use to perform this task using the Feature Store Client fs? A. ![Image](https://img.examtopics.com/certified-machine-learning-professional/image4.png) B. ![Image](https://img.examtopics.com/certified-machine-learning-professional/image5.png) C. ![Image](https://img.examtopics.com/certified-machine-learning-professional/image6.png) D. ![Image](https://img.examtopics.com/certified-machine-learning-professional/image7.png) E. ![Image](https://img.examtopics.com/certified-machine-learning-professional/image8.png)
D. The correct answer is D because the data scientist wants to replace all the data in the feature table with the new data. `fs.write_table` is the appropriate method for writing to an existing table, and setting `mode="overwrite"` ensures that the existing data is replaced. Option A is incorrect because it uses `fs.create_table`, which is used to create a new table, not to overwrite an existing one. Option B is incorrect because it uses `mode="merge"`, which would merge the new data with the existing data instead of overwriting it. Option C is incorrect because it uses `fs.create_table` and `mode="merge"`, which are both inappropriate for the given task. Option E is incorrect because the Feature Store Client `fs` does not have a `replace_table` method.
29
A data scientist is utilizing MLflow to track their machine learning experiments. After completing a series of runs for the experiment with experiment ID exp_id, the data scientist wants to programmatically work with the experiment run data in a Spark DataFrame. They have an active MLflow Client client and an active Spark session spark. Which of the following lines of code can be used to obtain run-level results for exp_id in a Spark DataFrame? A. client.list_run_infos(exp_id) B. spark.read.format("delta").load(exp_id) C. There is no way to programmatically return row-level results from an MLflow Experiment. D. mlflow.search_runs(exp_id) E. spark.read.format("mlflow-experiment").load(exp_id)
E. `spark.read.format("mlflow-experiment").load(exp_id)` The correct answer is E because the "mlflow-experiment" data source format in Spark is specifically designed to load MLflow experiment run data into a Spark DataFrame. Option A is incorrect because `client.list_run_infos(exp_id)` returns a list of `RunInfo` objects, not a Spark DataFrame. Option B is incorrect because `spark.read.format("delta").load(exp_id)` attempts to read a Delta table, but `exp_id` is not a path to a Delta table. Option C is incorrect because it states that it is impossible to programmatically return row-level results, which is false. Option D is incorrect because `mlflow.search_runs(exp_id)` returns an `mlflow.entities.run.Run` object, not a Spark DataFrame.
30
A data scientist has developed and logged a scikit-learn random forest model model, and then they ended their Spark session and terminated their cluster. After starting a new cluster, they want to review the feature_importances_ of the original model object. Which of the following lines of code can be used to restore the model object so that feature_importances_ is available? A. `mlflow.load_model(model_uri)` B. `client.list_artifacts(run_id)["feature-importances.csv"]` C. `mlflow.sklearn.load_model(model_uri)` D. This can only be viewed in the MLflow Experiments UI E. `client.pyfunc.load_model(model_uri)`
C. `mlflow.sklearn.load_model(model_uri)` This is the correct answer because the model was a scikit-learn model, and `mlflow.sklearn.load_model()` is the correct function to load a scikit-learn model that was previously logged with MLflow. Option A is incorrect because `mlflow.load_model()` is a generic function that may not correctly load all the attributes of a scikit-learn model. Option B is incorrect because `client.list_artifacts()` retrieves artifacts, not the model itself. While feature importances might be logged as an artifact, this approach doesn't restore the full model object. Option D is incorrect because the model object can be restored programmatically. Option E is incorrect because `client.pyfunc.load_model()` is used for loading models for generic Python function serving, not for restoring a scikit-learn model to access its attributes directly.
31
A data scientist has developed a model to predict ice cream sales using the expected temperature and expected number of hours of sun in the day. However, the expected temperature is dropping beneath the range of the input variable on which the model was trained. Which of the following types of drift is present in the above scenario? A. Label drift B. None of these C. Concept drift D. Prediction drift E. Feature drift
E. Feature drift is the correct answer. Feature drift occurs when the distribution of input features changes over time. In this case, the expected temperature, an input feature, is dropping below the range seen during training, indicating a shift in its distribution. A. Label drift refers to changes in the distribution of the target variable, which is not the issue here. B. None of these is incorrect because feature drift is present. C. Concept drift refers to changes in the relationship between input features and the target variable, not a change in the input features themselves. D. Prediction drift refers to changes in the model's output, which is a consequence of other types of drift, but not the primary issue here.
32
A data scientist wants to remove the star_rating column from the Delta table at the location path. To do this, they need to load in data and drop the star_rating column. Which of the following code blocks accomplishes this task? A. ```python spark.read.format(“delta”).load(path).drop(“star_rating”) ``` B. ```python spark.read.format(“delta”).table(path).drop(“star_rating”) ``` C. Delta tables cannot be modified D. ```python spark.read.table(path).drop(“star_rating”) ``` E. ```python spark.sql(“SELECT * EXCEPT star_rating FROM path”) ```
A. `spark.read.format(“delta”).load(path).drop(“star_rating”)` This is correct because it explicitly specifies the "delta" format when reading from the location path using `.load()`. Then, it drops the "star_rating" column from the resulting DataFrame. Incorrect Options: B: `.table()` expects a table name, not a file path. C: Delta tables can be modified. D: This option does not specify the "delta" format, which is necessary when reading a Delta table from a path. It would be valid if `path` were the name of a registered table, not a path. E: This option will execute a SQL query but will not update the Delta table. The result would need to be written back to the Delta table.
33
Which of the following operations in Feature Store Client fs can be used to return a Spark DataFrame of a data set associated with a Feature Store table? A. fs.create_table B. fs.write_table C. fs.get_table D. There is no way to accomplish this task with fs E. fs.read_table
E. fs.read_table The consensus in the discussion is that `fs.read_table` is the correct operation to return a Spark DataFrame of a dataset associated with a Feature Store table. Option A is incorrect because `fs.create_table` is used to create a new table, not read data. Option B is incorrect because `fs.write_table` is used to write data to a table, not read data. Option C might seem plausible, but `fs.get_table` likely retrieves table metadata or configuration, not the data itself as a DataFrame. Option D is incorrect because there is a way to accomplish the task.
34
Which of the following describes concept drift? A. Concept drift is when there is a change in the distribution of an input variable B. Concept drift is when there is a change in the distribution of a target variable C. Concept drift is when there is a change in the relationship between input variables and target variables D. Concept drift is when there is a change in the distribution of the predicted target given by the model E. None of these describe Concept drift
C. Concept drift is when there is a change in the relationship between input variables and target variables Concept drift occurs when the statistical properties of the target variable change over time. This means the relationship between the input features and the target variable is no longer the same, leading to a decrease in the model's performance. Options A and B describe changes in the distributions of input or target variables individually, but concept drift is specifically about the change in their *relationship*. Option D refers to the model's output, which is a consequence of concept drift, not the definition of concept drift itself. Option E is incorrect because option C provides a valid description.
35
A machine learning engineer is monitoring categorical input variables for a production machine learning application. The engineer believes that missing values are becoming more prevalent in more recent data for a particular value in one of the categorical input variables. Which of the following tools can the machine learning engineer use to assess their theory? A. Kolmogorov-Smirnov (KS) test B. One-way Chi-squared Test C. Two-way Chi-squared Test D. Jenson-Shannon distance E. None of these
B. One-way Chi-squared Test The question describes a scenario where the engineer wants to compare the distribution of a single categorical variable (with potentially missing values) across two time periods (old data vs. new data) to see if the prevalence of missing values has changed for a specific category. A one-way Chi-squared test is appropriate for comparing the observed frequencies of categories in a single variable to expected frequencies. Here, the engineer can compare the frequencies of the categories (including "missing") in the old data to the frequencies in the new data to see if there's a statistically significant difference. A two-way Chi-squared test (option C) is used to determine if there's an association between *two* categorical variables, which is not the primary goal here. The KS test (option A) is for continuous distributions. Jenson-Shannon distance (option D) could potentially be used, but the Chi-squared test is more standard for this type of categorical data comparison.
36
A data scientist is using MLflow to track their machine learning experiment. As a part of each MLflow run, they are performing hyperparameter tuning. The data scientist would like to have one parent run for the tuning process with a child run for each unique combination of hyperparameter values. They are using the following code block: ![Image](https://img.examtopics.com/certified-machine-learning-professional/image1.png) The code block is not nesting the runs in MLflow as they expected. Which of the following changes does the data scientist need to make to the above code block so that it successfully nests the child runs under the parent run in MLflow? A. Indent the child run blocks within the parent run block B. Add the nested=True argument to the parent run C. Remove the nested=True argument from the child runs D. Provide the same name to the run_name parameter for all three run blocks E. Add the nested=True argument to the parent run and remove the nested=True arguments from the child runs
A. Indent the child run blocks within the parent run block Indentation in Python defines code blocks. By indenting the child run blocks within the parent run block, you are telling Python (and MLflow) that these runs are intended to be nested within the parent run. The `with` statement in Python relies on indentation to define the scope of the code that should be executed within the context of the managed resource (in this case, the MLflow run). Options B, C, D, and E are incorrect because they do not address the fundamental issue of defining the code hierarchy and scope correctly. While the `nested=True` argument might be relevant in some MLflow nesting scenarios, it doesn't override the need for proper indentation to define code blocks within the parent run's context.
37
A machine learning engineer wants to log feature importance data from a CSV file at path importance_path with an MLflow run for model model. Which of the following code blocks will accomplish this task inside of an existing MLflow run block? A. ![Image](https://img.examtopics.com/certified-machine-learning-professional/image2.png) B. ![Image](https://img.examtopics.com/certified-machine-learning-professional/image3.png) C. mlflow.log_data(importance_path, "feature-importance.csv") D. mlflow.log_artifact(importance_path, "feature-importance.csv") E. None of these code blocks can accomplish the task.
D. `mlflow.log_artifact(importance_path, "feature-importance.csv")` Explanation: The `mlflow.log_artifact()` function is used to log files, such as the feature importance CSV, as artifacts within an MLflow run. This ensures the file is tracked and associated with the run. A and B: These options are not valid mlflow commands or structures. C: `mlflow.log_data` is not a valid MLflow function. E: Option D correctly implements the desired functionality.
38
A machine learning engineer is in the process of implementing a concept drift monitoring solution. They are planning to use the following steps: 1. Deploy a model to production and compute predicted values 2. Obtain the observed (actual) label values 3. _____ 4. Run a statistical test to determine if there are changes over time Which of the following should be completed as Step #3? A. Obtain the observed values (actual) feature values B. Measure the latency of the prediction time C. Retrain the model D. None of these should be completed as Step #3 E. Compute the evaluation metric using the observed and predicted values
E. Computing the evaluation metric using the observed and predicted values is the correct step. Concept drift is detected by observing the degradation of model performance metrics. After obtaining the predicted and actual values, calculating metrics like accuracy, precision, recall, or F1-score allows for the assessment of model performance over time. A statistical test in the subsequent step (#4) would then determine if the change in these metrics is statistically significant, indicating concept drift. Option A is incorrect because while feature values are important for model training and understanding data distributions, they are not directly used to evaluate model performance in the context of concept drift monitoring. Option B is incorrect because latency measures the speed of predictions, which is a performance concern, but not directly indicative of concept drift. Option C is incorrect because retraining the model would be a response to detecting concept drift, not a step in the monitoring process itself. Option D is incorrect because one of the provided options is correct.
39
Which of the following MLflow operations can be used to automatically calculate and log a Shapley feature importance plot? A. mlflow.shap.log_explanation B. None of these operations can accomplish the task. C. mlflow.shap D. mlflow.log_figure E. client.log_artifact
A. mlflow.shap.log_explanation The `mlflow.shap.log_explanation` operation automatically calculates and logs Shapley feature importance plots. This function computes explanations of a model's output and logs them as a directory of artifacts containing base values, SHAP values, and a summary bar plot. Option B is incorrect because `mlflow.shap.log_explanation` does accomplish the task. Option C is partially correct because `mlflow.shap` is the module, but `mlflow.shap.log_explanation` is the specific function used for logging explanations. Option D, `mlflow.log_figure`, logs a figure but doesn't automatically calculate Shapley values or create the associated plots. Option E, `client.log_artifact`, logs an existing artifact but doesn't calculate or create a Shapley plot.
40
A machine learning engineer is converting a Hyperopt-based hyperparameter tuning process from manual MLflow logging to MLflow Autologging. They are trying to determine how to manage nested Hyperopt runs with MLflow Autologging. Which of the following approaches will create a single parent run for the process and a child run for each unique combination of hyperparameter values when using Hyperopt and MLflow Autologging? A. Starting a manual parent run before calling fmin B. Ensuring that a built-in model flavor is used for the model logging C. Starting a manual child run within the objective_function D. There is no way to accomplish nested runs with MLflow Autologging and Hyperopt E. MLflow Autologging will automatically accomplish this task with Hyperopt
A. Starting a manual parent run before calling fmin Explanation: Starting a manual parent run before calling `fmin` will establish a parent run under which all hyperparameter tuning iterations (handled by `fmin`) will be logged as child runs. This creates the desired nested structure. Incorrect options: B: Using a built-in model flavor affects how the model is logged, not the structure of MLflow runs. C: Starting a manual child run within the `objective_function` could work but is not the standard recommended way to manage nested runs with Hyperopt and Autologging, as it requires manual management within each iteration. D: Nested runs are possible with MLflow Autologging and Hyperopt. E: Autologging alone will not automatically create a parent run. A parent run needs to be manually initiated.
41
A machine learning engineer is using the following code block as part of a batch deployment pipeline: ![Image](https://img.examtopics.com/certified-machine-learning-professional/image19.png) Which of the following changes needs to be made so this code block will work when the inference table is a stream source? A. Replace "inference" with the path to the location of the Delta table B. Replace schema(schema) with option("maxFilesPerTrigger", 1) C. Replace spark.read with spark.readStream D. Replace format("delta") with format("stream") E. Replace predict with a stream-friendly prediction function
C. Replace spark.read with spark.readStream The code currently uses `spark.read` which is for reading static data. To process a stream source, it needs to use `spark.readStream`. Option A is incorrect because while providing the correct path is important, it doesn't address the fundamental issue of reading a stream. Option B is incorrect because `maxFilesPerTrigger` is an option that *could* be used, but doesn't replace the schema. Option D is incorrect because `"delta"` is already correct for reading Delta tables, whether static or streaming. There is no "stream" format. Option E is incorrect because the prediction function might need to be adapted for streaming, but this change isn't necessary for the code to *work* with a stream source initially. The read operation must be configured correctly first.
42
Which of the following machine learning model deployment paradigms is the most common for machine learning projects? A. On-device B. Streaming C. Real-time D. Batch E. None of these deployments
D. Batch **Explanation:** The consensus in the discussion is that batch deployment is the most common paradigm for machine learning projects. Databricks training material suggests that 80-90% of deployments are batch. This makes option D the most likely correct answer. Options A, B, C, and E are less common than batch deployment.
43
Which of the following is an advantage of using the python_function(pyfunc) model flavor over the built-in library-specific model flavors? A. python_function provides no benefits over the built-in library-specific model flavors B. python_function can be used to deploy models in a parallelizable fashion C. python_function can be used to deploy models without worrying about which library was used to create the model D. python_function can be used to store models in an MLmodel file E. python_function can be used to deploy models without worrying about whether they are deployed in batch, streaming, or real-time environments
C. python_function can be used to deploy models without worrying about which library was used to create the model Explanation: The python_function flavor provides a generic interface, allowing deployment of models from various ML frameworks without being tied to a specific library. This versatility is its key advantage. Options A, D, and E are incorrect because python_function does provide benefits, can store models in MLmodel format, and doesn't abstract away deployment environments. Option B is incorrect because while parallelization might be possible, it is not the main advantage.
44
Which of the following lists all of the model stages are available in the MLflow Model Registry? A. Development, Staging, Production B. None, Staging, Production C. Staging, Production, Archived D. None, Staging, Production, Archived E. Development, Staging, Production, Archived
D. None, Staging, Production, Archived The MLflow Model Registry defines the stages a model version can be in as None, Staging, Production, or Archived. The other options include stages that are not part of the defined stages, such as Development.
45
Which of the following MLflow Model Registry use cases requires the use of an HTTP Webhook? A. Starting a testing job when a new model is registered B. Updating data in a source table for a Databricks SQL dashboard when a model version transitions to the Production stage C. Sending an email alert when an automated testing Job fails D. None of these use cases require the use of an HTTP Webhook E. Sending a message to a Slack channel when a model version transitions stages
E. Sending a message to a Slack channel when a model version transitions stages The MLflow documentation specifically mentions using webhooks to notify team members through Slack when a model transitions to production. While other options might be achievable through different means or could potentially involve webhooks in a more complex setup, sending a Slack message is a direct and common use case for HTTP webhooks in MLflow Model Registry. Therefore, option E is the most accurate answer. Options A, B, and C can be done with other methods like Databricks Jobs or alerts, and don't *require* a webhook.
46
A machine learning engineer wants to log and deploy a model as an MLflow pyfunc model. They have custom preprocessing that needs to be completed on feature variables prior to fitting the model or computing predictions using that model. They decide to wrap this preprocessing in a custom model class ModelWithPreprocess, where the preprocessing is performed when calling fit and when calling predict. They then log the fitted model of the ModelWithPreprocess class as a pyfunc model. Which of the following is a benefit of this approach when loading the logged pyfunc model for downstream deployment? A. The pyfunc model can be used to deploy models in a parallelizable fashion B. The same preprocessing logic will automatically be applied when calling fit C. The same preprocessing logic will automatically be applied when calling predict D. This approach has no impact when loading the logged pyfunc model for downstream deployment E. There is no longer a need for pipeline-like machine learning objects
C. The same preprocessing logic will automatically be applied when calling predict The question describes wrapping preprocessing logic within a custom model class that is then logged as a pyfunc model. The primary benefit when loading this model for deployment is that the preprocessing is automatically applied during prediction. Option A is incorrect because pyfunc models don't inherently guarantee parallelizability. Option B is incorrect because the model is already fitted and this is about deployment. Option D is incorrect because wrapping the preprocessing does have an impact. Option E is incorrect because pipelines may still be useful for managing more complex workflows.
47
Which of the following Databricks-managed MLflow capabilities is a centralized model store? A. Models B. Model Registry C. Model Serving D. Feature Store E. Experiments
B. The Model Registry is a centralized model store that helps manage the full lifecycle of MLflow Models, providing model versioning, annotation, and stage transitions. Options A, C, D, and E are incorrect because Models is a general term, Model Serving is for deploying models, Feature Store is for managing features, and Experiments is for tracking ML runs.
48
[LLM error]
49
[LLM error]
50
[LLM error]
51
[LLM error]
52
[LLM error]
53
A machine learning engineer has created a webhook with the following code block: ![Image](https://img.examtopics.com/certified-machine-learning-professional/image32.png) Which of the following code blocks will trigger this webhook to run the associate job? A. ![Image](https://img.examtopics.com/certified-machine-learning-professional/image33.png) B. ![Image](https://img.examtopics.com/certified-machine-learning-professional/image34.png) C. ![Image](https://img.examtopics.com/certified-machine-learning-professional/image35.png) D. ![Image](https://img.examtopics.com/certified-machine-learning-professional/image36.png) E. ![Image](https://img.examtopics.com/certified-machine-learning-professional/image37.png)
E. The webhook is configured to trigger on the `transition_model_version_stage` event for a model named "model". Option E correctly calls the `transition_model_version_stage` method on the "model" model, specifying a target stage. Option A is incorrect because it uses the wrong model name ("new_model"). Option B is incorrect because it calls the `transition_model_version_stage` method with "from" and "to" arguments, which are not supported. Option C is incorrect because the model name is wrong, and option D is incorrect because it doesn't call the `transition_model_version_stage` method.
54
A machine learning engineer wants to view all of the active MLflow Model Registry Webhooks for a specific model. They are using the following code block: ![Image](https://img.examtopics.com/certified-machine-learning-professional/image31.png) Which of the following changes does the machine learning engineer need to make to this code block so it will successfully accomplish the task? A. There are no necessary changes B. Replace list with view in the endpoint URL C. Replace POST with GET in the call to http_request D. Replace list with webhooks in the endpoint URL E. Replace POST with PUT in the call to http_request
C. The code is attempting to retrieve information, which is a 'GET' request, but it's currently using 'POST'. Changing 'POST' to 'GET' in the `http_request` call will correct this. Options A, B, D, and E are incorrect because they suggest changes to the URL or HTTP method that are not necessary for listing webhooks. The primary issue is using the wrong HTTP method for retrieval.
55
A machine learning engineering manager has asked all of the engineers on their team to add text descriptions to each of the model projects in the MLflow Model Registry. They are starting with the model project "model" and they'd like to add the text in the model_description variable. The team is using the following line of code: ![Image](https://img.examtopics.com/certified-machine-learning-professional/image25.png) Which of the following changes does the team need to make to the above code block to accomplish the task? A. Replace update_registered_model with update_model_version B. There no changes necessary C. Replace description with artifact D. Replace client.update_registered_model with mlflow E. Add a Python model as an argument to update_registered_model
B. The provided code snippet correctly uses `client.update_registered_model` to update the description of the registered model. According to the MLflow documentation, this is the appropriate method to update the model's description. * Option A is incorrect because `update_model_version` is used to update the properties of a specific model version, not the registered model itself. * Option C is incorrect because the parameter is called `description`. * Option D is incorrect because `client.update_registered_model` is the correct way to access this functionality through the MLflow client. * Option E is incorrect because `update_registered_model` takes the description directly as a parameter.
56
Which of the following MLflow operations can be used to delete a model from the MLflow Model Registry? A. client.transition_model_version_stage B. client.delete_model_version C. client.update_registered_model D. client.delete_model E. client.delete_registered_model
E. `client.delete_registered_model` The correct answer is E. The `client.delete_registered_model` operation is used to delete a model from the MLflow Model Registry. Option A is incorrect because `client.transition_model_version_stage` is used to transition a model version's stage (e.g., from "Staging" to "Production"). Option B is incorrect because `client.delete_model_version` only deletes a specific version of a registered model, not the entire model. Option C is incorrect because `client.update_registered_model` is used to update the metadata of a registered model (e.g., description, tags). Option D is incorrect because there is no MLflow operation called `client.delete_model`.
57
Which of the following is a reason for using Jensen-Shannon (JS) distance over a Kolmogorov-Smirnov (KS) test for numeric feature drift detection? A. All of these reasons B. JS is not normalized or smoothed C. None of these reasons D. JS is more robust when working with large datasets E. JS does not require any manual threshold or cutoff determinations
D. JS is more robust when working with large datasets DISCUSSION: The consensus in the discussion, including input from ChatGPT, indicates that option D is the most likely correct answer. While there's some debate about the robustness of JS versus KS tests with large datasets, the prevailing view is that JS offers advantages in this scenario. Here's why the other options are likely incorrect: * **A. All of these reasons:** Since there's disagreement about the validity of other options, this is unlikely to be the correct answer. * **B. JS is not normalized or smoothed:** JS distance *is* normalized, producing a value between 0 and 1. Smoothing may also be applied to the distributions before calculating JS distance to avoid issues with zero probabilities. * **C. None of these reasons:** The discussion suggests at least one reason (option D) might be valid, making this incorrect. * **E. JS does not require any manual threshold or cutoff determinations:** While JS distance produces a normalized score, determining whether drift is significant still requires a threshold, similar to KS test p-values and significance levels. Therefore, E is incorrect.
58
Which of the following is a simple statistic to monitor for categorical feature drift? A. Mode B. None of these C. Mode, number of unique values, and percentage of missing values D. Percentage of missing values E. Number of unique values
A. Mode DISCUSSION: The question asks for a *simple* statistic. While option C includes several statistics that can be useful for monitoring categorical feature drift, it is not a single statistic. The mode, which represents the most frequent category, is a single, simple statistic that can effectively indicate shifts in the distribution of categorical features over time. The number of unique values and percentage of missing values, while potentially useful, do not directly indicate drift in the categorical values themselves.
59
A machine learning engineer needs to select a deployment strategy for a new machine learning application. The feature values are not available until the time of delivery, and results are needed exceedingly fast for one record at a time. Which of the following deployment strategies can be used to meet these requirements? A. Edge/on-device B. Streaming C. None of these strategies will meet the requirements. D. Batch E. Real-time
E. Real-time Real-time deployment is the best choice because it provides immediate predictions for each record as it arrives. This aligns with the requirement for exceedingly fast results and the fact that feature values are only available at the time of delivery. Option A is incorrect because edge/on-device deployment might not always guarantee the fastest results, especially if the device has limited processing power. Option B is incorrect because while streaming can be used in real-time, it doesn't necessarily guarantee the fastest response for single records, as it often involves processing data in micro-batches. Option C is incorrect because real-time deployment *can* meet the requirements. Option D is incorrect because batch deployment is designed for processing large volumes of data at once, which contradicts the need for immediate results for single records.