Practice Questions - exam-certified-machine-learning-professional Flashcards
(59 cards)
A machine learning engineer is migrating a machine learning pipeline to use Databricks Machine Learning. They have programmatically identified the best run from an MLflow Experiment and stored its URI in the model_uri
variable and its Run ID in the run_id
variable. They have also determined that the model was logged with the name “model”. Now, the machine learning engineer wants to register that model in the MLflow Model Registry with the name “best_model”.
Which of the following lines of code can they use to register the model to the MLflow Model Registry?
A.
```python
mlflow.register_model(model_uri, “best_model”)
~~~
B.
```python
mlflow.register_model(run_id, “best_model”)
~~~
C.
```python
mlflow.register_model(f”runs:/{run_id}/best_model”, “model”)
~~~
D.
```python
mlflow.register_model(model_uri, “model”)
~~~
E.
```python
mlflow.register_model(f”runs:/{run_id}/model”)
~~~
A. mlflow.register_model(model_uri, "best_model")
The mlflow.register_model
function requires the model_uri
(location of the model) and the desired registered model name as arguments. Option A correctly passes the model_uri
and the desired name “best_model”.
Option B is incorrect because it uses the run_id
instead of the model_uri
.
Option C is incorrect because it constructs a URI using the run_id
but also incorrectly uses “best_model” within the URI when the model was logged as “model”. It also uses the wrong name, “model” instead of “best_model”.
Option D is incorrect because it uses the model_uri
correctly, but it uses the wrong name, “model” instead of “best_model”.
Option E is incorrect because it constructs a URI using the run_id
and doesn’t specify the registered model name.
A machine learning engineer wants to move their model version model_version
for the MLflow Model Registry model model
from the Staging stage to the Production stage using MLflow Client client
.
Which of the following code blocks can they use to accomplish the task?
link
A.
B.
C.
D.
E.
C.
The transition_model_version_stage
method is the correct method to promote a model version to a new stage. Option C correctly uses this method, passing the model name, model version, and the target stage “Production”.
Options A, B, D, and E use incorrect methods or parameters and therefore are not the correct answer.
A machine learning engineer is manually refreshing a model in an existing machine learning pipeline. The pipeline uses the MLflow Model Registry model “project”. The machine learning engineer would like to add a new version of the model to “project”.
Which of the following MLflow operations can the machine learning engineer use to accomplish this task?
A.
mlflow.register_model
B.
MlflowClient.update_registered_model
C.
mlflow.add_model_version
D.
MlflowClient.get_model_version
E.
The machine learning engineer needs to create an entirely new MLflow Model Registry model
A. mlflow.register_model
The question states that the engineer wants to add a new version of the model to “project”. mlflow.register_model
will create a new model version in the model registry for the model files specified by model_uri
.
Option B is incorrect because MlflowClient.update_registered_model
updates the metadata for the registered model (like the description), not the model version.
Option C is incorrect because mlflow.add_model_version
is not a valid MLflow function.
Option D is incorrect because MlflowClient.get_model_version
retrieves information about a specific model version, not create a new one.
Option E is incorrect because the model already exists, so a new one does not need to be created.
A machine learning engineer has developed a random forest model using scikit-learn, logged the model using MLflow as random_forest_model, and stored its run ID in the run_id Python variable. They now want to deploy that model by performing batch inference on a Spark DataFrame spark_df.
Which of the following code blocks can they use to create a function called predict that they can use to complete the task?
A.
B.
It is not possible to deploy a scikit-learn model on a Spark DataFrame.
C.
D.
E.
E.
Explanation:
Option E is correct because it demonstrates the proper usage of mlflow.pyfunc.spark_udf
to deploy an MLflow model for batch inference on a Spark DataFrame. The mlflow.pyfunc.spark_udf
function requires the SparkSession as its first argument, and the model URI as the second. The result is a Spark UDF that can be applied to the Spark DataFrame to generate predictions.
Option A is incorrect because it passes spark_df (the Spark DataFrame) as the first argument to mlflow.pyfunc.spark_udf
, which is incorrect. The first argument must be the SparkSession object.
Option B is incorrect because it is possible to deploy scikit-learn models on Spark DataFrames using MLflow.
Option C is incorrect because it attempts to load the model using mlflow.spark.load_model
, which is intended for models trained within Spark MLlib, not generic Python models. The loaded model is also not properly used as a UDF.
Option D is incorrect because it incorrectly assumes that mlflow.pyfunc.load_model
can directly operate on Spark DataFrames, which is not the correct approach for batch inference. This method loads the model into the driver’s memory but doesn’t distribute the prediction workload across the Spark cluster.
Which of the following describes the purpose of the context parameter in the predict method of Python models for MLflow?
A. The context parameter allows the user to specify which version of the registered MLflow Model should be used based on the given application’s current scenario
B. The context parameter allows the user to document the performance of a model after it has been deployed
C. The context parameter allows the user to include relevant details of the business case to allow downstream users to understand the purpose of the model
D. The context parameter allows the user to provide the model with completely custom if-else logic for the given application’s current scenario
E. The context parameter allows the user to provide the model access to objects like preprocessing models or custom configuration files
E. The context parameter in the predict
method of Python models for MLflow is used to provide the model with access to external objects like preprocessing models or custom configuration files. This allows the model to utilize necessary resources for making accurate predictions. Options A, B, C, and D describe functionalities that are not the primary purpose of the context parameter. The context parameter is not meant to determine model version (A), document model performance (B), provide business case details (C), or inject custom logic (D).
A machine learning engineer has developed a model and registered it using the FeatureStoreClient fs. The model has model URI model_uri. The engineer now needs to perform batch inference on customer-level Spark DataFrame spark_df, but it is missing a few of the static features that were used when training the model. The customer_id column is the primary key of spark_df and the training set used when training and logging the model.
Which of the following code blocks can be used to compute predictions for spark_df when the missing feature values can be found in the Feature Store by searching for features by customer_id?
A.
```python
df = fs.get_missing_features(spark_df, model_uri)
fs.score_model(model_uri, df)
~~~
B.
```python
fs.score_model(model_uri, spark_df)
~~~
C.
```python
df = fs.get_missing_features(spark_df, model_uri)
fs.score_batch(model_uri, df)
~~~
D.
```python
df = fs.get_missing_features(spark_df)
fs.score_batch(model_uri, df)
~~~
E.
```python
fs.score_batch(model_uri, spark_df)
~~~
E.
The score_batch
method of the FeatureStoreClient
automatically retrieves missing features from the Feature Store during batch inference, given that the primary key is available in the input DataFrame. Therefore, it is sufficient to call fs.score_batch(model_uri, spark_df)
to perform batch inference. Options A, C, and D include the method get_missing_features
which is not a valid method of the FeatureStoreClient. Option B uses the score_model
method which is not designed for batch inference.
Which of the following describes the concept of MLflow Model flavors?
A. A convention that deployment tools can use to wrap preprocessing logic into a Model
B. A convention that MLflow Model Registry can use to version models
C. A convention that MLflow Experiments can use to organize their Runs by project
D. A convention that deployment tools can use to understand the model
E. A convention that MLflow Model Registry can use to organize its Models by project
D. A convention that deployment tools can use to understand the model
The MLflow Model flavor is a convention that allows deployment tools to understand the structure and requirements of a model, enabling efficient deployment across different platforms. Flavors provide a standardized way to package models, including necessary metadata and environment details, so they can be loaded and used consistently.
Options A, B, C, and E are incorrect because they do not accurately describe the purpose of MLflow Model flavors. Flavors are not primarily for wrapping preprocessing logic, versioning models in the Model Registry, organizing runs in Experiments, or organizing models in the Model Registry. Their primary function is to enable deployment tools to understand and deploy models effectively.
In a continuous integration, continuous deployment (CI/CD) process for machine learning pipelines, which of the following events commonly triggers the execution of automated testing?
A. The launch of a new cost-efficient SQL endpoint
B. CI/CD pipelines are not needed for machine learning pipelines
C. The arrival of a new feature table in the Feature Store
D. The launch of a new cost-efficient job cluster
E. The arrival of a new model version in the MLflow Model Registry
E. The arrival of a new model version in the MLflow Model Registry
Automated testing in a CI/CD pipeline for ML is triggered when a new model version is registered. This ensures the new model performs as expected before deployment.
Option A is incorrect because SQL endpoint launches are related to data access, not model performance. Option B is incorrect because CI/CD pipelines are crucial for automating and managing ML model deployment. Option C is incorrect because while new feature tables are important, they don’t directly trigger model testing. Option D is incorrect because job cluster launches are related to resource allocation, not model validation.
A machine learning engineering team has written predictions computed in a batch job to a Delta table for querying. However, the team has noticed that the querying is running slowly. The team has already tuned the size of the data files. Upon investigating, the team has concluded that the rows meeting the query condition are sparsely located throughout each of the data files.
Based on the scenario, which of the following optimization techniques could speed up the query by colocating similar records while considering values in multiple columns?
A. Z-Ordering
B. Bin-packing
C. Write as a Parquet file
D. Data skipping
E. Tuning the file size
A. Z-Ordering
Z-Ordering is a data locality technique used in Delta Lake to optimize query performance. It achieves this by clustering similar records together based on the values of multiple columns. This co-location of related data reduces the amount of data that needs to be scanned to satisfy a query, especially when filtering on multiple columns, thus speeding up queries.
Options B, C, D, and E are incorrect because:
* Bin-packing is typically used to optimize storage space, not query performance directly.
* Writing as a Parquet file is a general storage format and doesn’t, by itself, address the problem of sparse data distribution.
* Data skipping relies on metadata about data distribution within files, but doesn’t re-organize the data for co-location.
* Tuning the file size has already been done, as stated in the problem description.
A machine learning engineer needs to deliver predictions of a machine learning model in real-time. However, the feature values needed for computing the predictions are available one week before the query time.
Which of the following is a benefit of using a batch serving deployment in this scenario rather than a real-time serving deployment where predictions are computed at query time?
A. Batch serving has built-in capabilities in Databricks Machine Learning
B. There is no advantage to using batch serving deployments over real-time serving deployments
C. Computing predictions in real-time provides more up-to-date results
D. Testing is not possible in real-time serving deployments
E. Querying stored predictions can be faster than computing predictions in real-time
E. Querying stored predictions can be faster than computing predictions in real-time
Explanation:
Since the feature values are available a week in advance, predictions can be pre-computed and stored using batch serving. When a prediction is needed, it can be quickly retrieved from storage, which is faster than computing it on-demand in real-time.
- A is incorrect because while Databricks may offer capabilities for batch serving, it’s not the primary benefit in this specific scenario.
- B is incorrect because there is a clear advantage (speed) to batch serving in this scenario.
- C is incorrect because batch serving allows for pre-computation with data available in advance, negating the need for up-to-the-minute real-time computation.
- D is incorrect because testing is possible in both real-time and batch serving deployments.
Which of the following tools can assist in real-time deployments by packaging software with its own application, tools, and libraries?
A. Cloud-based compute
B. None of these tools
C. REST APIs
D. Containers
E. Autoscaling clusters
D. Containers
Containers, like Docker, package an application with all its dependencies (libraries, tools, etc.) into a single, portable unit. This ensures consistent execution across different environments, crucial for real-time deployments.
Option A is incorrect because while cloud-based compute provides the infrastructure, it doesn’t handle the packaging of software and its dependencies. Option B is incorrect as containers are a valid tool. Option C is incorrect because REST APIs are used for communication between software systems, not for packaging applications. Option E is incorrect because autoscaling clusters adjust resources based on demand, but they don’t package software with its dependencies.
A machine learning engineer has registered a sklearn model in the MLflow Model Registry using the sklearn model flavor with UI model_uri.
Which of the following operations can be used to load the model as an sklearn object for batch deployment?
A.
mlflow.spark.load_model(model_uri)
B.
mlflow.pyfunc.read_model(model_uri)
C.
mlflow.sklearn.read_model(model_uri)
D.
mlflow.pyfunc.load_model(model_uri)
E.
mlflow.sklearn.load_model(model_uri)
E. mlflow.sklearn.load_model(model_uri)
The question specifies that the model was saved using the sklearn flavor. Therefore, the mlflow.sklearn.load_model
function should be used to load it back as a scikit-learn object. Options A, B, C, and D use incorrect functions (mlflow.spark
, mlflow.pyfunc.read_model
, mlflow.sklearn.read_model
, mlflow.pyfunc.load_model
) which are not appropriate for loading a scikit-learn model.
A data scientist set up a machine learning pipeline to automatically log a data visualization with each run. They now want to view the visualizations in Databricks.
Which of the following locations in Databricks will show these data visualizations?
A. The MLflow Model Registry Model page
B. The Artifacts section of the MLflow Experiment page
C. Logged data visualizations cannot be viewed in Databricks
D. The Artifacts section of the MLflow Run page
E. The Figures section of the MLflow Run page
D. The Artifacts section of the MLflow Run page
Explanation: When data visualizations are logged in MLflow, they are stored as artifacts associated with a specific run. These artifacts can be accessed and viewed in the Artifacts section of the corresponding MLflow Run page within Databricks.
- A is incorrect because the Model Registry is for managing MLflow models, not run artifacts.
- B is incorrect because while the Experiment page provides an overview of runs, the artifacts are found within the individual Run pages.
- C is incorrect because logged data visualizations can be viewed in Databricks.
- E is incorrect because while some visualizations may be displayed as figures, the artifacts tab is the general destination for visualizations.
A data scientist has developed a scikit-learn model sklearn_model and they want to log the model using MLflow.
They write the following incomplete code block:
Which of the following lines of code can be used to fill in the blank so the code block can successfully complete the task?
A.
mlflow.spark.track_model(sklearn_model, “model”)
B.
mlflow.sklearn.log_model(sklearn_model, “model”)
C.
mlflow.spark.log_model(sklearn_model, “model”)
D.
mlflow.sklearn.load_model(“model”)
E.
mlflow.sklearn.track_model(sklearn_model, “model”)
B.
mlflow.sklearn.log_model(sklearn_model, “model”)
The goal is to log a scikit-learn model using MLflow. mlflow.sklearn.log_model
is the correct function for this purpose. It takes the model object and a path (artifact_path) as arguments.
Option A is incorrect because mlflow.spark
is for logging Spark models, not scikit-learn models. track_model
is not a valid function in mlflow.spark
.
Option C is incorrect because mlflow.spark
is for logging Spark models, not scikit-learn models.
Option D is incorrect because mlflow.sklearn.load_model
is used to load a previously logged model, not to log a model.
Option E is incorrect because track_model
is not a valid function in mlflow.sklearn
.
A machine learning engineer has deployed a model recommender using MLflow Model Serving. They now want to query the version of that model that is in the Production stage of the MLflow Model Registry.
Which of the following model URIs can be used to query the described model version?
A. https://<databricks-instance>/model-serving/recommender/Production/invocations
B. The version number of the model version in Production is necessary to complete this task.
C. https://<databricks-instance>/model/recommender/stage-production/invocations
D. https://<databricks-instance>/model-serving/recommender/stage-production/invocations
E. https://<databricks-instance>/model/recommender/Production/invocations
E. The correct URI structure to query a model in a specific stage of the MLflow Model Registry using MLflow Model Serving is https://<databricks-instance>/model/<model_name>/<stage>/invocations
. Therefore, option E is correct. Options A and D are incorrect because they use /model-serving/
instead of /model/
. Option C is incorrect because it uses /stage-production/
instead of /Production/
. Option B is incorrect because the version number is not necessary when querying by stage.
[Parsing error] Full response:
QUESTION:
A data scientist has created a Python function compute_features
that returns a Spark DataFrame with the following schema:
Which of the following is a benefit of logging a model signature with an MLflow model?
A. The model will have a unique identifier in the MLflow experiment
B. The schema of input data can be validated when serving models
C. The model can be deployed using real-time serving tools
D. The model will be secured by the user that developed it
E. The schema of input data will be converted to match the signature
B. The schema of input data can be validated when serving models
Logging a model signature in MLflow defines the expected schema for the model’s inputs and outputs. This allows MLflow’s deployment tools to validate incoming data during serving, ensuring it matches the expected format. This validation helps prevent errors and unexpected behavior during inference.
Option A is incorrect because while MLflow tracks models within experiments, the signature itself doesn’t provide a unique identifier for the experiment. Option C is incorrect because while MLflow can be used with real-time serving tools, the signature doesn’t directly enable deployment. Option D is incorrect because model signatures don’t provide security features related to user access control. Option E is incorrect because the signature allows validation against a schema; it does not convert the input data to match the signature.
A machine learning engineer wants to deploy a model for real-time serving using MLflow Model Serving. For the model, the machine learning engineer currently has one model version in each of the stages in the MLflow Model Registry. The engineer wants to know which model versions can be queried once Model Serving is enabled for the model.
Which of the following lists all of the MLflow Model Registry stages whose model versions are automatically deployed with Model Serving?
A.
Staging, Production, Archived
B.
Production
C.
None, Staging, Production, Archived
D.
Staging, Production
E.
None, Staging, Production
D. Staging, Production
The MLflow Model Serving automatically deploys model versions that are in the Staging or Production stages. Model versions in the Archived or None stages are not automatically deployed. Therefore, option D is correct.
A machine learning engineering team wants to build a continuous pipeline for data preparation of a machine learning application. The team would like the data to be fully processed and made ready for inference in a series of equal-sized batches.
Which of the following tools can be used to provide this type of continuous processing?
A.
Spark UDFs
B.
Structured Streaming
C.
MLflow
D.
Delta Lake
E.
AutoML
B. Structured Streaming
Structured Streaming in Spark allows for continuous processing of data streams, where data can be processed in real-time and output in equal-sized batches. This makes it suitable for building continuous pipelines for data preparation in machine learning applications.
Incorrect Options:
A. Spark UDFs (User Defined Functions) are for custom data transformations, not continuous processing.
C. MLflow is a platform for managing the ML lifecycle, including experimentation, reproducibility, deployment, and a central model registry.
D. Delta Lake is a storage layer that brings reliability to data lakes.
E. AutoML automates the process of selecting, training, and tuning machine learning models.
A data scientist has written a function to track the runs of their random forest model. The data scientist is changing the number of trees in the forest across each run.
Which of the following MLflow operations is designed to log single values like the number of trees in a random forest?
A.
mlflow.log_artifact
B.
mlflow.log_model
C.
mlflow.log_metric
D.
mlflow.log_param
E.
There is no way to store values like this.
D. mlflow.log_param
The question is asking which MLflow function is appropriate for logging a parameter, like the number of trees. mlflow.log_param
is designed specifically for this purpose.
- A. mlflow.log_artifact: This is used for logging files or directories, not single values.
- B. mlflow.log_model: This is used for logging an entire machine learning model.
- C. mlflow.log_metric: This is used for logging metrics (evaluation results), which are typically numeric but represent performance, not hyperparameters.
- E. There is no way to store values like this: MLflow provides functionality for logging parameters.
Which of the following deployment paradigms can centrally compute predictions for a single record with exceedingly fast results?
A. Streaming
B. Batch
C. Edge/on-device
D. None of these strategies will accomplish the task.
E. Real-time
E. Real-time deployment is designed for low-latency, high-throughput inference, making it suitable for scenarios where predictions need to be computed quickly for individual records. Streaming, Batch, and Edge are not designed to provide immediate results for single records.
A machine learning engineer and data scientist are working together to convert a batch deployment to an always-on streaming deployment. The machine learning engineer has expressed that rigorous data tests must be put in place as a part of their conversion to account for potential changes in data formats.
Which of the following describes why these types of data type tests and checks are particularly important for streaming deployments?
A. Because the streaming deployment is always on, all types of data must be handled without producing an error
B. All of these statements
C. Because the streaming deployment is always on, there is no practitioner to debug poor model performance
D. Because the streaming deployment is always on, there is a need to confirm that the deployment can autoscale
E. None of these statements
The correct answer is B. Streaming deployments are always on, so handling all data types without errors is crucial. Also, continuous operation means less opportunity for manual debugging, and autoscaling is a key requirement for handling varying data volumes. Therefore, all of the statements A, C, and D are valid, making option B the correct answer. Options A, C, and D are individually correct, but incomplete. Option E is incorrect because options A, C, and D contain correct statements.
A data scientist has developed a scikit-learn random forest model model, but they have not yet logged model with MLflow. They want to obtain the input schema and the output schema of the model so they can document what type of data is expected as input.
Which of the following MLflow operations can be used to perform this task?
A.
mlflow.models.schema.infer_schema
B.
mlflow.models.signature.infer_signature
C.
mlflow.models.Model.get_input_schema
D.
mlflow.models.Model.signature
E.
There is no way to obtain the input schema and the output schema of an unlogged model.
B. mlflow.models.signature.infer_signature
The mlflow.models.signature.infer_signature
function can be used to infer the input and output schema of an unlogged model by inspecting the training data and (optionally) the model’s labels. This information is encapsulated in a ModelSignature
object.
Option A is incorrect because mlflow.models.schema.infer_schema
is used to infer schema from data, not directly from a model.
Options C and D are incorrect because they are methods that operate on already logged MLflow models.
Option E is incorrect because mlflow.models.signature.infer_signature
provides a way to obtain the schema of unlogged models.
Which of the following describes label drift?
A. Label drift is when there is a change in the distribution of the predicted target given by the model
B. None of these describe label drift
C. Label drift is when there is a change in the distribution of an input variable
D. Label drift is when there is a change in the relationship between input variables and target variables
E. Label drift is when there is a change in the distribution of a target variable
E. Label drift refers to a change in the distribution of the target variable itself. Options A, C, and D describe other types of drift, such as model prediction drift, feature drift, and concept drift, respectively.