Practice Questions - exam-certified-machine-learning-professional Flashcards

Question

A data scientist would like to enable MLflow Autologging for all machine learning libraries used in a notebook. They want to ensure that MLflow Autologging is used no matter what version of the Databricks Runtime for Machine Learning is used to run the notebook and no matter what workspace-wide configurations are selected in the Admin Console. Which of the following lines of code can they use to accomplish this task? A. ```python mlflow.sklearn.autolog() ``` B. ```python mlflow.spark.autolog() ``` C. ```python spark.conf.set(“autologging”, True) ``` D. It is not possible to automatically log MLflow runs. E. ```python mlflow.autolog() ```

Answer 1

E. `mlflow.autolog()` `mlflow.autolog()` enables MLflow Autologging for all supported libraries without needing to specify each library individually. This ensures that all relevant libraries will be autologged, regardless of the Databricks Runtime version or workspace configurations. Option A is incorrect because `mlflow.sklearn.autolog()` only enables autologging for scikit-learn models. Option B is incorrect because `mlflow.spark.autolog()` only enables autologging for Spark MLlib models. Option C is incorrect because `spark.conf.set(“autologging”, True)` is not the correct way to enable MLflow autologging. Option D is incorrect because MLflow Autologging is indeed possible.

Answer 2

C. `log_metric` The question asks how to store the RMSE value with the MLflow run. RMSE is a metric used to evaluate the performance of a model. MLflow provides the `log_metric` function to log metrics. Option A is incorrect because `log_artifact` is used to log data files or models. Option B is incorrect because `log_model` is used to log the model itself. Option D is incorrect because `log_param` is used to log input parameters, not metrics. Option E is incorrect because MLflow does provide a way to store values like this through the `log_metric` function.

Answer 3

B. Retraining and deploying a model on more recent data **Explanation:** * **B is correct:** When drift (changes in the input data that affect model performance) is detected, retraining the model with more recent data is a standard and effective response. This allows the model to learn the new patterns and maintain accuracy. * **A is incorrect:** There is a probable response. * **C is incorrect:** Not all responses are probable. * **D is incorrect:** Rebuilding with a new label variable is not a typical response to data drift. It would suggest a fundamental change in the problem being solved, not just a shift in the existing data. * **E is incorrect:** Sunsetting the application is an extreme measure that is usually taken when the model is no longer useful or when maintaining it is not cost-effective, but it's not a standard response to drift. Retraining is attempted first.

Answer 4

D. The correct answer is D because the data scientist wants to replace all the data in the feature table with the new data. `fs.write_table` is the appropriate method for writing to an existing table, and setting `mode="overwrite"` ensures that the existing data is replaced. Option A is incorrect because it uses `fs.create_table`, which is used to create a new table, not to overwrite an existing one. Option B is incorrect because it uses `mode="merge"`, which would merge the new data with the existing data instead of overwriting it. Option C is incorrect because it uses `fs.create_table` and `mode="merge"`, which are both inappropriate for the given task. Option E is incorrect because the Feature Store Client `fs` does not have a `replace_table` method.

Answer 5

E. `spark.read.format("mlflow-experiment").load(exp_id)` The correct answer is E because the "mlflow-experiment" data source format in Spark is specifically designed to load MLflow experiment run data into a Spark DataFrame. Option A is incorrect because `client.list_run_infos(exp_id)` returns a list of `RunInfo` objects, not a Spark DataFrame. Option B is incorrect because `spark.read.format("delta").load(exp_id)` attempts to read a Delta table, but `exp_id` is not a path to a Delta table. Option C is incorrect because it states that it is impossible to programmatically return row-level results, which is false. Option D is incorrect because `mlflow.search_runs(exp_id)` returns an `mlflow.entities.run.Run` object, not a Spark DataFrame.

Answer 6

C. `mlflow.sklearn.load_model(model_uri)` This is the correct answer because the model was a scikit-learn model, and `mlflow.sklearn.load_model()` is the correct function to load a scikit-learn model that was previously logged with MLflow. Option A is incorrect because `mlflow.load_model()` is a generic function that may not correctly load all the attributes of a scikit-learn model. Option B is incorrect because `client.list_artifacts()` retrieves artifacts, not the model itself. While feature importances might be logged as an artifact, this approach doesn't restore the full model object. Option D is incorrect because the model object can be restored programmatically. Option E is incorrect because `client.pyfunc.load_model()` is used for loading models for generic Python function serving, not for restoring a scikit-learn model to access its attributes directly.

Answer 7

E. Feature drift is the correct answer. Feature drift occurs when the distribution of input features changes over time. In this case, the expected temperature, an input feature, is dropping below the range seen during training, indicating a shift in its distribution. A. Label drift refers to changes in the distribution of the target variable, which is not the issue here. B. None of these is incorrect because feature drift is present. C. Concept drift refers to changes in the relationship between input features and the target variable, not a change in the input features themselves. D. Prediction drift refers to changes in the model's output, which is a consequence of other types of drift, but not the primary issue here.

Answer 8

A. `spark.read.format(“delta”).load(path).drop(“star_rating”)` This is correct because it explicitly specifies the "delta" format when reading from the location path using `.load()`. Then, it drops the "star_rating" column from the resulting DataFrame. Incorrect Options: B: `.table()` expects a table name, not a file path. C: Delta tables can be modified. D: This option does not specify the "delta" format, which is necessary when reading a Delta table from a path. It would be valid if `path` were the name of a registered table, not a path. E: This option will execute a SQL query but will not update the Delta table. The result would need to be written back to the Delta table.

Answer 9

E. fs.read_table The consensus in the discussion is that `fs.read_table` is the correct operation to return a Spark DataFrame of a dataset associated with a Feature Store table. Option A is incorrect because `fs.create_table` is used to create a new table, not read data. Option B is incorrect because `fs.write_table` is used to write data to a table, not read data. Option C might seem plausible, but `fs.get_table` likely retrieves table metadata or configuration, not the data itself as a DataFrame. Option D is incorrect because there is a way to accomplish the task.

Answer 10

C. Concept drift is when there is a change in the relationship between input variables and target variables Concept drift occurs when the statistical properties of the target variable change over time. This means the relationship between the input features and the target variable is no longer the same, leading to a decrease in the model's performance. Options A and B describe changes in the distributions of input or target variables individually, but concept drift is specifically about the change in their *relationship*. Option D refers to the model's output, which is a consequence of concept drift, not the definition of concept drift itself. Option E is incorrect because option C provides a valid description.

Answer 11

B. One-way Chi-squared Test The question describes a scenario where the engineer wants to compare the distribution of a single categorical variable (with potentially missing values) across two time periods (old data vs. new data) to see if the prevalence of missing values has changed for a specific category. A one-way Chi-squared test is appropriate for comparing the observed frequencies of categories in a single variable to expected frequencies. Here, the engineer can compare the frequencies of the categories (including "missing") in the old data to the frequencies in the new data to see if there's a statistically significant difference. A two-way Chi-squared test (option C) is used to determine if there's an association between *two* categorical variables, which is not the primary goal here. The KS test (option A) is for continuous distributions. Jenson-Shannon distance (option D) could potentially be used, but the Chi-squared test is more standard for this type of categorical data comparison.

Answer 12

A. Indent the child run blocks within the parent run block Indentation in Python defines code blocks. By indenting the child run blocks within the parent run block, you are telling Python (and MLflow) that these runs are intended to be nested within the parent run. The `with` statement in Python relies on indentation to define the scope of the code that should be executed within the context of the managed resource (in this case, the MLflow run). Options B, C, D, and E are incorrect because they do not address the fundamental issue of defining the code hierarchy and scope correctly. While the `nested=True` argument might be relevant in some MLflow nesting scenarios, it doesn't override the need for proper indentation to define code blocks within the parent run's context.

Answer 13

D. `mlflow.log_artifact(importance_path, "feature-importance.csv")` Explanation: The `mlflow.log_artifact()` function is used to log files, such as the feature importance CSV, as artifacts within an MLflow run. This ensures the file is tracked and associated with the run. A and B: These options are not valid mlflow commands or structures. C: `mlflow.log_data` is not a valid MLflow function. E: Option D correctly implements the desired functionality.

Answer 14

E. Computing the evaluation metric using the observed and predicted values is the correct step. Concept drift is detected by observing the degradation of model performance metrics. After obtaining the predicted and actual values, calculating metrics like accuracy, precision, recall, or F1-score allows for the assessment of model performance over time. A statistical test in the subsequent step (#4) would then determine if the change in these metrics is statistically significant, indicating concept drift. Option A is incorrect because while feature values are important for model training and understanding data distributions, they are not directly used to evaluate model performance in the context of concept drift monitoring. Option B is incorrect because latency measures the speed of predictions, which is a performance concern, but not directly indicative of concept drift. Option C is incorrect because retraining the model would be a response to detecting concept drift, not a step in the monitoring process itself. Option D is incorrect because one of the provided options is correct.

Answer 15

A. mlflow.shap.log_explanation The `mlflow.shap.log_explanation` operation automatically calculates and logs Shapley feature importance plots. This function computes explanations of a model's output and logs them as a directory of artifacts containing base values, SHAP values, and a summary bar plot. Option B is incorrect because `mlflow.shap.log_explanation` does accomplish the task. Option C is partially correct because `mlflow.shap` is the module, but `mlflow.shap.log_explanation` is the specific function used for logging explanations. Option D, `mlflow.log_figure`, logs a figure but doesn't automatically calculate Shapley values or create the associated plots. Option E, `client.log_artifact`, logs an existing artifact but doesn't calculate or create a Shapley plot.

Answer 16

A. Starting a manual parent run before calling fmin Explanation: Starting a manual parent run before calling `fmin` will establish a parent run under which all hyperparameter tuning iterations (handled by `fmin`) will be logged as child runs. This creates the desired nested structure. Incorrect options: B: Using a built-in model flavor affects how the model is logged, not the structure of MLflow runs. C: Starting a manual child run within the `objective_function` could work but is not the standard recommended way to manage nested runs with Hyperopt and Autologging, as it requires manual management within each iteration. D: Nested runs are possible with MLflow Autologging and Hyperopt. E: Autologging alone will not automatically create a parent run. A parent run needs to be manually initiated.

Answer 17

C. Replace spark.read with spark.readStream The code currently uses `spark.read` which is for reading static data. To process a stream source, it needs to use `spark.readStream`. Option A is incorrect because while providing the correct path is important, it doesn't address the fundamental issue of reading a stream. Option B is incorrect because `maxFilesPerTrigger` is an option that *could* be used, but doesn't replace the schema. Option D is incorrect because `"delta"` is already correct for reading Delta tables, whether static or streaming. There is no "stream" format. Option E is incorrect because the prediction function might need to be adapted for streaming, but this change isn't necessary for the code to *work* with a stream source initially. The read operation must be configured correctly first.

Answer 18

D. Batch **Explanation:** The consensus in the discussion is that batch deployment is the most common paradigm for machine learning projects. Databricks training material suggests that 80-90% of deployments are batch. This makes option D the most likely correct answer. Options A, B, C, and E are less common than batch deployment.

Answer 19

C. python_function can be used to deploy models without worrying about which library was used to create the model Explanation: The python_function flavor provides a generic interface, allowing deployment of models from various ML frameworks without being tied to a specific library. This versatility is its key advantage. Options A, D, and E are incorrect because python_function does provide benefits, can store models in MLmodel format, and doesn't abstract away deployment environments. Option B is incorrect because while parallelization might be possible, it is not the main advantage.

Answer 20

D. None, Staging, Production, Archived The MLflow Model Registry defines the stages a model version can be in as None, Staging, Production, or Archived. The other options include stages that are not part of the defined stages, such as Development.

Answer 21

E. Sending a message to a Slack channel when a model version transitions stages The MLflow documentation specifically mentions using webhooks to notify team members through Slack when a model transitions to production. While other options might be achievable through different means or could potentially involve webhooks in a more complex setup, sending a Slack message is a direct and common use case for HTTP webhooks in MLflow Model Registry. Therefore, option E is the most accurate answer. Options A, B, and C can be done with other methods like Databricks Jobs or alerts, and don't *require* a webhook.

Answer 22

C. The same preprocessing logic will automatically be applied when calling predict The question describes wrapping preprocessing logic within a custom model class that is then logged as a pyfunc model. The primary benefit when loading this model for deployment is that the preprocessing is automatically applied during prediction. Option A is incorrect because pyfunc models don't inherently guarantee parallelizability. Option B is incorrect because the model is already fitted and this is about deployment. Option D is incorrect because wrapping the preprocessing does have an impact. Option E is incorrect because pipelines may still be useful for managing more complex workflows.

Answer 23

B. The Model Registry is a centralized model store that helps manage the full lifecycle of MLflow Models, providing model versioning, annotation, and stage transitions. Options A, C, D, and E are incorrect because Models is a general term, Model Serving is for deploying models, Feature Store is for managing features, and Experiments is for tracking ML runs.

Answer 24

[LLM error]

Answer 25

[LLM error]

Answer 26

[LLM error]

Answer 27

[LLM error]

Answer 28

[LLM error]

Answer 29

E. The webhook is configured to trigger on the `transition_model_version_stage` event for a model named "model". Option E correctly calls the `transition_model_version_stage` method on the "model" model, specifying a target stage. Option A is incorrect because it uses the wrong model name ("new_model"). Option B is incorrect because it calls the `transition_model_version_stage` method with "from" and "to" arguments, which are not supported. Option C is incorrect because the model name is wrong, and option D is incorrect because it doesn't call the `transition_model_version_stage` method.

Answer 30

C. The code is attempting to retrieve information, which is a 'GET' request, but it's currently using 'POST'. Changing 'POST' to 'GET' in the `http_request` call will correct this. Options A, B, D, and E are incorrect because they suggest changes to the URL or HTTP method that are not necessary for listing webhooks. The primary issue is using the wrong HTTP method for retrieval.

Answer 31

B. The provided code snippet correctly uses `client.update_registered_model` to update the description of the registered model. According to the MLflow documentation, this is the appropriate method to update the model's description. * Option A is incorrect because `update_model_version` is used to update the properties of a specific model version, not the registered model itself. * Option C is incorrect because the parameter is called `description`. * Option D is incorrect because `client.update_registered_model` is the correct way to access this functionality through the MLflow client. * Option E is incorrect because `update_registered_model` takes the description directly as a parameter.

Answer 32

E. `client.delete_registered_model` The correct answer is E. The `client.delete_registered_model` operation is used to delete a model from the MLflow Model Registry. Option A is incorrect because `client.transition_model_version_stage` is used to transition a model version's stage (e.g., from "Staging" to "Production"). Option B is incorrect because `client.delete_model_version` only deletes a specific version of a registered model, not the entire model. Option C is incorrect because `client.update_registered_model` is used to update the metadata of a registered model (e.g., description, tags). Option D is incorrect because there is no MLflow operation called `client.delete_model`.

Answer 33

D. JS is more robust when working with large datasets DISCUSSION: The consensus in the discussion, including input from ChatGPT, indicates that option D is the most likely correct answer. While there's some debate about the robustness of JS versus KS tests with large datasets, the prevailing view is that JS offers advantages in this scenario. Here's why the other options are likely incorrect: * **A. All of these reasons:** Since there's disagreement about the validity of other options, this is unlikely to be the correct answer. * **B. JS is not normalized or smoothed:** JS distance *is* normalized, producing a value between 0 and 1. Smoothing may also be applied to the distributions before calculating JS distance to avoid issues with zero probabilities. * **C. None of these reasons:** The discussion suggests at least one reason (option D) might be valid, making this incorrect. * **E. JS does not require any manual threshold or cutoff determinations:** While JS distance produces a normalized score, determining whether drift is significant still requires a threshold, similar to KS test p-values and significance levels. Therefore, E is incorrect.

Answer 34

A. Mode DISCUSSION: The question asks for a *simple* statistic. While option C includes several statistics that can be useful for monitoring categorical feature drift, it is not a single statistic. The mode, which represents the most frequent category, is a single, simple statistic that can effectively indicate shifts in the distribution of categorical features over time. The number of unique values and percentage of missing values, while potentially useful, do not directly indicate drift in the categorical values themselves.

Answer 35

E. Real-time Real-time deployment is the best choice because it provides immediate predictions for each record as it arrives. This aligns with the requirement for exceedingly fast results and the fact that feature values are only available at the time of delivery. Option A is incorrect because edge/on-device deployment might not always guarantee the fastest results, especially if the device has limited processing power. Option B is incorrect because while streaming can be used in real-time, it doesn't necessarily guarantee the fastest response for single records, as it often involves processing data in micro-batches. Option C is incorrect because real-time deployment *can* meet the requirements. Option D is incorrect because batch deployment is designed for processing large volumes of data at once, which contradicts the need for immediate results for single records.

Practice Questions - exam-certified-machine-learning-professional Flashcards

(59 cards)