Practice Questions - Amazon AWS Certified Machine Learning Engineer - Associate MLA-C01 Flashcards

(113 cards)

1
Q

An ML engineer is developing a fraud detection model on AWS. The training dataset includes transaction logs, customer profiles, and tables from an on-premises MySQL database. The transaction logs and customer profiles are stored in Amazon S3. Which AWS service or feature can aggregate the data from the various data sources?

A. Amazon EMR Spark jobs
B. Amazon Kinesis Data Streams
C. Amazon DynamoDB
D. AWS Lake Formation

A

A

Amazon EMR with Spark is the most suitable option for aggregating data from diverse sources like Amazon S3 and an on-premises MySQL database. Spark’s ability to handle both structured and unstructured data makes it well-suited for this task. While AWS Lake Formation manages data lakes, it doesn’t inherently provide the ETL (Extract, Transform, Load) and data processing capabilities needed to aggregate and transform data from multiple sources. Amazon Kinesis Data Streams are designed for real-time data streaming, not batch processing of data for model training. Amazon DynamoDB is a NoSQL database, not designed for data aggregation from various sources.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

A company with hundreds of data scientists uses Amazon SageMaker to create ML models stored in model groups within the SageMaker Model Registry. Data scientists are categorized into three groups: computer vision, natural language processing (NLP), and speech recognition. An ML engineer needs a solution to organize these existing models by category to improve discoverability at scale, without altering the model artifacts or their current groupings. Which solution best meets these requirements?

A. Create a custom tag for each of the three categories. Add the tags to the model packages in the SageMaker Model Registry.
B. Create a model group for each category. Move the existing models into these category model groups.
C. Use SageMaker ML Lineage Tracking to automatically identify and tag which model groups should contain the models.
D. Create a Model Registry collection for each of the three categories. Move the existing model groups into the collections.

A

D

D is correct because creating Model Registry collections allows for organizing existing model groups without modifying the underlying model artifacts in Amazon S3 and Amazon ECR. This maintains the integrity of the models and their existing structure while improving discoverability at scale by grouping them into relevant categories.

A is incorrect because while tags can provide metadata, they are not as effective for large-scale organization as collections, which are specifically designed for grouping model groups.

B is incorrect because moving models to new model groups would alter the existing model groupings, violating the requirement to not affect the integrity of the model artifacts and their existing groupings.

C is incorrect because ML Lineage Tracking focuses on tracking model lineage and not on the organization and grouping of models at a higher level.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

A company has trained and deployed an ML model using Amazon SageMaker. The company needs to implement a solution to record and monitor all the API call events for the SageMaker endpoint. The solution must also provide a notification when the number of API call events breaches a threshold. Which solution will meet these requirements?

A. Use SageMaker Debugger to track the inferences and to report metrics. Create a custom rule to provide a notification when the threshold is breached.
B. Use SageMaker Debugger to track the inferences and to report metrics. Use the tensor_variance built-in rule to provide a notification when the threshold is breached.
C. Log all the endpoint invocation API events by using AWS CloudTrail. Use an Amazon CloudWatch dashboard for monitoring. Set up a CloudWatch alarm to provide notification when the threshold is breached.
D. Add the Invocations metric to an Amazon CloudWatch dashboard for monitoring. Set up a CloudWatch alarm to provide notification when the threshold is breached.

A

C

The correct answer is C because it uses the most appropriate AWS services to meet all the stated requirements. CloudTrail logs all API calls, including SageMaker endpoint invocations, fulfilling the requirement to record all events. CloudWatch dashboards can then monitor these logs, and a CloudWatch alarm can provide notifications when a threshold is breached.

Option A is incorrect because SageMaker Debugger is primarily for debugging model training and inference quality, not for comprehensive API call event logging. Option B is also incorrect because the tensor_variance rule is not relevant to API call event monitoring. Option D is incorrect because while it uses CloudWatch to monitor invocations and set up alarms, it doesn’t provide a solution for recording all API call events; CloudWatch only monitors what’s already being tracked. CloudTrail provides the necessary comprehensive logging.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

An ML engineer trained an ML model on Amazon SageMaker to detect automobile accidents from closed-circuit TV footage. The ML engineer used SageMaker Data Wrangler to create a training dataset of images of accidents and non-accidents. The model performed well during training and validation. However, the model is underperforming in production because of variations in the quality of the images from various cameras. Which solution will improve the model’s accuracy in the LEAST amount of time?
A. Collect more images from all the cameras. Use Data Wrangler to prepare a new training dataset.
B. Recreate the training dataset by using the Data Wrangler corrupt image transform. Specify the impulse noise option.
C. Recreate the training dataset by using the Data Wrangler enhance image contrast transform. Specify the Gamma contrast option.
D. Recreate the training dataset by using the Data Wrangler resize image transform. Crop all images to the same size.

A

B

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

A company is using Amazon SageMaker to create ML models. The company’s data scientists need fine-grained control of the ML workflows that they orchestrate. The data scientists also need the ability to visualize SageMaker jobs and workflows as a directed acyclic graph (DAG). The data scientists must keep a running history of model discovery experiments and must establish model governance for auditing and compliance verifications. Which solution will meet these requirements?
A. Use AWS CodePipeline and its integration with SageMaker Studio to manage the entire ML workflows. Use SageMaker ML Lineage Tracking for the running history of experiments and for auditing and compliance verifications.
B. Use AWS CodePipeline and its integration with SageMaker Experiments to manage the entire ML workflows. Use SageMaker Experiments for the running history of experiments and for auditing and compliance verifications.
C. Use SageMaker Pipelines and its integration with SageMaker Studio to manage the entire ML workflows. Use SageMaker ML Lineage Tracking for the running history of experiments and for auditing and compliance verifications.
D. Use SageMaker Pipelines and its integration with SageMaker Experiments to manage the entire ML workflows. Use SageMaker Experiments for the running history of experiments and for auditing and compliance verifications.

A

C

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

A company needs to create a central catalog for all the company’s ML models. The models are in AWS accounts where the company developed the models initially. The models are hosted in Amazon Elastic Container Registry (Amazon ECR) repositories. Which solution will meet these requirements?
A. Configure ECR cross-account replication for each existing ECR repository. Ensure that each model is visible in each AWS account.
B. Create a new AWS account with a new ECR repository as the central catalog. Configure ECR cross-account replication between the initial ECR repositories and the central catalog.
C. Use the Amazon SageMaker Model Registry to create a model group for models hosted in Amazon ECR. Create a new AWS account. In the new account, use the SageMaker Model Registry as the central catalog. Attach a cross-account resource policy to each model group in the initial AWS accounts.
D. Use an AWS Glue Data Catalog to store the models. Run an AWS Glue crawler to migrate the models from the ECR repositories to the Data Catalog. Configure cross-account access to the Data Catalog.

A

C

The correct answer is C because SageMaker Model Registry is designed as a central repository for managing and tracking machine learning models, including those hosted in ECR. Creating a new AWS account for the central catalog improves security and organization. Cross-account resource policies allow controlled access to the models from the original accounts.

Option A is incorrect because ECR is a container registry, not a catalog designed for managing model metadata and lineage. Simple replication doesn’t provide the centralized management features needed.

Option B is incorrect because it still relies on ECR as the central catalog, which lacks the model management capabilities of SageMaker Model Registry.

Option D is incorrect because AWS Glue Data Catalog is for managing data assets, not specifically ML models. While it could potentially store metadata about the models, it’s not the ideal solution for managing the models themselves and their lifecycle. Moreover, migrating the models themselves to the Glue Data Catalog is not straightforward or practical.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

A company is building a web-based AI application using Amazon SageMaker. The application will include ML experimentation, training, a central model registry, model deployment, and model monitoring. Training data is stored in Amazon S3, and the application requires secure and isolated use of this data throughout the ML lifecycle. The company must use the central model registry to manage different versions of models. Which action will meet this requirement with the LEAST operational overhead?

A. Create a separate Amazon Elastic Container Registry (Amazon ECR) repository for each model.
B. Use Amazon Elastic Container Registry (Amazon ECR) and unique tags for each model version.
C. Use the SageMaker Model Registry and model groups to catalog the models.
D. Use the SageMaker Model Registry and unique tags for each model version.

A

C

The best answer is C because it leverages the built-in features of SageMaker, specifically designed for managing ML models and their versions. Using SageMaker Model Registry and model groups minimizes operational overhead compared to managing models and versions externally using Amazon ECR. Option A requires creating and managing multiple ECR repositories, increasing overhead. Option B adds complexity by managing tags within ECR. Option D, while using the SageMaker Model Registry, lacks the organizational structure provided by model groups, potentially leading to less efficient management of model versions in the long run.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

A company uses AWS Glue jobs orchestrated by an AWS Glue workflow for data processing. These jobs can run on a schedule or be launched manually. They are integrating these jobs into Amazon SageMaker Pipelines for ML model development, where the Glue job outputs are needed during the data processing phase. Which solution integrates the AWS Glue jobs with the SageMaker pipelines while minimizing operational overhead?

A. Use AWS Step Functions to orchestrate the pipelines and the AWS Glue jobs.
B. Use processing steps in SageMaker Pipelines. Configure inputs that point to the Amazon Resource Names (ARNs) of the AWS Glue jobs.
C. Use Callback steps in SageMaker Pipelines to start the AWS Glue workflow and to stop the pipelines until the AWS Glue jobs finish running.
D. Use Amazon EventBridge to invoke the pipelines and the AWS Glue jobs in the desired order.

A

C

The correct answer is C because it directly addresses the need to wait for Glue jobs to complete before proceeding in the SageMaker pipeline, minimizing operational overhead by keeping the integration within the SageMaker pipeline framework. Option A introduces an additional orchestration layer (Step Functions), increasing complexity. Option B doesn’t guarantee that the Glue jobs finish before the pipeline proceeds, potentially leading to errors. Option D, while possible, requires more complex setup and monitoring compared to using callback steps within SageMaker Pipelines.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

A company is building an AI application on Amazon SageMaker that involves frequent consecutive training jobs using data stored in Amazon S3. The application requires secure and isolated data usage throughout the ML lifecycle. Which approach will MINIMIZE infrastructure startup times for these consecutive training jobs?

A. Use Managed Spot Training.
B. Use SageMaker managed warm pools.
C. Use SageMaker Training Compiler.
D. Use the SageMaker distributed data parallelism (SMDDP) library.

A

B

The correct answer is B because SageMaker managed warm pools keep instances ready between training jobs, eliminating the time needed for provisioning new infrastructure each time. Option A (Managed Spot Training) reduces cost, not startup time. Option C (SageMaker Training Compiler) optimizes code, not infrastructure. Option D (SMDDP) parallelizes training across instances, which improves training speed but doesn’t reduce startup time.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

A company is building a web-based AI application using Amazon SageMaker. This application will include ML experimentation, training, a central model registry, model deployment, and model monitoring. The training data is stored in Amazon S3, and the application requires a manual approval-based workflow to ensure only approved models are deployed to production endpoints. Which solution best meets this requirement?

A. Use SageMaker Experiments to facilitate the approval process during model registration.
B. Use SageMaker ML Lineage Tracking on the central model registry. Create tracking entities for the approval process.
C. Use SageMaker Model Monitor to evaluate the performance of the model and to manage the approval.
D. Use SageMaker Pipelines. When a model version is registered, use the AWS SDK to change the approval status to “Approved.”

A

D

The correct answer is D because SageMaker Pipelines allows for the orchestration of machine learning workflows, including the incorporation of manual approval steps. This allows for a model version to be registered, then its performance evaluated and only after manual approval via the AWS SDK, its status changed to “Approved” for deployment.

Option A is incorrect because SageMaker Experiments is for tracking and organizing experiments, not managing model approvals. Option B is incorrect because SageMaker ML Lineage Tracking tracks model lineage but doesn’t provide an approval mechanism. Option C is incorrect because SageMaker Model Monitor focuses on model performance monitoring, not approval workflows.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

A company is building a web-based AI application using Amazon SageMaker. This application will include ML experimentation, training, a central model registry, model deployment, and model monitoring. Training data, stored in Amazon S3, must be used securely and in isolation throughout the ML lifecycle. The company needs an on-demand workflow to monitor bias drift for models deployed to real-time endpoints from the application. Which action will meet this requirement?

A. Configure the application to invoke an AWS Lambda function that runs a SageMaker Clarify job.
B. Invoke an AWS Lambda function to pull the sagemaker-model-monitor-analyzer built-in SageMaker image.
C. Use AWS Glue Data Quality to monitor bias.
D. Use SageMaker notebooks to compare the bias.

A

A

A is correct because SageMaker Clarify is specifically designed for bias detection and monitoring. Integrating it with an AWS Lambda function allows for on-demand execution, triggering the bias analysis whenever needed by the application.

B is incorrect because the sagemaker-model-monitor-analyzer image handles general model monitoring tasks but not specifically bias detection.

C is incorrect because AWS Glue Data Quality focuses on data quality checks, not bias analysis.

D is incorrect because SageMaker notebooks are for interactive development and experimentation, not for implementing production-ready, on-demand monitoring workflows.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

An ML engineer is developing a fraud detection model on AWS. The training dataset includes transaction logs, customer profiles stored in Amazon S3, and tables from an on-premises MySQL database. The dataset has a class imbalance and features with interdependencies, hindering the algorithm’s ability to capture all underlying patterns. After data aggregation, the engineer needs a solution to automatically detect anomalies and visualize the results. Which solution best meets these requirements?

A. Use Amazon Athena to automatically detect the anomalies and to visualize the result.
B. Use Amazon Redshift Spectrum to automatically detect the anomalies. Use Amazon QuickSight to visualize the result.
C. Use Amazon SageMaker Data Wrangler to automatically detect the anomalies and to visualize the result.
D. Use AWS Batch to automatically detect the anomalies. Use Amazon QuickSight to visualize the result.

A

C

The correct answer is C because Amazon SageMaker Data Wrangler provides tools for data quality analysis, including anomaly detection, and offers visualization capabilities. Option A is incorrect because Athena is primarily a query service and doesn’t inherently offer anomaly detection. Option B is incorrect because while Redshift Spectrum can handle the data and QuickSight can visualize, neither individually offers automatic anomaly detection. Option D is incorrect because AWS Batch is a batch processing service; it doesn’t provide anomaly detection or visualization features directly. SageMaker Data Wrangler best fits the requirement of automatically detecting anomalies and visualizing the results within a single, integrated platform.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

An ML engineer is developing a fraud detection model on AWS. The training dataset, containing transaction logs, customer profiles (stored in Amazon S3), and tables from an on-premises MySQL database, exhibits class imbalance and feature interdependencies, hindering the algorithm’s pattern recognition. The dataset includes both categorical and numerical data. To maximize model accuracy with the LEAST operational overhead, which action should the ML engineer take?

A. Use AWS Glue to transform the categorical data into numerical data.
B. Use AWS Glue to transform the numerical data into categorical data.
C. Use Amazon SageMaker Data Wrangler to transform the categorical data into numerical data.
D. Use Amazon SageMaker Data Wrangler to transform the numerical data into categorical data.

A

C

Data Wrangler provides built-in transformations for encoding categorical data into numerical representations (such as one-hot encoding or ordinal encoding), making it more user-friendly and efficient than using AWS Glue for this task. Transforming numerical data into categorical data is unnecessary and would likely reduce model accuracy. AWS Glue can handle data transformations, but lacks the user-friendly interface and built-in categorical encoding capabilities of SageMaker Data Wrangler, resulting in higher operational overhead. Therefore, option C offers the best balance of effectiveness and minimal operational overhead.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

An ML engineer is developing a fraud detection model on AWS. The training dataset, including transaction logs and customer profiles from Amazon S3 and tables from an on-premises MySQL database, suffers from class imbalance affecting the model’s learning. Which solution requires the LEAST operational effort to address this imbalanced data before model training?

A. Use Amazon Athena to identify patterns that contribute to the imbalance. Adjust the dataset accordingly.
B. Use Amazon SageMaker Studio Classic built-in algorithms to process the imbalanced dataset.
C. Use AWS Glue DataBrew built-in features to oversample the minority class.
D. Use the Amazon SageMaker Data Wrangler balance data operation to oversample the minority class.

A

D

The correct answer is D because Amazon SageMaker Data Wrangler provides a built-in “balance data” operation specifically designed to handle class imbalance through techniques like oversampling and undersampling. This offers a low-code/no-code solution requiring minimal operational effort compared to other options.

Option A requires manual dataset adjustment after identifying patterns with Athena, increasing operational effort. Option B is less efficient as it uses algorithms within SageMaker Studio, whereas Data Wrangler is more directly focused on data preprocessing for imbalanced datasets. Option C is less efficient than D because DataBrew does not have a built-in recipe for balancing datasets, requiring more custom work compared to Data Wrangler’s direct functionality.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

A company has deployed an XGBoost prediction model in production to predict if a customer is likely to cancel a subscription. The company uses Amazon SageMaker Model Monitor to detect deviations in the F1 score. During a baseline analysis of model quality, the company recorded a threshold for the F1 score. After several months of no change, the model’s F1 score decreases significantly. What could be the reason for the reduced F1 score?
A. Concept drift occurred in the underlying customer data that was used for predictions.
B. The model was not sufficiently complex to capture all the patterns in the original baseline data.
C. The original baseline data had a data quality issue of missing values.
D. Incorrect ground truth labels were provided to Model Monitor during the calculation of the baseline.

A

A
Concept drift is the correct answer because it explains a decrease in F1 score after a period of stability. The statistical properties of the data used to train the model have changed over time, leading to the model’s reduced performance. Options B and C would have resulted in a consistently low F1 score from the beginning, not a sudden drop after months of acceptable performance. Option D would have affected the baseline F1 score itself, not caused a significant drop after the initial baseline was established.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

A company has a team of data scientists who use Amazon SageMaker notebook instances to test ML models. When the data scientists need new permissions, the company attaches the permissions to each individual role that was created during the creation of the SageMaker notebook instance. The company needs to centralize management of the team’s permissions. Which solution will meet this requirement?

A. Create a single IAM role that has the necessary permissions. Attach the role to each notebook instance that the team uses.
B. Create a single IAM group. Add the data scientists to the group. Associate the group with each notebook instance that the team uses.
C. Create a single IAM user. Attach the AdministratorAccess AWS managed IAM policy to the user. Configure each notebook instance to use the IAM user.
D. Create a single IAM group. Add the data scientists to the group. Create an IAM role. Attach the AdministratorAccess AWS managed IAM policy to the role. Associate the role with the group. Associate the group with each notebook instance that the team uses.

A

A

A is correct because it leverages the recommended approach of using IAM roles for AWS services like SageMaker. Centralizing permissions in a single IAM role simplifies management; updates to the role automatically propagate to all associated notebook instances.

B is incorrect because you cannot directly associate an IAM group with a SageMaker notebook instance.

C is incorrect because using the AdministratorAccess policy violates the principle of least privilege and it’s not possible to directly associate an IAM user with a notebook instance.

D is incorrect for several reasons: It uses the overly permissive AdministratorAccess policy; it’s unclear how associating a role with a group would function in this context, and, again, you cannot directly associate an IAM group with a notebook instance.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

An ML engineer needs to use an ML model to predict the price of apartments in a specific location. Which metric should the ML engineer use to evaluate the model’s performance?
A. Accuracy
B. Area Under the ROC Curve (AUC)
C. F1 score
D. Mean absolute error (MAE)

A

D. Mean absolute error (MAE)

The correct answer is D because predicting apartment prices is a regression problem, not a classification problem. MAE is a suitable metric for evaluating the performance of regression models. Accuracy, AUC-ROC, and F1 score are all metrics used for classification problems, where the model predicts a categorical outcome (e.g., “high price,” “medium price,” “low price”). Since the model is predicting a continuous value (price), these metrics are inappropriate.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

An ML engineer has trained a neural network using stochastic gradient descent (SGD). The neural network performs poorly on the test set. The training loss and validation loss values remain high and show an oscillating pattern; they decrease for a few epochs and then increase for a few epochs before repeating this cycle. What should the ML engineer do to improve the training process?

A. Introduce early stopping.
B. Increase the size of the test set.
C. Increase the learning rate.
D. Decrease the learning rate.

A

D

The oscillating pattern of the training and validation loss indicates that the learning rate is too high. A high learning rate causes the model to overshoot the optimal point in the loss landscape, leading to oscillations instead of convergence. Decreasing the learning rate allows the model to make smaller, more precise updates to the weights, leading to improved convergence and potentially better performance on the test set.

Option A (Introduce early stopping) is not the primary solution here. While early stopping can prevent overfitting, the main issue is the unstable training process caused by the high learning rate.

Option B (Increase the size of the test set) would not directly address the issue of the oscillating loss and unstable training process. A larger test set would only improve the accuracy of the test set evaluation, but not the model’s training.

Option C (Increase the learning rate) would exacerbate the problem, leading to even more significant oscillations and preventing convergence.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

An ML engineer needs to process thousands of existing CSV objects and new CSV objects that are uploaded. The CSV objects are stored in a central Amazon S3 bucket and have the same number of columns. One of the columns is a transaction date. The ML engineer must query the data based on the transaction date. Which solution will meet these requirements with the LEAST operational overhead?

A. Use an Amazon Athena CREATE TABLE AS SELECT (CTAS) statement to create a table based on the transaction date from data in the central S3 bucket. Query the objects from the table.
B. Create a new S3 bucket for processed data. Set up S3 replication from the central S3 bucket to the new S3 bucket. Use S3 Object Lambda to query the objects based on transaction date.
C. Create a new S3 bucket for processed data. Use AWS Glue for Apache Spark to create a job to query the CSV objects based on transaction date. Configure the job to store the results in the new S3 bucket. Query the objects from the new S3 bucket.
D. Create a new S3 bucket for processed data. Use Amazon Data Firehose to transfer the data from the central S3 bucket to the new S3 bucket. Configure Firehose to run an AWS Lambda function to query the data based on transaction date.

A

A

A is correct because Athena allows querying data directly from S3 using SQL, minimizing operational overhead. CTAS creates a table based on the filtered data (by transaction date), making subsequent queries efficient.

B is incorrect because S3 Object Lambda is designed for data transformation, not efficient querying. Adding replication increases complexity unnecessarily.

C is incorrect because while SparkSQL can query S3 data, it involves more setup and operational overhead than Athena, and creating a new S3 bucket is unnecessary.

D is incorrect because Firehose cannot directly consume from S3 for querying purposes. Using Lambda for querying adds significant complexity.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

A company has a large, unstructured dataset containing many duplicate records across several key attributes. Which AWS solution requires the LEAST amount of code development to detect these duplicates?

A. Use Amazon Mechanical Turk jobs to detect duplicates.
B. Use Amazon QuickSight ML Insights to build a custom deduplication model.
C. Use Amazon SageMaker Data Wrangler to pre-process and detect duplicates.
D. Use the AWS Glue FindMatches transform to detect duplicates.

A

D

The correct answer is D because AWS Glue FindMatches is specifically designed to identify duplicate or matching records in datasets with minimal code development. It uses machine learning to find fuzzy matches and allows customization without requiring the creation of a complex custom deduplication model.

Option A (Amazon Mechanical Turk) would require significant effort to define tasks, manage workers, and review results, making it far less efficient. Option B (Amazon QuickSight ML Insights) necessitates building a custom model, which requires substantial coding. Option C (Amazon SageMaker Data Wrangler) focuses on data preparation and transformation, not directly on duplicate detection. While it might be used as part of a deduplication workflow, it’s not the primary solution for the task.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

A company needs to run a batch data-processing job on Amazon EC2 instances. The job will run during the weekend and will take 90 minutes to finish running. The processing can handle interruptions. The company will run the job every weekend for the next 6 months. Which EC2 instance purchasing option will meet these requirements MOST cost-effectively?
A. Spot Instances
B. Reserved Instances
C. On-Demand Instances
D. Dedicated Instances

A

A. Spot Instances

Spot Instances are the most cost-effective option because they provide spare EC2 capacity at a significantly reduced price compared to On-Demand Instances. The fact that the job can handle interruptions is crucial; Spot Instances can be interrupted with short notice if AWS needs the capacity for other tasks. Since the job only runs for 90 minutes on weekends, the risk of interruption is manageable, and the cost savings outweigh the potential inconvenience.

Reserved Instances are more cost-effective for long-running, consistent workloads. Their upfront cost or commitment is not suitable for a job that runs only for 90 minutes each weekend.

On-Demand Instances offer flexibility but are the most expensive option and not cost-effective for this scenario.

Dedicated Instances provide dedicated physical hardware, which is unnecessary and more expensive than Spot Instances for this batch processing job.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q

An ML engineer has an Amazon Comprehend custom model in Account A in the us-east-1 Region. The ML engineer needs to copy the model to Account B in the same Region. Which solution will meet this requirement with the LEAST development effort?
A. Use Amazon S3 to make a copy of the model. Transfer the copy to Account B.
B. Create a resource-based IAM policy. Use the Amazon Comprehend ImportModel API operation to copy the model to Account B.
C. Use AWS DataSync to replicate the model from Account A to Account B.
D. Create an AWS Site-to-Site VPN connection between Account A and Account B to transfer the model.

A

B

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
23
Q

An ML engineer is training a simple neural network model. The ML engineer tracks the performance of the model over time on a validation dataset. The model’s performance improves substantially at first and then degrades after a specific number of epochs. Which solutions will mitigate this problem? (Choose two.)
A. Enable early stopping on the model.
B. Increase dropout in the layers.
C. Increase the number of layers.
D. Increase the number of neurons.
E. Investigate and reduce the sources of model bias.

A

A, B

The problem described is overfitting: the model performs well on the training data but poorly on unseen validation data, indicating it has learned the training data too well, including noise. Options A and B directly address overfitting:

A. Enable early stopping: This prevents the model from training past the point where its performance on the validation set begins to degrade. It stops training at the point of best validation performance, thus mitigating overfitting.

B. Increase dropout: Dropout randomly deactivates neurons during training, forcing the network to learn more robust features and preventing it from relying too heavily on any single neuron or set of neurons, reducing overfitting.

Options C and D would likely worsen the overfitting. Increasing the number of layers or neurons increases the model’s capacity, making it more prone to overfitting. Option E, while important for model quality in general, doesn’t directly address the observed overfitting problem of degrading validation performance after a certain number of epochs.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
24
Q

A company has a Retrieval Augmented Generation (RAG) application that uses a vector database to store embeddings of documents. The company must migrate the application to AWS and must implement a solution that provides semantic search of text files. The company has already migrated the text repository to an Amazon S3 bucket. Which solution will meet these requirements?

A. Use an AWS Batch job to process the files and generate embeddings. Use AWS Glue to store the embeddings. Use SQL queries to perform the semantic searches.
B. Use a custom Amazon SageMaker notebook to run a custom script to generate embeddings. Use SageMaker Feature Store to store the embeddings. Use SQL queries to perform the semantic searches.
C. Use the Amazon Kendra S3 connector to ingest the documents from the S3 bucket into Amazon Kendra. Query Amazon Kendra to perform the semantic searches.
D. Use an Amazon Textract asynchronous job to ingest the documents from the S3 bucket. Query Amazon Textract to perform the semantic searches.

A

C

Amazon Kendra is a service specifically designed for semantic search. Options A and B would require custom development to implement semantic search capabilities, making them less efficient and more complex than using a purpose-built service like Kendra. Option D, using Amazon Textract, is incorrect because Textract is primarily for extracting text and data from documents, not for performing semantic searches. Therefore, only option C directly addresses the requirement for semantic search using an existing AWS service already integrated with S3.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
25
A company uses Amazon Athena to query a dataset in Amazon S3. This dataset contains a target variable the company wants to predict. They need to determine if a model can predict this target variable with the least development effort. Which solution best achieves this? A. Create a new model by using Amazon SageMaker Autopilot. Report the model's achieved performance. B. Implement custom scripts to perform data pre-processing, multiple linear regression, and performance evaluation. Run the scripts on Amazon EC2 instances. C. Configure Amazon Macie to analyze the dataset and to create a model. Report the model's achieved performance. D. Select a model from Amazon Bedrock. Tune the model with the data. Report the model's achieved performance.
A A is correct because Amazon SageMaker Autopilot automates the process of building, training, and tuning machine learning models, requiring minimal development effort compared to the other options. B is incorrect because it requires significant development effort to write, test, and deploy custom scripts for data preprocessing, model training, and evaluation on EC2 instances. C is incorrect because Amazon Macie is a data security and privacy service; it is not designed for building predictive models. D is incorrect because Amazon Bedrock focuses on foundation models for tasks like text generation, not structured data prediction tasks, requiring significant effort to adapt it for this purpose.
26
A company wants to predict the success of advertising campaigns by considering the color scheme of each advertisement. An ML engineer is preparing data for a neural network model. The dataset includes color information as categorical data. Which technique for feature engineering should the ML engineer use for the model? A. Apply label encoding to the color categories. Automatically assign each color a unique integer. B. Implement padding to ensure that all color feature vectors have the same length. C. Perform dimensionality reduction on the color categories. D. One-hot encode the color categories to transform the color scheme feature into a binary matrix.
D
27
A company uses a hybrid cloud environment. A model deployed on-premises uses data in Amazon S3 to provide customers with a live conversational engine. The model uses sensitive data. An ML engineer needs to implement a solution to identify and remove this sensitive data with the LEAST operational overhead. Which solution best meets these requirements? A. Deploy the model on Amazon SageMaker. Create a set of AWS Lambda functions to identify and remove the sensitive data. B. Deploy the model on an Amazon Elastic Container Service (Amazon ECS) cluster that uses AWS Fargate. Create an AWS Batch job to identify and remove the sensitive data. C. Use Amazon Macie to identify the sensitive data. Create a set of AWS Lambda functions to remove the sensitive data. D. Use Amazon Comprehend to identify the sensitive data. Launch Amazon EC2 instances to remove the sensitive data.
C The best solution is C because it leverages managed services to minimize operational overhead. Amazon Macie is specifically designed for identifying sensitive data in S3, automating the identification process. Using Lambda functions to remove the data keeps the solution serverless, further reducing operational overhead compared to managing EC2 instances (D) or an ECS cluster (B). While option A involves migrating the model to SageMaker, this adds significant operational overhead compared to using existing on-premises infrastructure and utilizing a managed service like Macie. Option D also adds significant operational overhead by requiring the management of EC2 instances.
28
An ML engineer needs to create data ingestion pipelines and ML model deployment pipelines on AWS. All the raw data is stored in Amazon S3 buckets. Which solution will meet these requirements? A. Use Amazon Data Firehose to create the data ingestion pipelines. Use Amazon SageMaker Studio Classic to create the model deployment pipelines. B. Use AWS Glue to create the data ingestion pipelines. Use Amazon SageMaker Studio Classic to create the model deployment pipelines. C. Use Amazon Redshift ML to create the data ingestion pipelines. Use Amazon SageMaker Studio Classic to create the model deployment pipelines. D. Use Amazon Athena to create the data ingestion pipelines. Use an Amazon SageMaker notebook to create the model deployment pipelines.
B AWS Glue is the most appropriate service for creating data ingestion pipelines from Amazon S3. It's designed for batch processing and ETL (Extract, Transform, Load) jobs, making it suitable for handling raw data in S3 buckets. Amazon SageMaker Studio Classic is a well-suited environment for building and deploying ML models. Option A is incorrect because Amazon Kinesis Data Firehose is optimized for real-time data streaming, not batch processing from S3. Option C is incorrect because Amazon Redshift ML is primarily a database service for running machine learning models, not for data ingestion. Option D is incorrect because while Amazon Athena can query data in S3, it is not designed for building data ingestion pipelines, and using a SageMaker notebook for the deployment pipeline is less efficient and organized than SageMaker Studio Classic.
29
A company runs an Amazon SageMaker domain in a public subnet of a newly created VPC. The network is configured properly, and ML engineers can access the SageMaker domain. Recently, the company discovered suspicious traffic to the domain from a specific IP address. The company needs to block traffic from the specific IP address. Which update to the network configuration will meet this requirement? A. Create a security group inbound rule to deny traffic from the specific IP address. Assign the security group to the domain. B. Create a network ACL inbound rule to deny traffic from the specific IP address. Assign the rule to the default network ACL for the subnet where the domain is located. C. Create a shadow variant for the domain. Configure SageMaker Inference Recommender to send traffic from the specific IP address to the shadow endpoint. D. Create a VPC route table to deny inbound traffic from the specific IP address. Assign the route table to the domain.
B Security groups manage traffic at the instance level, allowing or denying traffic based on rules. They don't directly support denying traffic; they primarily focus on allowing traffic. Therefore, option A is incorrect. Network ACLs (NACLs) operate at the subnet level and allow or deny traffic based on rules. They can explicitly deny traffic from specific IP addresses, making option B the correct choice. Option C involves creating a shadow variant and using SageMaker Inference Recommender, which is not relevant to directly blocking an IP address. It addresses traffic routing, not access control. VPC route tables manage routing between subnets and the internet, not directly blocking inbound traffic at the subnet level. Therefore, option D is incorrect.
30
A company is gathering audio, video, and text data in various languages. The company needs to use a large language model (LLM) to summarize the gathered data that is in Spanish. Which solution will meet these requirements in the LEAST amount of time? A. Train and deploy a model in Amazon SageMaker to convert the data into English text. Train and deploy an LLM in SageMaker to summarize the text. B. Use Amazon Transcribe and Amazon Translate to convert the data into English text. Use Amazon Bedrock with the Jurassic model to summarize the text. C. Use Amazon Rekognition and Amazon Translate to convert the data into English text. Use Amazon Bedrock with the Anthropic Claude model to summarize the text. D. Use Amazon Comprehend and Amazon Translate to convert the data into English text. Use Amazon Bedrock with the Stable Diffusion model to summarize the text.
B The best answer is B because it leverages pre-trained services for both translation and summarization. Option A requires training new models which is significantly more time-consuming than using pre-trained models. Option C is incorrect because the Anthropic Claude model is not optimized for summarization. Option D is incorrect because Stable Diffusion is an image generation model, not suitable for text summarization. Option B uses Amazon Transcribe (for audio to text), Amazon Translate (for Spanish to English), and Amazon Bedrock with the Jurassic model (a summarization model), all pre-trained and readily available, making it the fastest solution.
31
A financial company receives a high volume of real-time market data streams from an external provider. The streams consist of thousands of JSON records every second. The company needs to implement a scalable solution on AWS to identify anomalous data points. Which solution will meet these requirements with the LEAST operational overhead? A. Ingest real-time data into Amazon Kinesis data streams. Use the built-in RANDOM_CUT_FOREST function in Amazon Managed Service for Apache Flink to process the data streams and to detect data anomalies. B. Ingest real-time data into Amazon Kinesis data streams. Deploy an Amazon SageMaker endpoint for real-time outlier detection. Create an AWS Lambda function to detect anomalies. Use the data streams to invoke the Lambda function. C. Ingest real-time data into Apache Kafka on Amazon EC2 instances. Deploy an Amazon SageMaker endpoint for real-time outlier detection. Create an AWS Lambda function to detect anomalies. Use the data streams to invoke the Lambda function. D. Send real-time data to an Amazon Simple Queue Service (Amazon SQS) FIFO queue. Create an AWS Lambda function to consume the queue messages. Program the Lambda function to start an AWS Glue extract, transform, and load (ETL) job for batch processing and anomaly detection.
A The best answer is A because it leverages fully managed AWS services designed for real-time processing and anomaly detection. Amazon Kinesis Data Streams is well-suited for handling high-volume data streams, and Amazon Managed Service for Apache Flink (with its built-in RANDOM_CUT_FOREST function) provides a scalable and managed solution for anomaly detection, minimizing operational overhead. Option B and C are less optimal because they require managing additional services (SageMaker endpoint, Lambda function) leading to increased operational complexity. Option D is incorrect because it uses a batch processing approach (AWS Glue) which is unsuitable for real-time anomaly detection. While the RANDOM_CUT_FOREST function might not be available directly as described, the overall approach of using Kinesis and a managed service for processing remains the most efficient for minimal operational overhead.
32
A company has a large collection of chat recordings from customer interactions after a product release. An ML engineer needs to create an ML model to analyze the chat data and determine the success of the product by reviewing customer sentiments about the product. Which action should the ML engineer take to complete the evaluation in the LEAST amount of time? A. Use Amazon Rekognition to analyze sentiments of the chat conversations. B. Train a Naive Bayes classifier to analyze sentiments of the chat conversations. C. Use Amazon Comprehend to analyze sentiments of the chat conversations. D. Use random forests to classify sentiments of the chat conversations.
C Amazon Comprehend is the correct answer because it's a pre-built service specifically designed for natural language processing (NLP) tasks, including sentiment analysis. This means it requires minimal setup and training compared to building and training a model from scratch (options B and D). Option A, Amazon Rekognition, is designed for image and video analysis, not text, making it unsuitable for this task. Therefore, using Amazon Comprehend offers the fastest solution for analyzing the large volume of chat data.
33
A company has a conversational AI assistant that sends requests through Amazon Bedrock to an Anthropic Claude large language model (LLM). Users report that when they ask similar questions multiple times, they sometimes receive different answers. An ML engineer needs to improve the responses to be more consistent and less random. Which solution will meet these requirements? A. Increase the temperature parameter and the top_k parameter. B. Increase the temperature parameter. Decrease the top_k parameter. C. Decrease the temperature parameter. Increase the top_k parameter. D. Decrease the temperature parameter and the top_k parameter.
D The correct answer is D because decreasing both the temperature and top_k parameters will make the LLM's output more deterministic and less random. A lower temperature parameter leads to higher probability outputs (more focused, less creative/random responses), and a lower top_k parameter focuses the model on the most likely outputs, further reducing randomness. Options A, B, and C all involve increasing either the temperature or top_k parameter (or both), which would increase randomness and variability in the responses, thus worsening the problem.
34
A company is using Amazon SageMaker's linear learner built-in algorithm with `multiclass_classifier` set for the `predictor_type` hyperparameter to predict the presence of a specific weed in a farmer's field. What should the company do to MINIMIZE false positives? A. Set the value of the weight decay hyperparameter to zero. B. Increase the number of training epochs. C. Increase the value of the `target_precision` hyperparameter. D. Change the value of the `predictor_type` hyperparameter to `regressor`.
C The correct answer is C because increasing the `target_precision` hyperparameter directly addresses the problem of minimizing false positives. Precision is the ratio of true positives to the sum of true positives and false positives. By increasing the target precision, the model is trained to prioritize correct positive predictions, thereby reducing the number of false positives. Option A is incorrect because setting weight decay to zero removes regularization, which can lead to overfitting and potentially increase false positives. Option B is incorrect because increasing the number of epochs might improve model accuracy but doesn't directly target false positives; it could even lead to overfitting and increased false positives. Option D is incorrect because changing `predictor_type` to `regressor` is inappropriate for this classification problem; a regressor predicts a continuous value, not a class label (weed present/absent).
35
A company has implemented a data ingestion pipeline for sales transactions from its ecommerce website. The company uses Amazon Data Firehose to ingest data into Amazon OpenSearch Service. The buffer interval of the Firehose stream is set for 60 seconds. An OpenSearch linear model generates real-time sales forecasts based on the data and presents the data in an OpenSearch dashboard. The company needs to optimize the data ingestion pipeline to support sub-second latency for the real-time dashboard. Which change to the architecture will meet these requirements? A. Use zero buffering in the Firehose stream. Tune the batch size that is used in the PutRecordBatch operation. B. Replace the Firehose stream with an AWS DataSync task. Configure the task with enhanced fan-out consumers. C. Increase the buffer interval of the Firehose stream from 60 seconds to 120 seconds. D. Replace the Firehose stream with an Amazon Simple Queue Service (Amazon SQS) queue.
A A is correct because using zero buffering in Firehose eliminates the 60-second delay caused by the buffer. Tuning the batch size further optimizes throughput for sub-second delivery, crucial for real-time dashboards. B is incorrect because AWS DataSync is designed for large-scale data transfers and is not optimized for the sub-second latency required for real-time dashboards. C is incorrect because increasing the buffer interval would *increase* latency, making the dashboard even slower. D is incorrect because introducing an SQS queue adds another layer of processing and queuing, increasing latency rather than reducing it. SQS is not ideal for the low-latency requirements of real-time dashboards.
36
A company has trained a machine learning (ML) model in Amazon SageMaker and needs to host it for production inferences. The model requires high availability, minimal latency, and must handle request sizes between 1 KB and 3 MB. The model will experience unpredictable request bursts throughout the day, demanding proportional scaling of inferences to match fluctuating demand. Which deployment strategy best meets these requirements? A. Create a SageMaker real-time inference endpoint. Configure auto-scaling. Configure the endpoint to present the existing model. B. Deploy the model on an Amazon Elastic Container Service (Amazon ECS) cluster. Use ECS scheduled scaling based on the CPU of the ECS cluster. C. Install SageMaker Operator on an Amazon Elastic Kubernetes Service (Amazon EKS) cluster. Deploy the model in Amazon EKS. Set horizontal pod auto-scaling to scale replicas based on the memory metric. D. Use Spot Instances with a Spot Fleet behind an Application Load Balancer (ALB) for inferences. Use the ALBRequestCountPerTarget metric for auto-scaling.
A A is correct because SageMaker real-time endpoints are specifically designed for low-latency, high-availability, and auto-scaling, making them ideal for handling unpredictable request bursts. The built-in auto-scaling feature directly addresses the need for proportional scaling to meet fluctuating demand. B is incorrect because while ECS allows for scaling, relying solely on CPU-based scheduled scaling may not be responsive enough to handle unpredictable bursts of requests. It lacks the fine-grained control and immediate responsiveness of SageMaker's auto-scaling. C is incorrect because while EKS with horizontal pod auto-scaling offers scalability, it adds complexity and overhead compared to the purpose-built SageMaker solution. Using memory as the scaling metric might not accurately reflect the inference workload. D is incorrect because using Spot Instances introduces the risk of interruptions due to instance termination. While ALB can handle load balancing, relying on ALBRequestCountPerTarget for auto-scaling might not be as efficient or responsive as SageMaker's integrated auto-scaling mechanism designed for ML inference.
37
An ML engineer needs to use an Amazon EMR cluster to process large volumes of data in batches. Any data loss is unacceptable. Which instance purchasing option will meet these requirements MOST cost-effectively? A. Run the primary node, core nodes, and task nodes on On-Demand Instances. B. Run the primary node, core nodes, and task nodes on Spot Instances. C. Run the primary node on an On-Demand Instance. Run the core nodes and task nodes on Spot Instances. D. Run the primary node and core nodes on On-Demand Instances. Run the task nodes on Spot Instances.
D The most cost-effective option that guarantees no data loss is to use On-Demand Instances for the primary and core nodes, and Spot Instances for the task nodes. This is because: * **Primary Node:** The primary node is critical for cluster operation. Using Spot Instances here risks cluster instability and potential data loss if the instance is interrupted. On-Demand instances guarantee availability. * **Core Nodes:** Core nodes are part of the HDFS (Hadoop Distributed File System) and losing these can lead to partial data loss. On-Demand instances ensure continuous operation and data safety. * **Task Nodes:** Task nodes process data but don't persistently store it in HDFS. If a Spot Instance task node is interrupted due to price increases, no data is lost. Using Spot Instances for task nodes offers significant cost savings. Option A is too expensive. Option B risks significant data loss. Option C still risks data loss from core node interruptions. Only Option D balances cost savings with the requirement of zero data loss.
38
A company wants to improve the sustainability of its ML operations. Which actions will reduce the energy usage and computational resources associated with the company's training jobs? (Choose two.) A. Use Amazon SageMaker Debugger to stop training jobs when non-converging conditions are detected. B. Use Amazon SageMaker Ground Truth for data labeling. C. Deploy models by using AWS Lambda functions. D. Use AWS Trainium instances for training. E. Use PyTorch or TensorFlow with the distributed training option.
A and D A is correct because Amazon SageMaker Debugger can detect issues in training jobs that lead to wasted resources. By stopping non-converging jobs early, it reduces energy consumption and computational resources. B is incorrect because data labeling is a preprocessing step and doesn't directly impact the energy consumption of training jobs. While efficient labeling is important for overall ML efficiency, this option doesn't directly reduce the energy used *during training*. C is incorrect because deploying models using AWS Lambda functions affects inference, not training. D is correct because AWS Trainium instances are designed for cost-effective and energy-efficient training. They are explicitly stated to be more energy-efficient than comparable alternatives. E is incorrect because while distributed training can improve training *speed*, it doesn't necessarily reduce the overall energy consumption or computational resources unless the training is shortened significantly due to faster completion (which isn't guaranteed). It may even increase total resource usage if it involves more instances than a comparable single-node training job.
39
A company is planning to create several ML prediction models. The training data is stored in Amazon S3. The entire dataset is more than 5 TB in size and consists of CSV, JSON, Apache Parquet, and simple text files. The data must be processed in several consecutive steps. The steps include complex manipulations that can take hours to finish running. Some of the processing involves natural language processing (NLP) transformations. The entire process must be automated. Which solution will meet these requirements? A. Process data at each step by using Amazon SageMaker Data Wrangler. Automate the process by using Data Wrangler jobs. B. Use Amazon SageMaker notebooks for each data processing step. Automate the process by using Amazon EventBridge. C. Process data at each step by using AWS Lambda functions. Automate the process by using AWS Step Functions and Amazon EventBridge. D. Use Amazon SageMaker Pipelines to create a pipeline of data processing steps. Automate the pipeline by using Amazon EventBridge.
D The correct answer is D because it best addresses all the requirements: The large dataset size (5TB+) and complex, hours-long processing steps involving NLP rule out solutions A, B, and C. SageMaker Pipelines are designed for building and managing complex ML workflows, including data processing. Option A (SageMaker Data Wrangler) is suitable for data preparation but isn't ideal for the extensive, multi-stage processing involved. Option B (SageMaker notebooks) is interactive and not designed for automated, large-scale processing. Option C (Lambda functions and Step Functions) could handle automation, but Lambda's execution time limits and cost implications for this scale of processing make it less efficient than SageMaker Pipelines. Amazon EventBridge can be used in conjunction with SageMaker Pipelines to trigger the pipeline automatically, thus fulfilling the automation requirement.
40
An ML engineer needs to use AWS CloudFormation to create an ML model that an Amazon SageMaker endpoint will host. Which resource should the ML engineer declare in the CloudFormation template to meet this requirement? A. AWS::SageMaker::Model B. AWS::SageMaker::Endpoint C. AWS::SageMaker::NotebookInstance D. AWS::SageMaker::Pipeline
A The correct answer is A, AWS::SageMaker::Model. This resource is specifically designed to define the ML model, including its location (S3 artifacts), inference container/image, and IAM role. A SageMaker Model is a prerequisite for deploying a model to a SageMaker Endpoint. Option B, AWS::SageMaker::Endpoint, is incorrect because it represents the endpoint itself, which hosts the model but doesn't define the model's properties. Option C, AWS::SageMaker::NotebookInstance, is incorrect as it relates to notebook instances used for model development, not the model itself. Option D, AWS::SageMaker::Pipeline, is incorrect because it's for creating and managing SageMaker pipelines, a process for building and deploying models, not the model itself.
41
An advertising company uses AWS Lake Formation to manage a data lake containing structured and unstructured data. ML engineers are assigned to specific advertisement campaigns and must access data via Amazon Athena and by browsing it directly in an Amazon S3 bucket. They must only access resources specific to their assigned campaigns. Which solution offers the MOST operationally efficient way to achieve this? A. Configure IAM policies on an AWS Glue Data Catalog to restrict access to Athena based on the ML engineers' campaigns. B. Store users and campaign information in an Amazon DynamoDB table. Configure DynamoDB Streams to invoke an AWS Lambda function to update S3 bucket policies. C. Use Lake Formation to authorize AWS Glue to access the S3 bucket. Configure Lake Formation tags to map ML engineers to their campaigns. D. Configure S3 bucket policies to restrict access to the S3 bucket based on the ML engineers' campaigns.
C Lake Formation is the most operationally efficient solution because it's designed for fine-grained access control within a data lake. By tagging resources with campaign information and mapping engineers to those campaigns, Lake Formation provides a centralized and automated way to manage access. Options A and D are less efficient because they require managing policies individually for each engineer and campaign, leading to complexity and potential errors. Option B is overly complex and inefficient, involving multiple services and custom code to manage access, whereas Lake Formation's built-in features offer a simpler and more streamlined approach.
42
An ML engineer needs to use data with Amazon SageMaker Canvas to train an ML model. The data is stored in Amazon S3 and is complex in structure. The ML engineer must use a file format that minimizes processing time for the data. Which file format will meet these requirements? A. CSV files compressed with Snappy B. JSON objects in JSONL format C. JSON files compressed with gzip D. Apache Parquet files
D Parquet is the correct answer because it is a columnar storage format optimized for performance and efficiency, especially with complex data. Its columnar structure allows for faster query processing as only the necessary columns are read, unlike row-oriented formats like CSV or JSON. Parquet also incorporates built-in compression, further enhancing performance. SageMaker Canvas is compatible with Parquet files. Option A (CSV with Snappy) is less efficient than Parquet due to its row-oriented nature. While compression helps, it doesn't address the fundamental performance limitations of CSV. Option B (JSONL) is also row-oriented and doesn't offer the same level of optimized performance as Parquet, especially for complex data. Option C (JSON with gzip) suffers from the same drawbacks as JSONL; it's row-oriented and less efficient for complex data, despite using compression.
43
An ML engineer is evaluating several ML models and must choose one model to use in production. The cost of false negative predictions by the models is much higher than the cost of false positive predictions. Which metric finding should the ML engineer prioritize MOST when choosing the model? A. Low precision B. High precision C. Low recall D. High recall
D. High recall The correct answer is D because the problem states that false negatives are far more costly than false positives. Recall is the ratio of correctly predicted positive observations to all actual positive observations. High recall minimizes false negatives, aligning with the engineer's priority to reduce the cost of these errors. Option A (Low precision) is incorrect because low precision increases false positives, which are less costly according to the problem. Option B (High precision) is incorrect because while it reduces false positives, it doesn't directly address the more critical issue of minimizing the more expensive false negatives. Option C (Low recall) is incorrect because low recall directly increases false negatives, which should be avoided due to their high cost.
44
A company uses an Amazon Redshift database as its sole data source, containing sensitive data. A data scientist needs access to this sensitive data. An ML engineer must grant this access without modifying the source data or storing anonymized data within the database. Which solution requires the LEAST implementation effort? A. Configure dynamic data masking policies to control how sensitive data is shared with the data scientist at query time. B. Create a materialized view with masking logic on top of the database. Grant the necessary read permissions to the data scientist. C. Unload the Amazon Redshift data to Amazon S3. Use Amazon Athena to create schema-on-read with masking logic. Share the view with the data scientist. D. Unload the Amazon Redshift data to Amazon S3. Create an AWS Glue job to anonymize the data. Share the dataset with the data scientist.
A
45
An ML engineer is using a training job to fine-tune a deep learning model in Amazon SageMaker Studio. The ML engineer previously used the same pre-trained model with a similar dataset. The ML engineer expects vanishing gradient, underutilized GPU, and overfitting problems. The ML engineer needs to implement a solution to detect these issues and to react in predefined ways when the issues occur. The solution also must provide comprehensive real-time metrics during the training. Which solution will meet these requirements with the LEAST operational overhead? A. Use TensorBoard to monitor the training job. Publish the findings to an Amazon Simple Notification Service (Amazon SNS) topic. Create an AWS Lambda function to consume the findings and to initiate the predefined actions. B. Use Amazon CloudWatch default metrics to gain insights about the training job. Use the metrics to invoke an AWS Lambda function to initiate the predefined actions. C. Expand the metrics in Amazon CloudWatch to include the gradients in each training step. Use the metrics to invoke an AWS Lambda function to initiate the predefined actions. D. Use SageMaker Debugger built-in rules to monitor the training job. Configure the rules to initiate the predefined actions.
D
46
A credit card company has a fraud detection model in production on an Amazon SageMaker endpoint. The company develops a new version of this model. They need to assess the new model's performance using live data without impacting production end-users. Which solution best meets these requirements? A. Set up SageMaker Debugger and create a custom rule. B. Set up blue/green deployments with all-at-once traffic shifting. C. Set up blue/green deployments with canary traffic shifting. D. Set up shadow testing with a shadow variant of the new model.
D Shadow testing allows the new model to process live data alongside the production model without affecting the production system's output. This enables a direct performance comparison against the existing model using real-world data. A is incorrect because SageMaker Debugger is used for model debugging and anomaly detection during training, not for evaluating a deployed model's performance in a production-like environment. B and C are incorrect because blue/green deployments involve switching traffic entirely to the new model (all-at-once) or gradually (canary), either of which would impact production end-users during the transition. These methods don't allow parallel evaluation against the production model using the same live data stream.
47
A company stores time-series data about user clicks in an Amazon S3 bucket. The raw data consists of millions of rows of user activity every day. ML engineers access this data to develop their ML models and need to generate daily reports and analyze click trends over the past 3 days using Amazon Athena. The company must retain the data for 30 days before archiving it. Which solution will provide the HIGHEST performance for data retrieval? A. Keep all the time-series data without partitioning in the S3 bucket. Manually move data that is older than 30 days to separate S3 buckets. B. Create AWS Lambda functions to copy the time-series data into separate S3 buckets. Apply S3 Lifecycle policies to archive data that is older than 30 days to S3 Glacier Flexible Retrieval. C. Organize the time-series data into partitions by date prefix in the S3 bucket. Apply S3 Lifecycle policies to archive partitions that are older than 30 days to S3 Glacier Flexible Retrieval. D. Put each day's time-series data into its own S3 bucket. Use S3 Lifecycle policies to archive S3 buckets that hold data that is older than 30 days to S3 Glacier Flexible Retrieval.
C The correct answer is C because partitioning the data by date allows Athena to quickly scan only the relevant partitions when querying for the past 3 days. This significantly improves query performance compared to scanning the entire dataset (A) or dealing with the overhead of Lambda functions (B) or managing numerous S3 buckets (D). Option A leads to slow queries due to scanning large datasets. Options B and D introduce unnecessary overhead, slowing down data retrieval. Option C efficiently manages data and utilizes S3 lifecycle policies for cost-effective archiving.
48
A company has deployed an ML model that detects fraudulent credit card transactions in real time in a banking application. The model uses Amazon SageMaker Asynchronous Inference. Consumers are reporting delays in receiving the inference results. An ML engineer needs to implement a solution to improve the inference performance. The solution also must provide a notification when a deviation in model quality occurs. Which solution will meet these requirements? A. Use SageMaker real-time inference for inference. Use SageMaker Model Monitor for notifications about model quality. B. Use SageMaker batch transform for inference. Use SageMaker Model Monitor for notifications about model quality. C. Use SageMaker Serverless Inference for inference. Use SageMaker Inference Recommender for notifications about model quality. D. Keep using SageMaker Asynchronous Inference for inference. Use SageMaker Inference Recommender for notifications about model quality.
A A is correct because SageMaker real-time inference provides faster predictions than asynchronous inference, addressing the delay issue. SageMaker Model Monitor effectively tracks model quality and sends alerts for deviations. B is incorrect because SageMaker batch transform is not suitable for real-time applications; it's designed for batch processing. C is incorrect because while SageMaker Serverless Inference can improve performance, SageMaker Inference Recommender is not designed to provide notifications about model quality deviations; Model Monitor is better suited for this purpose. D is incorrect because it doesn't address the delay problem; it continues using the slow asynchronous inference method.
49
An ML engineer needs to implement a solution to host a trained ML model. The rate of requests to the model will be inconsistent throughout the day. The ML engineer needs a scalable solution that minimizes costs when the model is not in use. The solution also must maintain the model's capacity to respond to requests during times of peak usage. Which solution will meet these requirements? A. Create AWS Lambda functions that have fixed concurrency to host the model. Configure the Lambda functions to automatically scale based on the number of requests to the model. B. Deploy the model on an Amazon Elastic Container Service (Amazon ECS) cluster that uses AWS Fargate. Set a static number of tasks to handle requests during times of peak usage. C. Deploy the model to an Amazon SageMaker endpoint. Deploy multiple copies of the model to the endpoint. Create an Application Load Balancer to route traffic between the different copies of the model at the endpoint. D. Deploy the model to an Amazon SageMaker endpoint. Create SageMaker endpoint auto scaling policies that are based on Amazon CloudWatch metrics to adjust the number of instances dynamically.
D The correct answer is D because it provides a solution that directly addresses all the requirements: scalability, cost minimization during low usage, and capacity during peak usage. SageMaker endpoint autoscaling, driven by CloudWatch metrics, dynamically adjusts the number of instances based on demand. This ensures that resources are efficiently used only when needed, minimizing costs during low-usage periods while maintaining sufficient capacity during peak demand. Option A is incorrect because while Lambda offers autoscaling, fixed concurrency contradicts the need for cost minimization during low usage. Option B is incorrect because setting a static number of tasks doesn't adapt to fluctuating demand, potentially leading to overspending or insufficient capacity. Option C is incorrect because while it offers a scalable approach using multiple copies and a load balancer, it lacks the dynamic scaling capabilities of SageMaker's autoscaling feature. It could potentially overspend by keeping many instances running even when unnecessary.
50
A company uses Amazon SageMaker Studio to develop an ML model within a single SageMaker Studio domain. An ML engineer needs to implement a solution that provides an automated alert when SageMaker compute costs reach a specific threshold. Which solution will meet these requirements? A. Add resource tagging by editing the SageMaker user profile in the SageMaker domain. Configure AWS Cost Explorer to send an alert when the threshold is reached. B. Add resource tagging by editing the SageMaker user profile in the SageMaker domain. Configure AWS Budgets to send an alert when the threshold is reached. C. Add resource tagging by editing each user's IAM profile. Configure AWS Cost Explorer to send an alert when the threshold is reached. D. Add resource tagging by editing each user's IAM profile. Configure AWS Budgets to send an alert when the threshold is reached.
B The correct answer is B because AWS Budgets is specifically designed for setting cost thresholds and sending alerts when those thresholds are reached. While Cost Explorer can show cost data, it doesn't have the built-in functionality to automatically send alerts based on predefined thresholds. Furthermore, tagging resources at the SageMaker user profile level (as opposed to individual IAM profiles) is more efficient and aligned with the single SageMaker domain context. Options A, C, and D are incorrect because they either use the wrong service for alerting (Cost Explorer instead of Budgets) or incorrectly suggest tagging at the individual IAM user level instead of the more efficient SageMaker user profile level.
51
A company uses Amazon SageMaker for its ML workloads. The company's ML engineer receives a 50 MB Apache Parquet data file to build a fraud detection model. The file includes several correlated columns that are not required for model training. What should the ML engineer do to drop the unnecessary columns in the file with the LEAST effort? A. Download the file to a local workstation. Perform one-hot encoding by using a custom Python script. B. Create an Apache Spark job that uses a custom processing script on Amazon EMR. C. Create a SageMaker processing job by calling the SageMaker Python SDK. D. Create a data flow in SageMaker Data Wrangler. Configure a transform step.
D The best answer is D because SageMaker Data Wrangler is designed for data exploration, cleaning, and preprocessing directly within the SageMaker ecosystem. Dropping unnecessary columns is a simple transformation easily accomplished within Data Wrangler's visual interface, requiring minimal coding and effort. Option A is inefficient because it involves downloading the data, writing custom code, and then uploading it again. Option B introduces unnecessary complexity by using EMR and Spark for a task easily handled within SageMaker. Option C, while feasible, is also more complex than using the purpose-built Data Wrangler tool and requires coding.
52
A company is creating an application that will recommend products for customers to purchase. The application will make API calls to Amazon Q Business. The company must ensure that responses from Amazon Q Business do not include the name of the company's main competitor. Which solution will meet this requirement? A. Configure the competitor's name as a blocked phrase in Amazon Q Business. B. Configure an Amazon Q Business retriever to exclude the competitor’s name. C. Configure an Amazon Kendra retriever for Amazon Q Business to build indexes that exclude the competitor's name. D. Configure document attribute boosting in Amazon Q Business to deprioritize the competitor's name.
A A is correct because Amazon Q Business allows for the configuration of blocked phrases, directly addressing the need to prevent specific terms (like the competitor's name) from appearing in responses. This is a precise and efficient solution. B is incorrect because while retrievers manage data access, they don't directly control the content of the final response generated by Amazon Q Business. They select relevant data, but don't filter out specific words within that data. C is incorrect because Amazon Kendra is a separate service for indexing documents. While indirectly related to Q Business, configuring Kendra doesn't directly filter Q Business's output. D is incorrect because document attribute boosting prioritizes or deprioritizes certain attributes within documents. It does not filter out or block specific terms from being included in the response entirely.
53
An ML engineer needs to use Amazon SageMaker to fine-tune a large language model (LLM) for text summarization using a low-code no-code (LCNC) approach. Which solution will meet these requirements? A. Use SageMaker Studio to fine-tune an LLM that is deployed on Amazon EC2 instances. B. Use SageMaker Autopilot to fine-tune an LLM that is deployed by a custom API endpoint. C. Use SageMaker Autopilot to fine-tune an LLM that is deployed on Amazon EC2 instances. D. Use SageMaker Autopilot to fine-tune an LLM that is deployed by SageMaker JumpStart.
D SageMaker Autopilot and SageMaker JumpStart are designed for low-code/no-code machine learning workflows. SageMaker JumpStart provides pre-trained models, reducing the need for extensive coding. Option D leverages both to fine-tune a pre-trained LLM for text summarization, fulfilling the LCNC requirement. Option A uses SageMaker Studio, which is a more code-intensive environment, thus not meeting the LCNC requirement. Options B and C, while using SageMaker Autopilot, involve deploying the LLM via a custom API or EC2 instances, both of which require more coding than using the pre-trained models from SageMaker JumpStart.
54
A company has a machine learning (ML) model that requires nightly execution to predict stock values. The model input is 3 MB of data collected daily, and the prediction process takes less than one minute. Which Amazon SageMaker deployment option best suits these requirements? A. Use a multi-model serverless endpoint. Enable caching. B. Use an asynchronous inference endpoint. Set the InitialInstanceCount parameter to 0. C. Use a real-time endpoint. Configure an auto scaling policy to scale the model to 0 when the model is not in use. D. Use a serverless inference endpoint. Set the MaxConcurrency parameter to 1.
D The correct answer is D because it leverages the cost-effectiveness and efficiency of serverless inference for a single nightly run. Setting `MaxConcurrency` to 1 ensures only one prediction is processed at a time, aligning with the one-time nightly execution requirement. Options A, B, and C are less suitable: A introduces unnecessary caching for a single nightly run; B uses asynchronous inference, which is not necessary for a process that completes in under a minute; and C involves managing auto-scaling for a process that doesn't require continuous availability, leading to unnecessary complexity and cost.
55
A company has an application that uses different APIs to generate embeddings for input text. The company needs to implement a solution to automatically rotate the API tokens every 3 months. Which solution will meet this requirement? A. Store the tokens in AWS Secrets Manager. Create an AWS Lambda function to perform the rotation. B. Store the tokens in AWS Systems Manager Parameter Store. Create an AWS Lambda function to perform the rotation. C. Store the tokens in AWS Key Management Service (AWS KMS). Use an AWS managed key to perform the rotation. D. Store the tokens in AWS Key Management Service (AWS KMS). Use an AWS owned key to perform the rotation.
A AWS Secrets Manager is designed for securely storing and managing sensitive data like API tokens, and it offers built-in automatic rotation capabilities. A Lambda function can be scheduled to trigger the rotation process every 3 months. Option B is incorrect because while AWS Systems Manager Parameter Store can store secrets, it does not have built-in automatic rotation features. Manual intervention would be required to rotate the tokens. Options C and D are incorrect because AWS KMS is primarily for managing encryption keys, not directly for API tokens. While you could potentially use KMS to encrypt the tokens stored elsewhere, it doesn't provide the automatic rotation functionality needed.
56
An ML engineer receives datasets containing missing values, duplicates, and extreme outliers. These datasets need to be consolidated into a single data frame and prepared for machine learning. Which solution best meets these requirements? A. Use Amazon SageMaker Data Wrangler to import the datasets and consolidate them into a single data frame. Use the cleansing and enrichment functionalities to prepare the data. B. Use Amazon SageMaker Ground Truth to import the datasets and consolidate them into a single data frame. Use the human-in-the-loop capability to prepare the data. C. Manually import and merge the datasets. Consolidate the datasets into a single data frame. Use Amazon Q Developer to generate code snippets that will prepare the data. D. Manually import and merge the datasets. Consolidate the datasets into a single data frame. Use Amazon SageMaker data labeling to prepare the data.
A
57
An ML engineer is developing a fraud detection model on AWS. The training dataset includes transaction logs, customer profiles, and tables from an on-premises MySQL database. The transaction logs and customer profiles are stored in Amazon S3. The dataset has a class imbalance that affects the learning of the model's algorithm. Additionally, many of the features have interdependencies. The algorithm is not capturing all the desired underlying patterns in the data. The ML engineer needs to use an Amazon SageMaker built-in algorithm to train the model. Which algorithm should the ML engineer use to meet this requirement? A. LightGBM B. Linear learner C. K-means clustering D. Neural Topic Model (NTM)
A. LightGBM LightGBM is the best choice because it effectively handles class imbalances through techniques like weighted loss and adaptive boosting. Furthermore, its tree-based structure allows it to capture non-linear relationships and interactions between features, unlike the linear learner. K-means clustering is an unsupervised learning algorithm and is not suitable for this supervised learning task (fraud detection). The Neural Topic Model is designed for topic modeling, not fraud detection.
58
A company has historical data indicating whether customers required long-term support. They need an ML model to predict if *new* customers will need long-term support. Which modeling approach is most appropriate? A. Anomaly detection B. Linear regression C. Logistic regression D. Semantic segmentation
C Logistic regression is the correct answer because it's designed for binary classification problems – predicting one of two outcomes (yes/no, in this case, whether a customer needs long-term support or not). A is incorrect because anomaly detection identifies unusual data points, not a binary classification prediction. B is incorrect because linear regression predicts continuous values, not categorical outcomes like "yes" or "no." D is incorrect because semantic segmentation is used for image analysis, not customer data prediction.
59
An ML engineer has developed a binary classification model outside of Amazon SageMaker. The model artifacts are stored in an Amazon S3 bucket. The ML engineer and the Canvas user are part of the same SageMaker domain. The ML engineer needs to make the model accessible to a SageMaker Canvas user for additional tuning. Which combination of requirements must be met so that the ML engineer can share the model with the Canvas user? (Choose two.) A. The ML engineer and the Canvas user must be in separate SageMaker domains. B. The Canvas user must have permissions to access the S3 bucket where the model artifacts are stored. C. The model must be registered in the SageMaker Model Registry. D. The ML engineer must host the model on AWS Marketplace. E. The ML engineer must deploy the model to a SageMaker endpoint.
B and C The correct answers are B and C because: * **B. The Canvas user must have permissions to access the S3 bucket where the model artifacts are stored:** SageMaker Canvas needs access to the model artifacts to load and use the model. If the Canvas user lacks permission to access the S3 bucket containing these artifacts, they cannot use the model. * **C. The model must be registered in the SageMaker Model Registry:** While the model is stored in S3, SageMaker Canvas requires the model to be registered in the Model Registry for discoverability and management within the SageMaker ecosystem. This allows Canvas to locate and interact with the model. A is incorrect: The question states that the ML engineer and Canvas user are in the *same* SageMaker domain. Separate domains would further complicate access. D is incorrect: AWS Marketplace is for selling and distributing models publicly, which is not necessary in this scenario where the model is being shared internally within the same organization and domain. E is incorrect: Deploying the model to a SageMaker endpoint is unnecessary for simply allowing a Canvas user to access and tune a model. Endpoints are typically used for real-time inference.
60
A company is building a deep learning model on Amazon SageMaker using a large training dataset. They need to optimize the model's hyperparameters to minimize the loss function on the validation dataset while minimizing computation time. Which hyperparameter tuning strategy will accomplish this goal with the LEAST computation time? A. Hyperband B. Grid search C. Bayesian optimization D. Random search
A Hyperband is the correct answer because it is designed for efficient hyperparameter optimization, prioritizing early stopping of unpromising configurations to reduce computation time. Grid search is computationally expensive as it exhaustively tries all combinations. Bayesian optimization is more efficient than grid search and random search, but generally less efficient than Hyperband. Random search is also less efficient than Hyperband because it doesn't leverage information from previous trials as effectively. The provided links and discussion emphasize Hyperband's efficiency and faster convergence compared to other methods.
61
A company is planning to use Amazon Redshift ML in its primary AWS account. The source data is in an Amazon S3 bucket in a secondary account. An ML engineer needs to set up an ML pipeline in the primary account to access the S3 bucket in the secondary account. The solution must not require public IPv4 addresses. Which solution will meet these requirements? A. Provision a Redshift cluster and Amazon SageMaker Studio in a VPC with no public access enabled in the primary account. Create a VPC peering connection between the accounts. Update the VPC route tables to remove the route to 0.0.0.0/0. B. Provision a Redshift cluster and Amazon SageMaker Studio in a VPC with no public access enabled in the primary account. Create an AWS Direct Connect connection and a transit gateway. Associate the VPCs from both accounts with the transit gateway. Update the VPC route tables to remove the route to 0.0.0.0/0. C. Provision a Redshift cluster and Amazon SageMaker Studio in a VPC in the primary account. Create an AWS Site-to-Site VPN connection with two encrypted IPsec tunnels between the accounts. Set up interface VPC endpoints for Amazon S3. D. Provision a Redshift cluster and Amazon SageMaker Studio in a VPC in the primary account. Create an S3 gateway endpoint. Update the S3 bucket policy to allow IAM principals from the primary account. Set up interface VPC endpoints for SageMaker and Amazon Redshift.
D The correct answer is D because it leverages privateLink for accessing S3 and other AWS services. An S3 gateway endpoint provides private connectivity to the S3 bucket without needing public IP addresses. Interface VPC endpoints for SageMaker and Redshift ensure private access to those services as well. Updating the S3 bucket policy to allow IAM principals from the primary account grants the necessary permissions. Option A is incorrect because VPC peering, while providing private connectivity between VPCs, might still require routing through the internet if not properly configured and doesn't inherently address S3 access. Option B is an overkill and unnecessarily complex solution; Direct Connect is not required for this scenario. Option C, while using a VPN, still relies on the VPN connection being correctly routed for all communication, and may require additional configurations that make it less efficient.
62
A company is using an AWS Lambda function to monitor the metrics from an ML model. An ML engineer needs to implement a solution to send an email message when the metrics breach a threshold. Which solution will meet this requirement? A. Log the metrics from the Lambda function to AWS CloudTrail. Configure a CloudTrail trail to send the email message. B. Log the metrics from the Lambda function to Amazon CloudFront. Configure an Amazon CloudWatch alarm to send the email message. C. Log the metrics from the Lambda function to Amazon CloudWatch. Configure a CloudWatch alarm to send the email message. D. Log the metrics from the Lambda function to Amazon CloudWatch. Configure an Amazon CloudFront rule to send the email message.
C CloudWatch is the correct service for monitoring metrics and setting up alarms. A CloudWatch alarm can be configured to trigger actions, such as sending an email, when a defined threshold is breached. Option A is incorrect because CloudTrail is for logging API calls, not for monitoring metrics. Option B is incorrect because CloudFront is a content delivery network and not designed for metric monitoring or alerting. Option D is incorrect because CloudFront rules manage content delivery, not alerts.
63
A company has used Amazon SageMaker to deploy a predictive ML model in production. The company is using SageMaker Model Monitor on the model. After a model update, an ML engineer notices data quality issues in the Model Monitor checks. What should the ML engineer do to mitigate the data quality issues that Model Monitor has identified? A. Adjust the model's parameters and hyperparameters. B. Initiate a manual Model Monitor job that uses the most recent production data. C. Create a new baseline from the latest dataset. Update Model Monitor to use the new baseline for evaluations. D. Include additional data in the existing training set for the model. Retrain and redeploy the model.
C The discussion highlights that a model update can invalidate the existing baseline used by SageMaker Model Monitor, leading to false positives regarding data quality issues. Creating a new baseline from the latest dataset (option C) directly addresses this problem by providing Model Monitor with a relevant comparison point for the post-update data. Option A is incorrect because adjusting model parameters and hyperparameters addresses model performance, not data quality issues detected by Model Monitor. Option B is incorrect because running a manual job with the latest data doesn't resolve the underlying issue of an outdated baseline. Option D is a more drastic measure that should only be taken if the data quality issues persist after creating a new baseline and are indicative of a more fundamental problem with the model's training data.
64
A company has an ML model that generates text descriptions based on images that customers upload to the company's website. The images can be up to 50 MB in total size. An ML engineer decides to store the images in an Amazon S3 bucket. The ML engineer must implement a processing solution that can scale to accommodate changes in demand. Which solution will meet these requirements with the LEAST operational overhead? A. Create an Amazon SageMaker batch transform job to process all the images in the S3 bucket. B. Create an Amazon SageMaker Asynchronous Inference endpoint and a scaling policy. Run a script to make an inference request for each image. C. Create an Amazon Elastic Kubernetes Service (Amazon EKS) cluster that uses Karpenter for auto scaling. Host the model on the EKS cluster. Run a script to make an inference request for each image. D. Create an AWS Batch job that uses an Amazon Elastic Container Service (Amazon ECS) cluster. Specify a list of images to process for each AWS Batch job.
B The best answer is B because it leverages the built-in scaling capabilities of Amazon SageMaker's asynchronous inference endpoints. This requires minimal operational overhead as the scaling is managed by AWS. Options A, C, and D require significantly more configuration and management to achieve similar scaling capabilities, resulting in greater operational overhead. Option A (batch transform) isn't designed for real-time or continuously varying demand. Option C (EKS with Karpenter) and D (AWS Batch with ECS) require managing the Kubernetes cluster or ECS cluster, respectively, adding considerable complexity.
65
An ML engineer needs to use AWS services to identify and extract meaningful unique keywords from documents. Which solution will meet these requirements with the LEAST operational overhead? A. Use the Natural Language Toolkit (NLTK) library on Amazon EC2 instances for text pre-processing. Use the Latent Dirichlet Allocation (LDA) algorithm to identify and extract relevant keywords. B. Use Amazon SageMaker and the BlazingText algorithm. Apply custom pre-processing steps for stemming and removal of stop words. Calculate term frequency-inverse document frequency (TF-IDF) scores to identify and extract relevant keywords. C. Store the documents in an Amazon S3 bucket. Create AWS Lambda functions to process the documents and to run Python scripts for stemming and removal of stop words. Use bigram and trigram techniques to identify and extract relevant keywords. D. Use Amazon Comprehend custom entity recognition and key phrase extraction to identify and extract relevant keywords.
D Amazon Comprehend is a managed service, meaning AWS handles the underlying infrastructure and operational overhead. Options A, B, and C require managing EC2 instances, SageMaker endpoints, Lambda functions, and potentially other infrastructure components, increasing operational complexity and overhead. Therefore, Amazon Comprehend (option D) offers the least operational overhead for keyword extraction.
66
A company needs to give its ML engineers appropriate access to training data. The ML engineers must access training data from only their own business group. The ML engineers must not be allowed to access training data from other business groups. The company uses a single AWS account and stores all the training data in Amazon S3 buckets. All ML model training occurs in Amazon SageMaker. Which solution will provide the ML engineers with the appropriate access? A. Enable S3 bucket versioning. B. Configure S3 Object Lock settings for each user. C. Add cross-origin resource sharing (CORS) policies to the S3 buckets. D. Create IAM policies. Attach the policies to IAM users or IAM roles.
D The correct answer is D because IAM policies offer granular control over access to AWS resources. By creating IAM policies that specifically grant access only to the S3 buckets containing training data for a given business group and attaching these policies to the appropriate IAM users or roles for those ML engineers, the company ensures that engineers only have access to the data they need. Option A (S3 bucket versioning) is incorrect because it manages data versioning, not access control. Option B (S3 Object Lock) is incorrect because it prevents deletion or modification of objects, not access control to them. Option C (CORS policies) is incorrect because it deals with cross-origin requests, which is irrelevant to this access control scenario within a single AWS account.
67
A company needs to host a custom ML model to perform forecast analysis. The forecast analysis will occur with predictable and sustained load during the same 2-hour period every day. Multiple invocations during the analysis period will require quick responses. The company needs AWS to manage the underlying infrastructure and any auto-scaling activities. Which solution will meet these requirements? A. Schedule an Amazon SageMaker batch transform job by using AWS Lambda. B. Configure an Auto Scaling group of Amazon EC2 instances to use scheduled scaling. C. Use Amazon SageMaker Serverless Inference with provisioned concurrency. D. Run the model on an Amazon Elastic Kubernetes Service (Amazon EKS) cluster on Amazon EC2 with pod auto scaling.
C The correct answer is C because Amazon SageMaker Serverless Inference with provisioned concurrency best fits the described needs. The predictable and sustained 2-hour load daily makes serverless a cost-effective choice, eliminating the need to pay for idle resources for the remaining 22 hours. Provisioned concurrency ensures quick responses by pre-warming the necessary resources, meeting the requirement for quick response times during the active period. AWS manages the underlying infrastructure and auto-scaling. Option A is incorrect because batch transform jobs are not ideal for real-time, low-latency predictions needed for quick responses. Option B is inefficient for a predictable 2-hour window; it involves managing and paying for resources that sit idle for the majority of the day. Option D, while capable of handling the load, requires more management overhead than the serverless approach in option C and is less cost-effective for this specific use case.
68
A company's ML engineer has deployed an ML model for sentiment analysis to an Amazon SageMaker endpoint. The ML engineer needs to explain to company stakeholders how the model makes predictions. Which solution will provide an explanation for the model's predictions? A. Use SageMaker Model Monitor on the deployed model. B. Use SageMaker Clarify on the deployed model. C. Show the distribution of inferences from A/B testing in Amazon CloudWatch. D. Add a shadow endpoint. Analyze prediction differences on samples.
B SageMaker Clarify is the correct answer because it's specifically designed to provide explanations for model predictions, including feature importance and bias detection. This directly addresses the need to explain the model's predictions to stakeholders. Option A is incorrect because SageMaker Model Monitor is for monitoring model performance over time, not for explaining individual predictions. Option C is incorrect because A/B testing results show overall performance differences, not explanations of individual predictions. Option D, while potentially useful for comparing models, doesn't directly provide explanations for how a *specific* model makes its predictions.
69
An ML engineer is using Amazon SageMaker to train a deep learning model that requires distributed training. After some training attempts, the ML engineer observes that the instances are not performing as expected due to communication overhead between the training instances. What should the ML engineer do to MINIMIZE the communication overhead between the instances? A. Place the instances in the same VPC subnet. Store the data in a different AWS Region from where the instances are deployed. B. Place the instances in the same VPC subnet but in different Availability Zones. Store the data in a different AWS Region from where the instances are deployed. C. Place the instances in the same VPC subnet. Store the data in the same AWS Region and Availability Zone where the instances are deployed. D. Place the instances in the same VPC subnet. Store the data in the same AWS Region but in a different Availability Zone from where the instances are deployed.
C The correct answer is C because placing the instances and the data in the same Availability Zone minimizes network latency and therefore communication overhead. Options A, B, and D all involve storing data in a different location than the instances, significantly increasing the network distance and thus the communication overhead. Option B also introduces the overhead of inter-AZ communication.
70
A company is running ML models on premises using custom Python scripts, proprietary datasets, and PyTorch. They need to move these models to AWS with the LEAST amount of effort. Which solution best meets these requirements? A. Use SageMaker built-in algorithms to train the proprietary datasets. B. Use SageMaker script mode and premade images for ML frameworks. C. Build a container on AWS that includes custom packages and a choice of ML frameworks. D. Purchase similar production models through AWS Marketplace.
B The best solution is B because it leverages SageMaker's script mode, allowing the company to use their existing custom Python scripts with minimal code changes. The pre-built PyTorch images provided by SageMaker eliminate the need to build and manage a custom container, significantly reducing the effort required for migration. Option A is incorrect because it requires retraining the models using SageMaker's built-in algorithms, which would likely involve substantial modifications to the existing code and potentially lead to performance differences. Option C is incorrect because building a custom container requires significant effort in packaging dependencies, configuring the environment, and testing the deployment. This is more complex than using SageMaker's pre-built images. Option D is incorrect because it involves purchasing entirely new models, rather than migrating the existing ones. This doesn't meet the requirement of moving the *company's* models to AWS.
71
A company is using Amazon SageMaker and millions of files (several megabytes each) stored in an Amazon S3 bucket to train an ML model. They need to improve training performance as quickly as possible. Which solution will meet these requirements in the LEAST amount of time? A. Transfer the data to a new S3 bucket that provides S3 Express One Zone storage. Adjust the training job to use the new S3 bucket. B. Create an Amazon FSx for Lustre file system. Link the file system to the existing S3 bucket. Adjust the training job to read from the file system. C. Create an Amazon Elastic File System (Amazon EFS) file system. Transfer the existing data to the file system. Adjust the training job to read from the file system. D. Create an Amazon ElastiCache (Redis OSS) cluster. Link the Redis OSS cluster to the existing S3 bucket. Stream the data from the Redis OSS cluster directly to the training job.
B The best solution is B because it offers the fastest improvement in training performance with minimal data movement. FSx for Lustre is designed for high-performance computing and provides low-latency access to data, significantly speeding up the training process. Linking it to the existing S3 bucket avoids the time-consuming process of transferring millions of files. Option A is slower because it still involves reading from S3, even with S3 Express One Zone, which offers only marginal improvement over standard S3. Option C is slower than B because it requires transferring all the data to the EFS file system, a time-consuming task for millions of files. Option D is incorrect because ElastiCache is a caching solution, not a high-performance file system suitable for large-scale machine learning training data. It is not optimized for the volume and type of data needed for model training.
72
A company wants to develop an ML model using tabular customer data containing ordered features and sensitive information that must not be discarded. Which solution best masks this sensitive data before model development begins? A. Use Amazon Macie to categorize the sensitive data. B. Prepare the data by using AWS Glue DataBrew. C. Run an AWS Batch job to change the sensitive data to random values. D. Run an Amazon EMR job to change the sensitive data to random values.
B AWS Glue DataBrew is the best solution because it's designed for data preparation, specifically handling tabular data. It offers data masking capabilities while preserving the order and structure of the features, unlike options C and D which would likely disrupt the data's integrity. Option A, Amazon Macie, is not a data masking tool; it's a data security service.
73
An ML engineer needs to deploy ML models to get inferences from large datasets in an asynchronous manner. The ML engineer also needs to implement scheduled monitoring of the data quality of the models and receive alerts when changes in data quality occur. Which solution will meet these requirements? A. Deploy the models by using scheduled AWS Glue jobs. Use Amazon CloudWatch alarms to monitor the data quality and to send alerts. B. Deploy the models by using scheduled AWS Batch jobs. Use AWS CloudTrail to monitor the data quality and to send alerts. C. Deploy the models by using Amazon Elastic Container Service (Amazon ECS) on AWS Fargate. Use Amazon EventBridge to monitor the data quality and to send alerts. D. Deploy the models by using Amazon SageMaker batch transform. Use SageMaker Model Monitor to monitor the data quality and to send alerts.
D AWS SageMaker batch transform is designed for asynchronous inference on large datasets. SageMaker Model Monitor is specifically built for monitoring the data quality and model quality of deployed models, providing alerts on changes in data quality. Options A, B, and C are incorrect because they utilize services not designed for model quality monitoring or lack the asynchronous processing capabilities needed for large datasets. CloudWatch is for system monitoring, not model quality, and CloudTrail is an audit trail. EventBridge is an event bus, not a model monitoring service.
74
An ML engineer normalized training data by using min-max normalization in AWS Glue DataBrew. The ML engineer must normalize the production inference data in the same way as the training data before passing the production inference data to the model for predictions. Which solution will meet this requirement? A. Apply statistics from a well-known dataset to normalize the production samples. B. Keep the min-max normalization statistics from the training set. Use these values to normalize the production samples. C. Calculate a new set of min-max normalization statistics from a batch of production samples. Use these values to normalize all the production samples. D. Calculate a new set of min-max normalization statistics from each production sample. Use these values to normalize all the production samples.
B The correct answer is B because using the same min-max normalization statistics from the training set ensures consistency in data preprocessing between training and inference. This consistency is crucial for accurate model predictions as models are sensitive to data distribution. Options A, C, and D introduce inconsistencies by using different normalization parameters, potentially leading to inaccurate or unreliable predictions. Option A uses external statistics irrelevant to the model's training data. Options C and D recalculate statistics for each batch or sample, introducing variability and affecting model performance.
75
A company is planning to use Amazon SageMaker to create image-based classification ratings. They have 6 TB of training data stored on an Amazon FSx for NetApp ONTAP system virtual machine (SVM) within the same VPC as SageMaker. An ML engineer needs to make this training data accessible to ML models in the SageMaker environment. Which solution best meets these requirements? A. Mount the FSx for ONTAP file system as a volume to the SageMaker instance. B. Create an Amazon S3 bucket. Use Mountpoint for Amazon S3 to link the S3 bucket to the FSx for NetApp ONTAP file system. C. Create a catalog connection from SageMaker Data Wrangler to the FSx for ONTAP file system. D. Create a direct connection from SageMaker Data Wrangler to the FSx for NetApp ONTAP file system.
A A is correct because mounting the FSx for ONTAP file system directly to the SageMaker instance provides the fastest and most direct access to the training data, eliminating the need for data transfer or intermediary services. This is especially beneficial given the 6TB size of the dataset. B is incorrect because introducing an S3 bucket adds unnecessary complexity and latency. Transferring 6TB of data to S3 and then accessing it would be significantly slower than direct mounting. C and D are incorrect because they involve SageMaker Data Wrangler, a tool primarily for data preparation and transformation, not for directly accessing and mounting large datasets for model training. Using Data Wrangler would add unnecessary steps and potentially increase processing time.
76
A company regularly receives new training data from the vendor of an ML model. The vendor delivers cleaned and prepared data to the company's Amazon S3 bucket every 3-4 days. The company has an Amazon SageMaker pipeline to retrain the model. An ML engineer needs to implement a solution to run the pipeline when new data is uploaded to the S3 bucket. Which solution will meet these requirements with the LEAST operational effort? A. Create an S3 Lifecycle rule to transfer the data to the SageMaker training instance and to initiate training. B. Create an AWS Lambda function that scans the S3 bucket. Program the Lambda function to initiate the pipeline when new data is uploaded. C. Create an Amazon EventBridge rule that has an event pattern that matches the S3 upload. Configure the pipeline as the target of the rule. D. Use Amazon Managed Workflows for Apache Airflow (Amazon MWAA) to orchestrate the pipeline when new data is uploaded.
C The correct answer is C because Amazon EventBridge offers a serverless, event-driven architecture that directly integrates with Amazon S3 and SageMaker pipelines. When new data is uploaded to S3, EventBridge detects the event and automatically triggers the SageMaker pipeline, minimizing operational overhead. Option A is incorrect because S3 Lifecycle rules are primarily designed for data management tasks like archiving or deleting objects, not for triggering workflows. It would require additional components to initiate the SageMaker pipeline. Option B is incorrect because while a Lambda function could monitor the S3 bucket and trigger the pipeline, it involves more operational effort in terms of coding, testing, and maintaining the Lambda function. EventBridge provides a more streamlined solution. Option D is incorrect because Amazon MWAA is a more complex orchestration tool better suited for more intricate and demanding workflows. For simply triggering a pipeline based on S3 uploads, it's an overkill compared to the simplicity and efficiency of EventBridge.
77
An ML engineer is developing a fraud detection model using the Amazon SageMaker XGBoost algorithm. The model classifies transactions as either fraudulent or legitimate. During testing, the model excels at identifying fraud in the training dataset, but performs poorly when identifying fraud in new, unseen transactions. What should the ML engineer do to improve the fraud detection for new transactions? A. Increase the learning rate. B. Remove some irrelevant features from the training dataset. C. Increase the value of the max_depth hyperparameter. D. Decrease the value of the max_depth hyperparameter.
D The correct answer is D because the problem described is overfitting. The model is too complex and has memorized the training data, leading to poor generalization to unseen data. Decreasing the `max_depth` hyperparameter reduces the complexity of the XGBoost model, preventing it from overfitting and improving its ability to generalize to new transactions. Option A is incorrect because increasing the learning rate can actually worsen overfitting. Option B is incorrect because while feature selection is important for model performance, it doesn't directly address the overfitting problem presented. Option C is incorrect because increasing `max_depth` would further increase the model's complexity and exacerbate the overfitting.
78
A company has a binary classification model in production. An ML engineer needs to develop a new version of the model that maximizes correct predictions of both positive and negative labels. Which metric should the ML engineer use for model recalibration? A. Accuracy B. Precision C. Recall D. Specificity
A. Accuracy Accuracy is the correct answer because it measures the overall proportion of correctly classified instances, encompassing both positive and negative predictions. The problem statement explicitly requires maximizing correct predictions for both labels, which is precisely what accuracy measures. Precision focuses solely on the positive predictions, while recall focuses only on correctly identifying all actual positives. Specificity focuses only on correctly identifying all actual negatives. Therefore, none of these are as suitable as accuracy for balancing the need to correctly classify both positive and negative instances.
79
A company wants to reduce the cost of its containerized ML applications. The applications use ML models that run on Amazon EC2 instances, AWS Lambda functions, and an Amazon Elastic Container Service (Amazon ECS) cluster. The EC2 workloads and ECS workloads use Amazon Elastic Block Store (Amazon EBS) volumes to save predictions and artifacts. An ML engineer must identify resources that are being used inefficiently and generate recommendations to reduce the cost of these resources. Which solution will meet these requirements with the LEAST development effort? A. Create code to evaluate each instance's memory and compute usage. B. Add cost allocation tags to the resources. Activate the tags in AWS Billing and Cost Management. C. Check AWS CloudTrail event history for the creation of the resources. D. Run AWS Compute Optimizer.
D
80
A company has developed a new ML model and needs to perform online validation on 10% of the traffic before full production release. The company uses an Amazon SageMaker endpoint behind an Application Load Balancer (ALB). Which solution provides the required online validation with the LEAST operational overhead? A. Use production variants to add the new model to the existing SageMaker endpoint. Set the variant weight to 0.1 for the new model. Monitor the number of invocations using Amazon CloudWatch. B. Use production variants to add the new model to the existing SageMaker endpoint. Set the variant weight to 1 for the new model. Monitor the number of invocations using Amazon CloudWatch. C. Create a new SageMaker endpoint. Use production variants to add the new model to the new endpoint. Monitor the number of invocations using Amazon CloudWatch. D. Configure the ALB to route 10% of the traffic to the new model at the existing SageMaker endpoint. Monitor the number of invocations using AWS CloudTrail.
A A is correct because SageMaker production variants offer built-in traffic splitting, allowing for easy A/B testing and online model validation with minimal operational overhead. Setting the variant weight to 0.1 directs 10% of traffic to the new model, fulfilling the requirement. CloudWatch is the appropriate service for monitoring invocations. B is incorrect because setting the variant weight to 1 sends all traffic to the new model, bypassing the validation phase. C is incorrect because creating a new SageMaker endpoint increases operational complexity and cost unnecessarily. D is incorrect because configuring ALB routing for traffic splitting is more complex than using SageMaker's built-in functionality; it adds unnecessary operational overhead. Furthermore, CloudTrail is not the ideal service for monitoring model invocations; CloudWatch is better suited for this purpose.
81
A company wants to host an ML model on Amazon SageMaker. An ML engineer is configuring a continuous integration and continuous delivery (CI/CD) pipeline in AWS CodePipeline to deploy the model. The pipeline must run automatically when new training data for the model is uploaded to an Amazon S3 bucket. Select and order the pipeline's correct steps from the following list. Each step should be selected one time or not at all. (Select and order three.) • An S3 event notification invokes the pipeline when new data is uploaded. • S3 Lifecycle rule invokes the pipeline when new data is uploaded. • SageMaker retrains the model by using the data in the S3 bucket. • The pipeline deploys the model to a SageMaker endpoint. • The pipeline deploys the model to SageMaker Model Registry.
1. An S3 event notification invokes the pipeline when new data is uploaded. 2. SageMaker retrains the model by using the data in the S3 bucket. 3. The pipeline deploys the model to a SageMaker endpoint.
82
An ML engineer is working on an ML model to predict the prices of similarly sized homes. The model will base predictions on several features. The ML engineer will use the following feature engineering techniques to estimate the prices of the homes: • Feature splitting • Logarithmic transformation • One-hot encoding • Standardized distribution Select the correct feature engineering techniques for the following list of features. Each feature engineering technique should be selected one time or not at all. (Select three.) [Image](https://img.examtopics.com/aws-certified-machine-learning-engineer-associate-mla-c01/image9.png)
City: One-hot encoding Type_year: Feature splitting Size of the building: Standardized distribution (or Logarithmic transformation, depending on the data distribution. If the data is significantly skewed, logarithmic transformation is preferred. If the data is already normally distributed or near-normal, standardization is better.)
83
A company stores historical data in .csv files in Amazon S3. Only some of the rows and columns in the .csv files are populated. The columns are not labeled. An ML engineer needs to prepare and store the data so that the company can use the data to train ML models. Select and order the correct steps from the following list to perform this task. Each step should be selected one time or not at all. (Select and order three.) • Create an Amazon SageMaker batch transform job for data cleaning and feature engineering. • Store the resulting data back in Amazon S3. • Use Amazon Athena to infer the schemas and available columns. • Use AWS Glue crawlers to infer the schemas and available columns. • Use AWS Glue DataBrew for data cleaning and feature engineering.
Use AWS Glue crawlers to infer the schemas and available columns. Use AWS Glue DataBrew for data cleaning and feature engineering. Store the resulting data back in Amazon S3. The correct answer is the sequence provided because it reflects a logical and efficient workflow for preparing data for ML model training. First, AWS Glue crawlers are used to discover the schema of the unstructured .csv files in S3. This provides metadata about the data's structure, which is crucial for subsequent steps. Next, AWS Glue DataBrew is ideal for data cleaning and feature engineering. DataBrew's capabilities are well-suited for handling the partially populated and unlabeled nature of the data. Finally, the cleaned and engineered data is stored back in S3, making it readily accessible for ML model training. The other options are incorrect because: * **Create an Amazon SageMaker batch transform job for data cleaning and feature engineering:** SageMaker batch transform is for applying a trained model to data, not for data preparation itself. Data cleaning and feature engineering should be done before model training. * **Use Amazon Athena to infer the schemas and available columns:** While Athena can query data, it's not designed for schema inference from unstructured data like Glue crawlers are. Glue crawlers are better suited for this task.
84
An ML engineer needs to use Amazon SageMaker Feature Store to create and manage features to train a model. Select and order the three steps from the following list to create and use the features in Feature Store. Each step should be selected one time. • Access the store to build datasets for training. • Create a feature group. • Ingest the records.
1. Create a feature group; 2. Ingest the records; 3. Access the store to build datasets for training.
85
An ML engineer is building a generative AI application on Amazon Bedrock by using large language models (LLMs). Select the correct generative AI term from the following list for each description. Each term should be selected one time or not at all. (Select three.) • Embedding • Retrieval Augmented Generation (RAG) • Temperature • Token [Image](https://img.examtopics.com/aws-certified-machine-learning-engineer-associate-mla-c01/image7.png)
The correct answer is to match the following descriptions with the corresponding terms: * **Description 1:** "Represents a unit of text used in processing and generating responses by the model." **Term:** Token * **Description 2:** "Converts text into vector representations to capture semantic meaning, enhancing the model's ability to understand and generate coherent content." **Term:** Embedding * **Description 3:** "Combines generated content with retrieved external information to enrich the output." **Term:** Retrieval Augmented Generation (RAG) The discussion clearly indicates that Tokens are the basic units of text processing for LLMs, Embeddings convert text into vectors for semantic understanding, and RAG augments generation with external information. Temperature is not relevant to any of the descriptions provided in the image.
86
A company has an Amazon S3 bucket that contains 1 TB of files from different sources. The S3 bucket contains the following file types in the same S3 folder: CSV, JSON, XLSX, and Apache Parquet. An ML engineer must implement a solution that uses AWS Glue DataBrew to process the data. The ML engineer also must store the final output in Amazon S3 so that AWS Glue can consume the output in the future. Which solution will meet these requirements? A. Use DataBrew to process the existing S3 folder. Store the output in Apache Parquet format. B. Use DataBrew to process the existing S3 folder. Store the output in AWS Glue Parquet format. C. Separate the data into a different folder for each file type. Use DataBrew to process each folder individually. Store the output in Apache Parquet format. D. Separate the data into a different folder for each file type. Use DataBrew to process each folder individually. Store the output in AWS Glue Parquet format.
A. Use DataBrew to process the existing S3 folder. Store the output in Apache Parquet format. AWS Glue performs best with Parquet files because they are optimized for analytical queries. DataBrew can handle mixed file types within a single folder; therefore, separating the files into different folders is unnecessary and adds extra work. Option B is incorrect because "AWS Glue Parquet format" is not a valid term; Apache Parquet is the correct format. Options C and D are incorrect because they introduce unnecessary complexity by requiring the data to be reorganized into separate folders before processing.
87
A manufacturing company uses an ML model to determine whether products meet a standard for quality. The model produces an output of "Passed" or "Failed." Robots separate the products into the two categories by using the model to analyze photos on the assembly line. Which metrics should the company use to evaluate the model's performance? (Choose two.) A. Precision and recall B. Root mean square error (RMSE) and mean absolute percentage error (MAPE) C. Accuracy and F1 score D. Bilingual Evaluation Understudy (BLEU) score E. Perplexity
A and C The correct answers are A (Precision and Recall) and C (Accuracy and F1 score). These metrics are appropriate for a binary classification problem (Pass/Fail) where the goal is to assess the model's ability to correctly identify positive and negative instances. * **A. Precision and Recall:** These are fundamental metrics for evaluating the performance of a binary classification model. Precision measures the accuracy of positive predictions, while recall measures the model's ability to find all positive instances. Both are crucial for assessing the quality control process. * **C. Accuracy and F1 score:** Accuracy represents the overall correctness of the model's predictions. The F1 score provides a balanced measure considering both precision and recall, which is valuable when dealing with imbalanced datasets (e.g., if many more products pass than fail). * **B. RMSE and MAPE:** These metrics are generally used for regression problems, not classification problems. They measure the difference between predicted and actual *continuous* values, which isn't applicable here. * **D. BLEU score:** This metric is used for evaluating machine translation and other natural language processing tasks, not relevant to this quality control scenario. * **E. Perplexity:** This metric assesses the performance of language models, again not relevant to this quality control application.
88
A company shares Amazon SageMaker Studio notebooks accessible through a VPN. The company must prevent malicious actors from exploiting presigned URLs to access these notebooks. Which solution best meets these requirements? A. Set up Studio client IP validation using the `aws:sourceIp` IAM policy condition. B. Set up Studio client VPC validation using the `aws:sourceVpc` IAM policy condition. C. Set up Studio client role endpoint validation using the `aws:PrimaryTag` IAM policy condition. D. Set up Studio client user endpoint validation using the `aws:PrincipalTag` IAM policy condition.
A A is correct because using the `aws:sourceIp` IAM policy condition allows restricting access based on the client's IP address. This is ideal for VPN environments where the IP range of authorized users is known and controlled, thus preventing access from outside the VPN even with a pre-signed URL. B is incorrect because VPC validation checks the source VPC, not the IP address. Pre-signed URLs can still be used outside the VPC. C and D are incorrect because `aws:PrimaryTag` and `aws:PrincipalTag` are not used for IP address validation and are irrelevant to this scenario. They relate to tagging resources and principals, not source IP addresses.
89
A company needs to develop a machine learning model that can identify an item within an image and provide the location of that item. Which Amazon SageMaker algorithm best meets these requirements? A. Image classification B. XGBoost C. Object detection D. K-nearest neighbors (k-NN)
C. Object detection Object detection is the correct answer because it specifically addresses the need to both identify an object *and* locate it within an image using bounding boxes. Image classification only identifies the object, not its location. XGBoost is a gradient boosting algorithm unsuitable for image data. K-nearest neighbors is a classification/regression algorithm, also not designed for object localization within images.
90
An ML engineer needs to encrypt all data in transit when an ML training job runs in Amazon SageMaker. The engineer must ensure that encryption in transit is applied to all processes used during the training job. Which solution will meet these requirements? A. Encrypt communication between nodes for batch processing. B. Encrypt communication between nodes in a training cluster. C. Specify an AWS Key Management Service (AWS KMS) key during creation of the training job request. D. Specify an AWS Key Management Service (AWS KMS) key during creation of the SageMaker domain.
B
91
A company runs training jobs on Amazon SageMaker using a compute-optimized instance. Demand for training runs will remain constant for the next 55 weeks. The instance needs to run for 35 hours each week. The company needs to reduce its model training costs. Which solution will meet these requirements? A. Use a serverless endpoint with a provisioned concurrency of 35 hours for each week. Run the training on the endpoint. B. Use SageMaker Edge Manager for the training. Specify the instance requirement in the edge device configuration. Run the training. C. Use the heterogeneous cluster feature of SageMaker Training. Configure the instance_type, instance_count, and instance_groups arguments to run training jobs. D. Opt in to a SageMaker Savings Plan with a 1-year term and an All Upfront payment. Run a SageMaker Training job on the instance.
D The correct answer is D because SageMaker Savings Plans offer significant discounts (up to 64%) for consistent, long-term usage of SageMaker instances. Given the company's predictable workload of 35 hours per week for 55 weeks, a 1-year Savings Plan with an All Upfront payment provides the most cost-effective solution. Option A is incorrect because serverless endpoints are designed for inference, not training. Option B is incorrect because SageMaker Edge Manager is for deploying models to edge devices, not for running training jobs in the cloud. Option C is incorrect because while it can optimize training jobs, it doesn't address the cost reduction requirement as effectively as a Savings Plan.
92
A company deployed an ML model using the XGBoost algorithm to predict product failures. The model is hosted on an Amazon SageMaker endpoint and trained on normal operating data. An AWS Lambda function provides predictions to the company's application. An ML engineer must implement a solution to detect decreased model accuracy over time using incoming live data. Which solution will meet these requirements? A. Use Amazon CloudWatch to create a dashboard that monitors real-time inference data and model predictions. Use the dashboard to detect drift. B. Modify the Lambda function to calculate model drift by using real-time inference data and model predictions. Program the Lambda function to send alerts. C. Schedule a monitoring job in SageMaker Model Monitor. Use the job to detect drift by analyzing the live data against a baseline of the training data statistics and constraints. D. Schedule a monitoring job in SageMaker Debugger. Use the job to detect drift by analyzing the live data against a baseline of the training data statistics and constraints.
C SageMaker Model Monitor is the best solution because it's designed specifically for monitoring model performance in production and detecting concept drift. It automatically compares live data against the baseline established during training, providing alerts when significant deviations occur. Option A is less suitable because manual detection of drift from a dashboard is less efficient and prone to human error. Option B puts unnecessary load on the Lambda function, which is better suited for prediction delivery. Option D is incorrect because SageMaker Debugger is used for debugging model training, not for ongoing monitoring of model performance in production.
93
A company has an ML model using historical transaction data to predict customer behavior. An ML engineer is optimizing this model in Amazon SageMaker to improve its predictive accuracy. The engineer needs to analyze input data and predictions to identify trends that might skew model performance across different demographics. Which solution best provides this analysis? A. Use Amazon CloudWatch to monitor network metrics and CPU metrics for resource optimization during model training. B. Create AWS Glue DataBrew recipes to correct the data based on statistics from the model output. C. Use SageMaker Clarify to evaluate the model and training data for underlying patterns that might affect accuracy. D. Create AWS Lambda functions to automate data pre-processing and to ensure consistent quality of input data for the model.
C SageMaker Clarify is the correct answer because it's designed for bias detection and model explainability. It analyzes both training data and model predictions to pinpoint potential biases and understand how the model impacts different demographic groups. Option A focuses on resource monitoring, not model accuracy or bias. Option B involves data correction after model output, which is not a proactive approach to identifying demographic bias. Option D addresses data preprocessing but doesn't offer the analysis needed to detect demographic skews in the model's performance.
94
A company uses 10 Reserved Instances of accelerated instance types to serve the current version of an ML model. An ML engineer needs to deploy a new version of the model to an Amazon SageMaker real-time inference endpoint. The solution must use the original 10 instances to serve both versions of the model. The solution also must include one additional Reserved Instance that is available to use in the deployment process. The transition between versions must occur with no downtime or service interruptions. Which solution will meet these requirements? A. Configure a blue/green deployment with all-at-once traffic shifting. B. Configure a blue/green deployment with canary traffic shifting and a size of 10%. C. Configure a shadow test with a traffic sampling percentage of 10%. D. Configure a rolling deployment with a rolling batch size of 1.
B. Configure a blue/green deployment with canary traffic shifting and a size of 10%. The discussion strongly favors option B. A blue/green deployment with canary traffic shifting allows the new version (green) to be deployed alongside the existing version (blue) using the extra reserved instance. Canary shifting gradually shifts traffic to the new version, minimizing risk. Starting with 10% allows for monitoring and rollback if issues arise. The other options are incorrect because: A uses all-at-once shifting, risking downtime; C (shadow testing) doesn't actually deploy the new version for real-time inference; and D (rolling deployment with batch size 1) is inefficient and doesn't optimally utilize the available resources, and doesn't address how the existing instances handle the transition to the new version. Option B efficiently utilizes the 11 instances (10 for the existing and 1 for the new version) while minimizing the risk of downtime.
95
An IoT company uses Amazon SageMaker to train and test an XGBoost model for object detection. ML engineers need to monitor performance metrics when they train the model with variants in hyperparameters. The ML engineers also need to send Short Message Service (SMS) text messages after training is complete. Which solution will meet these requirements? A. Use Amazon CloudWatch to monitor performance metrics. Use Amazon Simple Queue Service (Amazon SQS) for message delivery. B. Use Amazon CloudWatch to monitor performance metrics. Use Amazon Simple Notification Service (Amazon SNS) for message delivery. C. Use AWS CloudTrail to monitor performance metrics. Use Amazon Simple Queue Service (Amazon SQS) for message delivery. D. Use AWS CloudTrail to monitor performance metrics. Use Amazon Simple Notification Service (Amazon SNS) for message delivery.
B CloudWatch is the appropriate service for monitoring the performance metrics of the XGBoost model during training in SageMaker. Amazon SNS is designed for message delivery, including SMS, making it suitable for sending notifications upon training completion. Options A, C, and D are incorrect because: A and C incorrectly use SQS, which is a message queuing service, not a message delivery service capable of directly sending SMS messages. D incorrectly uses CloudTrail, which is a logging service, not a performance monitoring service. CloudTrail logs API calls, not model performance metrics.
96
A company is working on an ML project that will include Amazon SageMaker notebook instances. An ML engineer must ensure that the SageMaker notebook instances do not allow root access. Which solution will prevent the deployment of notebook instances that allow root access? A. Use IAM condition keys to stop deployments of SageMaker notebook instances that allow root access. B. Use AWS Key Management Service (AWS KMS) keys to stop deployments of SageMaker notebook instances that allow root access. C. Monitor resource creation by using Amazon EventBridge events. Create an AWS Lambda function that deletes all deployed SageMaker notebook instances that allow root access. D. Monitor resource creation by using AWS CloudFormation events. Create an AWS Lambda function that deletes all deployed SageMaker notebook instances that allow root access.
A A is correct because IAM condition keys provide a preventative control. By using the `sagemaker:RootAccess` condition key in an IAM policy, you can prevent the creation of SageMaker notebook instances with root access enabled. This stops the problem before it occurs. B is incorrect because AWS KMS is for managing encryption keys, not controlling access to SageMaker resources. C and D are incorrect because they are reactive solutions. While they would eventually delete non-compliant instances, they allow root access instances to be created first, creating a security vulnerability during the time between creation and deletion. A preventative solution is far more secure.
97
A company is using Amazon SageMaker to develop ML models. The company stores sensitive training data in an Amazon S3 bucket. The model training must have network isolation from the internet. Which solution will meet this requirement? A. Run the SageMaker training jobs in private subnets. Create a NAT gateway. Route traffic for training through the NAT gateway. B. Run the SageMaker training jobs in private subnets. Create an S3 gateway VPC endpoint. Route traffic for training through the S3 gateway VPC endpoint. C. Run the SageMaker training jobs in public subnets that have an attached security group. In the security group, use inbound rules to limit traffic from the internet. Encrypt SageMaker instance storage by using server-side encryption with AWS KMS keys (SSE-KMS). D. Encrypt traffic to Amazon S3 by using a bucket policy that includes a value of True for the aws:SecureTransport condition key. Use default at-rest encryption for Amazon S3. Encrypt SageMaker instance storage by using server-side encryption with AWS KMS keys (SSE-KMS).
B The correct answer is B because it uses private subnets and an S3 gateway VPC endpoint. Private subnets prevent direct internet access. The S3 gateway endpoint allows communication with S3 without traversing the public internet, ensuring network isolation. Option A is incorrect because while using private subnets is a good start, a NAT gateway still requires internet access to route traffic, defeating the purpose of network isolation. Option C is incorrect because it uses public subnets, which directly contradicts the requirement for network isolation. Even though inbound rules limit traffic, the instances are still accessible from the internet. Option D is incorrect because it focuses only on encryption in transit and at rest, addressing data security but not network isolation. While encryption is important, it doesn't prevent network access.
98
A company needs to use Retrieval Augmented Generation (RAG) to supplement an open source large language model (LLM) that runs on Amazon Bedrock. The company's data for RAG is a set of documents in an Amazon S3 bucket. The documents consist of .csv files and .docx files. Which solution will meet these requirements with the LEAST operational overhead? A. Create a pipeline in Amazon SageMaker Pipelines to generate a new model. Call the new model from Amazon Bedrock to perform RAG queries. B. Convert the data into vectors. Store the data in an Amazon Neptune database. Connect the database to Amazon Bedrock. Call the Amazon Bedrock API to perform RAG queries. C. Fine-tune an existing LLM by using an AutoML job in Amazon SageMaker. Configure the S3 bucket as a data source for the AutoML job. Deploy the LLM to a SageMaker endpoint. Use the endpoint to perform RAG queries. D. Create a knowledge base for Amazon Bedrock. Configure a data source that references the S3 bucket. Use the Amazon Bedrock API to perform RAG queries.
D The correct answer is D because it directly leverages Amazon Bedrock's built-in RAG capabilities. Options A, B, and C require significant additional steps and infrastructure setup, increasing operational overhead. A requires creating and managing a new model pipeline. B necessitates vectorizing data and managing a Neptune database. C involves fine-tuning a model, deploying it to a SageMaker endpoint, and managing that infrastructure. Option D is the simplest and most efficient approach for integrating the S3 data with the Bedrock LLM for RAG.
99
A company plans to deploy an ML model for production inference on an Amazon SageMaker endpoint. The average inference payload size will vary from 100 MB to 300 MB. Inference requests must be processed in 60 minutes or less. Which SageMaker inference option will meet these requirements? A. Serverless inference B. Asynchronous inference C. Real-time inference D. Batch transform
B. Asynchronous inference Asynchronous inference is the best option because it can handle large payloads (up to 1 GB or even 5 GB depending on the source) and allows for longer processing times (up to 15 minutes per request). The other options are unsuitable: Real-time inference has significantly smaller payload limits (around 5 MB), serverless inference has even smaller limits (around 4 MB), and batch transform is designed for offline processing of entire datasets rather than individual requests with varying processing times. The requirement of processing within 60 minutes is easily met by the asynchronous option, given its ability to handle larger payloads and longer processing times than the alternatives.
100
A company receives daily .csv files about customer interactions with its ML model. The company stores the files in Amazon S3 and uses the files to retrain the model. An ML engineer needs to implement a solution to mask credit card numbers in the files before the model is retrained. Which solution will meet this requirement with the LEAST development effort? A. Create a discovery job in Amazon Macie. Configure the job to find and mask sensitive data. B. Create Apache Spark code to run on an AWS Glue job. Use the Sensitive Data Detection functionality in AWS Glue to find and mask sensitive data. C. Create Apache Spark code to run on an AWS Glue job. Program the code to perform a regex operation to find and mask sensitive data. D. Create Apache Spark code to run on an Amazon EC2 instance. Program the code to perform an operation to find and mask sensitive data.
B The best answer is B because AWS Glue's built-in Sensitive Data Detection functionality directly addresses the need to identify and mask PII, including credit card numbers, minimizing development effort. Option A is incorrect because Amazon Macie is primarily a discovery service, not a transformation service; it identifies sensitive data but doesn't automatically mask it. Options C and D require writing custom code to implement the masking logic, increasing development time and effort compared to using Glue's built-in functionality. Option D also introduces the overhead of managing an EC2 instance.
101
A medical company is using AWS to build a tool to recommend treatments for patients. The company has obtained health records and self-reported textual information in English from patients. The company needs to use this information to gain insight about the patients. Which solution will meet this requirement with the LEAST development effort? A. Use Amazon SageMaker to build a recurrent neural network (RNN) to summarize the data. B. Use Amazon Comprehend Medical to summarize the data. C. Use Amazon Kendra to create a quick-search tool to query the data. D. Use the Amazon SageMaker Sequence-to-Sequence (seq2seq) algorithm to create a text summary from the data.
B
102
A company needs to extract entities from a PDF document to build a classifier model. Which solution will extract and store the entities in the LEAST amount of time? A. Use Amazon Comprehend to extract the entities. Store the output in Amazon S3. B. Use an open source AI optical character recognition (OCR) tool on Amazon SageMaker to extract the entities. Store the output in Amazon S3. C. Use Amazon Textract to extract the entities. Use Amazon Comprehend to convert the entities to text. Store the output in Amazon S3. D. Use Amazon Textract integrated with Amazon Augmented AI (Amazon A2I) to extract the entities. Store the output in Amazon S3.
A A is correct because Amazon Comprehend can directly extract entities from PDF documents, making it the fastest solution. Options B and C introduce extra steps (using an open-source OCR tool or a two-step process with Textract and Comprehend respectively) which increase processing time. Option D adds the human-in-the-loop element of Amazon A2I, significantly increasing processing time.
103
An ML engineer has deployed an Amazon SageMaker model to a serverless endpoint in production. The model is invoked by the InvokeEndpoint API operation. The model's latency in production is higher than the baseline latency in the test environment. The ML engineer thinks that the increase in latency is because of model startup time. What should the ML engineer do to confirm or deny this hypothesis? A. Schedule a SageMaker Model Monitor job. Observe metrics about model quality. B. Schedule a SageMaker Model Monitor job with Amazon CloudWatch metrics enabled. C. Enable Amazon CloudWatch metrics. Observe the ModelSetupTime metric in the SageMaker namespace. D. Enable Amazon CloudWatch metrics. Observe the ModelLoadingWaitTime metric in the SageMaker namespace.
C The correct answer is C because ModelSetupTime directly measures the time it takes to launch the compute resources for a serverless endpoint, which is the key factor contributing to increased latency due to model startup time (cold starts). Options A and B are incorrect because they don't directly address model startup time; they focus on model quality and don't pinpoint the source of latency. Option D is incorrect because ModelLoadingWaitTime is relevant for multi-model endpoints, not single-model serverless endpoints as described in the question.
104
A company is building a real-time data processing pipeline for an ecommerce application. The application generates a high volume of clickstream data that must be ingested, processed, and visualized in near real time. The company needs a solution that supports SQL for data processing and Jupyter notebooks for interactive analysis. Which solution will meet these requirements? A. Use Amazon Data Firehose to ingest the data. Create an AWS Lambda function to process the data. Store the processed data in Amazon S3. Use Amazon QuickSight to visualize the data. B. Use Amazon Kinesis Data Streams to ingest the data. Use Amazon Data Firehose to transform the data. Use Amazon Athena to process the data. Use Amazon QuickSight to visualize the data. C. Use Amazon Managed Streaming for Apache Kafka (Amazon MSK) to ingest the data. Use AWS Glue with PySpark to process the data. Store the processed data in Amazon S3. Use Amazon QuickSight to visualize the data. D. Use Amazon Managed Streaming for Apache Kafka (Amazon MSK) to ingest the data. Use Amazon Managed Service for Apache Flink to process the data. Use the built-in Flink dashboard to visualize the data.
D
105
An ML engineer needs to use metrics to assess the quality of a time-series forecasting model. Which metrics apply to this model? (Choose two.) A. Recall B. LogLoss C. Root mean square error (RMSE) D. Inference Latency E. Average weighted quantile loss (wQL)
C and E
106
A company runs Amazon SageMaker ML models that use accelerated instances. The models require real-time responses. Each model has different scaling requirements. The company must not allow a cold start for the models. Which solution will meet these requirements? A. Create a SageMaker Serverless Inference endpoint for each model. Use provisioned concurrency for the endpoints. B. Create a SageMaker Asynchronous Inference endpoint for each model. Create an auto scaling policy for each endpoint. C. Create a SageMaker endpoint. Create an inference component for each model. In the inference component settings, specify the newly created endpoint. Create an auto scaling policy for each inference component. Set the parameter for the minimum number of copies to at least 1. D. Create an Amazon S3 bucket. Store all the model artifacts in the S3 bucket. Create a SageMaker multi-model endpoint. Point the endpoint to the S3 bucket. Create an auto scaling policy for the endpoint. Set the parameter for the minimum number of copies to at least 1.
C Explanation: Option C is correct because it leverages SageMaker inference components, allowing independent scaling for each model hosted on a single endpoint. Setting the minimum number of copies to at least 1 ensures no cold starts. Option A (Serverless Inference) is unsuitable for real-time requirements due to potential latency. Option B (Asynchronous Inference) is not suitable for real-time responses. Option D (Multi-model endpoint with S3) doesn't offer independent scaling for each model and may introduce cold starts.
107
A company uses Amazon SageMaker for its ML process. A compliance audit discovers that an Amazon S3 bucket for training data uses server-side encryption with S3 managed keys (SSE-S3). The company requires customer managed keys. An ML engineer changes the S3 bucket to use server-side encryption with AWS KMS keys (SSE-KMS). The ML engineer makes no other configuration changes. After the change to the encryption settings, SageMaker training jobs start to fail with AccessDenied errors. What should the ML engineer do to resolve this problem? A. Update the IAM policy that is attached to the execution role for the training jobs. Include the s3:ListBucket and s3:GetObject permissions. B. Update the S3 bucket policy that is attached to the S3 bucket. Set the value of the aws:SecureTransport condition key to True. C. Update the IAM policy that is attached to the execution role for the training jobs. Include the kms:Encrypt and kms:Decrypt permissions. D. Update the IAM policy that is attached to the user that created the training jobs. Include the kms:CreateGrant permission.
C
108
A company needs an AWS solution that will automatically create versions of ML models as the models are created. Which solution will meet this requirement? A. Amazon Elastic Container Registry (Amazon ECR) B. Model packages from Amazon SageMaker Marketplace C. Amazon SageMaker ML Lineage Tracking D. Amazon SageMaker Model Registry
D. Amazon SageMaker Model Registry Amazon SageMaker Model Registry provides automatic versioning of ML models, which directly addresses the company's requirement. Option A, Amazon ECR, is for storing and managing container images, not ML models. Option B, Model packages from Amazon SageMaker Marketplace, offers pre-trained models but doesn't inherently provide automatic versioning. Option C, Amazon SageMaker ML Lineage Tracking, tracks model lineage and dependencies, but doesn't automatically create versions of the models themselves.
109
An ML engineer notices class imbalance in an image classification training job. What should the ML engineer do to resolve this issue? A. Reduce the size of the dataset. B. Transform some of the images in the dataset. C. Apply random oversampling on the dataset. D. Apply random data splitting on the dataset.
C. Apply random oversampling on the dataset. This is the correct answer because class imbalance in a dataset means that some classes have significantly more examples than others. Oversampling artificially increases the number of instances in the minority classes, balancing the dataset and improving the model's ability to learn from underrepresented classes. A is incorrect because reducing the dataset size would likely exacerbate the class imbalance problem, not solve it. B is incorrect because transforming images might help with other issues (e.g., improving image quality), but it doesn't directly address the class imbalance. D is incorrect because random data splitting is for creating training, validation, and testing sets, and doesn't modify class distribution within the dataset.
110
An ML engineer needs to merge and transform data from two sources to retrain an existing ML model. One data source consists of .csv files stored in an Amazon S3 bucket (millions of records per file). The other data source is an Amazon Aurora DB cluster. The merged and transformed data must be written to a second S3 bucket weekly. Which solution offers the LEAST operational overhead? A. Create a transient Amazon EMR cluster every week. Use the cluster to run an Apache Spark job to merge and transform the data. B. Create a weekly AWS Glue job that uses the Apache Spark engine. Use DynamicFrame native operations to merge and transform the data. C. Create an AWS Lambda function that runs Apache Spark code every week to merge and transform the data. Configure the Lambda function to connect to the initial S3 bucket and the DB cluster. D. Create an AWS Batch job that runs Apache Spark code on Amazon EC2 instances every week. Configure the Spark code to save the data from the EC2 instances to the second S3 bucket.
B The correct answer is B because AWS Glue is a fully managed ETL service. It handles scheduling, resource management, and integration with S3 and Aurora, minimizing operational overhead compared to managing EMR clusters (A), Lambda functions (C), or AWS Batch jobs (D). Options A, C, and D require more manual configuration and management of infrastructure, increasing operational overhead. Lambda (C) especially has limitations on execution time and memory that might be problematic for large datasets.
111
An ML engineer needs to ensure that a dataset complies with regulations for personally identifiable information (PII). The ML engineer will use the data to train an ML model on Amazon SageMaker instances. SageMaker must not use any of the PII. Which solution will meet these requirements in the MOST operationally efficient way? A. Use the Amazon Comprehend DetectPiiEntities API call to redact the PII from the data. Store the data in an Amazon S3 bucket. Access the S3 bucket from the SageMaker instances for model training. B. Use the Amazon Comprehend DetectPiiEntities API call to redact the PII from the data. Store the data in an Amazon Elastic File System (Amazon EFS) file system. Mount the EFS file system to the SageMaker instances for model training. C. Use AWS Glue DataBrew to cleanse the dataset of PII. Store the data in an Amazon Elastic File System (Amazon EFS) file system. Mount the EFS file system to the SageMaker instances for model training. D. Use Amazon Macie for automatic discovery of PII in the data. Remove the PII. Store the data in an Amazon S3 bucket. Mount the S3 bucket to the SageMaker instances for model training.
A The most operationally efficient solution is A. Amazon S3 is designed for scalability and performance when accessing large datasets, making it ideal for serving data to SageMaker instances. Directly accessing S3 from SageMaker is more efficient than mounting an EFS file system (options B and C), which adds network latency and complexity. Option D is less efficient because while Amazon Macie can discover PII, it doesn't inherently remove it; requiring an additional step before storage and access. Option A uses Comprehend, specifically designed for PII detection and redaction, directly before storing in S3, creating a streamlined process. Options B and C introduce the overhead of EFS, which is unnecessary for this task.
112
A company must install a custom script on any newly created Amazon SageMaker notebook instance. Which solution will meet this requirement with the LEAST operational overhead? A. Create a lifecycle configuration script to install the custom script when a new SageMaker notebook is created. Attach the lifecycle configuration to every new SageMaker notebook as part of the creation steps. B. Create a custom Amazon Elastic Container Registry (Amazon ECR) image that contains the custom script. Push the ECR image to a Docker registry. Attach the Docker image to a SageMaker Studio domain. Select the kernel to run as part of the SageMaker notebook. C. Create a custom package index repository. Use AWS CodeArtifact to manage the installation of the custom script. Set up AWS PrivateLink endpoints to connect CodeArtifact to the SageMaker instance. Install the script. D. Store the custom script in Amazon S3. Create an AWS Lambda function to install the custom script on new SageMaker notebooks. Configure Amazon EventBridge to invoke the Lambda function when a new SageMaker notebook is initialized.
A A is the correct answer because lifecycle configurations are designed specifically for automating tasks during the creation and modification of SageMaker notebook instances. This method directly addresses the requirement with minimal additional infrastructure or management. B is incorrect because using ECR and Docker images introduces unnecessary complexity for a simple script installation. It adds the overhead of managing container images and potentially increases startup time. C is incorrect because using CodeArtifact and PrivateLink introduces significant complexity and operational overhead for managing a simple script installation. This solution is overkill for the problem. D is incorrect because using Lambda and EventBridge adds more moving parts and introduces latency compared to the direct approach of lifecycle configurations. It creates a more complex system to manage and increases potential points of failure.
113
A medical company needs to store clinical data that includes personally identifiable information (PII) and protected health information (PHI). An ML engineer needs to implement a solution to ensure that the PII and PHI are not used to train ML models. Which solution will meet these requirements? A. Store the clinical data in Amazon S3 buckets. Use AWS Glue DataBrew to mask the PII and PHI before the data is used for model training. B. Upload the clinical data to an Amazon Redshift database. Use built-in SQL stored procedures to automatically classify and mask the PII and PHI before the data is used for model training. C. Use Amazon Comprehend to detect and mask the PII before the data is used for model training. Use Amazon Comprehend Medical to detect and mask the PHI before the data is used for model training. D. Create an AWS Lambda function to encrypt the PII and PHI. Program the Lambda function to save the encrypted data to an Amazon S3 bucket for model training.
C C is correct because Amazon Comprehend and Amazon Comprehend Medical are specifically designed to identify and mask PII and PHI respectively. This directly addresses the requirement of preventing the use of this sensitive data in model training. A is incorrect because while DataBrew can perform data masking, it's not specifically designed for identifying PII and PHI with the same accuracy and precision as Comprehend and Comprehend Medical. It relies on user-defined rules and might miss some instances. B is incorrect because while Redshift can store data and potentially have stored procedures for masking, it doesn't inherently possess the capabilities of specialized services like Comprehend and Comprehend Medical for reliably identifying and masking PII and PHI. D is incorrect because encrypting the data prevents direct access to PII and PHI, but it doesn't prevent metadata leakage or the possibility of unintended data exposure during the model training process if the encrypted data is still used. The question specifies that PII and PHI should *not* be used for training.