Practice Questions - Amazon AWS Certified Machine Learning Engineer - Associate MLA-C01 Flashcards
(113 cards)
An ML engineer is developing a fraud detection model on AWS. The training dataset includes transaction logs, customer profiles, and tables from an on-premises MySQL database. The transaction logs and customer profiles are stored in Amazon S3. Which AWS service or feature can aggregate the data from the various data sources?
A. Amazon EMR Spark jobs
B. Amazon Kinesis Data Streams
C. Amazon DynamoDB
D. AWS Lake Formation
A
Amazon EMR with Spark is the most suitable option for aggregating data from diverse sources like Amazon S3 and an on-premises MySQL database. Spark’s ability to handle both structured and unstructured data makes it well-suited for this task. While AWS Lake Formation manages data lakes, it doesn’t inherently provide the ETL (Extract, Transform, Load) and data processing capabilities needed to aggregate and transform data from multiple sources. Amazon Kinesis Data Streams are designed for real-time data streaming, not batch processing of data for model training. Amazon DynamoDB is a NoSQL database, not designed for data aggregation from various sources.
A company with hundreds of data scientists uses Amazon SageMaker to create ML models stored in model groups within the SageMaker Model Registry. Data scientists are categorized into three groups: computer vision, natural language processing (NLP), and speech recognition. An ML engineer needs a solution to organize these existing models by category to improve discoverability at scale, without altering the model artifacts or their current groupings. Which solution best meets these requirements?
A. Create a custom tag for each of the three categories. Add the tags to the model packages in the SageMaker Model Registry.
B. Create a model group for each category. Move the existing models into these category model groups.
C. Use SageMaker ML Lineage Tracking to automatically identify and tag which model groups should contain the models.
D. Create a Model Registry collection for each of the three categories. Move the existing model groups into the collections.
D
D is correct because creating Model Registry collections allows for organizing existing model groups without modifying the underlying model artifacts in Amazon S3 and Amazon ECR. This maintains the integrity of the models and their existing structure while improving discoverability at scale by grouping them into relevant categories.
A is incorrect because while tags can provide metadata, they are not as effective for large-scale organization as collections, which are specifically designed for grouping model groups.
B is incorrect because moving models to new model groups would alter the existing model groupings, violating the requirement to not affect the integrity of the model artifacts and their existing groupings.
C is incorrect because ML Lineage Tracking focuses on tracking model lineage and not on the organization and grouping of models at a higher level.
A company has trained and deployed an ML model using Amazon SageMaker. The company needs to implement a solution to record and monitor all the API call events for the SageMaker endpoint. The solution must also provide a notification when the number of API call events breaches a threshold. Which solution will meet these requirements?
A. Use SageMaker Debugger to track the inferences and to report metrics. Create a custom rule to provide a notification when the threshold is breached.
B. Use SageMaker Debugger to track the inferences and to report metrics. Use the tensor_variance built-in rule to provide a notification when the threshold is breached.
C. Log all the endpoint invocation API events by using AWS CloudTrail. Use an Amazon CloudWatch dashboard for monitoring. Set up a CloudWatch alarm to provide notification when the threshold is breached.
D. Add the Invocations metric to an Amazon CloudWatch dashboard for monitoring. Set up a CloudWatch alarm to provide notification when the threshold is breached.
C
The correct answer is C because it uses the most appropriate AWS services to meet all the stated requirements. CloudTrail logs all API calls, including SageMaker endpoint invocations, fulfilling the requirement to record all events. CloudWatch dashboards can then monitor these logs, and a CloudWatch alarm can provide notifications when a threshold is breached.
Option A is incorrect because SageMaker Debugger is primarily for debugging model training and inference quality, not for comprehensive API call event logging. Option B is also incorrect because the tensor_variance rule is not relevant to API call event monitoring. Option D is incorrect because while it uses CloudWatch to monitor invocations and set up alarms, it doesn’t provide a solution for recording all API call events; CloudWatch only monitors what’s already being tracked. CloudTrail provides the necessary comprehensive logging.
An ML engineer trained an ML model on Amazon SageMaker to detect automobile accidents from closed-circuit TV footage. The ML engineer used SageMaker Data Wrangler to create a training dataset of images of accidents and non-accidents. The model performed well during training and validation. However, the model is underperforming in production because of variations in the quality of the images from various cameras. Which solution will improve the model’s accuracy in the LEAST amount of time?
A. Collect more images from all the cameras. Use Data Wrangler to prepare a new training dataset.
B. Recreate the training dataset by using the Data Wrangler corrupt image transform. Specify the impulse noise option.
C. Recreate the training dataset by using the Data Wrangler enhance image contrast transform. Specify the Gamma contrast option.
D. Recreate the training dataset by using the Data Wrangler resize image transform. Crop all images to the same size.
B
A company is using Amazon SageMaker to create ML models. The company’s data scientists need fine-grained control of the ML workflows that they orchestrate. The data scientists also need the ability to visualize SageMaker jobs and workflows as a directed acyclic graph (DAG). The data scientists must keep a running history of model discovery experiments and must establish model governance for auditing and compliance verifications. Which solution will meet these requirements?
A. Use AWS CodePipeline and its integration with SageMaker Studio to manage the entire ML workflows. Use SageMaker ML Lineage Tracking for the running history of experiments and for auditing and compliance verifications.
B. Use AWS CodePipeline and its integration with SageMaker Experiments to manage the entire ML workflows. Use SageMaker Experiments for the running history of experiments and for auditing and compliance verifications.
C. Use SageMaker Pipelines and its integration with SageMaker Studio to manage the entire ML workflows. Use SageMaker ML Lineage Tracking for the running history of experiments and for auditing and compliance verifications.
D. Use SageMaker Pipelines and its integration with SageMaker Experiments to manage the entire ML workflows. Use SageMaker Experiments for the running history of experiments and for auditing and compliance verifications.
C
A company needs to create a central catalog for all the company’s ML models. The models are in AWS accounts where the company developed the models initially. The models are hosted in Amazon Elastic Container Registry (Amazon ECR) repositories. Which solution will meet these requirements?
A. Configure ECR cross-account replication for each existing ECR repository. Ensure that each model is visible in each AWS account.
B. Create a new AWS account with a new ECR repository as the central catalog. Configure ECR cross-account replication between the initial ECR repositories and the central catalog.
C. Use the Amazon SageMaker Model Registry to create a model group for models hosted in Amazon ECR. Create a new AWS account. In the new account, use the SageMaker Model Registry as the central catalog. Attach a cross-account resource policy to each model group in the initial AWS accounts.
D. Use an AWS Glue Data Catalog to store the models. Run an AWS Glue crawler to migrate the models from the ECR repositories to the Data Catalog. Configure cross-account access to the Data Catalog.
C
The correct answer is C because SageMaker Model Registry is designed as a central repository for managing and tracking machine learning models, including those hosted in ECR. Creating a new AWS account for the central catalog improves security and organization. Cross-account resource policies allow controlled access to the models from the original accounts.
Option A is incorrect because ECR is a container registry, not a catalog designed for managing model metadata and lineage. Simple replication doesn’t provide the centralized management features needed.
Option B is incorrect because it still relies on ECR as the central catalog, which lacks the model management capabilities of SageMaker Model Registry.
Option D is incorrect because AWS Glue Data Catalog is for managing data assets, not specifically ML models. While it could potentially store metadata about the models, it’s not the ideal solution for managing the models themselves and their lifecycle. Moreover, migrating the models themselves to the Glue Data Catalog is not straightforward or practical.
A company is building a web-based AI application using Amazon SageMaker. The application will include ML experimentation, training, a central model registry, model deployment, and model monitoring. Training data is stored in Amazon S3, and the application requires secure and isolated use of this data throughout the ML lifecycle. The company must use the central model registry to manage different versions of models. Which action will meet this requirement with the LEAST operational overhead?
A. Create a separate Amazon Elastic Container Registry (Amazon ECR) repository for each model.
B. Use Amazon Elastic Container Registry (Amazon ECR) and unique tags for each model version.
C. Use the SageMaker Model Registry and model groups to catalog the models.
D. Use the SageMaker Model Registry and unique tags for each model version.
C
The best answer is C because it leverages the built-in features of SageMaker, specifically designed for managing ML models and their versions. Using SageMaker Model Registry and model groups minimizes operational overhead compared to managing models and versions externally using Amazon ECR. Option A requires creating and managing multiple ECR repositories, increasing overhead. Option B adds complexity by managing tags within ECR. Option D, while using the SageMaker Model Registry, lacks the organizational structure provided by model groups, potentially leading to less efficient management of model versions in the long run.
A company uses AWS Glue jobs orchestrated by an AWS Glue workflow for data processing. These jobs can run on a schedule or be launched manually. They are integrating these jobs into Amazon SageMaker Pipelines for ML model development, where the Glue job outputs are needed during the data processing phase. Which solution integrates the AWS Glue jobs with the SageMaker pipelines while minimizing operational overhead?
A. Use AWS Step Functions to orchestrate the pipelines and the AWS Glue jobs.
B. Use processing steps in SageMaker Pipelines. Configure inputs that point to the Amazon Resource Names (ARNs) of the AWS Glue jobs.
C. Use Callback steps in SageMaker Pipelines to start the AWS Glue workflow and to stop the pipelines until the AWS Glue jobs finish running.
D. Use Amazon EventBridge to invoke the pipelines and the AWS Glue jobs in the desired order.
C
The correct answer is C because it directly addresses the need to wait for Glue jobs to complete before proceeding in the SageMaker pipeline, minimizing operational overhead by keeping the integration within the SageMaker pipeline framework. Option A introduces an additional orchestration layer (Step Functions), increasing complexity. Option B doesn’t guarantee that the Glue jobs finish before the pipeline proceeds, potentially leading to errors. Option D, while possible, requires more complex setup and monitoring compared to using callback steps within SageMaker Pipelines.
A company is building an AI application on Amazon SageMaker that involves frequent consecutive training jobs using data stored in Amazon S3. The application requires secure and isolated data usage throughout the ML lifecycle. Which approach will MINIMIZE infrastructure startup times for these consecutive training jobs?
A. Use Managed Spot Training.
B. Use SageMaker managed warm pools.
C. Use SageMaker Training Compiler.
D. Use the SageMaker distributed data parallelism (SMDDP) library.
B
The correct answer is B because SageMaker managed warm pools keep instances ready between training jobs, eliminating the time needed for provisioning new infrastructure each time. Option A (Managed Spot Training) reduces cost, not startup time. Option C (SageMaker Training Compiler) optimizes code, not infrastructure. Option D (SMDDP) parallelizes training across instances, which improves training speed but doesn’t reduce startup time.
A company is building a web-based AI application using Amazon SageMaker. This application will include ML experimentation, training, a central model registry, model deployment, and model monitoring. The training data is stored in Amazon S3, and the application requires a manual approval-based workflow to ensure only approved models are deployed to production endpoints. Which solution best meets this requirement?
A. Use SageMaker Experiments to facilitate the approval process during model registration.
B. Use SageMaker ML Lineage Tracking on the central model registry. Create tracking entities for the approval process.
C. Use SageMaker Model Monitor to evaluate the performance of the model and to manage the approval.
D. Use SageMaker Pipelines. When a model version is registered, use the AWS SDK to change the approval status to “Approved.”
D
The correct answer is D because SageMaker Pipelines allows for the orchestration of machine learning workflows, including the incorporation of manual approval steps. This allows for a model version to be registered, then its performance evaluated and only after manual approval via the AWS SDK, its status changed to “Approved” for deployment.
Option A is incorrect because SageMaker Experiments is for tracking and organizing experiments, not managing model approvals. Option B is incorrect because SageMaker ML Lineage Tracking tracks model lineage but doesn’t provide an approval mechanism. Option C is incorrect because SageMaker Model Monitor focuses on model performance monitoring, not approval workflows.
A company is building a web-based AI application using Amazon SageMaker. This application will include ML experimentation, training, a central model registry, model deployment, and model monitoring. Training data, stored in Amazon S3, must be used securely and in isolation throughout the ML lifecycle. The company needs an on-demand workflow to monitor bias drift for models deployed to real-time endpoints from the application. Which action will meet this requirement?
A. Configure the application to invoke an AWS Lambda function that runs a SageMaker Clarify job.
B. Invoke an AWS Lambda function to pull the sagemaker-model-monitor-analyzer
built-in SageMaker image.
C. Use AWS Glue Data Quality to monitor bias.
D. Use SageMaker notebooks to compare the bias.
A
A is correct because SageMaker Clarify is specifically designed for bias detection and monitoring. Integrating it with an AWS Lambda function allows for on-demand execution, triggering the bias analysis whenever needed by the application.
B is incorrect because the sagemaker-model-monitor-analyzer
image handles general model monitoring tasks but not specifically bias detection.
C is incorrect because AWS Glue Data Quality focuses on data quality checks, not bias analysis.
D is incorrect because SageMaker notebooks are for interactive development and experimentation, not for implementing production-ready, on-demand monitoring workflows.
An ML engineer is developing a fraud detection model on AWS. The training dataset includes transaction logs, customer profiles stored in Amazon S3, and tables from an on-premises MySQL database. The dataset has a class imbalance and features with interdependencies, hindering the algorithm’s ability to capture all underlying patterns. After data aggregation, the engineer needs a solution to automatically detect anomalies and visualize the results. Which solution best meets these requirements?
A. Use Amazon Athena to automatically detect the anomalies and to visualize the result.
B. Use Amazon Redshift Spectrum to automatically detect the anomalies. Use Amazon QuickSight to visualize the result.
C. Use Amazon SageMaker Data Wrangler to automatically detect the anomalies and to visualize the result.
D. Use AWS Batch to automatically detect the anomalies. Use Amazon QuickSight to visualize the result.
C
The correct answer is C because Amazon SageMaker Data Wrangler provides tools for data quality analysis, including anomaly detection, and offers visualization capabilities. Option A is incorrect because Athena is primarily a query service and doesn’t inherently offer anomaly detection. Option B is incorrect because while Redshift Spectrum can handle the data and QuickSight can visualize, neither individually offers automatic anomaly detection. Option D is incorrect because AWS Batch is a batch processing service; it doesn’t provide anomaly detection or visualization features directly. SageMaker Data Wrangler best fits the requirement of automatically detecting anomalies and visualizing the results within a single, integrated platform.
An ML engineer is developing a fraud detection model on AWS. The training dataset, containing transaction logs, customer profiles (stored in Amazon S3), and tables from an on-premises MySQL database, exhibits class imbalance and feature interdependencies, hindering the algorithm’s pattern recognition. The dataset includes both categorical and numerical data. To maximize model accuracy with the LEAST operational overhead, which action should the ML engineer take?
A. Use AWS Glue to transform the categorical data into numerical data.
B. Use AWS Glue to transform the numerical data into categorical data.
C. Use Amazon SageMaker Data Wrangler to transform the categorical data into numerical data.
D. Use Amazon SageMaker Data Wrangler to transform the numerical data into categorical data.
C
Data Wrangler provides built-in transformations for encoding categorical data into numerical representations (such as one-hot encoding or ordinal encoding), making it more user-friendly and efficient than using AWS Glue for this task. Transforming numerical data into categorical data is unnecessary and would likely reduce model accuracy. AWS Glue can handle data transformations, but lacks the user-friendly interface and built-in categorical encoding capabilities of SageMaker Data Wrangler, resulting in higher operational overhead. Therefore, option C offers the best balance of effectiveness and minimal operational overhead.
An ML engineer is developing a fraud detection model on AWS. The training dataset, including transaction logs and customer profiles from Amazon S3 and tables from an on-premises MySQL database, suffers from class imbalance affecting the model’s learning. Which solution requires the LEAST operational effort to address this imbalanced data before model training?
A. Use Amazon Athena to identify patterns that contribute to the imbalance. Adjust the dataset accordingly.
B. Use Amazon SageMaker Studio Classic built-in algorithms to process the imbalanced dataset.
C. Use AWS Glue DataBrew built-in features to oversample the minority class.
D. Use the Amazon SageMaker Data Wrangler balance data operation to oversample the minority class.
D
The correct answer is D because Amazon SageMaker Data Wrangler provides a built-in “balance data” operation specifically designed to handle class imbalance through techniques like oversampling and undersampling. This offers a low-code/no-code solution requiring minimal operational effort compared to other options.
Option A requires manual dataset adjustment after identifying patterns with Athena, increasing operational effort. Option B is less efficient as it uses algorithms within SageMaker Studio, whereas Data Wrangler is more directly focused on data preprocessing for imbalanced datasets. Option C is less efficient than D because DataBrew does not have a built-in recipe for balancing datasets, requiring more custom work compared to Data Wrangler’s direct functionality.
A company has deployed an XGBoost prediction model in production to predict if a customer is likely to cancel a subscription. The company uses Amazon SageMaker Model Monitor to detect deviations in the F1 score. During a baseline analysis of model quality, the company recorded a threshold for the F1 score. After several months of no change, the model’s F1 score decreases significantly. What could be the reason for the reduced F1 score?
A. Concept drift occurred in the underlying customer data that was used for predictions.
B. The model was not sufficiently complex to capture all the patterns in the original baseline data.
C. The original baseline data had a data quality issue of missing values.
D. Incorrect ground truth labels were provided to Model Monitor during the calculation of the baseline.
A
Concept drift is the correct answer because it explains a decrease in F1 score after a period of stability. The statistical properties of the data used to train the model have changed over time, leading to the model’s reduced performance. Options B and C would have resulted in a consistently low F1 score from the beginning, not a sudden drop after months of acceptable performance. Option D would have affected the baseline F1 score itself, not caused a significant drop after the initial baseline was established.
A company has a team of data scientists who use Amazon SageMaker notebook instances to test ML models. When the data scientists need new permissions, the company attaches the permissions to each individual role that was created during the creation of the SageMaker notebook instance. The company needs to centralize management of the team’s permissions. Which solution will meet this requirement?
A. Create a single IAM role that has the necessary permissions. Attach the role to each notebook instance that the team uses.
B. Create a single IAM group. Add the data scientists to the group. Associate the group with each notebook instance that the team uses.
C. Create a single IAM user. Attach the AdministratorAccess AWS managed IAM policy to the user. Configure each notebook instance to use the IAM user.
D. Create a single IAM group. Add the data scientists to the group. Create an IAM role. Attach the AdministratorAccess AWS managed IAM policy to the role. Associate the role with the group. Associate the group with each notebook instance that the team uses.
A
A is correct because it leverages the recommended approach of using IAM roles for AWS services like SageMaker. Centralizing permissions in a single IAM role simplifies management; updates to the role automatically propagate to all associated notebook instances.
B is incorrect because you cannot directly associate an IAM group with a SageMaker notebook instance.
C is incorrect because using the AdministratorAccess policy violates the principle of least privilege and it’s not possible to directly associate an IAM user with a notebook instance.
D is incorrect for several reasons: It uses the overly permissive AdministratorAccess policy; it’s unclear how associating a role with a group would function in this context, and, again, you cannot directly associate an IAM group with a notebook instance.
An ML engineer needs to use an ML model to predict the price of apartments in a specific location. Which metric should the ML engineer use to evaluate the model’s performance?
A. Accuracy
B. Area Under the ROC Curve (AUC)
C. F1 score
D. Mean absolute error (MAE)
D. Mean absolute error (MAE)
The correct answer is D because predicting apartment prices is a regression problem, not a classification problem. MAE is a suitable metric for evaluating the performance of regression models. Accuracy, AUC-ROC, and F1 score are all metrics used for classification problems, where the model predicts a categorical outcome (e.g., “high price,” “medium price,” “low price”). Since the model is predicting a continuous value (price), these metrics are inappropriate.
An ML engineer has trained a neural network using stochastic gradient descent (SGD). The neural network performs poorly on the test set. The training loss and validation loss values remain high and show an oscillating pattern; they decrease for a few epochs and then increase for a few epochs before repeating this cycle. What should the ML engineer do to improve the training process?
A. Introduce early stopping.
B. Increase the size of the test set.
C. Increase the learning rate.
D. Decrease the learning rate.
D
The oscillating pattern of the training and validation loss indicates that the learning rate is too high. A high learning rate causes the model to overshoot the optimal point in the loss landscape, leading to oscillations instead of convergence. Decreasing the learning rate allows the model to make smaller, more precise updates to the weights, leading to improved convergence and potentially better performance on the test set.
Option A (Introduce early stopping) is not the primary solution here. While early stopping can prevent overfitting, the main issue is the unstable training process caused by the high learning rate.
Option B (Increase the size of the test set) would not directly address the issue of the oscillating loss and unstable training process. A larger test set would only improve the accuracy of the test set evaluation, but not the model’s training.
Option C (Increase the learning rate) would exacerbate the problem, leading to even more significant oscillations and preventing convergence.
An ML engineer needs to process thousands of existing CSV objects and new CSV objects that are uploaded. The CSV objects are stored in a central Amazon S3 bucket and have the same number of columns. One of the columns is a transaction date. The ML engineer must query the data based on the transaction date. Which solution will meet these requirements with the LEAST operational overhead?
A. Use an Amazon Athena CREATE TABLE AS SELECT (CTAS) statement to create a table based on the transaction date from data in the central S3 bucket. Query the objects from the table.
B. Create a new S3 bucket for processed data. Set up S3 replication from the central S3 bucket to the new S3 bucket. Use S3 Object Lambda to query the objects based on transaction date.
C. Create a new S3 bucket for processed data. Use AWS Glue for Apache Spark to create a job to query the CSV objects based on transaction date. Configure the job to store the results in the new S3 bucket. Query the objects from the new S3 bucket.
D. Create a new S3 bucket for processed data. Use Amazon Data Firehose to transfer the data from the central S3 bucket to the new S3 bucket. Configure Firehose to run an AWS Lambda function to query the data based on transaction date.
A
A is correct because Athena allows querying data directly from S3 using SQL, minimizing operational overhead. CTAS creates a table based on the filtered data (by transaction date), making subsequent queries efficient.
B is incorrect because S3 Object Lambda is designed for data transformation, not efficient querying. Adding replication increases complexity unnecessarily.
C is incorrect because while SparkSQL can query S3 data, it involves more setup and operational overhead than Athena, and creating a new S3 bucket is unnecessary.
D is incorrect because Firehose cannot directly consume from S3 for querying purposes. Using Lambda for querying adds significant complexity.
A company has a large, unstructured dataset containing many duplicate records across several key attributes. Which AWS solution requires the LEAST amount of code development to detect these duplicates?
A. Use Amazon Mechanical Turk jobs to detect duplicates.
B. Use Amazon QuickSight ML Insights to build a custom deduplication model.
C. Use Amazon SageMaker Data Wrangler to pre-process and detect duplicates.
D. Use the AWS Glue FindMatches transform to detect duplicates.
D
The correct answer is D because AWS Glue FindMatches is specifically designed to identify duplicate or matching records in datasets with minimal code development. It uses machine learning to find fuzzy matches and allows customization without requiring the creation of a complex custom deduplication model.
Option A (Amazon Mechanical Turk) would require significant effort to define tasks, manage workers, and review results, making it far less efficient. Option B (Amazon QuickSight ML Insights) necessitates building a custom model, which requires substantial coding. Option C (Amazon SageMaker Data Wrangler) focuses on data preparation and transformation, not directly on duplicate detection. While it might be used as part of a deduplication workflow, it’s not the primary solution for the task.
A company needs to run a batch data-processing job on Amazon EC2 instances. The job will run during the weekend and will take 90 minutes to finish running. The processing can handle interruptions. The company will run the job every weekend for the next 6 months. Which EC2 instance purchasing option will meet these requirements MOST cost-effectively?
A. Spot Instances
B. Reserved Instances
C. On-Demand Instances
D. Dedicated Instances
A. Spot Instances
Spot Instances are the most cost-effective option because they provide spare EC2 capacity at a significantly reduced price compared to On-Demand Instances. The fact that the job can handle interruptions is crucial; Spot Instances can be interrupted with short notice if AWS needs the capacity for other tasks. Since the job only runs for 90 minutes on weekends, the risk of interruption is manageable, and the cost savings outweigh the potential inconvenience.
Reserved Instances are more cost-effective for long-running, consistent workloads. Their upfront cost or commitment is not suitable for a job that runs only for 90 minutes each weekend.
On-Demand Instances offer flexibility but are the most expensive option and not cost-effective for this scenario.
Dedicated Instances provide dedicated physical hardware, which is unnecessary and more expensive than Spot Instances for this batch processing job.
An ML engineer has an Amazon Comprehend custom model in Account A in the us-east-1 Region. The ML engineer needs to copy the model to Account B in the same Region. Which solution will meet this requirement with the LEAST development effort?
A. Use Amazon S3 to make a copy of the model. Transfer the copy to Account B.
B. Create a resource-based IAM policy. Use the Amazon Comprehend ImportModel API operation to copy the model to Account B.
C. Use AWS DataSync to replicate the model from Account A to Account B.
D. Create an AWS Site-to-Site VPN connection between Account A and Account B to transfer the model.
B
An ML engineer is training a simple neural network model. The ML engineer tracks the performance of the model over time on a validation dataset. The model’s performance improves substantially at first and then degrades after a specific number of epochs. Which solutions will mitigate this problem? (Choose two.)
A. Enable early stopping on the model.
B. Increase dropout in the layers.
C. Increase the number of layers.
D. Increase the number of neurons.
E. Investigate and reduce the sources of model bias.
A, B
The problem described is overfitting: the model performs well on the training data but poorly on unseen validation data, indicating it has learned the training data too well, including noise. Options A and B directly address overfitting:
A. Enable early stopping: This prevents the model from training past the point where its performance on the validation set begins to degrade. It stops training at the point of best validation performance, thus mitigating overfitting.
B. Increase dropout: Dropout randomly deactivates neurons during training, forcing the network to learn more robust features and preventing it from relying too heavily on any single neuron or set of neurons, reducing overfitting.
Options C and D would likely worsen the overfitting. Increasing the number of layers or neurons increases the model’s capacity, making it more prone to overfitting. Option E, while important for model quality in general, doesn’t directly address the observed overfitting problem of degrading validation performance after a certain number of epochs.
A company has a Retrieval Augmented Generation (RAG) application that uses a vector database to store embeddings of documents. The company must migrate the application to AWS and must implement a solution that provides semantic search of text files. The company has already migrated the text repository to an Amazon S3 bucket. Which solution will meet these requirements?
A. Use an AWS Batch job to process the files and generate embeddings. Use AWS Glue to store the embeddings. Use SQL queries to perform the semantic searches.
B. Use a custom Amazon SageMaker notebook to run a custom script to generate embeddings. Use SageMaker Feature Store to store the embeddings. Use SQL queries to perform the semantic searches.
C. Use the Amazon Kendra S3 connector to ingest the documents from the S3 bucket into Amazon Kendra. Query Amazon Kendra to perform the semantic searches.
D. Use an Amazon Textract asynchronous job to ingest the documents from the S3 bucket. Query Amazon Textract to perform the semantic searches.
C
Amazon Kendra is a service specifically designed for semantic search. Options A and B would require custom development to implement semantic search capabilities, making them less efficient and more complex than using a purpose-built service like Kendra. Option D, using Amazon Textract, is incorrect because Textract is primarily for extracting text and data from documents, not for performing semantic searches. Therefore, only option C directly addresses the requirement for semantic search using an existing AWS service already integrated with S3.