Practice Questions - Amazon AWS Certified Machine Learning Engineer - Associate MLA-C01 Flashcards

Question

A company uses Amazon Athena to query a dataset in Amazon S3. This dataset contains a target variable the company wants to predict. They need to determine if a model can predict this target variable with the least development effort. Which solution best achieves this? A. Create a new model by using Amazon SageMaker Autopilot. Report the model's achieved performance. B. Implement custom scripts to perform data pre-processing, multiple linear regression, and performance evaluation. Run the scripts on Amazon EC2 instances. C. Configure Amazon Macie to analyze the dataset and to create a model. Report the model's achieved performance. D. Select a model from Amazon Bedrock. Tune the model with the data. Report the model's achieved performance.

Answer 1

A A is correct because Amazon SageMaker Autopilot automates the process of building, training, and tuning machine learning models, requiring minimal development effort compared to the other options. B is incorrect because it requires significant development effort to write, test, and deploy custom scripts for data preprocessing, model training, and evaluation on EC2 instances. C is incorrect because Amazon Macie is a data security and privacy service; it is not designed for building predictive models. D is incorrect because Amazon Bedrock focuses on foundation models for tasks like text generation, not structured data prediction tasks, requiring significant effort to adapt it for this purpose.

Answer 2

C The best solution is C because it leverages managed services to minimize operational overhead. Amazon Macie is specifically designed for identifying sensitive data in S3, automating the identification process. Using Lambda functions to remove the data keeps the solution serverless, further reducing operational overhead compared to managing EC2 instances (D) or an ECS cluster (B). While option A involves migrating the model to SageMaker, this adds significant operational overhead compared to using existing on-premises infrastructure and utilizing a managed service like Macie. Option D also adds significant operational overhead by requiring the management of EC2 instances.

Answer 3

B AWS Glue is the most appropriate service for creating data ingestion pipelines from Amazon S3. It's designed for batch processing and ETL (Extract, Transform, Load) jobs, making it suitable for handling raw data in S3 buckets. Amazon SageMaker Studio Classic is a well-suited environment for building and deploying ML models. Option A is incorrect because Amazon Kinesis Data Firehose is optimized for real-time data streaming, not batch processing from S3. Option C is incorrect because Amazon Redshift ML is primarily a database service for running machine learning models, not for data ingestion. Option D is incorrect because while Amazon Athena can query data in S3, it is not designed for building data ingestion pipelines, and using a SageMaker notebook for the deployment pipeline is less efficient and organized than SageMaker Studio Classic.

Answer 4

B Security groups manage traffic at the instance level, allowing or denying traffic based on rules. They don't directly support denying traffic; they primarily focus on allowing traffic. Therefore, option A is incorrect. Network ACLs (NACLs) operate at the subnet level and allow or deny traffic based on rules. They can explicitly deny traffic from specific IP addresses, making option B the correct choice. Option C involves creating a shadow variant and using SageMaker Inference Recommender, which is not relevant to directly blocking an IP address. It addresses traffic routing, not access control. VPC route tables manage routing between subnets and the internet, not directly blocking inbound traffic at the subnet level. Therefore, option D is incorrect.

Answer 5

B The best answer is B because it leverages pre-trained services for both translation and summarization. Option A requires training new models which is significantly more time-consuming than using pre-trained models. Option C is incorrect because the Anthropic Claude model is not optimized for summarization. Option D is incorrect because Stable Diffusion is an image generation model, not suitable for text summarization. Option B uses Amazon Transcribe (for audio to text), Amazon Translate (for Spanish to English), and Amazon Bedrock with the Jurassic model (a summarization model), all pre-trained and readily available, making it the fastest solution.

Answer 6

A The best answer is A because it leverages fully managed AWS services designed for real-time processing and anomaly detection. Amazon Kinesis Data Streams is well-suited for handling high-volume data streams, and Amazon Managed Service for Apache Flink (with its built-in RANDOM_CUT_FOREST function) provides a scalable and managed solution for anomaly detection, minimizing operational overhead. Option B and C are less optimal because they require managing additional services (SageMaker endpoint, Lambda function) leading to increased operational complexity. Option D is incorrect because it uses a batch processing approach (AWS Glue) which is unsuitable for real-time anomaly detection. While the RANDOM_CUT_FOREST function might not be available directly as described, the overall approach of using Kinesis and a managed service for processing remains the most efficient for minimal operational overhead.

Answer 7

C Amazon Comprehend is the correct answer because it's a pre-built service specifically designed for natural language processing (NLP) tasks, including sentiment analysis. This means it requires minimal setup and training compared to building and training a model from scratch (options B and D). Option A, Amazon Rekognition, is designed for image and video analysis, not text, making it unsuitable for this task. Therefore, using Amazon Comprehend offers the fastest solution for analyzing the large volume of chat data.

Answer 8

D The correct answer is D because decreasing both the temperature and top_k parameters will make the LLM's output more deterministic and less random. A lower temperature parameter leads to higher probability outputs (more focused, less creative/random responses), and a lower top_k parameter focuses the model on the most likely outputs, further reducing randomness. Options A, B, and C all involve increasing either the temperature or top_k parameter (or both), which would increase randomness and variability in the responses, thus worsening the problem.

Answer 9

C The correct answer is C because increasing the `target_precision` hyperparameter directly addresses the problem of minimizing false positives. Precision is the ratio of true positives to the sum of true positives and false positives. By increasing the target precision, the model is trained to prioritize correct positive predictions, thereby reducing the number of false positives. Option A is incorrect because setting weight decay to zero removes regularization, which can lead to overfitting and potentially increase false positives. Option B is incorrect because increasing the number of epochs might improve model accuracy but doesn't directly target false positives; it could even lead to overfitting and increased false positives. Option D is incorrect because changing `predictor_type` to `regressor` is inappropriate for this classification problem; a regressor predicts a continuous value, not a class label (weed present/absent).

Answer 10

A A is correct because using zero buffering in Firehose eliminates the 60-second delay caused by the buffer. Tuning the batch size further optimizes throughput for sub-second delivery, crucial for real-time dashboards. B is incorrect because AWS DataSync is designed for large-scale data transfers and is not optimized for the sub-second latency required for real-time dashboards. C is incorrect because increasing the buffer interval would *increase* latency, making the dashboard even slower. D is incorrect because introducing an SQS queue adds another layer of processing and queuing, increasing latency rather than reducing it. SQS is not ideal for the low-latency requirements of real-time dashboards.

Answer 11

A A is correct because SageMaker real-time endpoints are specifically designed for low-latency, high-availability, and auto-scaling, making them ideal for handling unpredictable request bursts. The built-in auto-scaling feature directly addresses the need for proportional scaling to meet fluctuating demand. B is incorrect because while ECS allows for scaling, relying solely on CPU-based scheduled scaling may not be responsive enough to handle unpredictable bursts of requests. It lacks the fine-grained control and immediate responsiveness of SageMaker's auto-scaling. C is incorrect because while EKS with horizontal pod auto-scaling offers scalability, it adds complexity and overhead compared to the purpose-built SageMaker solution. Using memory as the scaling metric might not accurately reflect the inference workload. D is incorrect because using Spot Instances introduces the risk of interruptions due to instance termination. While ALB can handle load balancing, relying on ALBRequestCountPerTarget for auto-scaling might not be as efficient or responsive as SageMaker's integrated auto-scaling mechanism designed for ML inference.

Answer 12

D The most cost-effective option that guarantees no data loss is to use On-Demand Instances for the primary and core nodes, and Spot Instances for the task nodes. This is because: * **Primary Node:** The primary node is critical for cluster operation. Using Spot Instances here risks cluster instability and potential data loss if the instance is interrupted. On-Demand instances guarantee availability. * **Core Nodes:** Core nodes are part of the HDFS (Hadoop Distributed File System) and losing these can lead to partial data loss. On-Demand instances ensure continuous operation and data safety. * **Task Nodes:** Task nodes process data but don't persistently store it in HDFS. If a Spot Instance task node is interrupted due to price increases, no data is lost. Using Spot Instances for task nodes offers significant cost savings. Option A is too expensive. Option B risks significant data loss. Option C still risks data loss from core node interruptions. Only Option D balances cost savings with the requirement of zero data loss.

Answer 13

A and D A is correct because Amazon SageMaker Debugger can detect issues in training jobs that lead to wasted resources. By stopping non-converging jobs early, it reduces energy consumption and computational resources. B is incorrect because data labeling is a preprocessing step and doesn't directly impact the energy consumption of training jobs. While efficient labeling is important for overall ML efficiency, this option doesn't directly reduce the energy used *during training*. C is incorrect because deploying models using AWS Lambda functions affects inference, not training. D is correct because AWS Trainium instances are designed for cost-effective and energy-efficient training. They are explicitly stated to be more energy-efficient than comparable alternatives. E is incorrect because while distributed training can improve training *speed*, it doesn't necessarily reduce the overall energy consumption or computational resources unless the training is shortened significantly due to faster completion (which isn't guaranteed). It may even increase total resource usage if it involves more instances than a comparable single-node training job.

Answer 14

D The correct answer is D because it best addresses all the requirements: The large dataset size (5TB+) and complex, hours-long processing steps involving NLP rule out solutions A, B, and C. SageMaker Pipelines are designed for building and managing complex ML workflows, including data processing. Option A (SageMaker Data Wrangler) is suitable for data preparation but isn't ideal for the extensive, multi-stage processing involved. Option B (SageMaker notebooks) is interactive and not designed for automated, large-scale processing. Option C (Lambda functions and Step Functions) could handle automation, but Lambda's execution time limits and cost implications for this scale of processing make it less efficient than SageMaker Pipelines. Amazon EventBridge can be used in conjunction with SageMaker Pipelines to trigger the pipeline automatically, thus fulfilling the automation requirement.

Answer 15

A The correct answer is A, AWS::SageMaker::Model. This resource is specifically designed to define the ML model, including its location (S3 artifacts), inference container/image, and IAM role. A SageMaker Model is a prerequisite for deploying a model to a SageMaker Endpoint. Option B, AWS::SageMaker::Endpoint, is incorrect because it represents the endpoint itself, which hosts the model but doesn't define the model's properties. Option C, AWS::SageMaker::NotebookInstance, is incorrect as it relates to notebook instances used for model development, not the model itself. Option D, AWS::SageMaker::Pipeline, is incorrect because it's for creating and managing SageMaker pipelines, a process for building and deploying models, not the model itself.

Answer 16

C Lake Formation is the most operationally efficient solution because it's designed for fine-grained access control within a data lake. By tagging resources with campaign information and mapping engineers to those campaigns, Lake Formation provides a centralized and automated way to manage access. Options A and D are less efficient because they require managing policies individually for each engineer and campaign, leading to complexity and potential errors. Option B is overly complex and inefficient, involving multiple services and custom code to manage access, whereas Lake Formation's built-in features offer a simpler and more streamlined approach.

Answer 17

D Parquet is the correct answer because it is a columnar storage format optimized for performance and efficiency, especially with complex data. Its columnar structure allows for faster query processing as only the necessary columns are read, unlike row-oriented formats like CSV or JSON. Parquet also incorporates built-in compression, further enhancing performance. SageMaker Canvas is compatible with Parquet files. Option A (CSV with Snappy) is less efficient than Parquet due to its row-oriented nature. While compression helps, it doesn't address the fundamental performance limitations of CSV. Option B (JSONL) is also row-oriented and doesn't offer the same level of optimized performance as Parquet, especially for complex data. Option C (JSON with gzip) suffers from the same drawbacks as JSONL; it's row-oriented and less efficient for complex data, despite using compression.

Answer 18

D. High recall The correct answer is D because the problem states that false negatives are far more costly than false positives. Recall is the ratio of correctly predicted positive observations to all actual positive observations. High recall minimizes false negatives, aligning with the engineer's priority to reduce the cost of these errors. Option A (Low precision) is incorrect because low precision increases false positives, which are less costly according to the problem. Option B (High precision) is incorrect because while it reduces false positives, it doesn't directly address the more critical issue of minimizing the more expensive false negatives. Option C (Low recall) is incorrect because low recall directly increases false negatives, which should be avoided due to their high cost.

Answer 19

D Shadow testing allows the new model to process live data alongside the production model without affecting the production system's output. This enables a direct performance comparison against the existing model using real-world data. A is incorrect because SageMaker Debugger is used for model debugging and anomaly detection during training, not for evaluating a deployed model's performance in a production-like environment. B and C are incorrect because blue/green deployments involve switching traffic entirely to the new model (all-at-once) or gradually (canary), either of which would impact production end-users during the transition. These methods don't allow parallel evaluation against the production model using the same live data stream.

Answer 20

C The correct answer is C because partitioning the data by date allows Athena to quickly scan only the relevant partitions when querying for the past 3 days. This significantly improves query performance compared to scanning the entire dataset (A) or dealing with the overhead of Lambda functions (B) or managing numerous S3 buckets (D). Option A leads to slow queries due to scanning large datasets. Options B and D introduce unnecessary overhead, slowing down data retrieval. Option C efficiently manages data and utilizes S3 lifecycle policies for cost-effective archiving.

Answer 21

A A is correct because SageMaker real-time inference provides faster predictions than asynchronous inference, addressing the delay issue. SageMaker Model Monitor effectively tracks model quality and sends alerts for deviations. B is incorrect because SageMaker batch transform is not suitable for real-time applications; it's designed for batch processing. C is incorrect because while SageMaker Serverless Inference can improve performance, SageMaker Inference Recommender is not designed to provide notifications about model quality deviations; Model Monitor is better suited for this purpose. D is incorrect because it doesn't address the delay problem; it continues using the slow asynchronous inference method.

Answer 22

D The correct answer is D because it provides a solution that directly addresses all the requirements: scalability, cost minimization during low usage, and capacity during peak usage. SageMaker endpoint autoscaling, driven by CloudWatch metrics, dynamically adjusts the number of instances based on demand. This ensures that resources are efficiently used only when needed, minimizing costs during low-usage periods while maintaining sufficient capacity during peak demand. Option A is incorrect because while Lambda offers autoscaling, fixed concurrency contradicts the need for cost minimization during low usage. Option B is incorrect because setting a static number of tasks doesn't adapt to fluctuating demand, potentially leading to overspending or insufficient capacity. Option C is incorrect because while it offers a scalable approach using multiple copies and a load balancer, it lacks the dynamic scaling capabilities of SageMaker's autoscaling feature. It could potentially overspend by keeping many instances running even when unnecessary.

Answer 23

B The correct answer is B because AWS Budgets is specifically designed for setting cost thresholds and sending alerts when those thresholds are reached. While Cost Explorer can show cost data, it doesn't have the built-in functionality to automatically send alerts based on predefined thresholds. Furthermore, tagging resources at the SageMaker user profile level (as opposed to individual IAM profiles) is more efficient and aligned with the single SageMaker domain context. Options A, C, and D are incorrect because they either use the wrong service for alerting (Cost Explorer instead of Budgets) or incorrectly suggest tagging at the individual IAM user level instead of the more efficient SageMaker user profile level.

Answer 24

D The best answer is D because SageMaker Data Wrangler is designed for data exploration, cleaning, and preprocessing directly within the SageMaker ecosystem. Dropping unnecessary columns is a simple transformation easily accomplished within Data Wrangler's visual interface, requiring minimal coding and effort. Option A is inefficient because it involves downloading the data, writing custom code, and then uploading it again. Option B introduces unnecessary complexity by using EMR and Spark for a task easily handled within SageMaker. Option C, while feasible, is also more complex than using the purpose-built Data Wrangler tool and requires coding.

Answer 25

A A is correct because Amazon Q Business allows for the configuration of blocked phrases, directly addressing the need to prevent specific terms (like the competitor's name) from appearing in responses. This is a precise and efficient solution. B is incorrect because while retrievers manage data access, they don't directly control the content of the final response generated by Amazon Q Business. They select relevant data, but don't filter out specific words within that data. C is incorrect because Amazon Kendra is a separate service for indexing documents. While indirectly related to Q Business, configuring Kendra doesn't directly filter Q Business's output. D is incorrect because document attribute boosting prioritizes or deprioritizes certain attributes within documents. It does not filter out or block specific terms from being included in the response entirely.

Answer 26

D SageMaker Autopilot and SageMaker JumpStart are designed for low-code/no-code machine learning workflows. SageMaker JumpStart provides pre-trained models, reducing the need for extensive coding. Option D leverages both to fine-tune a pre-trained LLM for text summarization, fulfilling the LCNC requirement. Option A uses SageMaker Studio, which is a more code-intensive environment, thus not meeting the LCNC requirement. Options B and C, while using SageMaker Autopilot, involve deploying the LLM via a custom API or EC2 instances, both of which require more coding than using the pre-trained models from SageMaker JumpStart.

Answer 27

D The correct answer is D because it leverages the cost-effectiveness and efficiency of serverless inference for a single nightly run. Setting `MaxConcurrency` to 1 ensures only one prediction is processed at a time, aligning with the one-time nightly execution requirement. Options A, B, and C are less suitable: A introduces unnecessary caching for a single nightly run; B uses asynchronous inference, which is not necessary for a process that completes in under a minute; and C involves managing auto-scaling for a process that doesn't require continuous availability, leading to unnecessary complexity and cost.

Answer 28

A AWS Secrets Manager is designed for securely storing and managing sensitive data like API tokens, and it offers built-in automatic rotation capabilities. A Lambda function can be scheduled to trigger the rotation process every 3 months. Option B is incorrect because while AWS Systems Manager Parameter Store can store secrets, it does not have built-in automatic rotation features. Manual intervention would be required to rotate the tokens. Options C and D are incorrect because AWS KMS is primarily for managing encryption keys, not directly for API tokens. While you could potentially use KMS to encrypt the tokens stored elsewhere, it doesn't provide the automatic rotation functionality needed.

Answer 29

A. LightGBM LightGBM is the best choice because it effectively handles class imbalances through techniques like weighted loss and adaptive boosting. Furthermore, its tree-based structure allows it to capture non-linear relationships and interactions between features, unlike the linear learner. K-means clustering is an unsupervised learning algorithm and is not suitable for this supervised learning task (fraud detection). The Neural Topic Model is designed for topic modeling, not fraud detection.

Answer 30

C Logistic regression is the correct answer because it's designed for binary classification problems – predicting one of two outcomes (yes/no, in this case, whether a customer needs long-term support or not). A is incorrect because anomaly detection identifies unusual data points, not a binary classification prediction. B is incorrect because linear regression predicts continuous values, not categorical outcomes like "yes" or "no." D is incorrect because semantic segmentation is used for image analysis, not customer data prediction.

Answer 31

B and C The correct answers are B and C because: * **B. The Canvas user must have permissions to access the S3 bucket where the model artifacts are stored:** SageMaker Canvas needs access to the model artifacts to load and use the model. If the Canvas user lacks permission to access the S3 bucket containing these artifacts, they cannot use the model. * **C. The model must be registered in the SageMaker Model Registry:** While the model is stored in S3, SageMaker Canvas requires the model to be registered in the Model Registry for discoverability and management within the SageMaker ecosystem. This allows Canvas to locate and interact with the model. A is incorrect: The question states that the ML engineer and Canvas user are in the *same* SageMaker domain. Separate domains would further complicate access. D is incorrect: AWS Marketplace is for selling and distributing models publicly, which is not necessary in this scenario where the model is being shared internally within the same organization and domain. E is incorrect: Deploying the model to a SageMaker endpoint is unnecessary for simply allowing a Canvas user to access and tune a model. Endpoints are typically used for real-time inference.

Answer 32

A Hyperband is the correct answer because it is designed for efficient hyperparameter optimization, prioritizing early stopping of unpromising configurations to reduce computation time. Grid search is computationally expensive as it exhaustively tries all combinations. Bayesian optimization is more efficient than grid search and random search, but generally less efficient than Hyperband. Random search is also less efficient than Hyperband because it doesn't leverage information from previous trials as effectively. The provided links and discussion emphasize Hyperband's efficiency and faster convergence compared to other methods.

Answer 33

D The correct answer is D because it leverages privateLink for accessing S3 and other AWS services. An S3 gateway endpoint provides private connectivity to the S3 bucket without needing public IP addresses. Interface VPC endpoints for SageMaker and Redshift ensure private access to those services as well. Updating the S3 bucket policy to allow IAM principals from the primary account grants the necessary permissions. Option A is incorrect because VPC peering, while providing private connectivity between VPCs, might still require routing through the internet if not properly configured and doesn't inherently address S3 access. Option B is an overkill and unnecessarily complex solution; Direct Connect is not required for this scenario. Option C, while using a VPN, still relies on the VPN connection being correctly routed for all communication, and may require additional configurations that make it less efficient.

Answer 34

C CloudWatch is the correct service for monitoring metrics and setting up alarms. A CloudWatch alarm can be configured to trigger actions, such as sending an email, when a defined threshold is breached. Option A is incorrect because CloudTrail is for logging API calls, not for monitoring metrics. Option B is incorrect because CloudFront is a content delivery network and not designed for metric monitoring or alerting. Option D is incorrect because CloudFront rules manage content delivery, not alerts.

Answer 35

C The discussion highlights that a model update can invalidate the existing baseline used by SageMaker Model Monitor, leading to false positives regarding data quality issues. Creating a new baseline from the latest dataset (option C) directly addresses this problem by providing Model Monitor with a relevant comparison point for the post-update data. Option A is incorrect because adjusting model parameters and hyperparameters addresses model performance, not data quality issues detected by Model Monitor. Option B is incorrect because running a manual job with the latest data doesn't resolve the underlying issue of an outdated baseline. Option D is a more drastic measure that should only be taken if the data quality issues persist after creating a new baseline and are indicative of a more fundamental problem with the model's training data.

Answer 36

B The best answer is B because it leverages the built-in scaling capabilities of Amazon SageMaker's asynchronous inference endpoints. This requires minimal operational overhead as the scaling is managed by AWS. Options A, C, and D require significantly more configuration and management to achieve similar scaling capabilities, resulting in greater operational overhead. Option A (batch transform) isn't designed for real-time or continuously varying demand. Option C (EKS with Karpenter) and D (AWS Batch with ECS) require managing the Kubernetes cluster or ECS cluster, respectively, adding considerable complexity.

Answer 37

D Amazon Comprehend is a managed service, meaning AWS handles the underlying infrastructure and operational overhead. Options A, B, and C require managing EC2 instances, SageMaker endpoints, Lambda functions, and potentially other infrastructure components, increasing operational complexity and overhead. Therefore, Amazon Comprehend (option D) offers the least operational overhead for keyword extraction.

Answer 38

D The correct answer is D because IAM policies offer granular control over access to AWS resources. By creating IAM policies that specifically grant access only to the S3 buckets containing training data for a given business group and attaching these policies to the appropriate IAM users or roles for those ML engineers, the company ensures that engineers only have access to the data they need. Option A (S3 bucket versioning) is incorrect because it manages data versioning, not access control. Option B (S3 Object Lock) is incorrect because it prevents deletion or modification of objects, not access control to them. Option C (CORS policies) is incorrect because it deals with cross-origin requests, which is irrelevant to this access control scenario within a single AWS account.

Answer 39

C The correct answer is C because Amazon SageMaker Serverless Inference with provisioned concurrency best fits the described needs. The predictable and sustained 2-hour load daily makes serverless a cost-effective choice, eliminating the need to pay for idle resources for the remaining 22 hours. Provisioned concurrency ensures quick responses by pre-warming the necessary resources, meeting the requirement for quick response times during the active period. AWS manages the underlying infrastructure and auto-scaling. Option A is incorrect because batch transform jobs are not ideal for real-time, low-latency predictions needed for quick responses. Option B is inefficient for a predictable 2-hour window; it involves managing and paying for resources that sit idle for the majority of the day. Option D, while capable of handling the load, requires more management overhead than the serverless approach in option C and is less cost-effective for this specific use case.

Answer 40

B SageMaker Clarify is the correct answer because it's specifically designed to provide explanations for model predictions, including feature importance and bias detection. This directly addresses the need to explain the model's predictions to stakeholders. Option A is incorrect because SageMaker Model Monitor is for monitoring model performance over time, not for explaining individual predictions. Option C is incorrect because A/B testing results show overall performance differences, not explanations of individual predictions. Option D, while potentially useful for comparing models, doesn't directly provide explanations for how a *specific* model makes its predictions.

Answer 41

C The correct answer is C because placing the instances and the data in the same Availability Zone minimizes network latency and therefore communication overhead. Options A, B, and D all involve storing data in a different location than the instances, significantly increasing the network distance and thus the communication overhead. Option B also introduces the overhead of inter-AZ communication.

Answer 42

B The best solution is B because it leverages SageMaker's script mode, allowing the company to use their existing custom Python scripts with minimal code changes. The pre-built PyTorch images provided by SageMaker eliminate the need to build and manage a custom container, significantly reducing the effort required for migration. Option A is incorrect because it requires retraining the models using SageMaker's built-in algorithms, which would likely involve substantial modifications to the existing code and potentially lead to performance differences. Option C is incorrect because building a custom container requires significant effort in packaging dependencies, configuring the environment, and testing the deployment. This is more complex than using SageMaker's pre-built images. Option D is incorrect because it involves purchasing entirely new models, rather than migrating the existing ones. This doesn't meet the requirement of moving the *company's* models to AWS.

Answer 43

B The best solution is B because it offers the fastest improvement in training performance with minimal data movement. FSx for Lustre is designed for high-performance computing and provides low-latency access to data, significantly speeding up the training process. Linking it to the existing S3 bucket avoids the time-consuming process of transferring millions of files. Option A is slower because it still involves reading from S3, even with S3 Express One Zone, which offers only marginal improvement over standard S3. Option C is slower than B because it requires transferring all the data to the EFS file system, a time-consuming task for millions of files. Option D is incorrect because ElastiCache is a caching solution, not a high-performance file system suitable for large-scale machine learning training data. It is not optimized for the volume and type of data needed for model training.

Answer 44

B AWS Glue DataBrew is the best solution because it's designed for data preparation, specifically handling tabular data. It offers data masking capabilities while preserving the order and structure of the features, unlike options C and D which would likely disrupt the data's integrity. Option A, Amazon Macie, is not a data masking tool; it's a data security service.

Answer 45

D AWS SageMaker batch transform is designed for asynchronous inference on large datasets. SageMaker Model Monitor is specifically built for monitoring the data quality and model quality of deployed models, providing alerts on changes in data quality. Options A, B, and C are incorrect because they utilize services not designed for model quality monitoring or lack the asynchronous processing capabilities needed for large datasets. CloudWatch is for system monitoring, not model quality, and CloudTrail is an audit trail. EventBridge is an event bus, not a model monitoring service.

Answer 46

B The correct answer is B because using the same min-max normalization statistics from the training set ensures consistency in data preprocessing between training and inference. This consistency is crucial for accurate model predictions as models are sensitive to data distribution. Options A, C, and D introduce inconsistencies by using different normalization parameters, potentially leading to inaccurate or unreliable predictions. Option A uses external statistics irrelevant to the model's training data. Options C and D recalculate statistics for each batch or sample, introducing variability and affecting model performance.

Answer 47

A A is correct because mounting the FSx for ONTAP file system directly to the SageMaker instance provides the fastest and most direct access to the training data, eliminating the need for data transfer or intermediary services. This is especially beneficial given the 6TB size of the dataset. B is incorrect because introducing an S3 bucket adds unnecessary complexity and latency. Transferring 6TB of data to S3 and then accessing it would be significantly slower than direct mounting. C and D are incorrect because they involve SageMaker Data Wrangler, a tool primarily for data preparation and transformation, not for directly accessing and mounting large datasets for model training. Using Data Wrangler would add unnecessary steps and potentially increase processing time.

Answer 48

C The correct answer is C because Amazon EventBridge offers a serverless, event-driven architecture that directly integrates with Amazon S3 and SageMaker pipelines. When new data is uploaded to S3, EventBridge detects the event and automatically triggers the SageMaker pipeline, minimizing operational overhead. Option A is incorrect because S3 Lifecycle rules are primarily designed for data management tasks like archiving or deleting objects, not for triggering workflows. It would require additional components to initiate the SageMaker pipeline. Option B is incorrect because while a Lambda function could monitor the S3 bucket and trigger the pipeline, it involves more operational effort in terms of coding, testing, and maintaining the Lambda function. EventBridge provides a more streamlined solution. Option D is incorrect because Amazon MWAA is a more complex orchestration tool better suited for more intricate and demanding workflows. For simply triggering a pipeline based on S3 uploads, it's an overkill compared to the simplicity and efficiency of EventBridge.

Answer 49

D The correct answer is D because the problem described is overfitting. The model is too complex and has memorized the training data, leading to poor generalization to unseen data. Decreasing the `max_depth` hyperparameter reduces the complexity of the XGBoost model, preventing it from overfitting and improving its ability to generalize to new transactions. Option A is incorrect because increasing the learning rate can actually worsen overfitting. Option B is incorrect because while feature selection is important for model performance, it doesn't directly address the overfitting problem presented. Option C is incorrect because increasing `max_depth` would further increase the model's complexity and exacerbate the overfitting.

Answer 50

A. Accuracy Accuracy is the correct answer because it measures the overall proportion of correctly classified instances, encompassing both positive and negative predictions. The problem statement explicitly requires maximizing correct predictions for both labels, which is precisely what accuracy measures. Precision focuses solely on the positive predictions, while recall focuses only on correctly identifying all actual positives. Specificity focuses only on correctly identifying all actual negatives. Therefore, none of these are as suitable as accuracy for balancing the need to correctly classify both positive and negative instances.

Answer 51

A A is correct because SageMaker production variants offer built-in traffic splitting, allowing for easy A/B testing and online model validation with minimal operational overhead. Setting the variant weight to 0.1 directs 10% of traffic to the new model, fulfilling the requirement. CloudWatch is the appropriate service for monitoring invocations. B is incorrect because setting the variant weight to 1 sends all traffic to the new model, bypassing the validation phase. C is incorrect because creating a new SageMaker endpoint increases operational complexity and cost unnecessarily. D is incorrect because configuring ALB routing for traffic splitting is more complex than using SageMaker's built-in functionality; it adds unnecessary operational overhead. Furthermore, CloudTrail is not the ideal service for monitoring model invocations; CloudWatch is better suited for this purpose.

Answer 52

1. An S3 event notification invokes the pipeline when new data is uploaded. 2. SageMaker retrains the model by using the data in the S3 bucket. 3. The pipeline deploys the model to a SageMaker endpoint.

Answer 53

City: One-hot encoding Type_year: Feature splitting Size of the building: Standardized distribution (or Logarithmic transformation, depending on the data distribution. If the data is significantly skewed, logarithmic transformation is preferred. If the data is already normally distributed or near-normal, standardization is better.)

Answer 54

Use AWS Glue crawlers to infer the schemas and available columns. Use AWS Glue DataBrew for data cleaning and feature engineering. Store the resulting data back in Amazon S3. The correct answer is the sequence provided because it reflects a logical and efficient workflow for preparing data for ML model training. First, AWS Glue crawlers are used to discover the schema of the unstructured .csv files in S3. This provides metadata about the data's structure, which is crucial for subsequent steps. Next, AWS Glue DataBrew is ideal for data cleaning and feature engineering. DataBrew's capabilities are well-suited for handling the partially populated and unlabeled nature of the data. Finally, the cleaned and engineered data is stored back in S3, making it readily accessible for ML model training. The other options are incorrect because: * **Create an Amazon SageMaker batch transform job for data cleaning and feature engineering:** SageMaker batch transform is for applying a trained model to data, not for data preparation itself. Data cleaning and feature engineering should be done before model training. * **Use Amazon Athena to infer the schemas and available columns:** While Athena can query data, it's not designed for schema inference from unstructured data like Glue crawlers are. Glue crawlers are better suited for this task.

Answer 55

1. Create a feature group; 2. Ingest the records; 3. Access the store to build datasets for training.

Answer 56

The correct answer is to match the following descriptions with the corresponding terms: * **Description 1:** "Represents a unit of text used in processing and generating responses by the model." **Term:** Token * **Description 2:** "Converts text into vector representations to capture semantic meaning, enhancing the model's ability to understand and generate coherent content." **Term:** Embedding * **Description 3:** "Combines generated content with retrieved external information to enrich the output." **Term:** Retrieval Augmented Generation (RAG) The discussion clearly indicates that Tokens are the basic units of text processing for LLMs, Embeddings convert text into vectors for semantic understanding, and RAG augments generation with external information. Temperature is not relevant to any of the descriptions provided in the image.

Answer 57

A. Use DataBrew to process the existing S3 folder. Store the output in Apache Parquet format. AWS Glue performs best with Parquet files because they are optimized for analytical queries. DataBrew can handle mixed file types within a single folder; therefore, separating the files into different folders is unnecessary and adds extra work. Option B is incorrect because "AWS Glue Parquet format" is not a valid term; Apache Parquet is the correct format. Options C and D are incorrect because they introduce unnecessary complexity by requiring the data to be reorganized into separate folders before processing.

Answer 58

A and C The correct answers are A (Precision and Recall) and C (Accuracy and F1 score). These metrics are appropriate for a binary classification problem (Pass/Fail) where the goal is to assess the model's ability to correctly identify positive and negative instances. * **A. Precision and Recall:** These are fundamental metrics for evaluating the performance of a binary classification model. Precision measures the accuracy of positive predictions, while recall measures the model's ability to find all positive instances. Both are crucial for assessing the quality control process. * **C. Accuracy and F1 score:** Accuracy represents the overall correctness of the model's predictions. The F1 score provides a balanced measure considering both precision and recall, which is valuable when dealing with imbalanced datasets (e.g., if many more products pass than fail). * **B. RMSE and MAPE:** These metrics are generally used for regression problems, not classification problems. They measure the difference between predicted and actual *continuous* values, which isn't applicable here. * **D. BLEU score:** This metric is used for evaluating machine translation and other natural language processing tasks, not relevant to this quality control scenario. * **E. Perplexity:** This metric assesses the performance of language models, again not relevant to this quality control application.

Answer 59

A A is correct because using the `aws:sourceIp` IAM policy condition allows restricting access based on the client's IP address. This is ideal for VPN environments where the IP range of authorized users is known and controlled, thus preventing access from outside the VPN even with a pre-signed URL. B is incorrect because VPC validation checks the source VPC, not the IP address. Pre-signed URLs can still be used outside the VPC. C and D are incorrect because `aws:PrimaryTag` and `aws:PrincipalTag` are not used for IP address validation and are irrelevant to this scenario. They relate to tagging resources and principals, not source IP addresses.

Answer 60

C. Object detection Object detection is the correct answer because it specifically addresses the need to both identify an object *and* locate it within an image using bounding boxes. Image classification only identifies the object, not its location. XGBoost is a gradient boosting algorithm unsuitable for image data. K-nearest neighbors is a classification/regression algorithm, also not designed for object localization within images.

Answer 61

D The correct answer is D because SageMaker Savings Plans offer significant discounts (up to 64%) for consistent, long-term usage of SageMaker instances. Given the company's predictable workload of 35 hours per week for 55 weeks, a 1-year Savings Plan with an All Upfront payment provides the most cost-effective solution. Option A is incorrect because serverless endpoints are designed for inference, not training. Option B is incorrect because SageMaker Edge Manager is for deploying models to edge devices, not for running training jobs in the cloud. Option C is incorrect because while it can optimize training jobs, it doesn't address the cost reduction requirement as effectively as a Savings Plan.

Answer 62

C SageMaker Model Monitor is the best solution because it's designed specifically for monitoring model performance in production and detecting concept drift. It automatically compares live data against the baseline established during training, providing alerts when significant deviations occur. Option A is less suitable because manual detection of drift from a dashboard is less efficient and prone to human error. Option B puts unnecessary load on the Lambda function, which is better suited for prediction delivery. Option D is incorrect because SageMaker Debugger is used for debugging model training, not for ongoing monitoring of model performance in production.

Answer 63

C SageMaker Clarify is the correct answer because it's designed for bias detection and model explainability. It analyzes both training data and model predictions to pinpoint potential biases and understand how the model impacts different demographic groups. Option A focuses on resource monitoring, not model accuracy or bias. Option B involves data correction after model output, which is not a proactive approach to identifying demographic bias. Option D addresses data preprocessing but doesn't offer the analysis needed to detect demographic skews in the model's performance.

Answer 64

B. Configure a blue/green deployment with canary traffic shifting and a size of 10%. The discussion strongly favors option B. A blue/green deployment with canary traffic shifting allows the new version (green) to be deployed alongside the existing version (blue) using the extra reserved instance. Canary shifting gradually shifts traffic to the new version, minimizing risk. Starting with 10% allows for monitoring and rollback if issues arise. The other options are incorrect because: A uses all-at-once shifting, risking downtime; C (shadow testing) doesn't actually deploy the new version for real-time inference; and D (rolling deployment with batch size 1) is inefficient and doesn't optimally utilize the available resources, and doesn't address how the existing instances handle the transition to the new version. Option B efficiently utilizes the 11 instances (10 for the existing and 1 for the new version) while minimizing the risk of downtime.

Answer 65

B CloudWatch is the appropriate service for monitoring the performance metrics of the XGBoost model during training in SageMaker. Amazon SNS is designed for message delivery, including SMS, making it suitable for sending notifications upon training completion. Options A, C, and D are incorrect because: A and C incorrectly use SQS, which is a message queuing service, not a message delivery service capable of directly sending SMS messages. D incorrectly uses CloudTrail, which is a logging service, not a performance monitoring service. CloudTrail logs API calls, not model performance metrics.

Answer 66

A A is correct because IAM condition keys provide a preventative control. By using the `sagemaker:RootAccess` condition key in an IAM policy, you can prevent the creation of SageMaker notebook instances with root access enabled. This stops the problem before it occurs. B is incorrect because AWS KMS is for managing encryption keys, not controlling access to SageMaker resources. C and D are incorrect because they are reactive solutions. While they would eventually delete non-compliant instances, they allow root access instances to be created first, creating a security vulnerability during the time between creation and deletion. A preventative solution is far more secure.

Answer 67

B The correct answer is B because it uses private subnets and an S3 gateway VPC endpoint. Private subnets prevent direct internet access. The S3 gateway endpoint allows communication with S3 without traversing the public internet, ensuring network isolation. Option A is incorrect because while using private subnets is a good start, a NAT gateway still requires internet access to route traffic, defeating the purpose of network isolation. Option C is incorrect because it uses public subnets, which directly contradicts the requirement for network isolation. Even though inbound rules limit traffic, the instances are still accessible from the internet. Option D is incorrect because it focuses only on encryption in transit and at rest, addressing data security but not network isolation. While encryption is important, it doesn't prevent network access.

Answer 68

D The correct answer is D because it directly leverages Amazon Bedrock's built-in RAG capabilities. Options A, B, and C require significant additional steps and infrastructure setup, increasing operational overhead. A requires creating and managing a new model pipeline. B necessitates vectorizing data and managing a Neptune database. C involves fine-tuning a model, deploying it to a SageMaker endpoint, and managing that infrastructure. Option D is the simplest and most efficient approach for integrating the S3 data with the Bedrock LLM for RAG.

Answer 69

B. Asynchronous inference Asynchronous inference is the best option because it can handle large payloads (up to 1 GB or even 5 GB depending on the source) and allows for longer processing times (up to 15 minutes per request). The other options are unsuitable: Real-time inference has significantly smaller payload limits (around 5 MB), serverless inference has even smaller limits (around 4 MB), and batch transform is designed for offline processing of entire datasets rather than individual requests with varying processing times. The requirement of processing within 60 minutes is easily met by the asynchronous option, given its ability to handle larger payloads and longer processing times than the alternatives.

Answer 70

B The best answer is B because AWS Glue's built-in Sensitive Data Detection functionality directly addresses the need to identify and mask PII, including credit card numbers, minimizing development effort. Option A is incorrect because Amazon Macie is primarily a discovery service, not a transformation service; it identifies sensitive data but doesn't automatically mask it. Options C and D require writing custom code to implement the masking logic, increasing development time and effort compared to using Glue's built-in functionality. Option D also introduces the overhead of managing an EC2 instance.

Answer 71

A A is correct because Amazon Comprehend can directly extract entities from PDF documents, making it the fastest solution. Options B and C introduce extra steps (using an open-source OCR tool or a two-step process with Textract and Comprehend respectively) which increase processing time. Option D adds the human-in-the-loop element of Amazon A2I, significantly increasing processing time.

Answer 72

C The correct answer is C because ModelSetupTime directly measures the time it takes to launch the compute resources for a serverless endpoint, which is the key factor contributing to increased latency due to model startup time (cold starts). Options A and B are incorrect because they don't directly address model startup time; they focus on model quality and don't pinpoint the source of latency. Option D is incorrect because ModelLoadingWaitTime is relevant for multi-model endpoints, not single-model serverless endpoints as described in the question.

Answer 73

C Explanation: Option C is correct because it leverages SageMaker inference components, allowing independent scaling for each model hosted on a single endpoint. Setting the minimum number of copies to at least 1 ensures no cold starts. Option A (Serverless Inference) is unsuitable for real-time requirements due to potential latency. Option B (Asynchronous Inference) is not suitable for real-time responses. Option D (Multi-model endpoint with S3) doesn't offer independent scaling for each model and may introduce cold starts.

Answer 74

D. Amazon SageMaker Model Registry Amazon SageMaker Model Registry provides automatic versioning of ML models, which directly addresses the company's requirement. Option A, Amazon ECR, is for storing and managing container images, not ML models. Option B, Model packages from Amazon SageMaker Marketplace, offers pre-trained models but doesn't inherently provide automatic versioning. Option C, Amazon SageMaker ML Lineage Tracking, tracks model lineage and dependencies, but doesn't automatically create versions of the models themselves.

Answer 75

C. Apply random oversampling on the dataset. This is the correct answer because class imbalance in a dataset means that some classes have significantly more examples than others. Oversampling artificially increases the number of instances in the minority classes, balancing the dataset and improving the model's ability to learn from underrepresented classes. A is incorrect because reducing the dataset size would likely exacerbate the class imbalance problem, not solve it. B is incorrect because transforming images might help with other issues (e.g., improving image quality), but it doesn't directly address the class imbalance. D is incorrect because random data splitting is for creating training, validation, and testing sets, and doesn't modify class distribution within the dataset.

Answer 76

B The correct answer is B because AWS Glue is a fully managed ETL service. It handles scheduling, resource management, and integration with S3 and Aurora, minimizing operational overhead compared to managing EMR clusters (A), Lambda functions (C), or AWS Batch jobs (D). Options A, C, and D require more manual configuration and management of infrastructure, increasing operational overhead. Lambda (C) especially has limitations on execution time and memory that might be problematic for large datasets.

Answer 77

A The most operationally efficient solution is A. Amazon S3 is designed for scalability and performance when accessing large datasets, making it ideal for serving data to SageMaker instances. Directly accessing S3 from SageMaker is more efficient than mounting an EFS file system (options B and C), which adds network latency and complexity. Option D is less efficient because while Amazon Macie can discover PII, it doesn't inherently remove it; requiring an additional step before storage and access. Option A uses Comprehend, specifically designed for PII detection and redaction, directly before storing in S3, creating a streamlined process. Options B and C introduce the overhead of EFS, which is unnecessary for this task.

Answer 78

A A is the correct answer because lifecycle configurations are designed specifically for automating tasks during the creation and modification of SageMaker notebook instances. This method directly addresses the requirement with minimal additional infrastructure or management. B is incorrect because using ECR and Docker images introduces unnecessary complexity for a simple script installation. It adds the overhead of managing container images and potentially increases startup time. C is incorrect because using CodeArtifact and PrivateLink introduces significant complexity and operational overhead for managing a simple script installation. This solution is overkill for the problem. D is incorrect because using Lambda and EventBridge adds more moving parts and introduces latency compared to the direct approach of lifecycle configurations. It creates a more complex system to manage and increases potential points of failure.

Answer 79

C C is correct because Amazon Comprehend and Amazon Comprehend Medical are specifically designed to identify and mask PII and PHI respectively. This directly addresses the requirement of preventing the use of this sensitive data in model training. A is incorrect because while DataBrew can perform data masking, it's not specifically designed for identifying PII and PHI with the same accuracy and precision as Comprehend and Comprehend Medical. It relies on user-defined rules and might miss some instances. B is incorrect because while Redshift can store data and potentially have stored procedures for masking, it doesn't inherently possess the capabilities of specialized services like Comprehend and Comprehend Medical for reliably identifying and masking PII and PHI. D is incorrect because encrypting the data prevents direct access to PII and PHI, but it doesn't prevent metadata leakage or the possibility of unintended data exposure during the model training process if the encrypted data is still used. The question specifies that PII and PHI should *not* be used for training.

Practice Questions - Amazon AWS Certified Machine Learning Engineer - Associate MLA-C01 Flashcards

(113 cards)