amazon-certified-machine-learning-specialty - Copia(Hoja1) Flashcards

(368 cards)

1
Q

1 - A large mobile network operating company is building a machine learning model to predict customers who are likely to unsubscribe from the service. The company plans to offer an incentive for these customers as the cost of churn is far greater than the cost of the incentive. The model produces the following confusion matrix after evaluating on a test dataset of 100 customers: Based on the model evaluation results, why is this a viable model for production?

[https://www.examtopics.com/assets/media/exam-media/04145/0000200001.jpg] - A.. The model is 86% accurate and the cost incurred by the company as a result of false negatives is less than the false positives.
B.. The precision of the model is 86%, which is less than the accuracy of the model.
C.. The model is 86% accurate and the cost incurred by the company as a result of false positives is less than the false negatives.
D.. The precision of the model is 86%, which is greater than the accuracy of the model.

A

C - The Answer is A.
Reasons:
1. accurate is 86%
2. FN=4, FP= 10. The question is asking why this is a feasible model which means why this is working. So it is not asking the explaination of the unit cost of churn(FN) is greater than cost of incentive(FP). It is asking from the matrixs result, the number it self, FN(4) is less than FP(10). The model successfully keep a smaller number of FN regarding of FP.

Such question cannot be answered because we do not know how much more is greater the cost of churn than the cost of the incentive.
CoC - Cost of Churn
CoI - Cost of Incentive
cost incurred by the company as a result of false positives = CoI * 10
cost incurred by the company as a result of false negatives = CoC * 4
So is it the case that CoI * 10 > CoC * 4 => CoI > 0.4 * CoC, or rather CoI < 0.4 * CoC? We don’t know that because we don’t know what does it mean “far greater”, is it 100% greater, or is it 500% greater or any other number.

The answer is A

The Answer is c

Even though there are 10 false positives compared to 4 false negatives, the cost incurred by offering an incentive unnecessarily (false positive) is significantly less than the cost of losing a customer (false negative). This risk management aligns well with the company’s strategy to minimize expensive churn events.

Thus, the model is viable for production because it achieves 86% accuracy and, importantly, the cost of false positives (incentives given) is much lower than the cost associated with false negatives (lost customers).

Option C is indeed the correct choice. The model is 86% accurate, and the cost of false positives (offering incentives) is less than the cost of false negatives (losing customers). This makes the model viable for production.

Changing my earlier Answer from A to C. Cost of FP(10) is lower than Cost of FN(4)

some people that voted A have the right idea, but they chose the wrong option because they need to read the question again. We all agree that the cost of churn is much higher. So a false-negative means a customer churned and you didn’t do anything about it (because your model said “churn=no”) . A false positive means you tried to keep a customer that was not going to leave anyway (because your model said “churn=yes”). As you can see, false-negative is way costlier and should be avoided, therefore answer is C.

Cost incurred for churn higher than incentive. Cost of FN is higher than FP. And accuracy is 86%.

what tomatoteacher said

Accuracy is 86% and it should be A or C. Lost is very high compare to intensive. Means it is Okay to give intensive to customers who are not going to leave. Which means False positives potion.

Should be A. Since the cost of churn is much higher, the priority should be focused on minimizing FN and a viable model should be one with FN < FP, isn’t it?

Definitely C. If you look at the same question in https://aws.amazon.com/blogs/machine-learning/predicting-customer-churn-with-amazon-machine-learning/. Same question, but the confusion matrix is flipped in this case( TP top left, Tn bottom right) . When you miss an actual churn (FN) this would cost the company more. Therefore the answer is C 100%. I will die on this hill. I spent 20 minutes researching this to be certain. Most people who put A are incorrectly saying FPs are actual churns that are stated as no churn.. that is what a FN is. You can trust me on this.

There are more FP’s than FN’s, however the costs of FN’s are far larger than that of FP’s. So:
numberof(FP) > numberof(FN), costperunit(FP) &laquo_space;costperunit(FN). This itself could suggest that totalcosts(FP) < totalcosts(FN), but would be somewhat subjective, since it is not stated how far the unitary costs are.

What is suggested, however, is that the model is indeed viable (question asks WHY the model is viable, and not WHETHER it’s viable).

If the model didn’t exist, there would be no way that there are FP’s or FN’s, but churns would still exist, which have the same cost as FN’s.

So it means the total costs with FP’s must be less than the total costs with FN’s (churns).

Correct Answer C.
Explanation: The model’s accuracy is calculated as (True Positives + True Negatives) / Total predictions, which is (10 + 76) / 100 = 0.86, or 86%. The cost of false positives (customers predicted to churn but don’t) is less than the cost of false negatives (customers who churn but were not predicted to). Offering incentives to the false positives incurs less cost than losing customers due to false negatives. Therefore, this model is viable for production.

A. NO - accuracy is TP+TN / Total = (76+10)/100 = 86%; we know the model is working, so the cost of giving incentives to the wrong customers (FP) is less than the cost of customers we missed (FN), cost(FP) < cost(FN)
B. NO - accuracy is 86%, precision is TP / (TP+FP) = 10 /(10+10) = 50%
C. YES - accuracy is TP+TN / Total = (76+10)/100 = 86%; we know the model is working, so the cost of giving incentives to the wrong customers (FP) is less than the cost of customers we missed (FN), cost(FP) < cost(FN)
D. NO - accuracy is 86%, precision is TP / (TP+FP) = 10 /(10+10) = 50%

C is the correct answer - https://www.examtopics.com/discussions/amazon/view/43814-exam-aws-certified-machine-learning-specialty-topic-1/

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

2 - A Machine Learning Specialist is designing a system for improving sales for a company. The objective is to use the large amount of information the company has on users’ behavior and product preferences to predict which products users would like based on the users’ similarity to other users. What should the Specialist do to meet this objective? - A.. Build a content-based filtering recommendation engine with Apache Spark ML on Amazon EMR
B.. Build a collaborative filtering recommendation engine with Apache Spark ML on Amazon EMR.
C.. Build a model-based filtering recommendation engine with Apache Spark ML on Amazon EMR
D.. Build a combinative filtering recommendation engine with Apache Spark ML on Amazon EMR

A

B - B
see https://en.wikipedia.org/wiki/Collaborative_filtering#Model-based

Content-based filtering relies on similarities between features of items, whereas colloborative-based filtering relies on preferences from other users and how they respond to similar items.

Answer is B : Build a collaborative filtering recommendation engine with Apache Spark ML on Amazon EMR.
Collaborative filtering focuses on user behavior and preferences therefore it is perfect for predicting products based on user similarities.

B. Build a collaborative filtering recommendation engine with Apache Spark ML on Amazon EMR.

Collaborative filtering is a technique used to recommend products to users based on their similarity to other users. It is a widely used method for building recommendation engines. Apache Spark ML is a distributed machine learning library that provides scalable implementations of collaborative filtering algorithms. Amazon EMR is a managed cluster platform that provides easy access to Apache Spark and other distributed computing frameworks.

Build a collaborative filtering recommendation engine with Apache Spark ML on Amazon EMR. (TRUE)

Collaborative filtering is a commonly used method for recommendation systems that aims to predict the preferences of a user based on the behavior of similar users. In the case described, the objective is to use users’ behavior and product preferences to predict which products they want, making collaborative filtering a good fit.

Apache Spark ML is a machine learning library that provides scalable, efficient algorithms for building recommendation systems, while Amazon EMR provides a cloud-based platform for running Spark applications.
You can find more detail in https://www.udemy.com/course/aws-certified-machine-learning-specialty-2023

collaborative filtering

‘Collaborative filtering is a technique that can filter out items that a user might like on the basis of reactions by similar users.’

Source: https://realpython.com/build-recommendation-engine-collaborative-filtering/#what-is-collaborative-filtering

A. NO - content-based filtering looks at similarities with items the user already looked at, not activities of other users
B. YES - state of the art
C. NO - too generic terms, everything is a model
D. NO - combinative filtering does not exist

Collaborative filtering is a technique used by recommendation engines to make predictions about the interests of a user by collecting preferences or taste information from many users. The underlying assumption of the collaborative filtering approach is that if a person A has the same opinion as a person B on an issue, A is more likely to have B’s opinion on a different issue than that of a randomly chosen person.

B. Build a collaborative filtering recommendation engine with Apache Spark ML on Amazon EMR.

I think it should be b

Content-based recommendations rely on product similarity. If a user likes a product, products that are similar to that one will be recommended. Collaborative recommendations are based on user similarity. If you and other users have given similar reviews to a range of products, the model assumes it is likely that other products those other people have liked but that you haven’t purchased should be a good recommendation for you.

feature engineering is required, use model based

Answer is “B”

go for B

B is correct
https://aws.amazon.com/blogs/big-data/building-a-recommendation-engine-with-spark-ml-on-amazon-emr-using-zeppelin/ - https://www.examtopics.com/discussions/amazon/view/11248-exam-aws-certified-machine-learning-specialty-topic-1/

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

3 - A Mobile Network Operator is building an analytics platform to analyze and optimize a company’s operations using Amazon Athena and Amazon S3. The source systems send data in .CSV format in real time. The Data Engineering team wants to transform the data to the Apache Parquet format before storing it on Amazon S3. Which solution takes the LEAST effort to implement? - A.. Ingest .CSV data using Apache Kafka Streams on Amazon EC2 instances and use Kafka Connect S3 to serialize data as Parquet
B.. Ingest .CSV data from Amazon Kinesis Data Streams and use Amazon Glue to convert data into Parquet.
C.. Ingest .CSV data using Apache Spark Structured Streaming in an Amazon EMR cluster and use Apache Spark to convert data into Parquet.
D.. Ingest .CSV data from Amazon Kinesis Data Streams and use Amazon Kinesis Data Firehose to convert data into Parquet.

A

D - Answer is B

you cannot use AWS glue for streaming data. Clearly B is incorrect.

Even if the exam’s answer is based on solution before AWS implemented the capability of AWS glue to process streaming data, this answer is still correct as Kinesis would output the data to S3 and Glue will pick it up from there and covert to parquet. Question does not say data must be converted to parquet in real time, it only says the csv data is received as a stream in real time.

Actually question says “The source systems send data in CSV format in real time The Data Engineering team wants to transform the data to the Apache Parquet format before storing it on Amazon S3” same as saying data must be converted real time

AWS Glue can do it now (2020 May)
https://aws.amazon.com/jp/blogs/news/new-serverless-streaming-etl-with-aws-glue/

This link is in Japanese

the Approve Of B
https://aws.amazon.com/blogs/aws/new-serverless-streaming-etl-with-aws-glue/

D is wrong as kinesis firehose can convert from JSON to parquet but here we have CSV.
B is correct and here is another proof link: https://medium.com/searce/convert-csv-json-files-to-apache-parquet-using-aws-glue-a760d177b45f

https://docs.aws.amazon.com/firehose/latest/dev/record-format-conversion.html
You are right.

https://docs.aws.amazon.com/firehose/latest/dev/record-format-conversion.html
If you want to convert an input format other than JSON, such as comma-separated values (CSV) or structured text, you can use AWS Lambda to transform it to JSON first

But there is no Lambda in D

But there’s a D in Lambda

Kinesis Data Firehose supports real-time streaming ingestion and can automatically convert CSV to Parquet before storing it in S3.

Amazon Kinesis Data Streams + Amazon Kinesis Data Firehose
Effort: Lowest effort
Why?
Amazon Kinesis Data Firehose natively supports real-time CSV ingestion and automatic conversion to Parquet.
Fully managed, serverless, and directly integrates with Amazon S3.
Requires zero infrastructure management compared to other solutions.

I take this back .. ans shd be B.. on researching further it is JSON or ORC to Parque that KDS supports.. So answer is B - not optimal but close to suitable
. Amazon Kinesis Data Streams + AWS Glue AWS Glue can batch-process CSV and convert it to Parquet for S3. However, Glue is batch-oriented, not real-time.

Although I’d go with Glue and option B I’m pretty sure that this is one of those “15 unscored questions that do not affect your score. AWS collects information about performance on these unscored questions to evaluate these questions for future use as scored questions”

Just for fun I asked perplexity, chatgpt, gemini, deepseek and claude: all gave D as first response

When I pointed out that “according to this https://docs.aws.amazon.com/firehose/latest/dev/record-format-conversion.html Kinesis can’t convert directly cvs to parquet. It needs a Lambda” each model responded in a different way (some of them contradictory).

My reasoning is that D (Kinesis + Firehose) is incorrect because Firehose does not support direct CSV-to-Parquet conversion and needs a Lambda not mentioned in the option. But discussing about questions like this one is nothing but I big waste of time ;-P

D
Kinesis Data Firehose is designed specifically for streaming data delivery to destinations like S3. It has built-in support for data format conversion, including CSV to Parquet. This eliminates the need for managing separate transformation services like Glue or Spark. The setup is significantly simpler: you configure a Firehose delivery stream, specify the data format conversion, and point it to your S3 bucket.

Therefore, option D requires the least implementation effort because it leverages a fully managed service (Kinesis Data Firehose) with built-in functionality for data format conversion.

Amazon Kinesis Data Firehose can only convert from JSON to Apache Parquet or Apache ORC before storing the data in Amazon S3.

Answer B,
Yes, Amazon Kinesis Data Firehose can convert CSV to Apache Parquet, but you need to use a Lambda function to transform the CSV to JSON first: here the question is least effort to build, so B is the right answer with least effort to build the solution

Use Amazon Kinesis Data Streams to ingest customer data and configure a Kinesis Data Firehose delivery stream as a consumer to convert the data into Apache Parquet is incorrect. Although this could be a valid solution, it entails more development effort as Kinesis Data Firehose does not support converting CSV files directly into Apache Parquet, unlike JSON.

Amazon Kinesis Data Firehose can convert the format of your input data from JSON to Apache Parquet or Apache ORC before storing the data in Amazon S3. Parquet and ORC are columnar data formats that save space and enable faster queries compared to row-oriented formats like JSON. If you want to convert an input format other than JSON, such as comma-separated values (CSV) or structured text, you can use AWS Lambda to transform it to JSON first.
https://docs.aws.amazon.com/firehose/latest/dev/record-format-conversion.html

Between B and D chose D.
Because Firehose can’t handle csv directly.

Between B and D chose B.
Because Firehose can’t handle csv directly.

Answer is B.
https://docs.aws.amazon.com/firehose/latest/dev/record-format-conversion.html
“If you want to convert an input format other than JSON, such as comma-separated values (CSV) or structured text, you can use AWS Lambda to transform it to JSON first.”

u need glue to convert to parquet

D for sure, Firehose can convert csv to parquet

Answer is unfortunately B. firehose cannot convert coma separated CSV to parquet directly.

b is not goog but - >given the context of “finding the solution that requires the least effort to implement,” option D is the most suitable choice. Ingesting data from Amazon Kinesis Data Streams and using Amazon Kinesis Data Firehose to convert the data to Parquet format is a serverless approach. It allows for automatic data transformation and storage in Amazon S3 without the need for additional development or management of data conversion logic. Therefore, under the given conditions, option D is considered the solution that requires the “least effort” to implement

Kinesis Data Firehose doesn’t convert anything, it rather calls a lambda function to do so which is the overhead we want to avoid. B is the correct answer.

Amazon Kinesis Data Streams is a service that can capture, store, and process streaming data in real
time. Amazon Kinesis Data Firehose is a service that can deliver streaming data to various
destinations, such as Amazon S3, Amazon Redshift, or Amazon Elasticsearch Service. Amazon Kinesis
Data Firehose can also transform the data before delivering it, such as converting the data format,
compressing the data, or encrypting the data. One of the supported data formats that Amazon
Kinesis Data Firehose can convert to is Apache Parquet, which is a columnar storage format that can
improve the performance and cost-efficiency of analytics queries. By using Amazon Kinesis Data
Streams and Amazon Kinesis Data Firehose, the Mobile Network Operator can ingest the .CSV data
from the source systems and use Amazon Kinesis Data Firehose to convert the data into Parquet
before storing it on Amazon S3

Firehose cannot natively do the conversion. It requires a Lambda function for that purpose. - https://www.examtopics.com/discussions/amazon/view/8303-exam-aws-certified-machine-learning-specialty-topic-1/

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

4 - A city wants to monitor its air quality to address the consequences of air pollution. A Machine Learning Specialist needs to forecast the air quality in parts per million of contaminates for the next 2 days in the city. As this is a prototype, only daily data from the last year is available. Which model is MOST likely to provide the best results in Amazon SageMaker? - A.. Use the Amazon SageMaker k-Nearest-Neighbors (kNN) algorithm on the single time series consisting of the full year of data with a predictor_type of regressor.
B.. Use Amazon SageMaker Random Cut Forest (RCF) on the single time series consisting of the full year of data.
C.. Use the Amazon SageMaker Linear Learner algorithm on the single time series consisting of the full year of data with a predictor_type of regressor.
D.. Use the Amazon SageMaker Linear Learner algorithm on the single time series consisting of the full year of data with a predictor_type of classifier.

A

C - answer should be C

go for C

Amazon SageMaker Linear Learner (Regressor)
Why?
The Linear Learner algorithm can be used for time series regression.
Using predictor_type=regressor, it learns trends and patterns in historical data and extrapolates future values.
Given limited historical data (only 1 year), a simple linear regression model might perform well as a baseline.
While deep learning models (like Amazon Forecast) may be more advanced, Linear Learner is easier to implement and train for a prototype.

A. NO - kNN is not forecasting, it is similarities
B. NO - RCF is for anomality detection
C. YES - Linear Regression good for forecasting
D. NO - we don’t want to classify

The reason for this choice is that the Linear Learner algorithm is a versatile algorithm that can be used for both regression and classification tasks1. Regression is a type of supervised learning that predicts a continuous numeric value, such as the air quality in parts per million2. The predictor_type parameter specifies whether the algorithm should perform regression or classification3. Since the goal is to forecast a numeric value, the predictor_type should be set to regressor.

A. Managing Kafka on EC2 is not compatible with least effort requirement
B. Doable (in 2024) as Glue supports streaming ETL to consumes streams and supports CSV records -> https://docs.aws.amazon.com/glue/latest/dg/add-job-streaming.html
C. Managing an EMR cluster imo is no compatible with least effort requirement
D. Firehose supports kinesis data stream as source and it can use lambda to convert CSV records into parquet -> https://docs.aws.amazon.com/firehose/latest/dev/record-format-conversion.html

I guess this is a bit old question, pre Glue streaming ETL support (2023) -> https://aws.amazon.com/about-aws/whats-new/2023/03/aws-glue-4-0-streaming-etl/

Thus I’ll go for D

This blog wrote Japanese.
but its said using LinearLearner for air pollution prediction.
https://aws.amazon.com/jp/blogs/news/build-a-model-to-predict-the-impact-of-weather-on-urban-air-quality-using-amazon-sagemaker/

The HyperParameter is . Either “binary_classifier” or “multiclass_classifier” or “regressor”., there is no classifier so the answer is C

Ans should be c

a kNN will require a large value of k to avoid overfitting and we only have 1 year’s worth of data - kNNs also face a difficult time extrapolating if the air quality series contains a trend

If we had assurances there is no trend in the air quality series (no extrapolation), and we had enough data, then kNN should beat a linear model … I am inclined to go for C just going off of the cue that “only daily data from last year is available”

Agree with you analysis, to further expand it: we don’t have info about dataset features based on “only daily data from last year is available” this let me think we could be in a situation where our dataset is made up by timestamp and pollution_value so KNN would be pretty useless in this situation.

Random cut forests in timeseries are used for anomaly detection, and not for forecasting. KNN’s are classification algorithms. You would use the Linear Learner as a regressor, since forecasting falls into the domain of regression.

I mean, you could use KNN’s for regression, but for forecasting I don’t think so

KNN isn’t for time series predicting, go for A!

Im sorry, I wanted to say go for C!

Creating a machine learning model to predict air quality
To start small, we will follow the second approach, where we will build a model that will predict the NO2 concentration of any given day based on wind speed, wind direction, maximum temperature, pressure values of that day, and the NO2 concentration of the previous day. For this we will use the Linear Learner algorithm provided in Amazon SageMaker, enabling us to quickly build a model with minimal work.

Our model will consist of taking all of the variables in our dataset and using them as features of the Linear Learner algorithm available in Amazon SageMaker

Answer should be A.

k-Nearest-Neighbors (kNN) algorithm will provide the best results for this use case as it is a good fit for time series data, especially for predicting continuous values. The predictor_type of regressor is also appropriate for this task, as the goal is to forecast a continuous value (air quality in parts per million of contaminants). The other options are also viable, but may not provide as good of results as the kNN algorithm, especially with limited data.

using the Amazon SageMaker Linear Learner algorithm with a predictor_type of regressor, may still provide reasonable results, but it assumes a linear relationship between the input features and the target variable (air quality), which may not always hold in practice, especially with complex time series data. In such cases, non-linear models like kNN may perform better. Furthermore, the kNN algorithm can handle irregular patterns in the data, which may be present in the air quality data, and provide more accurate predictions.

Answer is “C” !!!

answer C

I go with A. Linear regression is not suitable for time series data. there is a library that implements knn for time-series https://cran.r-project.org/web/packages/tsfknn/vignettes/tsfknn.html

I mean the air quality have many feature correlations that are not linear. - https://www.examtopics.com/discussions/amazon/view/12382-exam-aws-certified-machine-learning-specialty-topic-1/

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

5 - A Data Engineer needs to build a model using a dataset containing customer credit card information How can the Data Engineer ensure the data remains encrypted and the credit card information is secure? - A.. Use a custom encryption algorithm to encrypt the data and store the data on an Amazon SageMaker instance in a VPC. Use the SageMaker DeepAR algorithm to randomize the credit card numbers.
B.. Use an IAM policy to encrypt the data on the Amazon S3 bucket and Amazon Kinesis to automatically discard credit card numbers and insert fake credit card numbers.
C.. Use an Amazon SageMaker launch configuration to encrypt the data once it is copied to the SageMaker instance in a VPC. Use the SageMaker principal component analysis (PCA) algorithm to reduce the length of the credit card numbers.
D.. Use AWS KMS to encrypt the data on Amazon S3 and Amazon SageMaker, and redact the credit card numbers from the customer data with AWS Glue.

A

D - Why not D? When the data encrypted on S3 and SageMaker uses the same AWS KMS key it can use encrypted data there.

should be D

Should be D.
Use Glue to do ETL to Hash the card number

Answer would be D

D is correct

https://aws.amazon.com/blogs/big-data/detect-and-process-sensitive-data-using-aws-glue-studio/
AWS Glue can be used for detecting and processing sensitive data.

Use AWS KMS for encryption and AWS Glue to redact credit card numbers
Reasoning:
AWS KMS (Key Management Service) encrypts data at rest in Amazon S3 and during processing in Amazon SageMaker.
AWS Glue can be used to redact sensitive data before processing, ensuring that credit card numbers are removed from datasets before being used for ML.
Complies with PCI DSS requirements for handling payment information securely.

The reason for this choice is that AWS KMS is a service that allows you to easily create and manage encryption keys and control the use of encryption across a wide range of AWS services and in your applications1. By using AWS KMS, you can encrypt the data on Amazon S3, which is a durable, scalable, and secure object storage service2, and on Amazon SageMaker, which is a fully managed service that provides every developer and data scientist with the ability to build, train, and deploy machine learning models quickly3. This way, you can protect the data at rest and in transit.

A. NO - no need for custom encryption
B. NO - IAM Policies are not to encrypt
C. NO - launch configuration is not to encrypt
D. YES

I think d is correct

It’s D, KMS key can be used for encrypting the data at rest!

agreed with D

IMHO, the problem with the question is that it is not clear whether the credit card number is used in the model. In that case discarding is never a good option. Hashing should be a safe option to keep it in the learning path

It’s gotta be D but C is a clever fake answer. Use PCA to reduce the length of the credit card number? That’s a clever joke, as if reducing the length of a character string is the same as reducing dimensionality in a feature set.

Can Glue do redaction?

Just have the Glue job remove the credit card column.

Encryption on AWS can be done using KMS so D is the answer

D is correct

D..KMS fully managed and other options are too whacky..

D is correct

Ans D is correct - https://www.examtopics.com/discussions/amazon/view/9818-exam-aws-certified-machine-learning-specialty-topic-1/

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

6 - A Machine Learning Specialist is using an Amazon SageMaker notebook instance in a private subnet of a corporate VPC. The ML Specialist has important data stored on the Amazon SageMaker notebook instance’s Amazon EBS volume, and needs to take a snapshot of that EBS volume. However, the ML Specialist cannot find the Amazon SageMaker notebook instance’s EBS volume or Amazon EC2 instance within the VPC. Why is the ML Specialist not seeing the instance visible in the VPC? - A.. Amazon SageMaker notebook instances are based on the EC2 instances within the customer account, but they run outside of VPCs.
B.. Amazon SageMaker notebook instances are based on the Amazon ECS service within customer accounts.
C.. Amazon SageMaker notebook instances are based on EC2 instances running within AWS service accounts.
D.. Amazon SageMaker notebook instances are based on AWS ECS instances running within AWS service accounts.

A

C - I think the answer should be C

The correct answer HAS TO be A

The instances are running in customer accounts but it’s in an AWS managed VPC while exposing ENI to customer VPC if it was chosen.
See explanation at https://aws.amazon.com/blogs/machine-learning/understanding-amazon-sagemaker-notebook-instance-networking-configurations-and-advanced-routing-options/

Can’t be A because A says “but they run outside of VPCs”, which is not correct. They are attached to VPC, but it can either be AWS Service VPC or Customer VPC, or Both, as per the explanation url you provided.

This is exactly right. According to that document, if the notebook instance is not in a customer VPC, then it has to be in the Sagemaker managed VPC. See Option 1 in that document.

Actually your link says: The notebook instance is running in an Amazon SageMaker managed VPC as shown in the above diagram. That means the correct answer is C. An Amazon SageMaker managed VPC can only be created in an Amazon managed Account.

C. Amazon SageMaker notebook instances are based on EC2 instances running within AWS service accounts.
Why?
Amazon SageMaker does use EC2 instances, but they are not directly managed within the customer’s AWS account.
Instead, these instances are provisioned within AWS-managed service accounts, which is why they do not appear within the customer’s VPC or EC2 console.
The only way to access the underlying EBS volume is via SageMaker APIs, rather than the EC2 console.

Although I’d go with Glue and option B I’m pretty sure that this is one of those “15 unscored questions that do not affect your score. AWS collects information about performance on these unscored questions to evaluate these questions for future use as scored questions”

Just for fun I asked perplexity, chatgpt, gemini, deepseek and claude: all gave D as first response

When I pointed out that “according to this https://docs.aws.amazon.com/firehose/latest/dev/record-format-conversion.html Kinesis can’t convert directly cvs to parquet. It needs a Lambda” each model responded in a different way (some of them contradictory).

My reasoning is that D (Kinesis + Firehose) is incorrect because Firehose does not support direct CSV-to-Parquet conversion and needs a Lambda not mentioned in the option. But discussing about questions like this one is nothing but I big waste of time ;-P

Forget about this please I posted this here incorrectly. This corresponds to Question 3. Apologies

Amazon SageMaker notebook instances are indeed based on EC2 instances, but they are managed by the SageMaker service and do not appear as standard EC2 instances in the customer’s VPC. Instead, they run in a managed environment that abstracts away the underlying EC2 instances, which is why the ML Specialist cannot see the instance in the VPC.

The explanation for this choice is that Amazon SageMaker notebook instances are fully managed by AWS and run on EC2 instances that are not visible to customers. These EC2 instances are launched in AWS-owned accounts and are isolated from customer accounts by using AWS PrivateLink1. This means that customers cannot access or manage these EC2 instances directly, nor can they see the EBS volumes attached to them.

A. NO - AEC2 instances within the customer account are necessarily in a VPCb
B. NO - Amazon ECS service is not within customer accounts
C. YES - EC2 instances running within AWS service accounts are not visible to customer account
D. NO - SageMaker manages EC2 instance, not ECS

A. NO. If the EC2 instance of the notebook was in the customer account, customer would be able to see it. Also, “they run outside VPCs” isn’t true as they run in service managed VPC or can be also attached to customer provided VPC -> https://aws.amazon.com/blogs/machine-learning/understanding-amazon-sagemaker-notebook-instance-networking-configurations-and-advanced-routing-options/
B. NO, Notebooks are based on EC2 + EBS
C. YES -> https://aws.amazon.com/blogs/machine-learning/understanding-amazon-sagemaker-notebook-instance-networking-configurations-and-advanced-routing-options/
D. NO, Notebooks are based on EC2 + EBS

I also actually tested it in my account: I created a Notebook and attached it to my VPC, I was not able to see the EC2 instance behind the Notebook but I was able to see the its ENI with the following description “[Do not delete] Network Interface created to access resources in your VPC for SageMaker Notebook Instance …”

already given below

I am pretty sure the answer is A : Amazon SageMaker notebook instances are indeed based on EC2 instances, and these instances are within your AWS customer account. However, by default, SageMaker notebook instances run outside of your VPC (Virtual Private Cloud), which is why they may not be visible within your VPC. SageMaker instances are designed to be easily accessible for data science and machine learning tasks, which is why they typically do not reside within a VPC. If you need them to operate within a VPC, you can configure them accordingly, but this is not the default behavior.

I think it should be c

Per https://docs.aws.amazon.com/sagemaker/latest/dg/studio-notebooks-and-internet-access.html it’s C

Notebooks can run inside AWS managed VPC or customer managed VPC

C, check the digram in https://docs.aws.amazon.com/sagemaker/latest/dg/studio-notebooks-and-internet-access.html

When a SageMaker notebook instance is launched in a VPC, it creates an Elastic Network Interface (ENI) in the subnet specified, but the underlying EC2 instance is not visible in the VPC. This is because the EC2 instance is managed by AWS, and it is outside of the VPC. The ENI acts as a bridge between the VPC and the notebook instance, allowing network connectivity between the notebook instance and other resources in the VPC. Therefore, the EBS volume of the notebook instance is also not visible in the VPC, and you cannot take a snapshot of the volume using VPC-based tools. Instead, you can create a snapshot of the EBS volume directly from the SageMaker console, AWS CLI, or SDKs.

what you described is C
“This is because the EC2 instance is managed by AWS, and it is outside of the VPC.”

Notebooks run inside a VPC not outside!

Definitely C - https://www.examtopics.com/discussions/amazon/view/11559-exam-aws-certified-machine-learning-specialty-topic-1/

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

7 - A Machine Learning Specialist is building a model that will perform time series forecasting using Amazon SageMaker. The Specialist has finished training the model and is now planning to perform load testing on the endpoint so they can configure Auto Scaling for the model variant. Which approach will allow the Specialist to review the latency, memory utilization, and CPU utilization during the load test? - A.. Review SageMaker logs that have been written to Amazon S3 by leveraging Amazon Athena and Amazon QuickSight to visualize logs as they are being produced.
B.. Generate an Amazon CloudWatch dashboard to create a single view for the latency, memory utilization, and CPU utilization metrics that are outputted by Amazon SageMaker.
C.. Build custom Amazon CloudWatch Logs and then leverage Amazon ES and Kibana to query and visualize the log data as it is generated by Amazon SageMaker.
D.. Send Amazon CloudWatch Logs that were generated by Amazon SageMaker to Amazon ES and use Kibana to query and visualize the log data.

A

B - Agreed. Ans is B

Generate an Amazon CloudWatch dashboard to create a single view for latency, memory utilization, and CPU utilization
Why?
Amazon SageMaker automatically pushes latency and instance utilization metrics to CloudWatch.
CloudWatch dashboards provide a single real-time view of these key metrics during load testing.
You can configure custom CloudWatch alarms to trigger auto scaling based on the load.

the question is clear that the specialist is seeking for latency, memory utilization, and CPU utilization during the load test and the ideal answer for all of these is amazon cloud watch which give you all these metrics
https://docs.aws.amazon.com/sagemaker/latest/dg/monitoring-cloudwatch.html

The reason for this choice is that Amazon CloudWatch is a service that monitors and manages your cloud resources and applications. It collects and tracks metrics, which are variables you can measure for your resources and applications1. Amazon SageMaker automatically reports metrics such as latency, memory utilization, and CPU utilization to CloudWatch2. You can use these metrics to monitor the performance and health of your SageMaker endpoint during the load test.

the question is clear that the specialist is seeking for latency, memory utilization, and CPU utilization during the load test and the ideal answer for all of these is amazon cloud watch which give you all these metrics
https://docs.aws.amazon.com/sagemaker/latest/dg/monitoring-cloudwatch.html

I think it should be b

It’s B, even the resources that aren’t visible in a first try are visible if you use cloudwatch agent.

Should be B

agreed with B

B is the ans

Should be C right, as Cloudwatch does not have metrics for memory utilization.

After further research, I think answer is B. While indeed true that Cloudwatch does not have metrics for memory utilization by default, you can achieve by installing ClouldWatch agent on the EC2. The EC2 used by Sagemaker is pre-installed with Cloudwatch Agent.

I do not think that CloudWatch, by default, logs memory utilization. It does log CPU utilization. If memory utilization is required, then a separate agent needs to be installed to watch for memory. Hence, in this case, we have to write an agent if the answer has to be B. Else, C looks to be a better solution.

answer is B

Answer is B 100%; very straightforward method

B is correct. Don’t need to use Kibana or QuickSight.

ans is B

B is correct - https://www.examtopics.com/discussions/amazon/view/11560-exam-aws-certified-machine-learning-specialty-topic-1/

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

8 - A manufacturing company has structured and unstructured data stored in an Amazon S3 bucket. A Machine Learning Specialist wants to use SQL to run queries on this data. Which solution requires the LEAST effort to be able to query this data? - A.. Use AWS Data Pipeline to transform the data and Amazon RDS to run queries.
B.. Use AWS Glue to catalogue the data and Amazon Athena to run queries.
C.. Use AWS Batch to run ETL on the data and Amazon Aurora to run the queries.
D.. Use AWS Lambda to transform the data and Amazon Kinesis Data Analytics to run queries.

A

B - B is correct

The correct answer HAS TO be B
Using Glue Use AWS Glue to catalogue the data and Amazon Athena to run queries against data on S3 are very typical use cases for those services.

D is not ideal, Lambda can surely do many things but it requires development/testing effort, and Amazon Kinesis Data Analytics is not ideal for ad-hoc queries.

B. Use AWS Glue to catalog the data and Amazon Athena to run queries.
Why is this the best choice?
AWS Glue can automatically catalog both structured and unstructured data in S3.
Amazon Athena is a serverless SQL query service that allows direct SQL queries on S3 data without moving it.
No infrastructure setup is required—just define a Glue Data Catalog and start querying with Athena.

S3 query === athena , to catalog data glue

AWS Glue is a fully managed ETL service that makes it easy to move data between data stores. It can automatically crawl, catalogue, and classify data stored in Amazon S3, and make it available for querying and analysis. With AWS Glue, you don’t have to worry about the underlying infrastructure and can focus on your data.

Amazon Athena is an interactive query service that makes it easy to analyze data in Amazon S3 using standard SQL. It integrates with AWS Glue, so you can use the catalogued data directly in Athena without any additional data movement or transformation.

The reason for this choice is that AWS Glue is a fully managed service that provides a data catalogue to make your data in S3 searchable and queryable1. AWS Glue crawls your data sources, identifies data formats, and suggests schemas and transformations1. You can use AWS Glue to catalogue both structured and unstructured data, such as relational data, JSON, XML, CSV files, images, or media files2.

I think it should be b

B is the easiest. We can use Glue crawler.

Answer B

Querying data in S3 with SQL is almost always Athena.

If AWS asks the question of querying unstructured data in an efficient manner, it is almost always Athena

B. I don’t think that you even need Glue to transform anything. Just use Glue to define the schemas and then use Athena to query based on those schemas.

answer is B

SQL on S3 is Athena so answer is B for sure

B is right

Answer is B.

Queries Against an Amazon S3 Data Lake
Data lakes are an increasingly popular way to store and analyze both structured and unstructured data. If you want to build your own custom Amazon S3 data lake, AWS Glue can make all your data immediately available for analytics without moving the data.

https://aws.amazon.com/glue/

Correct Ans is D…Kinesis Data Analytics can use Lambda to transform and then run the SQL queries..

May I know why you are taking complex route? - https://www.examtopics.com/discussions/amazon/view/11771-exam-aws-certified-machine-learning-specialty-topic-1/

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

9 - A Machine Learning Specialist is developing a custom video recommendation model for an application. The dataset used to train this model is very large with millions of data points and is hosted in an Amazon S3 bucket. The Specialist wants to avoid loading all of this data onto an Amazon SageMaker notebook instance because it would take hours to move and will exceed the attached 5 GB Amazon EBS volume on the notebook instance. Which approach allows the Specialist to use all the data to train the model? - A.. Load a smaller subset of the data into the SageMaker notebook and train locally. Confirm that the training code is executing and the model parameters seem reasonable. Initiate a SageMaker training job using the full dataset from the S3 bucket using Pipe input mode.
B.. Launch an Amazon EC2 instance with an AWS Deep Learning AMI and attach the S3 bucket to the instance. Train on a small amount of the data to verify the training code and hyperparameters. Go back to Amazon SageMaker and train using the full dataset
C.. Use AWS Glue to train a model using a small subset of the data to confirm that the data will be compatible with Amazon SageMaker. Initiate a SageMaker training job using the full dataset from the S3 bucket using Pipe input mode.
D.. Load a smaller subset of the data into the SageMaker notebook and train locally. Confirm that the training code is executing and the model parameters seem reasonable. Launch an Amazon EC2 instance with an AWS Deep Learning AMI and attach the S3 bucket to train the full dataset.

A

A - Answer is A. The answer to this question is about Pipe mode from S3. The only options are A and C. As AWS Glue cannot be use to create models which is option C.
The correct answer is A

Answer is A.

Training locally on a small dataset ensures the training script and model parameters are working correctly.
Amazon SageMaker training jobs allow direct access to S3 data without downloading everything.
Pipe input mode efficiently streams data from S3 to the training instance, reducing disk space requirements and speeding up training.

Only Pipe mode can stream data from S3

The reason for this choice is that Pipe input mode is a feature of Amazon SageMaker that allows you to stream data directly from an Amazon S3 bucket to your training instances without downloading it first1. This way, you can avoid the time and space limitations of loading a large dataset onto your notebook instance. Pipe input mode also offers faster start times and better throughput than File input mode, which downloads the entire dataset before training1.

A. YES - pipe mode is best to start inference before the entire data is transferred; the only drawback is if multiple training jobs are done in sequence (eg. different hyperparamater), the data will be downloaded again
B. NO - we want to use SageMaker first for initial training
C. NO - We first want to test things in SageMaker
D. NO - the SageMaker notebook will not use the AMI so the testing done is useless

B. Generate daily precision-recall data in Amazon QuickSight, and publish the results in a dashboard shared with the Business team.

This solution leverages QuickSight’s managed service capabilities for both data processing and visualization, which should minimize the coding effort required to provide the Business team with the necessary insights. However, it’s important to note that QuickSight’s ability to calculate the precision-recall data depends on its support for the necessary statistical functions or the availability of such calculations in the dataset. If QuickSight cannot perform these calculations directly, option C might be necessary, despite the increased effort.

I think it should be a

It’s A, pipe mode is for dealing with very big data.

A, PIPE is to do that sort of modeling

When data is already in S3 and next it should move to Sagemaker.. so option A is suitable

Answer is A. B, C & D can be dropped because there is no integration from/to Sage Maker train job (model).

Gotta be A. You need to use Pipe mode but Glue cannot train a model.

AAAAAAAAAAa

ans is A

Will you run AWS Deep Learning AMI for all cases where the data is very large in S3? Also what role is Glue playing here? Is there a transformation? These are the two issues for options B C and D. I believe they do not represent what is required to satisfy the requirements in the question. The answer definitely requires the pipe mode, but not with Glue. I go with A https://aws.amazon.com/blogs/machine-learning/using-pipe-input-mode-for-amazon-sagemaker-algorithms/

go for A - https://www.examtopics.com/discussions/amazon/view/9656-exam-aws-certified-machine-learning-specialty-topic-1/

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

10 - A Machine Learning Specialist has completed a proof of concept for a company using a small data sample, and now the Specialist is ready to implement an end- to-end solution in AWS using Amazon SageMaker. The historical training data is stored in Amazon RDS. Which approach should the Specialist use for training a model using that data? - A.. Write a direct connection to the SQL database within the notebook and pull data in
B.. Push the data from Microsoft SQL Server to Amazon S3 using an AWS Data Pipeline and provide the S3 location within the notebook.
C.. Move the data to Amazon DynamoDB and set up a connection to DynamoDB within the notebook to pull data in.
D.. Move the data to Amazon ElastiCache using AWS DMS and set up a connection within the notebook to pull data in for fast access.

A

B - Answer is B as the data for a SageMaker notebook needs to be from S3 and option B is the only option that says it. The only thing with option B is that it is talking of moving data from MS SQL Server not RDS

https://www.slideshare.net/AmazonWebServices/train-models-on-amazon-sagemaker-using-data-not-from-amazon-s3-aim419-aws-reinvent-2018

Please look at the slide 14 of that link, although the data source from DynamoDB or RDS, it is still need to use AWS Glue to move the data to S3 for SageMaker to use.

So, the right anwser should be B.

I agree. As from the ML developer guide I just read, it is the MYSQL RDS that can be used as SQL datasource.

Amazon SageMaker does not natively connect to Amazon RDS. Instead, training jobs work best with data stored in Amazon S3.
Amazon S3 is the preferred data source for SageMaker because:
It integrates seamlessly with SageMaker’s training job infrastructure.
It supports distributed training for large datasets.
It is cost-effective and decouples storage from compute.
Best practice → Export RDS data to Amazon S3 and train using SageMaker.

B is the correct answer.

Official AWS Documentation:
“Amazon ML allows you to create a datasource object from data stored in a MySQL database in Amazon Relational Database Service (Amazon RDS). When you perform this action, Amazon ML creates an AWS Data Pipeline object that executes the SQL query that you specify, and places the output into an S3 bucket of your choice. Amazon ML uses that data to create the datasource.”

In Option B approach, the Specialist can use AWS Data Pipeline to automate the movement of data from Amazon RDS to Amazon S3. This allows for the creation of a reliable and scalable data pipeline that can handle large amounts of data and ensure the data is available for training.

In the Amazon SageMaker notebook, the Specialist can then access the data stored in Amazon S3 and use it for training the model. Using Amazon S3 as the source of training data is a common and scalable approach, and it also provides durability and high availability of the data.

This approach is the most scalable and reliable way to train a model using data stored in Amazon RDS. Amazon S3 is a highly scalable and durable object storage service, and Amazon Data Pipeline is a managed service that makes it easy to move data between different AWS services. By pushing the data to Amazon S3, the Specialist can ensure that the data is available for training the model even if the Amazon RDS instance is unavailable.

A. NO - SageMaker can only read from S3
B. YES - AWS Data Pipeline can moved from SQL Server to S3
C. NO - SageMaker can only read from S3 and not DynamoDB
D. NO - SageMaker can only read from S3 and not ElastiCache

Option B (exporting to S3) is typically more flexible and cost-effective for large-scale or complex data needs (Which is our case - production), while Option A (direct connection) can be simpler and more immediate for real-time or smaller-scale scenarios like testing.

A. NO. It is doable, but this is not the best approach.
B. YES
C. NO. Pushing data to DynamoDB would not make it easier to access data
D. NO. Pushing data to ElastiCache would not make it easier to access data

For Amazon S3, you can import data from an Amazon S3 bucket as long as you have permissions to access the bucket.

For Amazon Athena, you can access databases in your AWS Glue Data Catalog as long as you have permissions through your Amazon Athena workgroup.

For Amazon RDS, if you have the AmazonSageMakerCanvasFullAccess policy attached to your user’s role, then you’ll be able to import data from your Amazon RDS databases into Canvas.

https://docs.aws.amazon.com/sagemaker/latest/dg/canvas-connecting-external.html

https://aws.amazon.com/about-aws/whats-new/2024/04/amazon-sagemaker-studio-notebooks-data-sql-query/

I think it should be b

It’s B, even if Microsoft SQL Server is a strange name for RDS, it’s a possible database to use there and the data for sagemaker needs to be in S3!

While B is a valid answer, It is also possible to make a SQL connection in a notebook and create a data object so A could be a valid answer too
https://stackoverflow.com/questions/36021385/connecting-from-python-to-sql-server
https://www.mssqltips.com/sqlservertip/6120/data-exploration-with-python-and-sql-server-using-jupyter-notebooks/

you need to choose the best answer, not any valid answer. Often, many of the answers are valid solutions, but are not best practice.

B is correct. MS SQL Server is also under RDS.

B is right

B it is

I’ll go with B - https://www.examtopics.com/discussions/amazon/view/11376-exam-aws-certified-machine-learning-specialty-topic-1/

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

11 - A Machine Learning Specialist receives customer data for an online shopping website. The data includes demographics, past visits, and locality information. The Specialist must develop a machine learning approach to identify the customer shopping patterns, preferences, and trends to enhance the website for better service and smart recommendations. Which solution should the Specialist recommend? - A.. Latent Dirichlet Allocation (LDA) for the given collection of discrete data to identify patterns in the customer database.
B.. A neural network with a minimum of three layers and random initial weights to identify patterns in the customer database.
C.. Collaborative filtering based on user interactions and correlations to identify patterns in the customer database.
D.. Random Cut Forest (RCF) over random subsamples to identify patterns in the customer database.

A

C - answer should be C
Collaborative filtering is for recommendation, LDA is for topic modeling

In natural language processing, the latent Dirichlet allocation (LDA) is a generative statistical model that allows sets of observations to be explained by unobserved groups that explain why some parts of the data are similar.

Amazon SageMaker Random Cut Forest (RCF) is an unsupervised algorithm for detecting anomalous data points within a data set

Neural network is used for image detection

Answer is C

Collab filtering it is..
Collaborative filtering is the most widely used approach for recommendation systems.
It uses customer interactions (purchases, clicks, ratings) to determine preferences based on similar users or items.
Implicit collaborative filtering (based on user behavior) and explicit collaborative filtering (based on ratings) can effectively personalize recommendations.

A. NO - LDA is for topic modeling
B. NO - NN is a too generic term, you want Neural Collaborative
C. YES - Collaborative filtering best fit
D. NO - Random Cut Forest (RCF) for anomalities

Collaborative filtering is a machine learning technique that recommends products or services to users based on the ratings or preferences of other users. This technique is well-suited for identifying customer shopping patterns and preferences because it takes into account the interactions between users and products.

From the doc: “You can use LDA for a variety of tasks, from clustering customers based on product purchases to automatic harmonic analysis in music.”

https://docs.aws.amazon.com/sagemaker/latest/dg/lda-how-it-works.html

I think it should be c

C, always when talk about recommendation you can think about collaborative patterns!

A
LDA used before collaborative filtering is largely adopted.
1) the input data that we have doesn’t lend itself to collaborative filtering - it requires a set of items and a set of users who have reacted to some of the items, which is NOT what we have
2) recommendation is just one thing that we want to do. What about trends?
3) collaborative filtering isn’t one of the pre-built algorithms (weak argument, admittedly)

collaborative

C. Easy question.

its a appropriate use case of Collaborative filtering

this is C

I’m thinking that it is A because:
1) the input data that we have doesn’t lend itself to collaborative filtering - it requires a set of items and a set of users who have reacted to some of the items, which is NOT what we have
2) recommendation is just one thing that we want to do. What about trends?
3) collaborative filtering isn’t one of the pre-built algorithms (weak argument, admittedly)

Answer is C, demographics, past visits, and locality information data, LDA is appropriate

Collaborative filtering is appropriate

Answer A might be more suitable than other

https://docs.aws.amazon.com/zh_tw/sagemaker/latest/dg/lda-how-it-works.html

Not convinced with A. Answer C seems to be a better fit than A for recommendation model (LDA appears to be a topic-based model on unavailable data with similar patterns)
https://aws.amazon.com/blogs/machine-learning/extending-amazon-sagemaker-factorization-machines-algorithm-to-predict-top-x-recommendations/ - https://www.examtopics.com/discussions/amazon/view/8304-exam-aws-certified-machine-learning-specialty-topic-1/

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

12 - A Machine Learning Specialist is working with a large company to leverage machine learning within its products. The company wants to group its customers into categories based on which customers will and will not churn within the next 6 months. The company has labeled the data available to the Specialist. Which machine learning model type should the Specialist use to accomplish this task? - A.. Linear regression
B.. Classification
C.. Clustering
D.. Reinforcement learning

A

B - B seems to be okay

CLASSIFICATION - Binary Classification - Supervised Learning to be precise
The company wants to predict customer churn (whether a customer will leave or stay).
The data is labeled, meaning we have historical outcomes (churn or no churn).
The task involves categorizing customers into two groups:
Customers who will churn (leave)
Customers who will not churn (stay)
This means the problem is a Supervised Learning problem, specifically a binary classification problem.
The company wants to predict customer churn (whether a customer will leave or stay).
The data is labeled, meaning we have historical outcomes (churn or no churn).
The task involves categorizing customers into two groups:
Customers who will churn (leave)
Customers who will not churn (stay)
This means the problem is a Supervised Learning problem, specifically a binary classification problem.

The reason for this choice is that classification is a type of supervised learning that predicts a discrete categorical value, such as yes or no, spam or not spam, or churn or not churn1. Classification models are trained using labeled data, which means that the input data has a known target attribute that indicates the correct class for each instance2. For example, a classification model that predicts customer churn would use data that has a label indicating whether the customer churned or not in the past.

Classification models can be used for various applications, such as sentiment analysis, image recognition, fraud detection, and customer segmentation2. Classification models can also handle both binary and multiclass problems, depending on the number of possible classes in the target attribute3.

Option B. This is a scenario for supervised learning model as data is labelled and only A, B are supervised learning algorithms from the options. Linear learning is to predict time series data and distribution is selecting which class the input belongs to. Hence most suitable is to use Binomial distribution model in this case.

A. NO - Linear regression is not best for classification
B. YES - Classification
C. NO - we want supervised classification
D. NO - there is nothing to Reinforce from

The question is not clear. Actually we have 2 tasks here - group into categories (clustering) and predict if customers will churn/not churn (classification). If we had to simply do classification, why there was mentioned to group into categories?

This is definitely a classification problem

B is correct

B - it’s a Binary Classification problem. Will the customer churn: Yes or No

100% is B since it is about labelled data

i think the key is “the company has labeled the data” so this is classification, so it’s B

B is okey

B is correct - https://www.examtopics.com/discussions/amazon/view/10005-exam-aws-certified-machine-learning-specialty-topic-1/

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

13 - The displayed graph is from a forecasting model for testing a time series. Considering the graph only, which conclusion should a Machine Learning Specialist make about the behavior of the model?

[https://www.examtopics.com/assets/media/exam-media/04145/0000900001.jpg] - A.. The model predicts both the trend and the seasonality well
B.. The model predicts the trend well, but not the seasonality.
C.. The model predicts the seasonality well, but not the trend.
D.. The model does not predict the trend or the seasonality well.

A

A - A is correct answer.
Please Refer: https://machinelearningmastery.com/decompose-time-series-data-trend-seasonality/

A; the problem is bias, not trends

B. The model predicts the trend well, but not the seasonality.

Here’s what we can observe:
The predicted mean line closely follows the general upward trend of the observed line.
The predicted mean line does not capture the high frequency up and down changes of the observed line.

agreed, this seems to be A. there is similarity between the blue and green lines as far as capturing trend and seasonality is concerned. It just seems that if assumption is that the model is a linear regression model then just the intercept is off by a few units.

A. The model predicts both the trend and the seasonality well

The problem is Bias not trends or sesonality!

A is right, both trend (rising) and seasonality is there

C is correct answer

A is correct answer. Not C

The trend is up, so isn’t it correctly predicted? And the seasonality is also in sync, the amplitude is wrong.

A is right. trend and seasonality are fine, level is the one the model gets wrong

Should be C

Should be A - https://www.examtopics.com/discussions/amazon/view/45385-exam-aws-certified-machine-learning-specialty-topic-1/

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

14 - A company wants to classify user behavior as either fraudulent or normal. Based on internal research, a Machine Learning Specialist would like to build a binary classifier based on two features: age of account and transaction month. The class distribution for these features is illustrated in the figure provided. Based on this information, which model would have the HIGHEST accuracy?

[https://www.examtopics.com/assets/media/exam-media/04145/0001000001.jpg] - A.. Long short-term memory (LSTM) model with scaled exponential linear unit (SELU)
B.. Logistic regression
C.. Support vector machine (SVM) with non-linear kernel
D.. Single perceptron with tanh activation function

A

C - Answer is C. SVM sample use case is to put the dimensions into a higher hyperplane that can separates it. Seeing how separable it is, SVM can be used for it.

Support Vector Machine (SVM) with Non-Linear Kernel –> Non-linear Data
Why?
SVM is powerful for classification and works well even with small datasets.
If the data has a non-linear decision boundary, using an SVM with a non-linear kernel (like RBF or polynomial) can improve accuracy.
Works well in low-dimensional feature spaces (since we have only 2 features: age of account & transaction month).
Optimal choice if the data has a non-linear decision boundary.

SVMs are particularly effective for binary classification tasks and can handle non-linear relationships between features1.

You can use a support vector machine (SVM) when your data has exactly two classes. An SVM classifies data by finding the best hyperplane that separates all data points of one class from those of the other class. The best hyperplane for an SVM means the one with the largest margin between the two classes. Margin means the maximal width of the slab parallel to the hyperplane that has no interior data points.

Well, C is the correct answer. This example is a classical one to use SVM.

SVM for RBF mode is the answer!

Answer is C

Textbook C

C. more reading for using non-linear kernel and separate samples with a hyperplane in a higher dimension space: https://medium.com/pursuitnotes/day-12-kernel-svm-non-linear-svm-5fdefe77836c

C seems right

answer is C

Agree. The answer is A.

https://www.surveypractice.org/article/2715-using-support-vector-machines-for-survey-research

This is a good explanation of SVM
https://uk.mathworks.com/help/stats/support-vector-machines-for-binary-classification.html - https://www.examtopics.com/discussions/amazon/view/43907-exam-aws-certified-machine-learning-specialty-topic-1/

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

15 - A Machine Learning Specialist at a company sensitive to security is preparing a dataset for model training. The dataset is stored in Amazon S3 and contains Personally Identifiable Information (PII). The dataset: ✑ Must be accessible from a VPC only. ✑ Must not traverse the public internet. How can these requirements be satisfied? - A.. Create a VPC endpoint and apply a bucket access policy that restricts access to the given VPC endpoint and the VPC.
B.. Create a VPC endpoint and apply a bucket access policy that allows access from the given VPC endpoint and an Amazon EC2 instance.
C.. Create a VPC endpoint and use Network Access Control Lists (NACLs) to allow traffic between only the given VPC endpoint and an Amazon EC2 instance.
D.. Create a VPC endpoint and use security groups to restrict access to the given VPC endpoint and an Amazon EC2 instance

A

A - Important things to note here is that

  1. “The Data in S3 Needs to be Accessible from VPC”
  2. “Traffic should not Traverse internet”

To fulfill Requirement #2 we need a VPC endpoint
To RESTRICT the access to S3/Bucket
- Access allowed only from VPC via VPC Endpoint

Even though Sagemaker uses EC2 - we are NOT asked to secure the EC2 :)

So the answer is A

Between A & B, the answer should be A. From here:
https://docs.aws.amazon.com/vpc/latest/userguide/vpc-endpoints-s3.html#vpc-endpoints-s3-bucket-policies
We can see that we restrict access using DENY if sourceVpce (vpc endpoint), or sourceVpc (vpc) is not equal to our VPCe/VPC. So we are using a DENY (choice A) and not an ALLOW policy (choice B).

Choices C, D we eliminate because they don’t address S3 access at all.

Create a VPC endpoint and apply a bucket access policy that restricts access to the given VPC endpoint and the VPC.
Why is this correct?
VPC endpoint for S3 allows private connectivity between Amazon S3 and the VPC without using the public internet.
Bucket access policy can be written to allow access only from this VPC endpoint.
This ensures maximum security by:
Preventing access from outside the VPC.
Blocking public access.

In Option A, the Machine Learning Specialist would create a VPC endpoint for Amazon S3, which would allow traffic to flow directly between the VPC and Amazon S3 without traversing the public internet. Access to the S3 bucket containing PII can then be restricted to the VPC endpoint and the VPC using a bucket access policy. This would ensure that only instances within the VPC can access the data, and that the data does not traverse the public internet.

Option B and D, allowing access from an Amazon EC2 instance, would not meet the requirement of not traversing the public internet, as the EC2 instance would be accessible from the internet. Option C, using Network Access Control Lists (NACLs) to allow traffic between only the VPC endpoint and an EC2 instance, would also not meet the requirement of not traversing the public internet, as the EC2 instance would still be accessible from the internet.

A. YES - We first create a S3 endpoint in the VPC subnet so traffic does not flow through the Internet, then on the S3 bucket create an access policy that restricts access to the given VPC based on its ID
B. NO - we don’t want to be specific to an instance
C. NO - the S3 bucket is on AWS network, you cannot change the NACL for it
D. NO - not all instances in a VPC will necessarily have the same principal that can be specified in the policy

Definetly A

Well, but removing methodology, only A remains: The question never cited EC2

Per https://docs.aws.amazon.com/AmazonS3/latest/userguide/example-bucket-policies-vpc-endpoint.html it’s A

The question do not mention EC2 at all, so should be A

I think it should be B. Traning instance is a EC2 instance and need to be set an endpoint to load the data from S3.

AWS security is a conservative security model, which implies that access are denied by default rather than granted by default. We have to explicitly allow access to a AWS resource. Additionally, B talks about allowing access FROM the VPC to S3 while A talks about allowing access from S3 to VPC (which is not what we need).
So, B.

Um, no. A VPC endpoint is outbound from the VPC to a supported AWS service.

Will go with B

Betting on B here, we should control access from VPC, not to VPC.

A!
Restricting access to a specific VPC endpoint
The following is an example of an Amazon S3 bucket policy that restricts access to a specific bucket, awsexamplebucket1, only from the VPC endpoint with the ID vpce-1a2b3c4d. The policy denies all access to the bucket if the specified endpoint is not being used. The aws:SourceVpce condition is used to specify the endpoint.

https://docs.aws.amazon.com/AmazonS3/latest/userguide/example-bucket-policies-vpc-endpoint.html

Can’t be B. You simple cannot enable access to an endpoint to some selected instance. So A.

We shouldn’t use private IP in bucket policy.

B does not say enable access TO the VPC endpoint. It says to allow access FROM the endpoint. So B is the correct answer. A talks about restricting access TO the VPC endpoint, so that option is irrelevant. We’re worried about access TO the S3 bucket, not access to the VPC. The question is not poorly-worded, but it is tricky and you need to read it carefully.

I also vote A.

A
found here
“You can control which VPCs or VPC endpoints have access to your buckets by using Amazon S3 bucket policies. For examples of this type of bucket policy access control, see the following topics on restricting access.”
https://docs.aws.amazon.com/AmazonS3/latest/dev/example-bucket-policies-vpc-endpoint.html - https://www.examtopics.com/discussions/amazon/view/11279-exam-aws-certified-machine-learning-specialty-topic-1/

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

16 - During mini-batch training of a neural network for a classification problem, a Data Scientist notices that training accuracy oscillates. What is the MOST likely cause of this issue? - A.. The class distribution in the dataset is imbalanced.
B.. Dataset shuffling is disabled.
C.. The batch size is too big.
D.. The learning rate is very high.

A

D - Answer is D.
Should the weight be increased or reduced so that the error is smaller than the current value? You need to examine the amount of change to know that. Therefore, we differentiate and check whether the slope of the tangent is positive or negative, and update the weight value in the direction to reduce the error. The operation is repeated over and over so as to approach the optimal solution that is the goal. The width of the update amount is important at this time, and is determined by the learning rate.

maybe D ?

D. The learning rate is very high.

Explanation:
When the learning rate is too high, the optimization process may overshoot the optimal weights in parameter space. Instead of gradually converging, the model updates weights in a highly unstable manner, causing fluctuations in training accuracy. The network fails to settle into a minimum because the updates are too aggressive.

A high learning rate can cause oscillations in the training accuracy because the optimizer makes large updates to the model parameters in each iteration, which can cause overshooting the optimal values. This can result in the model oscillating back and forth across the optimal solution.

If the learning rate is too high, the model weights may overshoot the optimal values and bounce back and forth around the minimum of the loss function. This can cause the training accuracy to oscillate and prevent the model from converging to a stable solution. The training accuracy is the proportion of correct predictions made by the model on the training data.

When the learning rate is set too high, it can lead to oscillations or divergence during training. Here’s why:

High Learning Rate: A high learning rate means that the model’s parameters are updated by a large amount in each training step. This can cause the model to overshoot the optimal parameter values, leading to instability in training.

Oscillations: If the learning rate is excessively high, the model’s updates can become unstable, causing it to oscillate back and forth between parameter values. This oscillation can prevent the model from converging to an optimal solution.

To address this issue, you can try reducing the learning rate. It’s often necessary to experiment with different learning rates to find the one that works best for your specific problem and dataset. Learning rate scheduling techniques, such as reducing the learning rate over time, can also help stabilize training.

Answer is A.
A high learning rate means that the model parameters are being updated by large magnitudes in each iteration. As a result, the optimization process may struggle to converge to the optimal solution, leading to erratic behavior and fluctuations in training accuracy.

If learning rate is high, the accuracy is fluctuated because the value of loss function moves back and forth over the global minimum.

The big learning rating overshoot in true minima.

D Learning rate is too high. Textbook example of learning rate being too high. Lower Learning_rate will take more iterations, or longer to train, but will settle in place.

12-sep exam

D: per supuesto

A company sells thousands of products on a public website and wants to automatically identify
products with potential durability problems. The company has 1.000 reviews with date, star rating,
review text, review summary, and customer email fields, but many reviews are incomplete and
have empty fields. Each review has already been labeled with the correct durability result.
A machine learning specialist must train a model to identify reviews expressing concerns over
product durability. The first model needs to be trained and ready to review in 2 days.
What is the MOST direct approach to solve this problem within 2 days?
A.
Train a custom classifier by using Amazon Comprehend.
B.
Build a recurrent neural network (RNN) in Amazon SageMaker by using Gluon and Apache
MXNet.
C.
Train a built-in BlazingText model using Word2Vec mode in Amazon SageMaker.
D.
Use a built-in seq2seq model in Amazon SageMaker.

Is A valid option?

D is correct. big batch size make local minia.

it is a multiple answer question and answer should be both A and D

Answer is D 100%; learning rate too high will cause such an event

The answer is D, from the Coursera deep learning specialization (course 2 - improving Deep NN) - https://www.examtopics.com/discussions/amazon/view/12378-exam-aws-certified-machine-learning-specialty-topic-1/

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

17 - An employee found a video clip with audio on a company’s social media feed. The language used in the video is Spanish. English is the employee’s first language, and they do not understand Spanish. The employee wants to do a sentiment analysis. What combination of services is the MOST efficient to accomplish the task? - A.. Amazon Transcribe, Amazon Translate, and Amazon Comprehend
B.. Amazon Transcribe, Amazon Comprehend, and Amazon SageMaker seq2seq
C.. Amazon Transcribe, Amazon Translate, and Amazon SageMaker Neural Topic Model (NTM)
D.. Amazon Transcribe, Amazon Translate and Amazon SageMaker BlazingText

A

A - the MOST efficient means to you don’t need to coding, building infra
All of sevices are manage by AWS is good,
Transcribe, Amazon Translate, and Amazon Comprehend

Answer is A

Agree, Answer is A

A is not 100% correct. You don’t need to translate Spanish. Amazon Comprehend supports Spanish.

Arguably, you still need a translation since the person doesn’t speak Spanish.

I think there is no need to use Amazon translate because sometimes the translation is not accurate.
It means some information gets lost.

Given the question, I believe that is necessary: look at the enphase of not understanding spanish. besides that, even with some information lost, you will at least understand something.

A. Amazon Transcribe, Amazon Translate, and Amazon Comprehend
Explanation of the Process:
Amazon Transcribe – Converts the Spanish audio in the video into text.
Amazon Translate – Translates the Spanish text to English.
Amazon Comprehend – Performs sentiment analysis on the translated English text.

It’s A:
1.Amazon Transcribe - to convert Spanish speech to Spanish text.
2.Amazon Translate - to translate Spanish text to English text
3.Amazon Comprehend - to analyze text for sentiments

A. YES - Comprehend is supervised so user must understand through Translate
B. NO - seq2seq is for generation and not classification
C. NO - Amazon SageMaker Neural Topic Model is unsupervised topic extraction, will not give sentiment against user-defined classes
D. NO - BlazingText is word2vec, does not give sentiment classes

It’s A:
1.Amazon Transcribe - to convert Spanish speech to Spanish text.
2.Amazon Translate - to translate Spanish text to English text
3.Amazon Comprehend - to analyze text for sentiments

It’s A 100%

Transcribe: Speech to text
Translate: Any language to any language
Comprehend: offers a range of capabilities for extracting insights and meaning from unstructured text data. Ex: Sentiment analysis, entity recognition, KeyPhrase Extraction, Language Detection, Document Classification

absolutely need STT(transcribe), translation(translate), and sentimental analysis(comprehend)

A - confirmed by ACG

I agree that the answer is A

answer is a

A; D is wrong because The Amazon SageMaker BlazingText algorithm provides highly optimized implementations of the Word2vec and text classification algorithms.

The Question/Anwser is not poorly as someone mentioned.
–Even though Comprehend can do the analysis directly on Spanish (no need of translate) but if comprehend does analysis and the resulting words are still in spanish , it will no help the employee as he doesnt know Spanish.So the transalate after transcribe will help Employee understand what is being analyzed by Comprehend in next step.
So read the question carefully before jumping to conclusions. it will save you an Exam :)

I don’t get this question. Comprehend supports Spanish natively. There is no need for Translate, and translate would actually reduce effectiveness of sentimental analysis. However, BCD are all invalid choices.

A
because Comprehend can provide sentiment analysis

A,
https://aws.amazon.com/getting-started/hands-on/analyze-sentiment-comprehend/ - https://www.examtopics.com/discussions/amazon/view/8306-exam-aws-certified-machine-learning-specialty-topic-1/

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

18 - A Machine Learning Specialist is packaging a custom ResNet model into a Docker container so the company can leverage Amazon SageMaker for training. The Specialist is using Amazon EC2 P3 instances to train the model and needs to properly configure the Docker container to leverage the NVIDIA GPUs. What does the Specialist need to do? - A.. Bundle the NVIDIA drivers with the Docker image.
B.. Build the Docker container to be NVIDIA-Docker compatible.
C.. Organize the Docker container’s file structure to execute on GPU instances.
D.. Set the GPU flag in the Amazon SageMaker CreateTrainingJob request body.

A

B - https://docs.aws.amazon.com/sagemaker/latest/dg/sagemaker-dg.pdf
page 55:
If you plan to use GPU devices, make sure that your containers are nvidia-docker compatible. Only the
CUDA toolkit should be included on containers. Don’t bundle NVIDIA drivers with the image. For more
information about nvidia-docker, see NVIDIA/nvidia-docker.

So the answer is B

Yeah, it’s B. But the page in the developer guide is page number 201 (209 in pdf). Second bullet point at the top.

Answer is B. below is from AWS documentation,
If you plan to use GPU devices for model training, make sure that your containers are nvidia-docker compatible. Only the CUDA toolkit should be included on containers; don’t bundle NVIDIA drivers with the image.
https://docs.aws.amazon.com/sagemaker/latest/dg/your-algorithms-training-algo-dockerfile.html

When using Amazon SageMaker with GPU-based EC2 instances (e.g., P3 instances), you must ensure that your custom Docker container can leverage NVIDIA GPUs. NVIDIA-Docker (now part of Docker with nvidia-container-runtime) allows containers to access GPU resources without needing to bundle NVIDIA drivers inside the container.

To make a custom Docker container GPU-compatible, the Machine Learning Specialist should:

Use NVIDIA CUDA and cuDNN in the Dockerfile.
Ensure the container is built using the NVIDIA Container Toolkit (nvidia-docker).
Use nvidia-container-runtime as the runtime.

To leverage the NVIDIA GPUs on Amazon EC2 P3 instances for training with Amazon SageMaker, the Docker container must be built to be compatible with NVIDIA-Docker.

NVIDIA-Docker is a wrapper around Docker that makes it easier to use GPUs in containers by providing GPU-aware functionality.

To build a Docker container that is compatible with NVIDIA-Docker, the Specialist should install the NVIDIA GPU drivers in the Docker container and install the NVIDIA-Docker runtime on the EC2 instances.

NVIDIA-Docker is a Docker container runtime plugin that allows the Docker container to access the GPU resources on the host machine. By building the Docker container to be NVIDIA-Docker compatible, the Docker container will have access to the NVIDIA GPU resources on the Amazon EC2 P3 instances, allowing for accelerated training of the ResNet model.

The reason for this choice is that NVIDIA-Docker is a tool that enables GPU-accelerated containers by automatically configuring the container runtime to use NVIDIA GPUs1. NVIDIA-Docker allows you to build and run Docker containers that can fully access the GPUs on your host system. This way, you can run GPU-intensive applications, such as deep learning frameworks, inside containers without any performance loss or compatibility issues.

A. NO - the drivers are not necessary (https://docs.aws.amazon.com/sagemaker/latest/dg/your-algorithms-training-algo-dockerfile.html)
B. YES - it is about using the CUDA library, need to use proper base image (https://medium.com/@jgleeee/building-docker-images-that-require-nvidia-runtime-environment-1a23035a3a58)
C. NO - file structure irrelavant to GPU
D. NO - SageMaker config, irrelevant to Docker

https://docs.aws.amazon.com/sagemaker/latest/dg/sagemaker-dg.pdf
page 55

page 570
On a GPU instance, the image is run with the –gpus option. Only the CUDA toolkit should be
included in the image not the NVIDIA drivers. For more information, see NVIDIA User Guide.

Answer B
Load the CUDA toolkit only, not the drivers. Ref GPU section : https://docs.aws.amazon.com/sagemaker/latest/dg/studio-byoi-specs.html

I think it should be b

B is correct!

As per aws documentation, answer is B, and A is even explicitly not recommended

As referred in other comments ans is B

ANS B
As mentioned byi other users

As per me answer is B

The answer is for sure B - as mentioned by others. And this is clearly stated in the docs

Ans. is B. - https://www.examtopics.com/discussions/amazon/view/9805-exam-aws-certified-machine-learning-specialty-topic-1/

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

19 - A Machine Learning Specialist is building a logistic regression model that will predict whether or not a person will order a pizza. The Specialist is trying to build the optimal model with an ideal classification threshold. What model evaluation technique should the Specialist use to understand how different classification thresholds will impact the model’s performance? - A.. Receiver operating characteristic (ROC) curve
B.. Misclassification rate
C.. Root Mean Square Error (RMSE)
D.. L1 norm

A

A - Ans. A is correct

Answer is A.
An ROC curve (receiver operating characteristic curve) is a graph showing the performance of a classification model at all classification thresholds

Explanation:
The ROC curve is the best technique to evaluate how different classification thresholds impact the model’s performance. It plots True Positive Rate (TPR) against False Positive Rate (FPR) at various threshold values.

Why the ROC Curve?
Logistic regression outputs probabilities, and we need to select a classification threshold to decide between “order pizza” (1) and “not order pizza” (0).
Changing the threshold impacts the trade-off between sensitivity (recall) and specificity.
The ROC curve helps visualize this trade-off and select the best threshold based on the business goal (e.g., maximizing recall vs. minimizing false positives).
The Area Under the ROC Curve (AUC-ROC) is a useful metric to measure the model’s discrimination ability.

A is indeed correct see https://developers.google.com/machine-learning/crash-course/classification/roc-and-auc
An ROC curve (receiver operating characteristic curve) is a graph showing the performance of a classification model at all classification thresholds. This curve plots two parameters:
• True Positive Rate
• False Positive Rate
True Positive Rate (TPR) is a synonym for recall and is therefore defined as follows:
TPR = TP/TP+FN
False Positive Rate (FPR) is defined as follows:
FPR = FP/FP+TN

The reason for this choice is that a ROC curve is a graphical plot that illustrates the performance of a binary classifier across different values of the classification threshold1. A ROC curve plots the true positive rate (TPR) or sensitivity against the false positive rate (FPR) or 1-specificity for various threshold values2. The TPR is the proportion of positive instances that are correctly classified, while the FPR is the proportion of negative instances that are incorrectly classified.

ROC curve is for defining the threshold.

A surely

Question is about classification so confusion matrix would come into mind; A is the answer

It is A.

obviously A

Root Mean Square Error (RMSE) Ans. c

I think RMSE is for regression model - https://www.examtopics.com/discussions/amazon/view/10011-exam-aws-certified-machine-learning-specialty-topic-1/

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

20 - An interactive online dictionary wants to add a widget that displays words used in similar contexts. A Machine Learning Specialist is asked to provide word features for the downstream nearest neighbor model powering the widget. What should the Specialist do to meet these requirements? - A.. Create one-hot word encoding vectors.
B.. Produce a set of synonyms for every word using Amazon Mechanical Turk.
C.. Create word embedding vectors that store edit distance with every other word.
D.. Download word embeddings pre-trained on a large corpus.

A

D - the solution is word embedding. As it is a interactive online dictionary, we need pre-trained word embedding thus the answer is D. In addition, there is no mention that the online dictonary is unique and does not have a pre-trained word embedding.
Thus I strongly feel the answer is D

D is correct. It is not a specialized dictionary so use the existing word corpus to train the model

D. Download word embeddings pre-trained on a large corpus.
Reason :
For a nearest neighbor model that finds words used in similar contexts, word embeddings are the best choice. Pre-trained word embeddings capture semantic relationships and contextual similarity between words based on a large text corpus (e.g., Wikipedia, Common Crawl).

The Specialist should:

Use pre-trained word embeddings like Word2Vec, GloVe, or FastText.
Load the embeddings into the model for efficient similarity comparisons.
Use a nearest neighbor search algorithm (e.g., FAISS, k-d tree, Annoy) to quickly find similar words.

D. Download word embeddings pre-trained on a large corpus.

Word embeddings are a type of dense representation of words, which encode semantic meaning in a vector form. These embeddings are typically pre-trained on a large corpus of text data, such as a large set of books, news articles, or web pages, and capture the context in which words are used. Word embeddings can be used as features for a nearest neighbor model, which can be used to find words used in similar contexts.

Downloading pre-trained word embeddings is a good way to get started quickly and leverage the strengths of these representations, which have been optimized on a large amount of data. This is likely to result in more accurate and reliable features than other options like one-hot encoding, edit distance, or using Amazon Mechanical Turk to produce synonyms.

A. NO - one-hot encoding is a very early featurization stage
B. NO - we don’t want human labelling
C. NO - too costly to do from scratch
D. YES - leverage exiting training; the word embeddings will provide vectors than be used to measure distance in the downstream nearest neighbor model

Pre-trained word embeddings, such as Word2Vec, GloVe, or FastText, capture the semantic and contextual meaning of words based on a large corpus of text data. By downloading pre-trained word embeddings, the Specialist can leverage the semantic relationships between words to provide meaningful word features for the nearest neighbor model powering the widget. Utilizing pre-trained word embeddings allows the model to understand and display words used in similar contexts effectively.

A. One-hot word encoding vectors: These vectors represent words by marking them as present or absent in a fixed-length binary vector. However, they don’t capture relationships between words or their meanings.

B. Producing synonyms: This would involve generating similar words for each word manually, which could be time-consuming and might not cover all possible contexts.

C. Word embedding vectors based on edit distance: This approach focuses on how similar words are in terms of their spelling or characters, not necessarily their meaning or context in sentences.

D. Downloading pre-trained word embeddings: These are vectors that represent words based on their contextual usage in a large dataset, capturing relationships between words and their meanings.

correct D ay tupoy

words that are used in similar contexts will have vectors that are close in the embedding space

D is correct

I also believe that D is the correct answer. No reason to create word embeddings from scratch

  1. One-hot encoding will blow up the feature space - it is not recommended for a high cardinality problem domain.
  2. One still needs to train the word features on large bodies of text to map context to each word

12-sep exam

DDDDDDDDDDDDD

D for sure

Definitely D.

A)It requires that document text be cleaned and prepared such that each word is one-hot encoded.
Ref:https://machinelearningmastery.com/what-are-word-embeddings/ - https://www.examtopics.com/discussions/amazon/view/9825-exam-aws-certified-machine-learning-specialty-topic-1/

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

21 - A Machine Learning Specialist is configuring Amazon SageMaker so multiple Data Scientists can access notebooks, train models, and deploy endpoints. To ensure the best operational performance, the Specialist needs to be able to track how often the Scientists are deploying models, GPU and CPU utilization on the deployed SageMaker endpoints, and all errors that are generated when an endpoint is invoked. Which services are integrated with Amazon SageMaker to track this information? (Choose two.) - A.. AWS CloudTrail
B.. AWS Health
C.. AWS Trusted Advisor
D.. Amazon CloudWatch
E.. AWS Config

A

AD - AD is correct

CloudTrail is use to track scientist how ofthe they deploy a model
CloudWatch for monitoring GPU and CPU
so answer is A & D

To monitor SageMaker model deployments, track resource utilization, and log errors, the Machine Learning Specialist should use:

AWS CloudTrail – Tracks API activity, such as:

Model deployments (e.g., CreateModel, CreateEndpoint)
Notebook access and actions
SageMaker job executions
Amazon CloudWatch – Monitors and logs operational metrics, such as:

CPU & GPU utilization of SageMaker endpoints
Invocation errors and latencies
Custom metrics from deployed models
Logs from training jobs and inference endpoints (via CloudWatch Logs)

I think AWS Config is still not the service designed to track how often Data Scientists are deploying models, nor does it track operational performance metrics like GPU and CPU utilization or the invocation errors of SageMaker endpoints.
and AWS CloudTrail continues to be the service that will track and record user activity and API usage, which includes deploying models in Amazon SageMaker.
So the answers are still A and D - CloudTrail and CloudWatch.

AD is correct

A. YES - to track deployements
B. NO - AWS Health is to track AWS Cloud itself (eg. is a zone down ? )
C. NO - AWS Trusted Advisor to give recommendations on infra
D. YES - for errors
E. AWS Config

I also believe that A and D are correct. Can someone please explain to me the main differences between CloudWatch and CloudTrail? I find the documentation a bit confusing about it

Option E AWS Config to record all resource types, then the new resources will be automatically recorded in your account.
Option A CloudTrail is use to track scientist how of the they deploy a model
Option D CloudWatch for monitoring GPU and CPU

Log Amazon Sagemaker API Calls with AWS CloudTrail - https://docs.aws.amazon.com/sagemaker/latest/dg/logging-using-cloudtrail.html

I wouldn’t be so sure about CloudTrail, AWS Configs also tracks Sagemaker and the resource “AWS::Sagemaker::Model”

just seen, this was release 4 days ago…
https://aws.amazon.com/about-aws/whats-new/2022/06/aws-config-15-new-resource-types/

A&D
CloudWatch and ClouTrail

AD Are Correct.

absolutely

cloudtrail and cloudwatch, no thinking - https://www.examtopics.com/discussions/amazon/view/10012-exam-aws-certified-machine-learning-specialty-topic-1/

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q

22 - A retail chain has been ingesting purchasing records from its network of 20,000 stores to Amazon S3 using Amazon Kinesis Data Firehose. To support training an improved machine learning model, training records will require new but simple transformations, and some attributes will be combined. The model needs to be retrained daily. Given the large number of stores and the legacy data ingestion, which change will require the LEAST amount of development effort? - A.. Require that the stores to switch to capturing their data locally on AWS Storage Gateway for loading into Amazon S3, then use AWS Glue to do the transformation.
B.. Deploy an Amazon EMR cluster running Apache Spark with the transformation logic, and have the cluster run each day on the accumulating records in Amazon S3, outputting new/transformed records to Amazon S3.
C.. Spin up a fleet of Amazon EC2 instances with the transformation logic, have them transform the data records accumulating on Amazon S3, and output the transformed records to Amazon S3.
D.. Insert an Amazon Kinesis Data Analytics stream downstream of the Kinesis Data Firehose stream that transforms raw record attributes into simple transformed values using SQL.

A

D - D is correct. Question has ‘“simple transformations, and some attributes will be combined” and Least development effort. Kinesis analytics can get data from Firehose, transform and write to S3
https://docs.aws.amazon.com/kinesisanalytics/latest/java/examples-s3.html

Best explanation here, kudos.

I can’t find any information that indicate Kinesis data analytics taking data from firehose

The best way to transform data is before it arrives to S3 so D should be best answer. But D is not completed. It should have another Firehose to deliver results to S3.

D. Insert an Amazon Kinesis Data Analytics stream downstream of the Kinesis Data Firehose stream that transforms raw record attributes into simple transformed values using SQL.
Explanation:
Since the data is already flowing through Amazon Kinesis Data Firehose, the least development effort solution is to use Amazon Kinesis Data Analytics, which supports SQL-based transformations on streaming data without requiring new infrastructure.

Why is this the best choice?
No major architectural changes – Data continues flowing from stores into Kinesis Data Firehose and then to Amazon S3.
Simple SQL transformations – Since the changes are simple (e.g., attribute combinations), SQL is sufficient.
Low operational overhead – No need to manage clusters or instances.
Real-time processing – Transformed records immediately enter Amazon S3 for training.

Ans is D
Amazon Kinesis Data Analytics provides a serverless option for real-time data processing using SQL queries. In this case, by inserting a Kinesis Data Analytics stream downstream of the Kinesis Data Firehose stream, the retail chain can easily perform the required simple transformations on the ingested purchasing records.

The best answer is to use a lambda, but the letter D can do it very good too in the absence of the lambda option.

I go with D. A tough question, though. And C are definitely out. They key to the question is that it does not say that the transformed data needs to be stored again in S3. It just needs to be sent to the model for training after being transformed. So a Kinesis Data Analytics stream is appropriate to do the transformation.

Legacy data – Firehose – Kinesis Analytics – S3.This happens in near real time before the data ends up in S3.
–Legacy data – Firehose – S3 is already happening (mentioned in first line in question), adding Kinesis Data Analytics to do simple transformation joins using SQL on the incoming data is the LEAST amount of work needed.
Kinesis Data analytics can write o S3. here is the AWS link with working example.Even Though Udemy tutorial said it cannot write directly to S3 :) .

https://docs.aws.amazon.com/kinesisanalytics/latest/java/examples-s3.html

It seems that LEAST developmnet effort:
https://aws.amazon.com/fr/blogs/big-data/preprocessing-data-in-amazon-kinesis-analytics-with-aws-lambda/

and GRETAST development effort:
https://aws.amazon.com/fr/blogs/big-data/optimizing-downstream-data-processing-with-amazon-kinesis-data-firehose-and-amazon-emr-running-apache-spark/

It’s D

https://aws.amazon.com/blogs/big-data/preprocessing-data-in-amazon-kinesis-analytics-with-aws-lambda/

“In some scenarios, you may need to enhance your streaming data with additional information, before you perform your SQL analysis. Kinesis Analytics gives you the ability to use data from Amazon S3 in your Kinesis Analytics application, using the Reference Data feature. However, you cannot use other data sources from within your SQL query.”

I believe, kinesis should be used only in case of live data stream and this is not the case here. So as per me D shouldn’t be the answer. I think A should be the answer as AWS storage gateway is something which is used alongwith on premise applications to move data to s3. Then glue can be used to transform the data.

With option A, you would be changing the legacy data ingestion, a huge development effort. Remember, you’re talking about 20,000 stores.

It is D.

I think the answer is D, because require the LEAST amount of development effort.

it’s D, kinesis analytic can easily connect with firehose

why not A. it seems good to me

“require stores to capture data locally using S3 gateway” - for 20k stores this creates a HUUUGE operational overhead and development effort, definitely wrong

D is correct…rest all need some kind of manual intervention as well as they are not simple..Firehose allows transformation as well as moving into S3

I think the answer is B. D would be correct if they didn’t want to transform the legacy data from before the switch, but it seems like they do. Choosing D would mean that you’d have to use an EC2 instance or something else to transform the legacy data along with adding the Kinesis data analytics functionality. Also, there is no real-time requirement so daily transformation is fine.

Its D, because with KDA you can transform the data with SQL while with EMR you need to write code, considering the requirement of “least development effort”, so D

I think the answer is B. D would be correct if they didn’t want to transform the legacy data from before the switch, but it seems like they do. Choosing D would mean that you’d have to use an EC2 instance or something else to transform the legacy data along with adding the Kinesis data analytics functionality. Also, there is no real-time requirement so daily transformation is fine.

“LEAST amount of development effort” , EMR is no complicated to LEAST

If the question is “least cost” then B, but the question is “least develope effort, then you want to keep original architeture. I agree that for daily ETL instead of real-time, and large dataset, B is better option.

You can use Lambda instead of EC2. So D should be OK.
https://aws.amazon.com/blogs/big-data/preprocessing-data-in-amazon-kinesis-analytics-with-aws-lambda/

can be B - https://www.examtopics.com/discussions/amazon/view/9826-exam-aws-certified-machine-learning-specialty-topic-1/

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
23
Q

23 - A Machine Learning Specialist is building a convolutional neural network (CNN) that will classify 10 types of animals. The Specialist has built a series of layers in a neural network that will take an input image of an animal, pass it through a series of convolutional and pooling layers, and then finally pass it through a dense and fully connected layer with 10 nodes. The Specialist would like to get an output from the neural network that is a probability distribution of how likely it is that the input image belongs to each of the 10 classes. Which function will produce the desired output? - A.. Dropout
B.. Smooth L1 loss
C.. Softmax
D.. Rectified linear units (ReLU)

A

C - C might be much suitable
softmax is to turn numbers into probabilities.

https://medium.com/data-science-bootcamp/understand-the-softmax-function-in-minutes-f3a59641e86d

C is right. Softmax function is used for multi-class predictoins

In a multiclass classification problem (such as classifying an image into one of 10 animal categories), the model should output a probability distribution over the classes. The Softmax function achieves this by:

Taking the raw scores (logits) from the final dense layer (10 nodes, one per class).
Exponentializing each score and normalizing them so they sum to 1, effectively turning them into probabilities.

A. NO - Dropout is to prevent overfitting
B. NO - L1 regularization is to prevent overfitting
C. YES - Softmax will give probabilities for each class
D. NO - Rectified linear units (ReLU) is an activation function

Softmax is the correct answer.

Multiclassification with probabilities is about softmax!

Softmax is for probability distribution

it should be C. Softmax
Softmax converts outputs to Probabilites of each classification

absolutely C

Absolute C.

This is as easy a question as you will likely see on the exam, Everyone has the right answer here.

C –> Softmax.

Let’s go over the alternatives:
A. Dropout –> Not really a function, but rather a method to avoid overfitting. It consists of dropping some neurons during the training process, so that the performance of our algorithm does not become very dependent on any single neuron.
B. Smooth L1 loss –> It’s a loss function, thus a function to be minimized by the entire neural network. It’s not an activation function.
C. Softmax –> This is the traditional function used for multi-class classification problems (such as classifying an animal into one of 10 categories)
D. Rectified linear units (ReLU) –> This activation function is often used on the first and intermediate (hidden) layers, not on the final layer. In any case, it wouldn’t make sense to use it for classification because its values can exceed 1 (and probabilities can’t)

C, Softmax is the best suitable answer
Ref: The softmax function, also known as softargmax[1]:184 or normalized exponential function,[2]:198 is a generalization of the logistic function to multiple dimensions. It is used in multinomial logistic regression and is often used as the last activation function of a neural network to normalize the output of a network to a probability distribution over predicted output classes, based on Luce’s choice axiom.

You guys are right, the answer is C since it automatically provides the output with a confidence interval…

Relu could be used as well but it needs to be coded in to provide the probabilities

https://medium.com/@himanshuxd/activation-functions-sigmoid-relu-leaky-relu-and-softmax-basics-for-neural-networks-and-deep-8d9c70eed91e

Definitely C

Definitely softmax.

Are you sure it is C?`
The output should be “[the probability that] the input image belongs to each of the 10 classes.” And not the most likely class with the highest probability, which would be the result of softmax layer.

Yes, softmax returns indeed a vector of probabilities. - https://www.examtopics.com/discussions/amazon/view/8307-exam-aws-certified-machine-learning-specialty-topic-1/

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
24
Q

24 - A Machine Learning Specialist trained a regression model, but the first iteration needs optimizing. The Specialist needs to understand whether the model is more frequently overestimating or underestimating the target. What option can the Specialist use to determine whether it is overestimating or underestimating the target value? - A.. Root Mean Square Error (RMSE)
B.. Residual plots
C.. Area under the curve
D.. Confusion matrix

A

B - RMSE says about the error value but not the sign of error. The question is to find whether the model overestimates or underestimates - I guess residual plots clearly show that

answer B

Answer is B. Residual plot distribution indicates over or under-estimations

A residual plot helps determine whether a regression model is overestimating or underestimating the target value.

Residual = Actual Value - Predicted Value
Positive residual → The model underestimated the target.
Negative residual → The model overestimated the target.
By plotting residuals, the Machine Learning Specialist can see patterns that indicate bias:

More positive residuals → The model is underestimating.
More negative residuals → The model is overestimating.
Randomly scattered residuals around zero → The model is well-calibrated.

Residual plots shows mistake by mistake!

B - Residual plots it is - https://docs.aws.amazon.com/machine-learning/latest/dg/regression-model-insights.html

Residual Plots (B).
AUC and Confusion Matrices are used for classification problems, not regression.
And RMSE does not tell us if the target is being over or underestimated, because residuals are squared! So we actually have to look at the residuals themselves. And that’s B.

Root Mean Square Error (RMSE) is the standard deviation of the residuals (prediction errors). Residuals are a measure of how far from the regression line data points are; RMSE is a measure of how spread out these residuals are. In other words, it tells you how concentrated the data is around the line of best fit. Root mean square error is commonly used in climatology, forecasting, and regression analysis to verify experimental results.

1) Squaring the residuals.
2) Finding the average of the residuals.
3) Taking the square root of the result.

Residual Plots (B). would have to be my answer

residual plot
https://docs.aws.amazon.com/machine-learning/latest/dg/regression-model-insights.html

https://stattrek.com/statistics/dictionary.aspx?definition=residual%20plot#:~:text=A%20residual%20plot%20is%20a,nonlinear%20model%20is%20more%20appropriate.

Answer is B

without a second thought residual plot

The answer is B. Refer to Exercise 7.2.1.A
https://stats.libretexts.org/Bookshelves/Introductory_Statistics/Book%3A_OpenIntro_Statistics_(Diez_et_al)./07%3A_Introduction_to_Linear_Regression/7.02%3A_Line_Fitting%2C_Residuals%2C_and_Correlation

Residual plot it is Option B

Residual plot

B is the correct answer!!!!
RMSE has the S in it that is square… that vanishes the above below factor of the prediction.
Answers C and D are for other type of problems

It should be B. The residual plot will be give whether the target value is overestimated or underestimated.

Answer is C.

https://www.youtube.com/watch?v=MrjWcywVEiU

Answer is B.
Your vid shows a technique that is useful for defining integrals and has NOTHING to do linear regression. Also, it over-/underestimates the area under the curve, NOT the target value.

Good grief, AUC is used for classification not regression.

B..Residual helps to find out whether the model is underestimating or overestimating

answer is B - https://www.examtopics.com/discussions/amazon/view/8308-exam-aws-certified-machine-learning-specialty-topic-1/

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
25
25 - A company wants to classify user behavior as either fraudulent or normal. Based on internal research, a Machine Learning Specialist would like to build a binary classifier based on two features: age of account and transaction month. The class distribution for these features is illustrated in the figure provided. Based on this information, which model would have the HIGHEST recall with respect to the fraudulent class? [https://www.examtopics.com/assets/media/exam-media/04145/0001600001.jpg] - A.. Decision tree B.. Linear support vector machine (SVM) C.. Naive Bayesian classifier D.. Single Perceptron with sigmoidal activation function
A - C is the correct answer because gaussian naive Bayes can do this nicely. of course it doesn't mention the gaussian here and refers to naive bayes in general, but I'm still positive with C. Answer should be A:. B: LINEAR SVM is a linear classifier -> All of these have a linear decision boundary (so it's just a line y = mx+b). This leads to a bad recall and so A must be the right choice. Decision Tree (Best Choice) ✅ Highly flexible: Can capture non-linear decision boundaries, making it effective when the class distribution is not linearly separable. Maximizes recall: A decision tree can prioritize minimizing false negatives by adjusting its splits. Handles imbalanced classes well using class weighting or pruning techniques. Gaussian naive Bayes is correct one A non-linear problem would be a case where linear classifiers, such as naive Bayes, would not be suitable since the classes are not linearly separable. In such a scenario, non-linear classifiers (e.g.,instance-based nearest neighbour classifiers) should be preferred. decision tree can effectively maximize the recall by drawing a square (3 <= month <= 7, 3 <= age <= 7) Option C. Naive Bayesian classifier is not the best choice for achieving the highest recall for the fraudulent class because it makes strong assumptions about the independence of features. In many real-world scenarios, especially with complex data like user behavior, these assumptions do not hold true, which can lead to suboptimal performance. In contrast, a Decision tree (Option A) can handle feature interactions and is more flexible in capturing the relationships between features, making it more effective in identifying fraudulent behavior and achieving higher recall Answer in my opinion is A A Decision Tree Classifier can handle complex decision boundaries and does not assume any particular distribution of data. It is well-suited for cases like this where the decision boundary is non-linear, as seen with the clear separation between the normal and fraudulent transactions. A Naive Bayesian classifier, on the other hand, assumes independence among features and typically performs better when data is normally distributed, which might not be the case here given the data's clustering pattern. From Claude 3 Haiku: A. NO, decision trees may struggle to capture the linear separability of the classes. B. NO, Linear SVM may not be able to fully exploit the class separation due to its linear decision boundary. C. YES, The Naive Bayesian classifier tends to perform well in situations where the classes are linearly separable. This model requires the features are independent and this is the case D. The single Perceptron with a sigmoidal activation function may not be able to capture the complex class distributions as effectively as the Naive Bayesian classifier. Funny that if you ask Haiku to explain its reason step by step, it will chose A instead of C ``` Based on the information provided, the model that is likely to have the highest recall with respect to the fraudulent class is the **Decision Tree (Most Voted)**. ``` Answer by Claude3: In contrast, the Decision Tree (A) and Linear SVM (B) models are generally more robust to overfitting and can achieve a better balance between recall and precision, but they may not necessarily have the highest recall for the minority class. Considering the importance of maximizing recall for the fraudulent class in this use case, the Naive Bayesian Classifier (C) could be a valid choice, although it may come with the trade-off of lower precision and potentially higher false positive rates. highest recall. So A Only A (DT) is non-linear among the mentioned algorithms. Given the visualized data, the Decision tree (Option A) is likely the best model to achieve the highest recall for the fraudulent class. It can handle complex patterns and create rules that are more suited for clustered and potentially non-linearly separable classes. Recall is a measure of a model's ability to capture all actual positives, and a decision tree can be tuned to prioritize capturing more of the fraudulent cases at the expense of making more false-positive errors on the normal cases. if it was highest precision: Given these considerations, the best model for precision would likely be a Support Vector Machine with a non-linear kernel, such as the RBF (Radial Basis Function) kernel. This model can tightly fit the boundary around the fraudulent class, minimizing the inclusion of normal transactions in the fraudulent prediction space, and thus potentially achieving high precision. Precision is sensitive to the false positives, and the flexibility of SVMs with non-linear kernels to create a tight and precise boundary can help to minimize these. GPT 4 Answer is Decision Tree. Considering the goal is to achieve the highest recall for the fraudulent class, which means we aim to capture as many fraudulent cases as possible even if it means getting more false positives, a Decision Tree would likely be the best option. This is because it can adapt to the complex shape of the class distribution and encapsulate the majority of the fraudulent class within its decision boundaries. Recall is a measure of a model's ability to capture all actual positives, and the decision tree's complex boundary setting capabilities make it well-suited for maximizing recall in this case. I'm going with A. As pointed out in this article, Naive Bayes performs poorly with non-linear classification problems. The picture shows a case where the classes are not linearly separable. Decision Tree will probably give better results. https://sebastianraschka.com/Articles/2014_naive_bayes_1.html Highest recall for fraudulent class means that Precision for Fraudulent predictions can be low. So basically just two conditions Transaction Month nearly greater than 8 and age of accounts greater than 8 can help identify the fraudulent class but it will classify most of non-fraudulent cases as fraudulent. - https://www.examtopics.com/discussions/amazon/view/43910-exam-aws-certified-machine-learning-specialty-topic-1/
26
26 - A Machine Learning Specialist kicks off a hyperparameter tuning job for a tree-based ensemble model using Amazon SageMaker with Area Under the ROC Curve (AUC) as the objective metric. This workflow will eventually be deployed in a pipeline that retrains and tunes hyperparameters each night to model click-through on data that goes stale every 24 hours. With the goal of decreasing the amount of time it takes to train these models, and ultimately to decrease costs, the Specialist wants to reconfigure the input hyperparameter range(s). Which visualization will accomplish this? - A.. A histogram showing whether the most important input feature is Gaussian. B.. A scatter plot with points colored by target variable that uses t-Distributed Stochastic Neighbor Embedding (t-SNE) to visualize the large number of input variables in an easier-to-read dimension. C.. A scatter plot showing the performance of the objective metric over each training iteration. D.. A scatter plot showing the correlation between maximum tree depth and the objective metric.
D - This is a very tricky question. The idea is to reconfigure the ranges of the hyperparameters. A refers to a feature, not a hyperparameter. A is out. C refers to training the model, not optimizing the range of hyperparameters. C is out. Now it gets tricky. D will let you find determine what the approximately best tree depth is. That's good. That's what you're trying to do but it's only one of many hyperparameters. It's the best choice so far. B is tricky. t-SNE does help you visualize multidimensional data but option B refers to input variables, not hyperparameters. For this very tricky question, I would do with D. It's the only one that accomplishes the task of limiting the range of a hyperparameter, even if it is only one of them. It's good to see someone keeping a thoughtful and curious mind to this question. I too have the same conclusion, not an easy question. But, how do you optimize hyperparameters without training experiments? That is why C is the best option. You get a value for each unique combination of hyperparameters. B is also wrong as t-SNE picture is not actionable - good visual but ... that's it. try pictures here https://lvdmaaten.github.io/tsne/ When you are tuning hyperparameters you are literally training multiple models and searching for the best ones. B doesn't make sense I think it's D The goal is to reduce training time and costs by optimizing the hyperparameter tuning process. In tree-based ensemble models (e.g., XGBoost, Random Forest, or Gradient Boosting), tree depth is one of the most influential hyperparameters affecting: Model complexity: Deeper trees increase complexity but may lead to overfitting. Training time: More depth means more splits, significantly increasing computation. Performance (AUC score in this case): There is typically an optimal depth that balances underfitting and overfitting. A scatter plot showing the correlation between tree depth and the AUC metric will allow the ML Specialist to: Identify whether increasing depth leads to diminishing returns. Choose an optimal tree depth that balances performance with training efficiency. Reduce the search space of hyperparameters, speeding up tuning and lowering costs. A. No, doesn't help to set/reduce hyperparameter value/range B. No, honestly this is gibberish to me C. No, doesn't help to reduce hyperparameter value range D. YES, this help me understand how to set max tree depth hyperparameter Option C. See it is doing a scatter plot on the metric for each iteration. Each iteration is running with a certain set of hyper parameters. So if I plot this. and I find which iteration has the best metric, I could simply pick up those set of hyperparameters. D will only led to the tuning of maximum tree depth. I am not sure which option would satisfy the goal to decrease cost but just looking at maximum tree depth doesnt seem right to me. It might be a way to just look at the tree depth and tune just that parameter and since you are only tuning 1 paramter, it may be cheaper, but would that lead to a usable model? I think it should be option C. On what basis the correct answers are provided in this platform? Are they assuming this is the correct answer or it is taken from somewhere ? D IS THE CORRECT Option D, can also be useful in hyperparameter tuning for tree-based ensemble models, especially if the maximum tree depth is one of the hyperparameters you want to optimize. However, when the goal is to decrease training time and costs by reconfiguring input hyperparameter ranges, a scatter plot showing the performance of the objective metric over each training iteration (Option C) is generally more directly related to the hyperparameter tuning process. It helps you track how the model's performance changes during hyperparameter tuning, which is critical for making decisions about which hyperparameter ranges to explore further. Option D is valuable for understanding the relationship between maximum tree depth and the objective metric, but it might not provide as comprehensive insights into the overall hyperparameter tuning process compared to Option C. A. NO - it is about data discovery B. NO - it is about data discovery C. MIGHT - (NO) is a training iteration the overnight training the question is referring to ? (YES) Or each HPO training within each night ? D. YES - the less ambiguous answer I think that C should be the right answer. The specialist can monitor how the model works by changing hyperparameters' values in each training iteration. Option D A and B are wrong, because is totally out of question context. C is for monitoring a model, it doesn't help to change your HP range. D is the only answer that applies to the question. I think it should be c By plotting the performance of the objective metric (AUC) over each training iteration, the Specialist can analyze how different hyperparameter configurations affect the model's performance. This visualization helps in understanding which hyperparameter combinations lead to better results and allows the Specialist to identify areas of improvement. D: By analyzing this relationship, the Specialist can adjust the range of maximum tree depth values used during hyperparameter tuning to decrease training time and costs. D Seems like the best answer. When answer is considered correct who is making that call an is there any justification provided for us to learn from? It's about parameters, not about dimensionality. - https://www.examtopics.com/discussions/amazon/view/10264-exam-aws-certified-machine-learning-specialty-topic-1/
27
27 - A Machine Learning Specialist is creating a new natural language processing application that processes a dataset comprised of 1 million sentences. The aim is to then run Word2Vec to generate embeddings of the sentences and enable different types of predictions. Here is an example from the dataset: "The quck BROWN FOX jumps over the lazy dog.` Which of the following are the operations the Specialist needs to perform to correctly sanitize and prepare the data in a repeatable manner? (Choose three.) - A.. Perform part-of-speech tagging and keep the action verb and the nouns only. B.. Normalize all words by making the sentence lowercase. C.. Remove stop words using an English stopword dictionary. D.. Correct the typography on "quck" to "quick.ג€ E.. One-hot encode all words in the sentence. F.. Tokenize the sentence into words.
BCF - B C F should be correct. I will select B, C, F 1- Apply words stemming and lemmatization 2- Remove Stop words 3- Tokensize the sentences https://towardsdatascience.com/nlp-extracting-the-main-topics-from-your-dataset-using-lda-in-minutes-21486f5aa925 B. Normalize all words by making the sentence lowercase: Word2Vec treats words as distinct entities. If you don't convert everything to lowercase, "The" and "the" will be considered different words, which is generally not what you want. Lowercasing ensures consistency. D. Correct the typography on "quck" to "quick": Misspellings need to be corrected. Word2Vec learns embeddings based on the words it encounters. If "quck" remains, it will be treated as a separate word from "quick," and you'll lose the relationship between them. Correcting typos is crucial for data quality. F. Tokenize the sentence into words: Tokenization is the process of breaking down the sentence into individual words (or tokens). Word2Vec operates on individual words, so you need to split the sentence into its constituent parts. This is a fundamental step in NLP. While C - is debatable - not always necessary to remove stop words in Word2Vec - as sometimes the stop words do provide context ==================== For Word2Vec training, data preprocessing is essential to ensure that words are correctly represented, consistent, and free from unnecessary noise. The key steps are: Lowercasing the text (B) Word embeddings treat "FOX" and "fox" as different words. To avoid redundancy, lowercasing the text ensures consistency. Correcting typos (D) "quck" should be corrected to "quick" to prevent incorrect word representations in Word2Vec. Misspelled words can create meaningless embeddings. Tokenizing the sentence into words (F) Word2Vec operates at the word level, so breaking the sentence into individual tokens (words) is necessary. A. NO - word2vec works on raw data B. YES - case here is not significant C. YES - will help reduce dimensionality D. NO - word2vec will do it by itself E. NO - One-hot encoding is for classification F. YES - word2vec takes tokens as input Data need to be tokenized and cleaned! B, C F is the correct BCF correct. D is not correct (Pay attention to “in a repeatable manner” in the question.) B/C/F. D should not be performed because spell check is a subjective thing. You don't know for sure what the word was supposed to be if you have a typo. I saw this exact question on "whizlabs" practice exam and correct options were B/C/F https://towardsdatascience.com/an-implementation-guide-to-word2vec-using-numpy-and-google-sheets-13445eebd281 Data Preparation — Define corpus, clean, normalise and tokenise words To begin, we start with the following corpus: “natural language processing and machine learning is fun and exciting” For simplicity, we have chosen a sentence without punctuation and capitalization. Also, we did not remove stop words “and” and “is”. In reality, text data are unstructured and can be “dirty”. Cleaning them will involve steps such as o removing stop words, o removing punctuations, o convert text to lowercase (actually depends on your use-case), o replacing digits, etc. o After preprocessing, we then move on to tokenising the corpus Answer: B, C, F BCF is 100% correct Correct answers are B, C and F The correct answer is B, C and F A: POS tagging has nothing to do with word2vec D: fixing "quck" to "quick" only works for that specific word F: word2vec can use CBOW or skipgram, so no need to have one-hot decoding here sorry E: word2vec can use CBOW or skipgram, so no need to have one-hot decoding here BCF is correct B, C F correct B, C, and F are correct answers. I have done this question many times in many practice tests. B, C, F are my choice. D is also possible but not as widely used as others. - https://www.examtopics.com/discussions/amazon/view/11820-exam-aws-certified-machine-learning-specialty-topic-1/
28
28 - A company is using Amazon Polly to translate plaintext documents to speech for automated company announcements. However, company acronyms are being mispronounced in the current documents. How should a Machine Learning Specialist address this issue for future documents? - A.. Convert current documents to SSML with pronunciation tags. B.. Create an appropriate pronunciation lexicon. C.. Output speech marks to guide in pronunciation. D.. Use Amazon Lex to preprocess the text files for pronunciation
B - SSML is specific to that particular doument, like W3C an be pronounced as "World Wide Web Consortium" using W3C in that specific document and when you create a new document, you need to format again. But with LEXICONS, you can upload a lexicon file once and ALL the FUTURE documents can just have W3C and that will be pronounced as "World Wide Web Consortium".. so answer is B, because the question asks for "future" documents. The correct answer is B, as explained by VB. For the exact reason you state, the correct answer is A. For every different document, a particular acronym may mean something different so you must have a solution that is document-specific. As the question stated "address this issue FOR FUTURE DOCUMENTS". B addresses for future. A only address the issue in a case-by-case manner. It is the same business, so the acronyms are not expected to change from document to document absolutely, B is the correct choice. A.The document section for "Pronouncing Acronyms and Abbreviations". Source: https://docs.aws.amazon.com/polly/latest/dg/supportedtags.html I think the answer is B. https://docs.aws.amazon.com/polly/latest/dg/managing-lexicons.html https://www.smashingmagazine.com/2019/08/text-to-speech-aws/ Lifted from the above link - "Your text might include an acronym, such as W3C. You can use a lexicon to define an alias for the word W3C so that it is read in the full, expanded form (World Wide Web Consortium)." Clearly this is the same use case. Explanation: Amazon Polly sometimes mispronounces acronyms because it reads them as regular words. The best way to correct mispronunciations in future documents is to create a pronunciation lexicon. This allows you to define how specific words, acronyms, or abbreviations should be pronounced. How to Use a Pronunciation Lexicon in Amazon Polly? Define the correct pronunciation of acronyms in a Lexicon XML file. Use phonetic notation (e.g., IPA or Speech Synthesis Markup Language (SSML) Phoneme tags). Upload the lexicon to Polly via the AWS Management Console or AWS SDK. Reference the lexicon in Polly API requests. Answer : B https://aws.amazon.com/blogs/machine-learning/customize-pronunciation-using-lexicons-in-amazon-polly/ Use SSML tag which is great for inserting one-off customizations or testing purposes. We recommend using Lexicon to create a consistent set of pronunciations for frequently used words across your organization. This enables your content writers to spend time on writing instead of the tedious task of adding phonetic pronunciations in the script repetitively. SSML supports phonetic pronunciation. Seems to me A is correct too. https://docs.aws.amazon.com/polly/latest/dg/supportedtags.html#phoneme-tag B IS ANSWER B is the correct, A hardan cixdi debil? This issue can be faced with both methods described in A and B. Though the answer A refers to the "current" document while the question regards "future" documents, so I think the right answer is B. Letter B is correct to ensure that acronyms or terms are pronounced correctly. Letter A works, but look at the catch: It's asked for future documents, but it mentions converting only current ones to SSML format, while future ones would be in plaintext. Company using plaintext and Future document means plaintext! So only Custom Lexicon will help. https://docs.aws.amazon.com/polly/latest/dg/managing-lexicons.html this explains acronym. I believe it should be lexicon. Can you share how you tag the correct answer? Key here being "for future documents", answer should be B as SSML is for a specific document only this should be multiple choice question which answer is a AND b response B A pronunciation lexicon is a list of words and their correct phonetic pronunciation that can be used to improve the accuracy of text-to-speech conversion. In this case, the Machine Learning Specialist can create a custom lexicon for the company's acronyms and upload it to Amazon Polly. This will ensure that the acronyms are pronounced correctly in the future announcements. Should be B With Amazon Polly's custom lexicons or vocabularies, you can modify the pronunciation of particular words, such as company names, acronyms, foreign words, etc. To customize these pronunciations, you upload an XML file with lexical entries. {rpmimcoatopm ;exocpms enable you to customize the pronunciation of words. Amazon Polly provides API operations that you can use to store lexicons in an AWS region. Those lexicons are then specific to that particular region. References: https://docs.aws.amazon.com/polly/latest/dg/managing-lexicons.html https://aws.amazon.com/blogs/machine-learning/create-accessible-training-with-initiafy-and-amazon-polly/ - https://www.examtopics.com/discussions/amazon/view/10014-exam-aws-certified-machine-learning-specialty-topic-1/
29
29 - An insurance company is developing a new device for vehicles that uses a camera to observe drivers' behavior and alert them when they appear distracted. The company created approximately 10,000 training images in a controlled environment that a Machine Learning Specialist will use to train and evaluate machine learning models. During the model evaluation, the Specialist notices that the training error rate diminishes faster as the number of epochs increases and the model is not accurately inferring on the unseen test images. Which of the following should be used to resolve this issue? (Choose two.) - A.. Add vanishing gradient to the model. B.. Perform data augmentation on the training data. C.. Make the neural network architecture complex. D.. Use gradient checking in the model. E.. Add L2 regularization to the model.
BE - The model must have been overfitted. Regularization helps to solve the overfitting problem in machine learning (as well as data augmentation). Correct answers should be BE. Agreed 100% agree on BE Answer: BE https://www.kdnuggets.com/2019/12/5-techniques-prevent-overfitting-neural-networks.html 5 techniques to prevent overfitting: 1. Simplifying the model 2. Early stopping 3. Use data argumentation 4. Use regularization 5. Use dropouts he issue described suggests that the model is overfitting to the training data: Training error decreases quickly, meaning the model is learning the training set very well. Poor performance on unseen test data, indicating overfitting. To resolve overfitting, the Machine Learning Specialist should: Perform Data Augmentation (B) Expands the training dataset artificially by applying transformations (e.g., rotations, flips, brightness changes, cropping). Helps the model generalize better by exposing it to more diverse variations of the same class. Add L2 Regularization (E) Also known as weight decay, it penalizes large weights, preventing the model from memorizing the training data. Encourages simpler models, which reduces variance and improves generalization. agreed with vetal A. NO - vanishing gradient is somebody bad they might happen and prevent convergence, we don't want that or something we can add explicitly. it is a result of the learning B. YES - we have a overfitting problem so more training examples will help C. NO - we already have good accuracy on the training set D. NO - gradient checking is to find bugs in model implementation E. YES - we have a overfitting problem B. Perform data augmentation on the training data. ( it should add validation data as well) data should be distributed among train validation and test. Answer B&E looks good B & E is the correct ans BE is exact BE are the correct answers Looks like B and D are correct.. For D -> https://www.youtube.com/watch?v=P6EtCVrvYPU gradient checking doesn't resolve the issue, but adding it will confirm / deny the issue. So, it helps to validate the issue but not resolve. I would say B, E are correct L2 regularization tries to reduce the possibility of overfitting by keeping the values of the weights and biases small. why not becuase of vanishing gradient? Vanishing gradients are a problem when training a NN. Answer A mentions that the solution should be to add that, which is not possible. Correct solution is BE. https://www.kdnuggets.com/2022/02/vanishing-gradient-problem.html This is L2 Regularization....Do you think this is the right answer? agree BE - https://www.examtopics.com/discussions/amazon/view/10037-exam-aws-certified-machine-learning-specialty-topic-1/
30
30 - When submitting Amazon SageMaker training jobs using one of the built-in algorithms, which common parameters MUST be specified? (Choose three.) - A.. The training channel identifying the location of training data on an Amazon S3 bucket. B.. The validation channel identifying the location of validation data on an Amazon S3 bucket. C.. The IAM role that Amazon SageMaker can assume to perform tasks on behalf of the users. D.. Hyperparameters in a JSON array as documented for the algorithm used. E.. The Amazon EC2 instance class specifying whether training will be run using CPU or GPU. F.. The output path specifying where on an Amazon S3 bucket the trained model will persist.
ACF - THE ANSWER SHOUD BE CEF IAM ROLE, INSTANCE TYPE, OUTPUT PATH Why not A? You don't need to tell Sagemaker where the training data is located? You need to specify the InputDataConfig, but it does not need to be "S3" I think the reason why A and B are wrong, not because data location is not required, but because it doesn't need to be S3, it can be Amazon S3, EFS, or FSx location Should be C, E, F From the SageMaker notebook example: https://github.com/aws/amazon-sagemaker-examples/blob/master/introduction_to_amazon_algorithms/semantic_segmentation_pascalvoc/semantic_segmentation_pascalvoc.ipynb # Create the sagemaker estimator object. ss_model = sagemaker.estimator.Estimator(training_image, role, train_instance_count = 1, train_instance_type = 'ml.p3.2xlarge', train_volume_size = 50, train_max_run = 360000, output_path = s3_output_location, base_job_name = 'ss-notebook-demo', sagemaker_session = sess) It says InstanceClass - CPU/GPU in the question, not InstanceType instance type has default value. From here https://docs.aws.amazon.com/zh_tw/sagemaker/latest/dg/API_CreateTrainingJob.html .. the only "Required: Yes" attributes are: 1. AlgorithmSpecification (in this TrainingInputMode is Required - i.e. File or Pipe) 2. OutputDataConfig (in this S3OutputPath is Required - where the model artifacts are stored) 3. ResourceConfig (in this EC2 InstanceType and VolumeSizeInGB are required) 4. RoleArn (..The Amazon Resource Name (ARN) of an IAM role that Amazon SageMaker can assume to perform tasks on your behalf...the caller of this API must have the iam:PassRole permission.) 5. StoppingCondition 6. TrainingJobName (The name of the training job. The name must be unique within an AWS Region in an AWS account.) From the given options in the questions.. we have 2, 3, and 4 above. so, the answer is CEF. This is the best explanation that CEF is the right answer, IMO. The document at that url is very informative. It also specifically states that InputDataConfig is NOT required. Having said that, I have no idea how the model will train if it doesn't know where to find the training data, but that is what the document says. If someone can explain that, I'd like to hear the explanation. If I see this question on the actual exam, I'm going with AEF. The model absolutely must know where the training data is. I have seen other documentation that does confirm that you need the location of the input data, the compute instance and location to output the model artifacts. but you also need to specify the service role sagemaker should use otherwise it will not be able to perform actions on your behalf like provisioning the training instances. Perfect explanation. It is CEF The question is asking about built in algorithms. It should be ADE. See https://docs.aws.amazon.com/zh_tw/sagemaker/latest/dg/API_CreateTrainingJob.html for "3. ResourceConfig", only VolumeSizeInGB is required. So, it's not about the instance type. Check: https://docs.aws.amazon.com/zh_tw/sagemaker/latest/APIReference/API_ResourceConfig.html Reason: When submitting Amazon SageMaker training jobs using built-in algorithms, the following parameters must be specified: Training Data Location (A) SageMaker requires the training dataset's location in Amazon S3. Provided as a channel input in the training job. IAM Role (C) SageMaker needs IAM permissions to access data from S3 and execute tasks on behalf of the user. Model Output Path (F) Specifies the S3 bucket location where the trained model artifacts will be stored. Instance type is required but not specific class CPU/GPU. Sagamkaer can handle that. These parameters ensure that the training job has access to the necessary data, permissions, and storage locations to function correctly. Options B, D, and E are important but not always mandatory for every training job. For example, validation data (Option B) is not always required, and hyperparameters (Option D) and instance types (Option E) can have default values or be optional depending on the specific algorithm and setup. import boto3 import sagemaker sess = sagemaker.Session() # Example for the linear learner linear = sagemaker.estimator.Estimator( container, role, # role (c) instance_count=1, instance_type="ml.c4.xlarge", # instance type (e) output_path=output_location, # output path (f) sagemaker_session=sess, ) Going with cef ANSWER IS CEF Here from Amazon docs InputDataConfig An array of Channel objects. Each channel is a named input source. InputDataConfig describes the input data and its location. Required: No OutputDataConfig Specifies the path to the S3 location where you want to store model artifacts. SageMaker creates subfolders for the artifacts. Required: Yes ResourceConfig - Identifies the resources, ML compute instances, and ML storage volumes to deploy for model training. In distributed training, you specify more than one instance. Required: Yes CEF https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_CreateTrainingJob.html#API_CreateTrainingJob_RequestParameters Based on https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_CreateTrainingJob.html Required parameters are: - AlgorithmSpecification (registry path of the Docker image with the training algorithm) - OutputDataConfig (path to the S3 location where you want to store model artifacts) - ResourceConfig (resources, including the ML compute instances and ML storage volumes, to use for model training) - RoleArn - StoppingCondition (time limit for training job) - TrainingJobName Thus, the answer is: C E F wording for option E is inaccurate "EC2 instance class specifying whether training will be run using CPU or GPU" but they do it on purpose The input channel and output channel are mandatory, as the training job needs to know where to get the input data from and where to publish the model artifact. IAM role is also needed, for AWS services. others are not mandatory, validation channel is not mandatory for instance in case of unsupervised learning, likewise hyper params can be be auto tuned for as well as the ec2 instance types can be default ones that will be picked As they narrowed it to S3, A is incorrect BUT when submitting Amazon SageMaker training jobs using one of the built-in algorithms, it is a MUST to identify the location of training data. While Amazon S3 is commonly used for storing training data, other sources like Docker containers, DynamoDB, or local disks of training instances can also be used. Therefore, specifying the location of training data is essential for SageMaker to know where to access the data during training. So the right answer is CEF for me for this case... However if A was saying identify the location of training data, I think option A would be included in the MUST parameter. InputDataConffig is optional in create_training_job.Please check thte parameters that are required. So answer is CEF: https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_CreateTrainingJob.html InputDataConffig is optional in create_training_job.Please check thte parameters that are required. So answer is SEF: https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_CreateTrainingJob.html Input is required only when calling Fit method. When initializing the Estimator, we do not need input I open the sagemaker and tested. A C F B is not needed for non-supervised algorithm. - https://www.examtopics.com/discussions/amazon/view/8316-exam-aws-certified-machine-learning-specialty-topic-1/
31
31 - A monitoring service generates 1 TB of scale metrics record data every minute. A Research team performs queries on this data using Amazon Athena. The queries run slowly due to the large volume of data, and the team requires better performance. How should the records be stored in Amazon S3 to improve query performance? - A.. CSV files B.. Parquet files C.. Compressed JSON D.. RecordIO
B - Answer is B. Athena is best in Parquet format. You can improve the performance of your query by compressing, partitioning, or converting your data into columnar formats. Amazon Athena supports open source columnar data formats such as Apache Parquet and Apache ORC. Converting your data into a compressed, columnar format lowers your cost and improves query performance by enabling Athena to scan less data from S3 when executing your query Amazon Athena performs best when querying columnar storage formats like Apache Parquet. Given that 1 TB of data is generated every minute, optimizing storage format is critical for query performance and cost efficiency. Why Parquet (B) is the Best Choice? Columnar Storage: Parquet stores data by columns instead of rows, allowing Athena to scan only the needed columns, reducing the amount of data read. Compression Efficiency: Parquet automatically compresses data more efficiently than CSV or JSON. Smaller file sizes = Faster queries + Lower costs. Efficient Query Performance: Parquet supports predicate pushdown, meaning queries can skip irrelevant rows without scanning the entire dataset. Optimized for Big Data & Athena: Designed for big data workloads in Athena, Redshift Spectrum, and Presto. Works well with S3 partitioning to improve query speed. A. NO - slower B. YES - Parquet native in Aethena/Presto C. NO - Compressed JSON D. NO - no built-in support according to: https://dzone.com/articles/how-to-be-a-hero-with-powerful-parquet-google-and the query run time over parquet file was 6.78 seconds while it was 236 seconds on the same data but stored on csv file which mean that parquet file is 34x faster than csv file B it is Answer is B. https://aws.amazon.com/tw/blogs/big-data/top-10-performance-tuning-tips-for-amazon-athena/ But why does this question relate to Machine Learning? Because you must explore data very quickly using SQL in order to run EDA / analyze data for ML purposes. Those explorations can inform on selecting features that can be used for modeling purposes. - https://www.examtopics.com/discussions/amazon/view/14807-exam-aws-certified-machine-learning-specialty-topic-1/
32
32 - Machine Learning Specialist is working with a media company to perform classification on popular articles from the company's website. The company is using random forests to classify how popular an article will be before it is published. A sample of the data being used is below. Given the dataset, the Specialist wants to convert the Day_Of_Week column to binary values. What technique should be used to convert this column to binary values? [https://www.examtopics.com/assets/media/exam-media/04145/0002100001.png] - A.. Binarization B.. One-hot encoding C.. Tokenization D.. Normalization transformation
B - I choose b Correct answer is B. Example: Mon | Tue | Wed .... 1 0 0 0 1 0 Easy Peasy Any categorical feature needs to be converted using One Hot Encoding and NOT label encoding. Originally I put A, (believing to be able to format it as (0,1,2,3,4,5,6), or something as it mentioned it to convert the column, but later I realized Binarization is only designed for continuous or numerical data. Even though one-hot encoding will create 6 more columns it is correct. B is correct. B 1000000 = Mon 0100000 = Tue 0010000 = Wed 0001000 = Thur 0000100 = Fri 0000010 = Sat 0000001=Sun why not A? 001 010, 011 i thought of this at first, but chatgpt's explanation changed my mind In summary, if the names of days represent nominal categorical variables, one-hot encoding is generally the preferred choice. It maintains distinctiveness, is interpretable, and ensures that each day is clearly represented as a separate binary feature. Binary encoding may be considered for memory efficiency, especially when dealing with a large number of ordinal categories, but it should be used with caution as it introduces an ordinal relationship between categories, which may or may not align with the nature of the data. Ultimately, the choice between the two methods should align with the specific needs of your analysis and the data's characteristics. B is the obvious answer Binary encoding would've been a correct answer but it is not here & Binarization is used for continuous variables. leaving w/ option B B is wrong. You do not need to one hot encode the variable in random trees. If you do so, you tree must be very deep, which is not efficient. The correct answer is C! "The Specialist want to convert the Day Of Week column in the dataset to binary values." You are misreading the question. The answer is B. Stop misleading people, the question already asked to convert the data into binary. C is not even remotely close to be correct - https://www.examtopics.com/discussions/amazon/view/46347-exam-aws-certified-machine-learning-specialty-topic-1/
33
33 - A gaming company has launched an online game where people can start playing for free, but they need to pay if they choose to use certain features. The company needs to build an automated system to predict whether or not a new user will become a paid user within 1 year. The company has gathered a labeled dataset from 1 million users. The training dataset consists of 1,000 positive samples (from users who ended up paying within 1 year) and 999,000 negative samples (from users who did not use any paid features). Each data sample consists of 200 features including user age, device, location, and play patterns. Using this dataset for training, the Data Science team trained a random forest model that converged with over 99% accuracy on the training set. However, the prediction results on a test dataset were not satisfactory Which of the following approaches should the Data Science team take to mitigate this issue? (Choose two.) - A.. Add more deep trees to the random forest to enable the model to learn more features. B.. Include a copy of the samples in the test dataset in the training dataset. C.. Generate more positive samples by duplicating the positive samples and adding a small amount of noise to the duplicated data. D.. Change the cost function so that false negatives have a higher impact on the cost value than false positives. E.. Change the cost function so that false positives have a higher impact on the cost value than false negatives.
CD - I think it should be CD C: because we need a balance dataset D: The number of positive samples is large so model tends to predict 0 (negative) for all cases leading to False Negative problem. We should minimize that. My opinion I think it should be CD C: because we need a balance dataset D: The number of negative samples is large so model tends to predict 0 (negative) for all cases leading to False Negative problem. We should minimize that. My opinion Why These Are the Best Choices? C. Generate more positive samples by duplicating the positive samples and adding a small amount of noise to the duplicated data. Balances the dataset by increasing the number of positive samples. Adding noise prevents overfitting and helps the model generalize better. Alternative: Use SMOTE (Synthetic Minority Over-sampling Technique) to generate synthetic positive examples. D. Change the cost function so that false negatives have a higher impact on the cost value than false positives. Since missing a potential paying user (false negative) is more critical than misclassifying a non-paying user, adjusting the cost function to penalize false negatives more will improve recall for paid users. Methods: Use weighted loss functions (e.g., weighted cross-entropy). Adjust class weights in random forest or another algorithm. Use AUC-ROC or F1-score instead of accuracy for evaluation. Think C and D C,D is correct (percentage of the positive class is key to decide which case we are interested in) This question, positive class (Pay) is 0.01% as compared to 99.99( not pay) , as a result, we have to pay attention to Pay because if we miss 0.01% out, we didn't get revenue. it is a false negative. In contrast to these questions, it positive class (Pay) is 40% as compared to negative class (60% not pay), it is avoidable to emphasize on 40% ( if model predict as payment but in reality customer neglect), we won't get revenue the amount from false positive) I think is CD C and D. Hopefully, no one honestly thinks that B is a good answer. Never expose test data to the training set or vice versa. C is right because of the highly imbalanced training set. D is right because you want to minimize false negatives, maximize true positives, maximize recall of the positive class. I'm not sure why anyone's worried about precision in this case. CD The model has 99% accuracy because it's simply predicting that everyone's a negative. Since almost everyone's a negative, it will get almost everyone right. So we need to penalize the model for predicting that someone is a negative when it is not (i.e. penalize false negatives). So that's D. Also, it would be really nice to have more positives -- one way to do that is to follow option C. CD 100% CD C:imbalance of test (1000 positive, 999000 negative = 0.1% positive) thus C to increase that D :also to reduce generalizing, since everyone says no, the model would generalize to no, but increasing the penalty of a false negative would reduce generalizing.. It is needed to diminish the FP, because they are player predicted to pay and in reality will not pay. So FP should impact the cost metric more. CE should be the answer. CD are correct for sure. It is C,E... we want to find all paying customers, which are positives, so we have to punish incorrectly finding negatives, which is E CD although i am worried about the noise being introduced as it could skew the data nevertheless no better answer is given CD We need high recall so that we do not miss many Positive cases. In that case we need to have less False Negative(FN) therefore it should have high impact on cost function. in my view, CD are answers C: of course, handle the imbalanced dataset D: right now, model accuracy is 99%, it means model predict everything is negative leading to FN problem, so we need to minimize it more in cost function CD, FN are valuable players, we should care more on FN - https://www.examtopics.com/discussions/amazon/view/10409-exam-aws-certified-machine-learning-specialty-topic-1/
34
34 - A Data Scientist is developing a machine learning model to predict future patient outcomes based on information collected about each patient and their treatment plans. The model should output a continuous value as its prediction. The data available includes labeled outcomes for a set of 4,000 patients. The study was conducted on a group of individuals over the age of 65 who have a particular disease that is known to worsen with age. Initial models have performed poorly. While reviewing the underlying data, the Data Scientist notices that, out of 4,000 patient observations, there are 450 where the patient age has been input as 0. The other features for these observations appear normal compared to the rest of the sample population How should the Data Scientist correct this issue? - A.. Drop all records from the dataset where age has been set to 0. B.. Replace the age field value for records with a value of 0 with the mean or median value from the dataset C.. Drop the age feature from the dataset and train the model using the rest of the features. D.. Use k-means clustering to handle missing features
B - Dropping the Age feature is a NOT ATOLL a good idea - as age plays a critical role in this disease as per the question Dropping 10% of data is NOT a good idea considering the fact that the number of observations is already low. The Mean or Median are a potential solutions But the question says that "Disease worsens after age 65 so there is a correlation between age and other symptoms related feature" So that means that using Unsupervised Learning we can make pretty good prediction of "Age" So the answer is D Use K-Means clustering https://www.displayr.com/5-ways-deal-missing-data-cluster-analysis/ B is correct If it was KNN it would be more accurate, but we don't have that option. Replacing the age with mean or median might bring a bias to the dataset. Use k-means clustering to estimate the missing age based on other features might get better results. Removing 10% available data looks odd. Why not D? The issue arises from incorrect age values (age = 0) in a dataset where all patients are supposed to be over 65 years old. Since age is an important predictor for the disease's progression, removing or ignoring this feature may negatively impact model performance. The best approach is imputing missing or incorrect values with a reasonable estimate (e.g., mean or median age of the dataset), ensuring that: The dataset remains intact without losing valuable patient records. The model still benefits from age as a feature. The imputed values are realistic and do not introduce bias. Preserves data, maintains model integrity, and corrects anomalies effectively. B. Replace the age field value for records with a value of 0 with the mean or median value from the dataset: This method allows for retaining all patient records while addressing the anomaly. It is a standard approach for dealing with missing or incorrect values in a way that preserves the integrity of the dataset. B. GPT answer B-chatgpt The question tries to mislead by adding information around the feature correlation. K-means clustering is not meant for imputing data. Hence answer should be B, that would be the right way of handling the missing value. Using k-means clustering to handle missing features is not directly applicable to this scenario. K-means clustering is a method for grouping data points into clusters based on similarity, and it's not typically used for imputing missing values. add/ comment why? b ? - >replacing the age field value for records with a value of 0 with the mean or median value from the dataset, is generally the best approach among the given options. It allows the preservation of the dataset size and leverages the remaining correct data points, assuming age is a crucial predictor in this context. However, it's vital to perform this imputation carefully to avoid introducing bias. Median is often preferred in this scenario to mitigate the impact of outliers. The best way to handle the missing values in the patient age feature is to replace them with the mean or median value from the dataset. This is a common technique for imputing missing values that preserves the overall distribution of the data and avoids introducing bias or reducing the sample size. Dropping the records or the feature would result in losing valuable information and reducing the accuracy of the model. Using k-means clustering would not be appropriate for handling missing values in a single feature, as it is a method for grouping similar data points based on multiple mean or median is for outliers so D Obviously B, why would you use a clustering algorithm to predict a value? D just doesn't make sense B is correct.K-means is unsupervised and used mainly for clustering. KNN would have been more accurate. It can be used to predict a value. since knn is not present i think it is mean median value B is correct or KNN, but dont K means A. NO - unless we want to loose 10% of the data B. NO - age is predictive, so using the means we would introduce a bias C. NO - age is predictive D. YES - better quality than B, it is likely that other physiological values can help predict the age k-means should give the best estimation of the age. Using mean would reduce the correlation between outcome and age for the model. How can it be when there is a labelled outcome, which means this is Supervised and K-Means is for UnSupervised. So only possible answer should be B - https://www.examtopics.com/discussions/amazon/view/10054-exam-aws-certified-machine-learning-specialty-topic-1/
35
35 - A Data Science team is designing a dataset repository where it will store a large amount of training data commonly used in its machine learning models. As Data Scientists may create an arbitrary number of new datasets every day, the solution has to scale automatically and be cost-effective. Also, it must be possible to explore the data using SQL. Which storage scheme is MOST adapted to this scenario? - A.. Store datasets as files in Amazon S3. B.. Store datasets as files in an Amazon EBS volume attached to an Amazon EC2 instance. C.. Store datasets as tables in a multi-node Amazon Redshift cluster. D.. Store datasets as global tables in Amazon DynamoDB.
A - Ans: A (S3) is most cost effective A : S3 cost effective + athena ( not c redshift dont support unstructured data) Amazon S3 (Simple Storage Service) is the best choice because it: Scales automatically to store an arbitrary number of datasets. Is cost-effective, as S3 charges only for storage used, unlike provisioned databases. Supports querying datasets with SQL using Amazon Athena. Is highly durable (99.999999999% durability) and optimized for large datasets. How It Works in This Scenario? Store datasets in S3 as files in Parquet, ORC, or CSV format. Use AWS Glue Data Catalog to create table metadata. Query the datasets using Amazon Athena (serverless SQL querying on S3). Automatically scale without worrying about storage limits. 'cost effective' --> AWS S3 A. YES - S3 + Athena/Presto B. NO - no SQL support C. NO - expensive to scale D. NO - DynamoDB is NoSQL AWS S3 + Athena will do it The most appropriate storage scheme for this scenario is option A: Store datasets as files in Amazon S3. Amazon S3 is a highly scalable and cost-effective object storage service that can store a large amount of data. S3 can scale automatically to accommodate a large number of datasets, making it a good option for storing the training data used in machine learning models. Additionally, S3 supports SQL querying through Amazon Athena or Amazon Redshift Spectrum, allowing data scientists to easily explore the data. "store a large amount of training data commonly used in its machine learning models".. well it cannot be anything other than S3. Athena can query S3 cataloged data with SQL commands. Anwser is A Amazon Redshift is not cost-effective. I would say C https://docs.aws.amazon.com/redshift/latest/mgmt/working-with-clusters.html "For workloads that require ever-growing storage, managed storage lets you automatically scale your data warehouse storage capacity without adding and paying for additional nodes." Data warehouse is not needed. For exploring data using SQL, you can use Athena Amazon Redshift is a fast, fully managed data warehouse that makes it simple and cost-effective to analyze all your data using standard SQL and your existing Business Intelligence (BI) tools. It allows you to run complex analytic queries against petabytes of structured data using sophisticated query optimization, columnar storage on high-performance storage, and massively parallel query execution. Most results come back in seconds. s3 is right A, S3 is most cost effective - https://www.examtopics.com/discussions/amazon/view/10056-exam-aws-certified-machine-learning-specialty-topic-1/
36
36 - A Machine Learning Specialist deployed a model that provides product recommendations on a company's website. Initially, the model was performing very well and resulted in customers buying more products on average. However, within the past few months, the Specialist has noticed that the effect of product recommendations has diminished and customers are starting to return to their original habits of spending less. The Specialist is unsure of what happened, as the model has not changed from its initial deployment over a year ago. Which method should the Specialist try to improve model performance? - A.. The model needs to be completely re-engineered because it is unable to handle product inventory changes. B.. The model's hyperparameters should be periodically updated to prevent drift. C.. The model should be periodically retrained from scratch using the original data while adding a regularization term to handle product inventory changes D.. The model should be periodically retrained using the original training data plus new data as product inventory changes.
D - Ans: D The model performance degradation over time suggests concept drift—the relationship between input features and the target variable has changed. Since product recommendations depend on customer behavior, preferences, and product inventory, periodic retraining with updated data ensures the model adapts to these changes. Why Periodic Retraining? Customer preferences evolve: Buying patterns change over time due to seasons, trends, and external factors. New products get added, and old ones are discontinued: The model must learn about new items and stop recommending outdated ones. The dataset needs to reflect recent trends: Using new and historical data together ensures the model retains useful past knowledge while learning new patterns. 'retrained using the original training data plus new data' I believe it should be B 1. The model performance has diminished gradually over the past few months, indicating the data distribution may have changed since initial deployment over a year ago. This is a classic sign of concept drift. 2. The model architecture and training procedure have remained unchanged since initial deployment. Updating the hyperparameters is a lighter approach than retraining the model from scratch, and can help prevent further performance deterioration if done periodically to adapt to changes in user preferences and product inventory. Answer is D D is the answer! D is the answer. There has been a data drift resulting from new customer segment visiting the site. So, the model needs to be updated periodically with new data from the website. DDDDD. D :D Incremental training. D. Periodically Re-Fit D agree with D D is correct - https://www.examtopics.com/discussions/amazon/view/10057-exam-aws-certified-machine-learning-specialty-topic-1/
37
37 - A Machine Learning Specialist working for an online fashion company wants to build a data ingestion solution for the company's Amazon S3-based data lake. The Specialist wants to create a set of ingestion mechanisms that will enable future capabilities comprised of: ✑ Real-time analytics ✑ Interactive analytics of historical data ✑ Clickstream analytics ✑ Product recommendations Which services should the Specialist use? - A.. AWS Glue as the data catalog; Amazon Kinesis Data Streams and Amazon Kinesis Data Analytics for real-time data insights; Amazon Kinesis Data Firehose for delivery to Amazon ES for clickstream analytics; Amazon EMR to generate personalized product recommendations B.. Amazon Athena as the data catalog: Amazon Kinesis Data Streams and Amazon Kinesis Data Analytics for near-real-time data insights; Amazon Kinesis Data Firehose for clickstream analytics; AWS Glue to generate personalized product recommendations C.. AWS Glue as the data catalog; Amazon Kinesis Data Streams and Amazon Kinesis Data Analytics for historical data insights; Amazon Kinesis Data Firehose for delivery to Amazon ES for clickstream analytics; Amazon EMR to generate personalized product recommendations D.. Amazon Athena as the data catalog; Amazon Kinesis Data Streams and Amazon Kinesis Data Analytics for historical data insights; Amazon DynamoDB streams for clickstream analytics; AWS Glue to generate personalized product recommendations
A - Ans: A seems to be reasonable A looks correct but it is missing for "Interactive analytics of historical data" AWS Glue as data catalog, then you can analyze historical data, such as running sql with Athena. Once you insert real-time data to ES, you can see historical data from Kibana dashboard. but C is missing for "real-time analythics" and also C is saying historical data analytics for Kinesis Data analytics which is real-time analytics not historical, so the answer might not C but the answer is A A. YES - Amazon Kinesis Data Analytics is for real-time data insights B. NO - Amazon Athena has no data catalog C. NO - Amazon Kinesis Data Streams and Amazon Kinesis Data Analytics is not for historical data insights D. NO - Amazon Athena has no data catalog Athena can not be used for data catalog, so B and D are wrong. A and C are equals, but it's well known that Kinesis DS and Analytics are used together for real time solutions, which is mentioned in the question / answer, but lack on C. All are bad options, but A can do it. AWS Glue is a fully managed extract, transform, and load (ETL) service that makes it easy to move data between data stores. It can be used as a data catalog to store metadata information about the data in the data lake. Amazon Kinesis Data Streams and Amazon Kinesis Data Analytics can be used together to collect, process, and analyze real-time streaming data. Amazon Kinesis Data Firehose can be used to deliver streaming data to destinations such as Amazon ES for clickstream analytics. Finally, Amazon EMR can be used to run big data frameworks such as Apache Spark and Apache Hadoop to generate personalized product recommendations. A or C https://aws.amazon.com/blogs/big-data/retaining-data-streams-up-to-one-year-with-amazon-kinesis-data-streams/ Athena can do Interactive analytics on Historical data, but here its only use is "Athena as the data catalog" and this is the work of Glue data catalog using its crawlers, so it cannot be B or D. --So its either A or C -- Now Kinesis data Streams/Analytics is know for real time data analytics but if it is reading from data already stored in S3 using DMS then we can say it is getting historical data. -- Here I am not very clear if Kinesis part will happen on incoming data before S3 or After data persists to S3 and Kinesis reads it through S3-->DMS--Kinesis data stream -- Kinesis analytics-->Firehose. But still insights are always on real-time/current data based on historical data trends , so the statement in C "Analytics for historical data insights" is in-correct in general . Hence ANSWER is :A A is correct, for those asking the difference between A and D, D talks about using kinesis stream and data analytics to create historical analysis.... waste of money no? Answer = A A it is it's A, ES can perform clickstream analytics and EMR can handle spark job recomendation at scale Only C and D mention interactive analytics of historical data. Glue won't provide personalised recommendation so it is C What is the difference between the solution in A or C ???? A is real time data analytics with Kinesis Data analytics and C is saying historical data which is wrong Looks like C Amazon ES has KIbana which supports click stream A is Correct - https://www.examtopics.com/discussions/amazon/view/10058-exam-aws-certified-machine-learning-specialty-topic-1/
38
38 - A company is observing low accuracy while training on the default built-in image classification algorithm in Amazon SageMaker. The Data Science team wants to use an Inception neural network architecture instead of a ResNet architecture. Which of the following will accomplish this? (Choose two.) - A.. Customize the built-in image classification algorithm to use Inception and use this for model training. B.. Create a support case with the SageMaker team to change the default image classification algorithm to Inception. C.. Bundle a Docker container with TensorFlow Estimator loaded with an Inception network and use this for model training. D.. Use custom code in Amazon SageMaker with TensorFlow Estimator to load the model with an Inception network, and use this for model training. E.. Download and apt-get install the inception network code into an Amazon EC2 instance and use this instance as a Jupyter notebook in Amazon SageMaker.
CD - You might be spent a lot of money for ask AWS A.CHANGE built-in image OR B.Create a support case. The effectial way BOTH RELATIVE TO SageMaker Estimator C.DOCKER OR BRING YOUR CODE BY D.SageMaker with TensorFlow Estimator THE BEAUTYFUL ANSWER ARE C AND D I will go for C & D Option A is not possible because the built-in image classification algorithm cannot be customized. Option B is not feasible because it is not possible to change the default image classification algorithm through a support case. Option E is also not a recommended approach because it involves manually installing software on an EC2 instance rather than using the managed services provided by SageMaker. The effectial way BOTH RELATIVE TO SageMaker Estimator C.DOCKER OR BRING YOUR CODE BY D.SageMaker with TensorFlow Estimator This question ask for 2 ways not a set of actions. So may be confused. Anwers AD go to: https://docs.aws.amazon.com/sagemaker/latest/dg/docker-containers.html CD and also A says it but in a more general term.... https://aws.amazon.com/blogs/machine-learning/transfer-learning-for-custom-labels-using-a-tensorflow-container-and-bring-your-own-algorithm-in-amazon-sagemaker/ C and D are correct https://docs.aws.amazon.com/zh_tw/sagemaker/latest/dg/your-algorithms.html https://aws.amazon.com/tw/blogs/machine-learning/transfer-learning-for-custom-labels-using-a-tensorflow-container-and-bring-your-own-algorithm-in-amazon-sagemaker/ https://docs.aws.amazon.com/zh_tw/sagemaker/latest/dg/tf.html - https://www.examtopics.com/discussions/amazon/view/8317-exam-aws-certified-machine-learning-specialty-topic-1/
39
39 - A Machine Learning Specialist built an image classification deep learning model. However, the Specialist ran into an overfitting problem in which the training and testing accuracies were 99% and 75%, respectively. How should the Specialist address this issue and what is the reason behind it? - A.. The learning rate should be increased because the optimization process was trapped at a local minimum. B.. The dropout rate at the flatten layer should be increased because the model is not generalized enough. C.. The dimensionality of dense layer next to the flatten layer should be increased because the model is not complex enough. D.. The epoch number should be increased because the optimization process was terminated before it reached the global minimum.
B - DROPOUT HELPS PREVENT OVERFITTING https://keras.io/layers/core/#dropout THE BEAUTIFUL ANSER SHOULD BE B. agree. it should be B https://kharshit.github.io/blog/2018/05/04/dropout-prevent-overfitting Answer is B 100% Increasing dropout rate will reduce complexity of the model which inturn reduces overfitting This is clearly B, dont get why the answer is marked as D. Regularization will seek to obtain similar accuracies in train and test sets. Anything else will make the overfitting worse B is correct, D so stup*d answer A. NO - accuracy on training set is high B. YES - increased dropout rate => reduce model complexity => less overfitting C. NO - we want to reduce model complexity D. NO - the model converged I don't understand why the highlighted "right" answer is D. To increase the number of epochs will make the situation even worse than it is; dropout is the right action to take in this case B is correct agree, B makes more sense here Definitely B because overfitting comes from complex model that captures patterns of training data well. But D is getting this model more complex, worsening overfitting. Correct my reasoning! D is worsening overfitting because it feeds more data after overfitting arises. D is used for underfitted models. Increasing Epoch only makes things worse on a overfitting model. You should perform regularization by introducing drop outs to generalize the model. Option B is the correct answer because increasing the dropout rate at the flatten layer helps prevent overfitting by randomly dropping out units during training, effectively creating a more robust model that can generalize better to new data. Dropout is a regularization technique that helps prevent overfitting by forcing the model to learn redundant representations of the data. By increasing the dropout rate at the flatten layer, the model becomes more generalized, which should help to improve the testing accuracy. Overfitting occurs when a model is too complex and memorizes the training data instead of learning the underlying pattern. As a result, the model performs well on the training data but poorly on new, unseen data. Increasing the dropout rate, a regularization technique, can help combat overfitting by randomly dropping out some neurons during training, which prevents the model from relying too heavily on any single feature. Model is overfitting, I will go with option B, increasing epoch will cause more overfitting it should answer B. 12-sep exam - https://www.examtopics.com/discussions/amazon/view/8318-exam-aws-certified-machine-learning-specialty-topic-1/
40
40 - A Machine Learning team uses Amazon SageMaker to train an Apache MXNet handwritten digit classifier model using a research dataset. The team wants to receive a notification when the model is overfitting. Auditors want to view the Amazon SageMaker log activity report to ensure there are no unauthorized API calls. What should the Machine Learning team do to address the requirements with the least amount of code and fewest steps? - A.. Implement an AWS Lambda function to log Amazon SageMaker API calls to Amazon S3. Add code to push a custom metric to Amazon CloudWatch. Create an alarm in CloudWatch with Amazon SNS to receive a notification when the model is overfitting. B.. Use AWS CloudTrail to log Amazon SageMaker API calls to Amazon S3. Add code to push a custom metric to Amazon CloudWatch. Create an alarm in CloudWatch with Amazon SNS to receive a notification when the model is overfitting. C.. Implement an AWS Lambda function to log Amazon SageMaker API calls to AWS CloudTrail. Add code to push a custom metric to Amazon CloudWatch. Create an alarm in CloudWatch with Amazon SNS to receive a notification when the model is overfitting. D.. Use AWS CloudTrail to log Amazon SageMaker API calls to Amazon S3. Set up Amazon SNS to receive a notification when the model is overfitting
B - THE ANSWER SHOULD BE B. YOU DON'T NEED TO THROUGH LAMBDA TO INTERGE CLOUDTRAIL Log Amazon SageMaker API Calls with AWS CloudTrail https://docs.aws.amazon.com/sagemaker/latest/dg/logging-using-cloudtrail.html Agreed B for the following reasons # CloudTrail logs captured in S3 without any code/lambda # The custom metrics can be published to Cloudwatch...in this case it would be a test for overfit on MXNET .... which will set off an alarm .... which can then be subscribed on SNS Breakdown of the Chosen Solution (B) Use AWS CloudTrail to log SageMaker API calls to Amazon S3 ✅ CloudTrail automatically logs all AWS API activity, including SageMaker API calls, for auditing. S3 stores these logs securely for auditor review. Push custom metrics to Amazon CloudWatch ✅ Model overfitting can be detected using a custom CloudWatch metric (e.g., validation loss increasing while training loss decreases). The SageMaker training script can push loss values to CloudWatch during training. Create a CloudWatch alarm + SNS notification ✅ Set a CloudWatch alarm on the overfitting metric (e.g., validation loss surpassing a threshold). Use Amazon SNS to send a notification (email, SMS, or Lambda trigger) when the alarm is triggered. Option D involves using AWS CloudTrail to log Amazon SageMaker API calls to Amazon S3 and setting up Amazon SNS to receive a notification when the model is overfitting. While this approach addresses the logging requirement, it does not provide a mechanism for pushing custom metrics to Amazon CloudWatch, which is necessary for monitoring model performance and detecting overfitting. So 'B ' is correct https://docs.aws.amazon.com/fr_fr/sagemaker/latest/dg/training-metrics.html#define-train-metrics It detects hardware resource usage issues (such as CPU, GPU, and I/O bottlenecks) and non-convergent model issues (such as overfitting, disappearing gradients, and tensor explosion). why couldn't the answer be D, as this covers all of the requirements, and B seems to add an extra step with adding push code, when it already has a builtin metric for overfitting. This would have been correct had the question not mentioned that the algorithm is "hand-written" which means it's not a built in algorithm. So, for SageMaker AI to understand your custom algorithm's metrics, it needs a regex definition to apply to the logs in order to generate those custom metrics and then alert on them using CW Alarms and SNS to deliver notifications. See https://docs.aws.amazon.com/sagemaker/latest/dg/define-train-metrics.html Custom metric Need to built and pushed. A. NO - CloudTrail has built-in SageMaker API calls tracking, no lambda needed B. YES - the chain works C. NO - CloudTrail has built-in SageMaker API calls tracking, no lambda needed D. NO - CloudTrail has not specific Amazon SageMaker integration to detect overfitting Option B "least amount of code and fewest steps?" I think it's D. Agreed, with less code effort. I would consider D as well. You can just setup a SNS that is triggered by a built-in action like here: https://docs.aws.amazon.com/sagemaker/latest/dg/debugger-built-in-actions.html You can see that overfitting is a built-in rule for MXNet from here: https://docs.aws.amazon.com/sagemaker/latest/dg/debugger-built-in-rules.html Not that B is not working. Maybe the question was prior to this new solution. The loss_not_decreasing, overfit, overtraining, and stalled_training_rule monitors if your model is optimizing the loss function without those training issues. If the rules detect training anomalies, the rule evaluation status changes to IssueFound. You can set up automated actions, such as notifying training issues and stopping training jobs using Amazon CloudWatch Events and AWS Lambda. For more information, see Action on Amazon SageMaker Debugger Rules. https://docs.aws.amazon.com/sagemaker/latest/dg/use-debugger-built-in-rules.html It's B. AWS CloudTrail provides a history of AWS API calls made on the account. The Machine Learning team can use AWS CloudTrail to log Amazon SageMaker API calls to Amazon S3. They can then use CloudWatch to create alarms and receive notifications when the model is overfitting. To ensure auditors can view the Amazon SageMaker log activity report, the team can add code to push a custom metric to Amazon CloudWatch. This provides a single place to view and analyze logs across all the services and resources in the environment. B. cloudwatch + metrics from sagemaker + sns https://docs.aws.amazon.com/fr_fr/sagemaker/latest/dg/training-metrics.html#define-train-metrics B requires the least amount of code and satisfies all conditions What does this line do? "Add code to push a custom metric to Amazon CloudWatch" It creates a metric for overfitting (accuracy of training data and accuracy of test data). Its not B. Why would you use CloudTrail? Having used Lambda for API calls I'm inclined to agree with the original answer, C. https://docs.aws.amazon.com/sagemaker/latest/dg/logging-using-cloudtrail.html Because that is the only job of CloudTrail - to log actions taken on your AWS account. So why need a Lambda function to trigger it? B it is B it is - https://www.examtopics.com/discussions/amazon/view/8319-exam-aws-certified-machine-learning-specialty-topic-1/
41
41 - A Machine Learning Specialist is building a prediction model for a large number of features using linear models, such as linear regression and logistic regression. During exploratory data analysis, the Specialist observes that many features are highly correlated with each other. This may make the model unstable. What should be done to reduce the impact of having such a large number of features? - A.. Perform one-hot encoding on highly correlated features. B.. Use matrix multiplication on highly correlated features. C.. Create a new feature space using principal component analysis (PCA) D.. Apply the Pearson correlation coefficient.
C - C is correct You want to reduce features/dimension so PCA is the answer C is the way C is correct. D could be correct if the correlation is used to omit features. PCA and T-SNE are for solving the curse of dimensionality mencioned here! I assume PCA is for unsupervised learning!...and the scenario in the question looks like supervised learning data (x, y) --> (PCA) --> preprocessed data(x', y) --> learning why not for supervised learning? Tricky. The sentence 'many features are highly correlated with each other' is no use. It's. PCA removes such correlation. Answer C: Read through this carefully "What should be done to reduce the impact of having such a large number of features?" only answer comes in mind PCA Of course, it's PCA. PCA is the solution. So, answer is C - https://www.examtopics.com/discussions/amazon/view/11851-exam-aws-certified-machine-learning-specialty-topic-1/
42
42 - A Machine Learning Specialist is implementing a full Bayesian network on a dataset that describes public transit in New York City. One of the random variables is discrete, and represents the number of minutes New Yorkers wait for a bus given that the buses cycle every 10 minutes, with a mean of 3 minutes. Which prior probability distribution should the ML Specialist use for this variable? - A.. Poisson distribution B.. Uniform distribution C.. Normal distribution D.. Binomial distribution
A - A If you have information about the average (mean) number of things that happen in some given time period / interval,Poisson distribution can give you a way to predict the odds of getting some other value on a given future day Ans: A https://brilliant.org/wiki/poisson-distribution/ ChatGPT's answer: B Explanation The problem describes a random variable representing the waiting time for a bus, where buses arrive every 10 minutes, and the mean waiting time is 3 minutes. In such a periodic arrival process, the waiting time follows a Uniform Distribution because: - Any given person’s waiting time is equally likely to be any value between 0 and 10 minutes. - There is no clustering around a particular value—every moment within the cycle is equally probable. Thus, the waiting time follows a Uniform(0, 10) distribution. Why Not the Other Options? (A) Poisson Distribution Poisson is used for counting discrete events over a fixed period (e.g., number of buses arriving per hour). Since waiting time is continuous, Poisson is not appropriate. (C) Normal Distribution Normal (Gaussian) distribution assumes values cluster around the mean and extend infinitely. Here, waiting time is evenly spread between 0–10 minutes, not forming a bell curve. (D) Binomial Distribution Binomial is used for counting successes in a fixed number of trials (e.g., flipping a coin multiple times). Waiting time is continuous, not a count of discrete occurrences. Buses cycle every 10 minutes, and waiting time can be modeled as a uniform random variable between [0, 10] minutes. The average waiting time of 3 minutes suggests that waiting is uniformly distributed, not event-based like Poisson. If buses arrive every 10 minutes and riders arrive randomly, the waiting time follows a Uniform Distribution (B) because: The arrival process is regular (every 10 minutes). There’s no stochastic randomness in the bus arrival schedule, ruling out Poisson. Poisson would apply if buses arrived randomly at an average rate rather than at fixed intervals. B Poisson is suitable for modeling the number of events (like buses arriving) in a fixed time frame, not the time between events when the events occur at regular intervals. The waiting time variable is not about the count of buses but rather the time to the next bus, which is evenly distributed. A is correct Poisson distribution is discrete, and gives the number of events that occur in a given time interval A. YES - Poisson distribution is discrete, and gives the number of events that occur in a given time interval B. NO - Uniform distribution is continuous, we want discrete C. NO - Normal distribution is continuous we want discrete D. NO - Binomial distribution give the probability that a random variable is A or B (possibly in with different weight) Option A indeed ANSWER IS A https://www.investopedia.com/terms/d/discrete-distribution.asp The Poisson distribution is commonly used for count data, which is the case here as we are interested in the number of minutes New Yorkers wait for a bus. The Poisson distribution is characterized by a single parameter, lambda, which represents the mean and variance of the distribution. In this case, the mean is 3 minutes, so we would set lambda to 3. The Poisson distribution assumes that events occur independently of each other, which is a reasonable assumption in this case since the waiting time for each individual is likely to be independent of the waiting time for others. The Poisson distribution is a discrete probability distribution that is commonly used to model the number of events that occur in a fixed interval of time, given an average rate of occurrence. Since the buses cycle every 10 minutes and the mean wait time is 3 minutes, it is reasonable to assume that the number of minutes New Yorkers wait for a bus can be modeled by a Poisson distribution. 100% A, as discrete, while binomial has to be binary data (success or failure) A is a discrete distribution I do choose Poisson. A. 12-sep exam Answer is A .. these types on footfalls ,etc ..answer always Poisson-distribution - https://www.examtopics.com/discussions/amazon/view/8339-exam-aws-certified-machine-learning-specialty-topic-1/
43
43 - A Data Science team within a large company uses Amazon SageMaker notebooks to access data stored in Amazon S3 buckets. The IT Security team is concerned that internet-enabled notebook instances create a security vulnerability where malicious code running on the instances could compromise data privacy. The company mandates that all instances stay within a secured VPC with no internet access, and data communication traffic must stay within the AWS network. How should the Data Science team configure the notebook instance placement to meet these requirements? - A.. Associate the Amazon SageMaker notebook with a private subnet in a VPC. Place the Amazon SageMaker endpoint and S3 buckets within the same VPC. B.. Associate the Amazon SageMaker notebook with a private subnet in a VPC. Use IAM policies to grant access to Amazon S3 and Amazon SageMaker. C.. Associate the Amazon SageMaker notebook with a private subnet in a VPC. Ensure the VPC has S3 VPC endpoints and Amazon SageMaker VPC endpoints attached to it. D.. Associate the Amazon SageMaker notebook with a private subnet in a VPC. Ensure the VPC has a NAT gateway and an associated security group allowing only outbound connections to Amazon S3 and Amazon SageMaker.
C - NAT gateway COULD GO OUT TO THE INTERNET AND DOWNLOAD BACK MALICIOUS D. IS NOT A GOOD ANSWER. THE SAFE ONE IS ANSWER C. ASSOCIATE WITH VPC_ENDPOINT AND S3_ENDPOINT C is correct We must use the VPC endpoint (either Gateway Endpoint or Interface Endpoint)to comply with this requirement "Data communication traffic must stay within the AWS network". https://docs.aws.amazon.com/sagemaker/latest/dg/notebook-interface-endpoint.html A. NO - We don't place a S3 bucket in a VPC, it is always in AWS Service Account B. NO - without an S3 VPC endpoint, traffic will go through the Internet C. YES - we need endpoints for both SageMaker and S3 to avoid Internet traffic D. NO - we need endpoints for both SageMaker and S3 to avoid Internet traffic Option C C is the correct. A is not so correct, because it's possible to communicate two different VPCs inside AWS network (which is not optimized). This configuration would meet the company's requirements for security, as the notebook instance would be placed within a private subnet in a VPC, and data communication traffic would stay within the AWS network through the use of VPC endpoints for S3 and Amazon SageMaker. Additionally, the VPC would not have internet access, further reducing the security risk. C - "and data communication traffic must stay within the AWS network." that discards D Answer should be C. Because, Security team don't want Internet Access, Option-D has NAT and will get to Internet somehow. Also connecting S3 and SageMaker EC2 instance via VPC endpoints is best way to secure the resources. Using a NAT gateway is the old way to do it. Option C is the way to do it now. https://cloudacademy.com/blog/vpc-endpoint-for-amazon-s3/#:~:text=Accessing%20S3%20the%20old%20way%20%28without%20VPC%20Endpoint%29,has%20no%20access%20to%20any%20outside%20public%20resources "and data communication traffic must stay within the AWS network" , NAT gateway will always go over the Internet to access S3.with NAT you can put your instances in private subnet and NAT itself in public subnet , but still in order to access S3 it will go over the internet. SO answer cannot be D. -- C is the only correct option here , as S3 VPC endpoints is a real thing "google it" and it sole purpose is to create route from VPC endpoint to S3 , without going over the Internet. C is correct answer. D is only applicable -"If your model needs access to an AWS service that doesn't support interface VPC endpoints or to a resource outside of AWS, create a NAT gateway and configure your security groups to allow outbound connections. " https://docs.aws.amazon.com/sagemaker/latest/dg/host-vpc.html D is correct Answer is D, read third paragraph https://docs.aws.amazon.com/sagemaker/latest/dg/appendix-notebook-and-internet-access.html NAT is the way that a VPC connect to internet and other ASW service when there is NO INTERNET ACCESS FOR VPC. Thus the answer is D. "concerned that internet-enabled notebook instances create a security vulnerability where malicious code running on the instances could compromise data privacy." NAT Gateway does not mitigate this risk! This is the correct answer. If this answer is confusing, study some of the associate exams before going for this one. VPC endpoint and NAT gateway are similar, but NAT gateway is for giving resources in the VPC the chance to initiate connections with the internet, whereas a VPC endpoint only allows it to go to other AWS services, which is the best solution for this question. C: If you configure your VPC so that it doesn't have internet access, models that use that VPC do not have access to resources outside your VPC. If your model needs access to resources outside your VPC, provide access with one of the following options: If your model needs access to an AWS service that supports interface VPC endpoints, create an endpoint to connect to that service. For a list of services that support interface endpoints, see VPC Endpoints in the Amazon VPC User Guide. For information about creating an interface VPC endpoint, see Interface VPC Endpoints (AWS PrivateLink) in the Amazon VPC User Guide. If your model needs access to an AWS service that doesn't support interface VPC endpoints or to a resource outside of AWS, create a NAT gateway and configure your security groups to allow outbound connections. For information about setting up a NAT gateway for your VPC, see Scenario 2: VPC with Public and Private Subnets (NAT) in the Amazon Virtual Private Cloud User Guide. what is the difference between A & C? are both answers OK? It is not enough for sagemaker to communicate to S3 if both of them are inside the same VPC. Sagamaker inside a VPC needs to create a endpoint to connect to other AWS services which has endpoint too. - https://www.examtopics.com/discussions/amazon/view/8343-exam-aws-certified-machine-learning-specialty-topic-1/
44
44 - A Machine Learning Specialist has created a deep learning neural network model that performs well on the training data but performs poorly on the test data. Which of the following methods should the Specialist consider using to correct this? (Choose three.) - A.. Decrease regularization. B.. Increase regularization. C.. Increase dropout. D.. Decrease dropout. E.. Increase feature combinations. F.. Decrease feature combinations.
BCF - Yes, answer is BCF Go for BCF I think here the point is around the definition of "feature combinations". If you refer to it as "combine the features to generate a smaller but more effective feature set" this would end up to a smaller feature set thus a good thing for overfitting. However, if you refer to it as "combine the features to generate additional features" this would end up to a larger feature set thus a bad thing for overfitting. Also, in some cases you implement feature combinations in your model (see hidden layers in feed-forward network) thus increasing model complexity which is bad for overfitting. To me this question is poorly worded. I would pick F as my best guess is that you need to implement feature combination in your model, thus decreasing feature combination decrease complexity hence improving with overfitting issue Great callout - what exactly the Feature combination is performing has not been elaborated It can be: Using PCA or t-SNE, it is essentially optimizing features - good to address overfitting, and should be done Or, it can be: Using Cartesian Product, features are being combined to create additional features - this will aid overfitting and should NOT be done. Wish questions and answer options are written clearly so that there is no room for ambiguity. Especially, taking into account that in real life, these kind of communication/write-up will trigger follow-up questions until addressed satisfactorily. About option E: When increasing feature combinations, the goal is not to simply add more features indiscriminately, which could indeed lead to overfitting. Instead, it involves selecting and combining features in a way that captures important patterns and relationships in the data. When done effectively, increasing feature combinations can help the model generalize better to unseen data by providing more informative and discriminative features, thus reducing the risk of overfitting. If your model is overfitting the training data, it makes sense to take actions that reduce model flexibility. To reduce model flexibility, try the following: Feature selection: consider using fewer feature combinations, decrease n-grams size, and decrease the number of numeric attribute bins. Increase the amount of regularization used. https://docs.aws.amazon.com/machine-learning/latest/dg/model-fit-underfitting-vs-overfitting.html Best choices are B (Increase regularization), C (Increase dropout), and F (Decrease feature combinations), as these techniques are effective in reducing overfitting and improving the model's ability to generalize to new data. BCE The model has learnt training data. One approach is to increase complexity by increasing the features or remove some features to increase bias. In deep learning, i thinking increasing feature set is more workable. B-C-F. All of those options can be used to reduce model complexity and thus: overfit its BCF BCF is correct. Increasing regularization helps to prevent overfitting by adding a penalty term to the loss function to discourage the model from learning the noise in the data. Increasing dropout helps to prevent overfitting by randomly dropping out some neurons during training, which forces the model to learn more robust representations that do not depend on the presence of any single neuron. Decreasing the number of feature combinations helps to simplify the model, making it less likely to overfit. I see all the comments for BCF, although when you look at F it just says decrease 'feature combinations', not features themselves. In one way to decrease feature combinations results in having more features (less feature engineering), which in turn will cause more overfitting. Unless the question in badly worded, saying less feature combinations just mean those combinations, which components will not be used, then it has to be BCE. Decrease feature combinations - too many irrelevant features can influence the model by drowning out the signal with noise Increasing the number of feature combinations can sometimes improve the performance of a model if the model is underfitting the data. However, in this context, it is not likely to be a solution to overfitting. BCF - Always remember in case of overfitting - reduce features, Add regularisation and increase dropouts. BCE: The main objective of PCA (technic to feature combination) is to simplify your model features into fewer components to help visualize patterns in your data and to help your model run faster. Using PCA also reduces the chance of overfitting your model by eliminating features with high correlation. https://towardsdatascience.com/dealing-with-highly-dimensional-data-using-principal-component-analysis-pca-fea1ca817fe6 AWS Documentation explicitly mentions reducing feature combinations to prevent overfitting - https://docs.aws.amazon.com/machine-learning/latest/dg/model-fit-underfitting-vs-overfitting.html It's B C F B/C/F Easy peasy. BCF 100% BCF F explained in AWS document: Feature selection: consider using fewer feature combinations, decrease n-grams size, and decrease the number of numeric attribute bins. Increase the amount of regularization used https://docs.aws.amazon.com/machine-learning/latest/dg/model-fit-underfitting-vs-overfitting.html - https://www.examtopics.com/discussions/amazon/view/8348-exam-aws-certified-machine-learning-specialty-topic-1/
45
45 - A Data Scientist needs to create a serverless ingestion and analytics solution for high-velocity, real-time streaming data. The ingestion process must buffer and convert incoming records from JSON to a query-optimized, columnar format without data loss. The output datastore must be highly available, and Analysts must be able to run SQL queries against the data and connect to existing business intelligence dashboards. Which solution should the Data Scientist build to satisfy the requirements? - A.. Create a schema in the AWS Glue Data Catalog of the incoming data format. Use an Amazon Kinesis Data Firehose delivery stream to stream the data and transform the data to Apache Parquet or ORC format using the AWS Glue Data Catalog before delivering to Amazon S3. Have the Analysts query the data directly from Amazon S3 using Amazon Athena, and connect to BI tools using the Athena Java Database Connectivity (JDBC) connector. B.. Write each JSON record to a staging location in Amazon S3. Use the S3 Put event to trigger an AWS Lambda function that transforms the data into Apache Parquet or ORC format and writes the data to a processed data location in Amazon S3. Have the Analysts query the data directly from Amazon S3 using Amazon Athena, and connect to BI tools using the Athena Java Database Connectivity (JDBC) connector. C.. Write each JSON record to a staging location in Amazon S3. Use the S3 Put event to trigger an AWS Lambda function that transforms the data into Apache Parquet or ORC format and inserts it into an Amazon RDS PostgreSQL database. Have the Analysts query and run dashboards from the RDS database. D.. Use Amazon Kinesis Data Analytics to ingest the streaming data and perform real-time SQL queries to convert the records to Apache Parquet before delivering to Amazon S3. Have the Analysts query the data directly from Amazon S3 using Amazon Athena and connect to BI tools using the Athena Java Database Connectivity (JDBC) connector.
A - Kinesis Data Analytics NO PARQET FORMAT, BESIDES THAT JSON NO NEED TO STORE IN S3. RDS ISN'T serverless ingestion and analytics solution ANSWER IS A. I thinks it should be A please check https://aws.amazon.com/blogs/big-data/analyzing-apache-parquet-optimized-data-using-amazon-kinesis-data-firehose-amazon-athena-and-amazon-redshift/ Amazon Kinesis Data Firehose Ingests real-time data with automatic buffering. Supports built-in transformation to Apache Parquet/ORC before writing to Amazon S3. Requires minimal code and infrastructure. AWS Glue Data Catalog Catalogs the schema for structured querying. Enables Athena to directly query data in S3. Amazon Athena Serverless SQL querying on S3-based datasets. Can connect to BI tools (Tableau, QuickSight) via JDBC. A. Create a schema in the AWS Glue Data Catalog of the incoming data format. Use Amazon Kinesis Data Firehose to buffer and transform the streaming JSON data to a columnar format like Apache Parquet or ORC using the AWS Glue Data Catalog before delivering to Amazon S3. Analysts can then query the data using Amazon Athena and connect to BI dashboards using the Athena JDBC connector. This solution is serverless, manages high-velocity data streams, supports SQL queries, and connects to BI tools—all while being highly available. A. YES - we need a catalog to create parquet (https://docs.aws.amazon.com/firehose/latest/APIReference/API_SchemaConfiguration.html) B. NO - no need for extra staging C. NO - no need for extra staging D. NO - we need a catalog Option A A is correct. For those selecting B, answer me: how exactly the json will be stored in the S3? It's not mentioned in the answer. For me it's an incomplete solution. This solution leverages AWS Glue to create a schema of the incoming data format, which helps to buffer and convert the records to a query-optimized, columnar format without data loss. The Amazon Kinesis Data Firehose delivery stream is used to stream the data and transform it to Apache Parquet or ORC format using the AWS Glue Data Catalog, and the data is stored in Amazon S3, which is highly available. The Analysts can then query the data directly from Amazon S3 using Amazon Athena, and connect to BI tools using the Athena JDBC connector. This solution provides a serverless, scalable, and cost-effective solution for real-time streaming data ingestion and analytics. Since you want to buffer and convert data so A is correct answer. No other option is fulfilling this requirement I go for A. However, I am not sure why AWS Glue is very important here given that Firehose can convert JSON to parquet. If I haven't remembered correctly. Athena requires a schema of the S3 object to perform SQL query. That's probably why we need Glue for the schema once you ingest the data using Kinesis Firehose, you can set "generate table" and automatically create Glue schema. I think both Glue and Firehose can do data conversion from JSON to parquet. Why AWS Glue is needed? Firehose could convert to parquet directly... https://docs.aws.amazon.com/firehose/latest/dev/record-format-conversion.html Amazon Data Firehose requires a schema to determine how to interpret that data. Use AWS Glue to create a schema in the AWS Glue Data Catalog. Amazon Data Firehose then references that schema and uses it to interpret your input data Kinesis Data Analytics is near real-time, not real time Answer is ”A” The difference between "real-time" and "near-real-time" is pretty semantic(60s). The fact that the data comes through kinesis data streams (real time) is implied as the only valid input to firehose. Mind you, "the ingestion process must buffer and transform incoming records from JSON to a query-optimized, columnar format" That is exactly what kinesis firehose does. "Kinesis Data Firehose buffers incoming data before delivering it to Amazon S3. You can configure the values for S3 buffer size (1 MB to 128 MB) or buffer interval (60 to 900 seconds), and the condition satisfied first triggers data delivery to Amazon S3." See link: https://aws.amazon.com/kinesis/data-firehose/faqs/#:~:text=Kinesis%20Data%20Firehose%20buffers%20incoming,data%20delivery%20to%20Amazon%20S3. Data Firehose is always Near Real Time not Real Time. The prompt clearly states that process must be done in Real Time. Why A? Firehose is near real-time, and not real-time which is a requirement There is no requirement for real time processing. It says the data is in real time but the processing of that data should buffer ANSWER is A -- and every statement in it is accurate. Firehose does integrate with GLue data catalog and it also "Buffers" the data . "When Kinesis Data Firehose processes incoming events and converts the data to Parquet, it needs to know which schema to apply." This is achived by glue data catalog and athena and it works on real-time data ingest.See link below. https://aws.amazon.com/blogs/big-data/analyzing-apache-parquet-optimized-data-using-amazon-kinesis-data-firehose-amazon-athena-and-amazon-redshift/ - https://www.examtopics.com/discussions/amazon/view/8351-exam-aws-certified-machine-learning-specialty-topic-1/
46
46 - An online reseller has a large, multi-column dataset with one column missing 30% of its data. A Machine Learning Specialist believes that certain columns in the dataset could be used to reconstruct the missing data. Which reconstruction approach should the Specialist use to preserve the integrity of the dataset? - A.. Listwise deletion B.. Last observation carried forward C.. Multiple imputation D.. Mean substitution
C - C looks correct since multiple imputation can be performed based on the related variable as given in the question Multiple Imputation by Chained Equations or MICE, as per udemy this is always the best answer of all Why not D: Doesn't Account for Relationships: Mean substitution doesn't take into account the potential relationships between variables. In the scenario you provided, it's believed that other columns could help in reconstructing the missing data. Using only the mean of the missing column doesn't leverage this potential inter-column relationship. Assumption of Missing Completely at Random (MCAR): Mean substitution often operates under the assumption that the data is Missing Completely at Random (MCAR). In reality, data might be missing for a reason, and that reason might relate to other observed variables. Using mean substitution in such cases can introduce biases. A. NO - Listwise deletion is just dropping rows B. NO - does not reconstruct the data based on other fields C. YES - by definition D. NO - does not reconstruct the data based on other fields MICE is the algorithm to choose here Option C Multiple imputation is a statistical technique for handling missing data that involves generating multiple versions of the dataset with missing values filled in, and then combining the results to produce a single, complete dataset. This approach takes into account the relationship between variables in the dataset, and uses statistical models to predict missing values based on the information in other columns. This helps to preserve the integrity of the dataset by avoiding the introduction of bias or systematic error into the results. I am trying to understand why Mean Substitution is not the solution. Imputation typically uses the mean if the missing data is random, implying the substitution is not biased. Mean substitution is limited to the current column. In this case, the requirement is to impute missing data from other columns Reason is if you replace 30% of the missing values , likely you will bias the variable. If it's handling missing data then imputation comes into play Answer is C 100% https://www.countants.com/blogs/heres-how-you-can-configure-automatic-imputation-of-missing-data/ C it's C A common strategy used to impute missing values is to replace missing values with the mean or median value. It is important to understand your data before choosing a strategy for replacing missing values. https://docs.aws.amazon.com/machine-learning/latest/dg/feature-processing.html - https://www.examtopics.com/discussions/amazon/view/10080-exam-aws-certified-machine-learning-specialty-topic-1/
47
47 - A company is setting up an Amazon SageMaker environment. The corporate data security policy does not allow communication over the internet. How can the company enable the Amazon SageMaker service without enabling direct internet access to Amazon SageMaker notebook instances? - A.. Create a NAT gateway within the corporate VPC. B.. Route Amazon SageMaker traffic through an on-premises network. C.. Create Amazon SageMaker VPC interface endpoints within the corporate VPC. D.. Create VPC peering with Amazon VPC hosting Amazon SageMaker.
C - NAT CLOUD GO OUT TO THE INTERNET, IT STILL CANNOT PREVENT DOWNLOAD MALICIOUS BY YOURSELF. THE RIGHT ANSWER IS C. C.INTERFACE VPC ENDPOINT https://docs.aws.amazon.com/sagemaker/latest/dg/sagemaker-dg.pdf (516) https://docs.aws.amazon.com/zh_tw/vpc/latest/userguide/vpc-endpoints.html Not sure if C is correct in this particular scenario. From https://docs.aws.amazon.com/sagemaker/latest/dg/sagemaker-dg.pdf Page 202 of the SageMaker Guide has: If you allowed access to resources from your VPC, enable direct internet access. For Direct internet access, choose Enable. Without internet access, you can't train or host models from notebooks on this notebook instance unless your VPC has a NAT gateway and your security group allows outbound connect There are two possible solutions, but the safer solution and easier is trough VPC endpoints. You can connect to your notebook instance from your VPC through an interface endpoint in your Virtual Private Cloud (VPC) instead of connecting over the internet. When you use a VPC interface endpoint, communication between your VPC and the notebook instance is conducted entirely and securely within the AWS network. And there is not problem that the notebooks does not have public internet. Because Amazon SageMaker notebook instances support Amazon Virtual Private Cloud (Amazon VPC) interface endpoints that are powered by AWS PrivateLink. Each VPC endpoint is represented by one or more Elastic Network Interfaces (ENIs) with private IP addresses in your VPC subnets. Each VPC endpoint is represented by one or more Elastic Network Interfaces (ENIs) with private IP addresses in your VPC subnets... so the Answer is C. A may the right answer C is correct. "The VPC interface endpoint connects your VPC directly to the Amazon SageMaker API or Runtime without an internet gateway, **NAT** device, VPN connection, or AWS Direct Connect connection." https://docs.aws.amazon.com/sagemaker/latest/dg/interface-vpc-endpoint.html Explanation: The company's data security policy does not allow internet access, so the solution must allow Amazon SageMaker to function privately within the VPC without internet access. VPC Interface Endpoints (AWS PrivateLink) for SageMaker allow services to communicate privately over the AWS network, without requiring an Internet Gateway (IGW) or NAT Gateway. Explanation: The company's data security policy does not allow internet access, so the solution must allow Amazon SageMaker to function privately within the VPC without internet access. VPC Interface Endpoints (AWS PrivateLink) for SageMaker allow services to communicate privately over the AWS network, without requiring an Internet Gateway (IGW) or NAT Gateway. The answer is C. - If you want to allow internet access, you must use a NAT gateway with access to the internet, for example through an internet gateway. - If you don't want to allow internet access, create interface VPC endpoints (AWS PrivateLink) to allow Studio Classic to access the following services with the corresponding service names. You must also associate the security groups for your VPC with these endpoints. This is exactly what's written in the ref. doc given in the answer section of the question. (Check page Security and Permissions 1120- 1121) https://docs.aws.amazon.com/pdfs/sagemaker/latest/dg/sagemaker-dg.pdf C. To enable Amazon SageMaker service without enabling direct internet access to Amazon SageMaker notebook instances, while adhering to a corporate data security policy that restricts internet communication, the company can: C. Create Amazon SageMaker VPC interface endpoints within the corporate VPC. This option involves setting up VPC (Virtual Private Cloud) interface endpoints for Amazon SageMaker within the corporate VPC (Virtual Private Cloud). This is done using AWS PrivateLink, which allows private connectivity between AWS services using private IP addresses. By creating VPC interface endpoints, the traffic between the corporate VPC and Amazon SageMaker does not traverse the public internet, thereby meeting the corporate data security requirements. A would allow instances in a private subnet to initiate outbound internet traffic. This is against the requirement of no direct internet access. NAT means data will go to internet. C is the right choice. Option c Only C, endpoints. C is correct, NAT allow outband traffic pass through internet. To prevent SageMaker from providing internet access to your Studio notebooks, you can disable internet access by specifying the VPC only network access type when you the onboard to Studio or call CreateDomain API. As a result, you won't be able to run a Studio notebook unless your VPC has an interface endpoint to the SageMaker API and runtime, or a NAT gateway with internet access, and your security groups allow outbound connections. To disable direct internet access, under Direct Internet access, simply choose Disable – use VPC only , and select the Create notebook instance button at the bottom. You are ready to go. from: https://aws.amazon.com/blogs/machine-learning/customize-your-amazon-sagemaker-notebook-instances-with-lifecycle-configurations-and-the-option-to-disable-internet-access/#:~:text=To%20disable%20direct%20internet%20access%2C%20under%20Direct%20Internet%20access%2C%20simply,running%2C%20without%20direct%20internet%20access. If you want to allow internet access, you must use a example through an internet gateway. If you don't want to allow internet access, NAT gateway with access to the internet, for create interface VPC endpoints (AWS PrivateLink) to allow Studio to access the following services with the corresponding service names. You must also associate the security groups for your VPC with these endpoints. A VPC interface endpoint is a private connection between a VPC and Amazon SageMaker that is powered by AWS PrivateLink. With a VPC interface endpoint, traffic between the VPC and Amazon SageMaker never leaves the Amazon network. Page 3438 of https://docs.aws.amazon.com/sagemaker/latest/dg/sagemaker-dg.pdf VPC Interface endpoints If the question just had the last sentence, the answer would be A or C, per this page:https://docs.aws.amazon.com/sagemaker/latest/dg/appendix-notebook-and-internet-access.html. "To disable direct internet access, you can specify a VPC for your notebook instance. By doing so, you prevent SageMaker from providing internet access to your notebook instance. As a result, the notebook instance won't be able to train or host models unless your VPC has an interface endpoint (PrivateLink) or a NAT gateway, and your security groups allow outbound connections." HOWEVER, the question has more context that internet access is not allowed by the corporate policy. ("When you use a VPC interface endpoint, communication between your VPC and the notebook instance is conducted entirely and securely within the AWS network.") Therefore, the answer must be ONLY C. Answer is C. From https://docs.aws.amazon.com/sagemaker/latest/dg/interface-vpc-endpoint.html -> "The VPC interface endpoint connects your VPC directly to the SageMaker API or Runtime without an internet gateway, NAT device, VPN connection, or AWS Direct Connect connection. The instances in your VPC don't need public IP addresses to communicate with the SageMaker API or Runtime." I see a lot of people employing pretzel logic to try to explain why they should be using NAT. The question states no internet communication. Period. No internet means no NAT. Answer is C. - https://www.examtopics.com/discussions/amazon/view/8370-exam-aws-certified-machine-learning-specialty-topic-1/
48
48 - A Machine Learning Specialist is training a model to identify the make and model of vehicles in images. The Specialist wants to use transfer learning and an existing model trained on images of general objects. The Specialist collated a large custom dataset of pictures containing different vehicle makes and models. What should the Specialist do to initialize the model to re-train it with the custom data? - A.. Initialize the model with random weights in all layers including the last fully connected layer. B.. Initialize the model with pre-trained weights in all layers and replace the last fully connected layer. C.. Initialize the model with random weights in all layers and replace the last fully connected layer. D.. Initialize the model with pre-trained weights in all layers including the last fully connected layer.
B - Ans B sounds correct In transfer learning, a pre-trained model is used as a starting point to train a new model on a different task, typically using a smaller dataset. The pre-trained model contains weights that have been learned from a large amount of data on a related task, and these weights can be leveraged to train the new model more efficiently. To re-train the model with the custom data, the Specialist should initialize the model with pre-trained weights in all layers, as these weights can provide a good starting point for the new task. The Specialist should then replace the last fully connected layer, which is responsible for making the final predictions, as this layer will likely need to be modified to reflect the new task. By keeping the pre-trained weights in the other layers, the Specialist can take advantage of the knowledge learned from the previous task, and potentially speed up the training process. Explanation: The Machine Learning Specialist wants to use transfer learning with an existing model trained on general object images and fine-tune it for vehicle make and model classification. The best approach is: Use pre-trained weights from the existing model for feature extraction. Replace the last fully connected (FC) layer to match the number of vehicle classes. Fine-tune the new model on the vehicle dataset. Why This Works? Lower training time: The model has already learned useful features from general objects (e.g., edges, shapes). Improves accuracy: Instead of training from scratch, transfer learning leverages knowledge from large datasets (e.g., ImageNet). Avoids catastrophic forgetting: Reusing pre-trained weights preserves learned low- and mid-level features while adapting the last layer for new classes. Transfer learning helps accelerate the training and at this point, model has yet to learn from the new data. So, all layers including the fully-connected by replaced. Eventually, the training will update the fully-connected layer. The question is about initialization, so we should initialize the fully-connected layers too. A. NO - random weights does not allow transfer learning B. YES - the last layer gives the final classes, we want to have new classes C. NO - random weights does not allow transfer learning D. NO - the last layer gives the final classes, we want to have new classes Option B For Transfer Learning, A and C are incorrect because we restart the model. The correct is letter B B. The reason is, fine-tuning a model means to use the weights/biases trained before. also no matter which strategy you go for in transfer learning (fine-tuning or feature extraction) you always replace the last or last few layers. The task is to " to re-train it with the custom data". That means, it is not transfer learning anymore. The "transfer learning" is just a title to make a question tricky. So, in this case we should randomize the weights and retrain whole model from scratch on custom user's images only. The correct answer is C. I think retraining revers in this context to the training on the custom data that the expert as already conducted before thinking about transfer learning. The task is to " to re-train it with the custom data". That means, it is not transfer learning anymore. The "transfer learning" is just a title to make a question tricky. So, in this case we should randomize the weights and retrain whole model from scratch on custom user's images only. The correct answer is C. The fully connected layer will need to be trained from scratch to incorporate the features of his domain problem (Car models) 12-sep exam D is the best - here is why Question is not to design a final production with deep lense - it is to use it as a dev platform to comeup with a edge ML vs. dump load all to S3 -which is very wasteful! AWS did not mae deeplense as a toy for devs! it is meant to help companies experiment with edge ML And then copy and reuse the open hardware platfom one of the method to implement transfer learning I will go with B, we are mainly concerned with the output layer for us to get the desired results, hence we need to replace it. B is correct Actually, it should be NONE of IT!.... it should be like B with exception that 20-40% top layers should be retrained :) -- this is classic transfer learning setup, so B is the answer here. - https://www.examtopics.com/discussions/amazon/view/10081-exam-aws-certified-machine-learning-specialty-topic-1/
49
49 - An office security agency conducted a successful pilot using 100 cameras installed at key locations within the main office. Images from the cameras were uploaded to Amazon S3 and tagged using Amazon Rekognition, and the results were stored in Amazon ES. The agency is now looking to expand the pilot into a full production system using thousands of video cameras in its office locations globally. The goal is to identify activities performed by non-employees in real time Which solution should the agency consider? - A.. Use a proxy server at each local office and for each camera, and stream the RTSP feed to a unique Amazon Kinesis Video Streams video stream. On each stream, use Amazon Rekognition Video and create a stream processor to detect faces from a collection of known employees, and alert when non-employees are detected. B.. Use a proxy server at each local office and for each camera, and stream the RTSP feed to a unique Amazon Kinesis Video Streams video stream. On each stream, use Amazon Rekognition Image to detect faces from a collection of known employees and alert when non-employees are detected. C.. Install AWS DeepLens cameras and use the DeepLens_Kinesis_Video module to stream video to Amazon Kinesis Video Streams for each camera. On each stream, use Amazon Rekognition Video and create a stream processor to detect faces from a collection on each stream, and alert when non-employees are detected. D.. Install AWS DeepLens cameras and use the DeepLens_Kinesis_Video module to stream video to Amazon Kinesis Video Streams for each camera. On each stream, run an AWS Lambda function to capture image fragments and then call Amazon Rekognition Image to detect faces from a collection of known employees, and alert when non-employees are detected.
A - Answer is "A". C and D are out as DeepLens is not offered as a commercial product. It is purely for developers to experiment with. From https://aws.amazon.com/deeplens/device-terms-of-use/ " (i) you may use the AWS DeepLens Device for personal, educational, evaluation, development, and testing purposes, and not to process your production workloads;" A is correct as it's will analyse live video streams instead of images. From https://aws.amazon.com/rekognition/video-features/ "Amazon Rekognition Video can identify known people in a video by searching against a private repository of face images. " Agree as well, besides that: (D) uses Rekognition with Image mode, which is wrong for this case. Agreed Why not A? DeepLens is for development purpose and much more expensive than just a camera. They are referring to 1000 camera in production scale? C is the correct answer. We could use A, since it is for security service, DeepLens allows to notify the security (through aws lamda) immediately when it sees non employee at the office location. So C is more appropriate for the problem than A. DeepLens is for developers only, it is not available as a commercial product. A bit off topic but yeah, how could you justify using deep lens for production. Cameras have viewing angles, weather proofing, network connectivity issues (Wifi only), infra red for low lighting conditions, no power over ethernet? Using Deeplens would be laughable for a full production system. Ans - A -- Proxy Server + Kinesis Video Streams + Rekognition Video The goal is to scale from 100 cameras to thousands and perform real-time detection of non-employees in office locations globally. The best approach is to use Amazon Kinesis Video Streams + Amazon Rekognition Video for real-time face detection. The correct answer is D. Very tricky one but re-read the 2nd sentence in the question; “Images from the cameras were uploaded to Amazon S3 and tagged using Amazon Rekognition, and the results were stored in Amazon ES.” So, we have ‘images’ as training data, not videos. This is why it can not be option C - where it says to use Amazon Recognition Video. The only option mentioning Amazon Recognition Image is the option D. Also check: https://docs.aws.amazon.com/rekognition/latest/dg/what-is.html “...For example, each time a person arrives at your residence, your door camera can upload a photo of the visitor to Amazon S3. This triggers a Lambda function that uses Amazon Rekognition API operations to identify your guest. You can run analysis directly on images that are stored in Amazon S3 without having to load or move the data.” A is answer A not B: Use Amazon Rekognition Video instead of Amazon Rekognition Image in this case. A is correct! DeepLens is overkill for mass systems A. NO - thousands of cameras would choke network bandwidth B. NO - thousands of cameras would choke network bandwidth C. YES - DeepLens is made for edge computing; it might be EOL / Not commercially available, but if they did not want you to use DeepLens the question would not have come in the first place D. NO - use Amazon Rekognition Video directly instead of Amazon Rekognition Image Why A and not B? Can someone please explain it? Option A From Chat GPT The solution that the agency should consider is option A: Use a proxy server at each local office and for each camera, and stream the RTSP feed to a unique Amazon Kinesis Video Streams video stream. On each stream, use Amazon Rekognition Video and create a stream processor to detect faces from a collection of known employees and alert when non-employees are detected. By using a proxy server at each local office and streaming the RTSP feed to individual Amazon Kinesis Video Streams video streams, the agency can efficiently handle the large number of video cameras in different office locations. Using Amazon Rekognition Video, the agency can create a stream processor to detect faces from a collection of known employees. This allows for real-time identification of non-employees based on facial recognition. Alerts can then be generated when non-employees are detected, ensuring that the agency is able to identify and respond to potential security threats in real-time. I initially thought it is C but looks like A makes more sense here. The DeepLens Service will reach EOL at the end of Jan 2024, so more than likely that this question will not be asked in the exam D is the answer now, DeepLens is used for situations like this! Maybe, its EOL Jan 2024 Think big picture - you tested something (let say code python) and ready to implement into prod will you move python code or java code! Here in this particular case, they tested with actual video camera and they did not say deeplense so answer is A! For knowledge sake if they say in real exam it is tested with deeplense ---then ideal solution should be model inference happening at deeplense itself with search against existing employees and send back model inference when it detect new faces who are not employees back to cloud may be S3 Answer is "A". As mentioned in below user comment, DeepLens is not offered as a commercial product. https://aws.amazon.com/deeplens/device-terms-of-use/ - https://www.examtopics.com/discussions/amazon/view/8374-exam-aws-certified-machine-learning-specialty-topic-1/
50
50 - A Marketing Manager at a pet insurance company plans to launch a targeted marketing campaign on social media to acquire new customers. Currently, the company has the following data in Amazon Aurora: ✑ Profiles for all past and existing customers ✑ Profiles for all past and existing insured pets ✑ Policy-level information ✑ Premiums received ✑ Claims paid What steps should be taken to implement a machine learning model to identify potential new customers on social media? - A.. Use regression on customer profile data to understand key characteristics of consumer segments. Find similar profiles on social media B.. Use clustering on customer profile data to understand key characteristics of consumer segments. Find similar profiles on social media C.. Use a recommendation engine on customer profile data to understand key characteristics of consumer segments. Find similar profiles on social media. D.. Use a decision tree classifier engine on customer profile data to understand key characteristics of consumer segments. Find similar profiles on social media.
B - All of the questions in the preceding examples rely on having example data that includes answers. There are times that you don't need, or can't get, example data with answers. This is true for problems whose answers identify groups. For example: "I want to group current and prospective customers into 10 groups based on their attributes. How should I group them? " You might choose to send the mailing to customers in the group that has the highest percentage of current customers. That is, prospective customers that most resemble current customers based on the same set of attributes. For this type of question, Amazon SageMaker provides the K-Means Algorithm. https://docs.aws.amazon.com/sagemaker/latest/dg/algos.html Clustering algorithms are unsupervised. In unsupervised learning, labels that might be associated with the objects in the training dataset aren't used. https://docs.aws.amazon.com/sagemaker/latest/dg/algo-kmeans-tech-notes.html THE ANSWER COULD BE B.clustering on customer profile data to understand key characteristic Yes, Clustering seems to be more appropriate in this scenario than recommender system Collaborative filtering recommendation system is also unsupervised https://towardsdatascience.com/customer-segmentation-with-machine-learning-a0ac8c3d4d84 B Option C. This is not purely unsupervised, as clustering would be, because we have current and past customer profiles to go on. We want to find new customers by finding similar profiles on social media. So it is supervised to some extent. It's not a cluster problem; it is user-user collaborative filtering. The key is to recognize that this is not clustering. You're not blindly trying to group people. You have existing profiles that you are comparing them to. 'B' is correct It is B. Recommendation Engines: Traditionally focus on suggesting products/services to existing customers based on past behavior. Clustering is right C would be an answer if wanted to send the promo to the existing customers. But we want to find potential customers. And we can do it only by comparing existing customers with potential customers. It can be done by creating clusters of existing customers and measuring the distance to those clusters for the new potential users. So my answer is B A. NO - Linear Regression not best to understand relationships between data B. NO - it is supervised (we know premiums received vs. claims paid, so can assign users to GOOD or BAD), so no clustering C. YES - A recommendation engine in AWS lingua is Amazing Recommender (https://docs.aws.amazon.com/personalize/latest/dg/what-is-personalize.html - "Creating a targeted marketing campaign") and can create user segments D. NO - not as good as C B for me Recommendation engines is perfect for customers we have, but for implementing a machine learning model to identify potential (new customers on social media) this requires clustering and segmentation. https://neptune.ai/blog/customer-segmentation-using-machine-learning Based on the link below, it must be C https://medium.com/voice-tech-podcast/a-simple-way-to-explain-the-recommendation-engine-in-ai-d1a609f59d97 We are divided, but I stick with B. I think it should be c recommender system would help here, as we already have details of all customers it should be C - recommender system would be better fit here. We should use recommendation system to find key characteristics only among company users (past and present). At this step we don't take any users from the web. After we finish processing this CF model we identify key characteristics (important features?) and only after that, we will start looking for similar users on the web. I would use clustering technique to identify which customers in my database are the target audience and get similar customer profiles from the social media dataset. Its a lot simpler recommendation engines can use either supervised or unsupervised learning. I can't find any reason to NOT use recommendation engine??? - https://www.examtopics.com/discussions/amazon/view/8376-exam-aws-certified-machine-learning-specialty-topic-1/
51
51 - A manufacturing company has a large set of labeled historical sales data. The manufacturer would like to predict how many units of a particular part should be produced each quarter. Which machine learning approach should be used to solve this problem? - A.. Logistic regression B.. Random Cut Forest (RCF) C.. Principal component analysis (PCA) D.. Linear regression
D - HOW MANY/MUCH, THOSE ARE REGRESSION TOPIC, LOGISTIC FOR 0/1,YES/NO https://docs.aws.amazon.com/zh_tw/machine-learning/latest/dg/regression-model-insights.html THE ANSWER SHOULD BE D. agree. RCF is mostly used for anomaly detection or separate outliers Amazon SageMaker Random Cut Forest (RCF) is an unsupervised algorithm for detecting anomalous data points within a data set Answer is D 100% The problem involves predicting the number of units to be produced each quarter based on historical sales data. This is a continuous numerical prediction, making it a regression problem. Linear regression is ideal for forecasting when there is a linear relationship between input variables (e.g., past sales, seasonal trends) and the target variable (units to be produced). It helps model the relationship between past sales and future demand. If there are seasonal effects, a time-series model (like ARIMA or Prophet) could be considered as well. The Answer is D. Random Cut Forest is for Anomaly Detection D should be the answer How many units should give this away as Linear regression I do not see any hint of anomalies here, we are looking for a number to be predicted, this seems to be the reason of the correct answer https://docs.aws.amazon.com/quicksight/latest/user/how-does-rcf-generate-forecasts.html How can the right answer be B? That Random Cut Forest is an algorithm written for anomaly detection. option D D is the correct. B is for outlier detection only. It sounds like Linear regression problem and Random Cut is more known for anomaly detection while it can do other types of ML. The answer seems to be strange with no explanation. D is correct! D. Linear regression would be the appropriate machine learning approach to solve this problem of predicting the number of units of a particular part to be produced each quarter. Linear regression is a supervised learning algorithm used for predicting continuous variables based on input features. In this case, the historical sales data can be used as input features, and the number of units produced each quarter can be used as the continuous target variable. definitely D. This is a regression problem where the goal is to predict a continuous outcome, which in this case is the number of units of a particular part that should be produced each quarter. Linear regression is a simple and commonly used approach to solve such problems, where a linear relationship is established between the independent variables (e.g., historical sales data) and the dependent variable (e.g., number of units of a part to be produced). D, RCF answers here just link one article where RCF is implemented to find outliers in time series, or are able to deduce trends, but here they mention already labelled data, RCF is unsupervised, so that data would go to waste. Honestly, i think these are all bad answers. It should be time series modeling methods. - https://www.examtopics.com/discussions/amazon/view/8379-exam-aws-certified-machine-learning-specialty-topic-1/
52
52 - A financial services company is building a robust serverless data lake on Amazon S3. The data lake should be flexible and meet the following requirements: ✑ Support querying old and new data on Amazon S3 through Amazon Athena and Amazon Redshift Spectrum. ✑ Support event-driven ETL pipelines ✑ Provide a quick and easy way to understand metadata Which approach meets these requirements? - A.. Use an AWS Glue crawler to crawl S3 data, an AWS Lambda function to trigger an AWS Glue ETL job, and an AWS Glue Data catalog to search and discover metadata. B.. Use an AWS Glue crawler to crawl S3 data, an AWS Lambda function to trigger an AWS Batch job, and an external Apache Hive metastore to search and discover metadata. C.. Use an AWS Glue crawler to crawl S3 data, an Amazon CloudWatch alarm to trigger an AWS Batch job, and an AWS Glue Data Catalog to search and discover metadata. D.. Use an AWS Glue crawler to crawl S3 data, an Amazon CloudWatch alarm to trigger an AWS Glue ETL job, and an external Apache Hive metastore to search and discover metadata.
A - BOTH A AND B ARE ANSWERS. BUT external Apache Hive MIGHT BE NOT SERVERLESS SOLUTION. The AWS Glue Data Catalog is your persistent metadata store. It is a managed service that lets you store, annotate, and share metadata in the AWS Cloud in the same way you would in an Apache Hive metastore. The Data Catalog is a drop-in replacement for the Apache Hive Metastore https://docs.aws.amazon.com/zh_tw/glue/latest/dg/components-overview.html BEAUTIFUL ANSWER IS A. I am thinking about Answer C, because events can be triggered by cloudwatch w/Glue metastore you can't schedule AWS Batch with CloudWatch We can schedule batch with cloud watch events. srr, looks like you can apart from Cron, the argument should be AWS Batch aren't SERVERLESS if we use Flexible as key word ..Using Lambda might be a constraint Answer is A. Lamda is the preferred way of implementing event-driven ETL job with S3, when new data arrives in S3, it notifies lamda which can start the ETL job. agree, event-driven means Lambda, CloudWatch alarms are just to trigger alarms based on log analysis. A. YES - all integrated components B. NO - missing a component to invoke the Lambda C. NO - CloudWatch will not trigger when there is a new file to process D. NO - CloudWatch will not trigger when there is a new file to process A for me Note that the question asks for a serverless system. In this case, the letters B, C and D are wrong, as they bring options that are managed: AWS Batch (managed) and external Apache Hive (even more managed). For event-driven AWS ETL solutions that are serverless, activation through the Lambda function is recommended, so the correct alternative is Letter A. Note that CloudWatch Alarms only activates from log evaluation, which is not mentioned in the question. I will chose A, I think C & D is wrong, you can use Amazon CloudWatch Event to trigger lambda but not CloudWatch alarm. Batch is more for configurations and other kinds of things by scheduling than event driven and batch data processing with ETL, the answer is A. Found this supporting A - Lambda used to trigger ETL job after crawler completes. The crawler starts on schedules or events (files arriving). Based on Majority discussion Quite confused between A&C since they all workable solution. In below AWS Blog, even mix the CloudWatch + Lambda to use the Glue. For key word event trigger, prefer CloudWatch https://aws.amazon.com/blogs/big-data/build-and-automate-a-serverless-data-lake-using-an-aws-glue-trigger-for-the-data-catalog-and-etl-jobs/ https://docs.aws.amazon.com/glue/latest/dg/automating-awsglue-with-cloudwatch-events.html cloudwatch and lambda function can work together to trigger event. But AWS batch cannot independently conduct ETL and require other service. when it comes to ETL, glue is much easier choice than Batch Agreed. CloudWatch could trigger event to launch Lambda. Refer to: https://docs.aws.amazon.com/lambda/latest/dg/services-cloudwatchevents.html Answer is A 100% A is preferred. Lambda can trigger ETL pipelines: https://aws.amazon.com/glue/ A is correct...Lambda is event driven and Glue is serverless as opposed to Hive - https://www.examtopics.com/discussions/amazon/view/8382-exam-aws-certified-machine-learning-specialty-topic-1/
53
53 - A company's Machine Learning Specialist needs to improve the training speed of a time-series forecasting model using TensorFlow. The training is currently implemented on a single-GPU machine and takes approximately 23 hours to complete. The training needs to be run daily. The model accuracy is acceptable, but the company anticipates a continuous increase in the size of the training data and a need to update the model on an hourly, rather than a daily, basis. The company also wants to minimize coding effort and infrastructure changes. What should the Machine Learning Specialist do to the training solution to allow it to scale for future demand? - A.. Do not change the TensorFlow code. Change the machine to one with a more powerful GPU to speed up the training. B.. Change the TensorFlow code to implement a Horovod distributed framework supported by Amazon SageMaker. Parallelize the training to as many machines as needed to achieve the business goals. C.. Switch to using a built-in AWS SageMaker DeepAR model. Parallelize the training to as many machines as needed to achieve the business goals. D.. Move the training to Amazon EMR and distribute the workload to as many machines as needed to achieve the business goals.
B - the answer is B. using Hovord distribution results in less coding effort Answer is B. "minimize coding effort and infrastructure changes" If we use DeepAR then the code and infra has to be changed to work with DeepAR. A. NO, this will not address training dataset continuous increase B. NO, this will require code effort and infrastructure change C. YES, a built-in model ensure low code effort, so only infrastructure change needed* D. This will not work * they say current model accuracy is acceptable, we doo expect good results with DeepAR as it allows to automatically pick among 5 different models what works best for the customer DeepAR doesn't pick among 5 models. However, I still think that switching to DeepAR can assure accuracy and minimize coding effort as the model is built-in A comes with minimum changes, but it wont scale. B code changes are minimum but infrastructure still needs to be changed to achieve a distributed solution. C. Is even more significant infra and code change. D. wont work. It is really subjective and tricky. Could be A or B, depending on what change is considered "SMALL". For scalability, B seems better. for quick win A could work. I keep going back and forth. A. NO - one time shot and not scalable B. YES - best practice C. NO - DeepAR is for forecasting D. NO - code will not benefit from parallelization without change option B Note that we want to increase training speed, minimize code and infrastructure modification effort on AWS. Letter A would only delay the problem and increase costs too much. The solution that best translates the problem would be Letter B: we would keep the code in tensorflow and use Horovod to make our training faster through parallelization. Letter D is too complex and would change the execution infrastructure a lot and Letter C would be too abrupt a turn as we would throw our model away. A is better option even though B helps. Firstly, you only have One GPU, in this case distributed training Horovod doesn't help much; Secondly, the question is about minimize "coding effort" not minimize budget. adding distributed framework require much more coding, but increase gpu instance only require single click. Horovod distribution is accepted by sagemaker, making easy to implement! Hovord distribution will allow the Machine Learning Specialist to take advantage of Amazon SageMaker's built-in support for Horovod, which is a popular, open-source distributed deep learning framework. Implementing Horovod in TensorFlow will allow the Specialist to parallelize the training across multiple GPUs or instances, which can significantly reduce the time it takes to train the model. This will allow the company to meet its requirement to update the model on an hourly basis, and minimize coding effort and infrastructure changes as it leverages the existing TensorFlow code and infrastructure, along with the scalability and ease of use of Amazon SageMaker. Are is there a 23X differential between the weakest and strongest GPU in AWS? (and allow for future growth). I don' tthink so. Answer:C- built-in sagemaker DeepAR model. minimize coding & infra changes. But they are happy with it - just want it to go faster. Not throw the whole thing out. the answer is B. using Hovord distribution results in less coding effort Most likely , it is A because it is based on AWS teachnoloy, why we have to use open source we exam AWS ML , the answer should be relevant to AWS technology inevitably https://aws.amazon.com/sagemaker/distributed-training/ This one reminds me of an old saying by Yogi Berra: "When you come to a fork in the road, take it." If you see Horovod as an option in a question about scaling TF, take it. Answer is B. I Think it's B https://aws.amazon.com/blogs/machine-learning/launching-tensorflow-distributed-training-easily-with-horovod-or-parameter-servers-in-amazon-sagemaker/ & https://aws.amazon.com/blogs/machine-learning/multi-gpu-and-distributed-training-using-horovod-in-amazon-sagemaker-pipe-mode/ Seen similar question on udemy/whizlab , its always Horvord when Tensorflow needs scaling. ANWSER is B - https://www.examtopics.com/discussions/amazon/view/10082-exam-aws-certified-machine-learning-specialty-topic-1/
54
54 - Which of the following metrics should a Machine Learning Specialist generally use to compare/evaluate machine learning classification models against each other? - A.. Recall B.. Misclassification rate C.. Mean absolute percentage error (MAPE) D.. Area Under the ROC Curve (AUC)
D - RECALL IS ONE OF FACTOR IN CLASSIFY, AUC IS MORE FACTORS TO COMPREHENSIVE JUDGEMENT https://docs.aws.amazon.com/zh_tw/machine-learning/latest/dg/cross-validation.html ANSWER MIGHT BE D. AUC is to determine hyperparams in a single model, not compare different models. Not might be, but should be D Area Under the ROC Curve (AUC) is a commonly used metric to compare and evaluate machine learning classification models against each other. The AUC measures the model's ability to distinguish between positive and negative classes, and its performance across different classification thresholds. The AUC ranges from 0 to 1, with a score of 1 representing a perfect classifier and a score of 0.5 representing a classifier that is no better than random. While recall is an important evaluation metric for classification models, it alone is not sufficient to compare and evaluate different models against each other. Recall measures the proportion of actual positive cases that are correctly identified as positive, but does not take into account the false positive rate. chatgpt answers, all your answers are from chatgpt why not C? it's a classification problem, mape is for regression option D AUC is the best metric. D. AUC is always used to compare ML classification models. The others can all be misleading. Consider the cases where classes are highly imbalanced. In those cases accuracy, misclassification rate and the like are useless. Recall is only useful if used in combination with precision or specificity, which what AUC does. AUC/ROC work well with special case of Binary Classification not in general AUC is to compare different models in terms of their separation power. 0.5 is useless as it's the diagonal line. 1 is perfect. I would go with F1 Score if it was an option. However, taking Recall only as a metric for comparing between models, would be misleading. Its Accuracy,Precision,Recall and F1 score , there is no metion of AUC/ROC for comparing models in many articles , so ANSWER is A When you draw the ROC graph, you're considering True and False Positive Rate. The first one is also called Recall ;) D. AUC is scale- and threshold-invariant, enabling it compare models. https://towardsdatascience.com/how-to-evaluate-a-classification-machine-learning-model-d81901d491b1 Actually A, B and D seem to be correct Probably D https://towardsdatascience.com/metrics-for-evaluating-machine-learning-classification-models-python-example-59b905e079a5 why not B? Answer should be D..ROC is used to determine the diagnostic capability of classification model varying on threshold Should be A. A is the only one that generally works for classifcation. AUC only works with binary classification. Actually AUC could be generalized for multi-class problem. https://www.datascienceblog.net/post/machine-learning/performance-measures-multi-class-problems/ Could be, you mean in a multiclass clasification problem. But in that con context recall directly can't be compare because first you have to decide recall of what of the classes, in a 3 classes problem we have 3 recalls or you suppose a weighted recall or average recall ?. Do you think in that ? Also in multi-class classification, if you follow an One-vs_Rest strategy you can still use AUC. https://scikit-learn.org/stable/auto_examples/model_selection/plot_roc.html#sphx-glr-auto-examples-model-selection-plot-roc-py Correct Answer is D. Another benefit of using AUC is that it is classification-threshold-invariant like log loss. https://towardsdatascience.com/the-5-classification-evaluation-metrics-you-must-know-aa97784ff226 - https://www.examtopics.com/discussions/amazon/view/8384-exam-aws-certified-machine-learning-specialty-topic-1/
55
55 - A company is running a machine learning prediction service that generates 100 TB of predictions every day. A Machine Learning Specialist must generate a visualization of the daily precision-recall curve from the predictions, and forward a read-only version to the Business team. Which solution requires the LEAST coding effort? - A.. Run a daily Amazon EMR workflow to generate precision-recall data, and save the results in Amazon S3. Give the Business team read-only access to S3. B.. Generate daily precision-recall data in Amazon QuickSight, and publish the results in a dashboard shared with the Business team. C.. Run a daily Amazon EMR workflow to generate precision-recall data, and save the results in Amazon S3. Visualize the arrays in Amazon QuickSight, and publish them in a dashboard shared with the Business team. D.. Generate daily precision-recall data in Amazon ES, and publish the results in a dashboard shared with the Business team.
C - Ans C is reasonable Agree with C. Quicksight cannot handle 100TB each day. Amazon QuickSight, particularly when using its SPICE (Super-fast, Parallel, In-memory Calculation Engine) feature, has specific data capacity limits. For the Enterprise Edition, SPICE can handle up to 1 billion rows or 1 TB per dataset1. This means that while QuickSight is highly capable, handling 100 TB of data per day would exceed its current capacity limits. The limit of QuickSight for 1TB is soft limit which can be increased to unlimited number of TBs. Ans C Because Quicksight Can't handle 100 TB even in Entiripise Quotas for SPICE are as follows: 2,047 Unicode characters for each field 127 Unicode characters for each column name 2,000 columns for each file 1,000 files for each manifest For Standard edition, 25 million (25,000,000) rows or 25 GB for each dataset For Enterprise edition, 1 billion (1,000,000,000) rows or 1 TB for each dataset https://docs.aws.amazon.com/quicksight/latest/user/data-source-limits.html QuickSight can handle large volumes of data for analytics and visualizations. Some key points: QuickSight scales seamlessly from hundreds of megabytes to many terabytes of data without needing to manage infrastructure. It uses an in-memory engine called SPICE to enable high performance analytics on large datasets. so the choice is B B. Generate daily precision-recall data in Amazon QuickSight, and publish the results in a dashboard shared with the Business team. This solution leverages QuickSight's managed service capabilities for both data processing and visualization, which should minimize the coding effort required to provide the Business team with the necessary insights. However, it's important to note that QuickSight's ability to calculate the precision-recall data depends on its support for the necessary statistical functions or the availability of such calculations in the dataset. If QuickSight cannot perform these calculations directly, option C might be necessary, despite the increased effort. The question does not ask for processing of 1Tb data. it asks for visuals/predications of that data. So B C. Considering the large volume of data (100 TB daily), Option C seems to be the most appropriate solution B it's not correct because of 100tb data size. C is the answer: https://docs.aws.amazon.com/quicksight/latest/user/data-source-limits.html ANs c is correct A. NO - we want a dashboard for business B. NO - 100TB is very large, it will not fit in memory (1TB max for SPICE dataset) or return within the 2min limit if delegated to a DB (https://docs.aws.amazon.com/quicksight/latest/user/data-source-limits.html) C. YES - best combination; EMR can distribute the computation of precision-recall for each slice of data D. NO - ES cannot help to generate precision-recall although C is tempting but goes with B due to less effort it is not about the least effort only, since the least effort solution here will not get your job done, look at the quick sight max data it can deal with when it compared to EMR which is built to deal with Big data. using quick sight for creation of the precision recall with 100 TB every day cann't be done since the max size for quick sight to deal with is : For Standard edition, 25 million (25,000,000) rows or 25 GB for each dataset For Enterprise edition, 1 billion (1,000,000,000) rows or 1 TB for each dataset acc to AWS documentation : https://docs.aws.amazon.com/quicksight/latest/user/data-source-limits.html but we can do it with EMR and latterly use quick sight to visualize the results Looking at the QuickSight documentation: it has a limit of 1 TB per dataset. So it's necessary a previous layer. Letter C is the correct one. It's 100TB daily, need EMR to reduce, option C is correct. Quicksight can handle maximum 1TB data set only. We have 100TB data set so we need EMR. https://docs.aws.amazon.com/quicksight/latest/user/data-source-limits.html - https://www.examtopics.com/discussions/amazon/view/10083-exam-aws-certified-machine-learning-specialty-topic-1/
56
56 - A Machine Learning Specialist is preparing data for training on Amazon SageMaker. The Specialist is using one of the SageMaker built-in algorithms for the training. The dataset is stored in .CSV format and is transformed into a numpy.array, which appears to be negatively affecting the speed of the training. What should the Specialist do to optimize the data for training on SageMaker? - A.. Use the SageMaker batch transform feature to transform the training data into a DataFrame. B.. Use AWS Glue to compress the data into the Apache Parquet format. C.. Transform the dataset into the RecordIO protobuf format. D.. Use the SageMaker hyperparameter optimization feature to automatically optimize the data.
C - C is okay Anwer is C. Most Amazon SageMaker algorithms work best when you use the optimized protobuf recordIO format for the training data. https://docs.aws.amazon.com/sagemaker/latest/dg/cdf-training.html option C The Specialist should transform the dataset into the RecordIO protobuf format. This format is optimized for use with SageMaker and has been shown to improve the speed and efficiency of training algorithms. Using the RecordIO protobuf format is a best practice for preparing data for use with Amazon SageMaker, and it is specifically recommended for use with the built-in algorithms. I would assume the issue is the transformation. It can be nasty slow between pandas / csv / numpy. Go to protobuf. C is the best Agree with C - https://www.examtopics.com/discussions/amazon/view/10084-exam-aws-certified-machine-learning-specialty-topic-1/
57
57 - A Machine Learning Specialist is required to build a supervised image-recognition model to identify a cat. The ML Specialist performs some tests and records the following results for a neural network-based image classifier: Total number of images available = 1,000 Test set images = 100 (constant test set) The ML Specialist notices that, in over 75% of the misclassified images, the cats were held upside down by their owners. Which techniques can be used by the ML Specialist to improve this specific test error? - A.. Increase the training data by adding variation in rotation for training images. B.. Increase the number of epochs for model training C.. Increase the number of layers for the neural network. D.. Increase the dropout rate for the second-to-last layer.
A - NO CORRECT TRAINING DATA, MORE WORKS JUST WASTE TIME. ONE OF THE REASONS FOR POOR ACCURACY COULD BE INSUFFICIENT DATA. THIS CAN BE OVERCOME BY IMAGE AUGMENTATION. IMAGE AUGMENTATION IS A TECHNIQUE OF INCREASING THE DATASET SIZE BY PROCESSING (MIRRORING, FLIPPING, ROTATING, INCREASING/DECREASING BRIGHTNESS, CONTRAST, COLOR) THE IMAGES. HTTPS://MEDIUM.COM/DATADRIVENINVESTOR/AUTO-MODEL-TUNING-FOR-KERAS-ON-AMAZON-SAGEMAKER-PLANT-SEEDLING-DATASET-7B591334501E ANSWER A. ADD MORE TRAINING DATA FOR ROTATION IMAGES COULD BE A WAY TO DEAL WITH ISSUE Donald, your caps lock is on. Okay, was funny LOL :D is it possible no using MAYUS? it is annoying The key phrase might be "constant test set", so you can't increase training set by shrinking the size of test set. Thus the only feasible choice is to increase training time by increasing the number of epochs => answer B. The problem is images are upside down and misclassified. If right side up then the model would classify correctly. This can only be fixed ba rotating not by trying to recognise upside down cat more times. What's your answer B? A . Increase the training data by adding variation in rotation for training images. It never says to move the images from Test data set (because it is constant)... only variations are added to the images..so, A is correct. agree with A A is answer Data Augmentation would fix the missing conditional data ChatGPT says the answer is A. Trust a model to answer an ML question correctly! ;) how come more epochs it better than augmentation? option A The question is clear and the answer is clear as well should be A More epochs is not a good approach to fundamental data issues the Specialist can apply data augmentation techniques to increase the training data by adding variation in rotation for training images. This technique will allow the model to learn to recognize cats in various orientations, including upside down. Adding more variation in rotation to the training data can help the model to learn how to classify cats in different orientations, including when they are held upside down. This can improve the model's ability to identify cats in this position and reduce the misclassification rate for images in which the cats are upside down. By adding more rotation to the training data, the model can be trained to generalize better to new images, including those with cats in different orientations. This can help to reduce overfitting and improve the model's overall performance. Only logical answer 100% A. More data is a good answer. A Answer is ”A”” Answer is A This is a clear case of Data Augumentation solution. Common step in CNN, Image augmentation. A. - https://www.examtopics.com/discussions/amazon/view/8386-exam-aws-certified-machine-learning-specialty-topic-1/
58
58 - A Machine Learning Specialist needs to be able to ingest streaming data and store it in Apache Parquet files for exploration and analysis. Which of the following services would both ingest and store this data in the correct format? - A.. AWS DMS B.. Amazon Kinesis Data Streams C.. Amazon Kinesis Data Firehose D.. Amazon Kinesis Data Analytics
C - the answer is C. as the main point of the question is data transformation to Parquet format which is done by Kinesis Data Firehose not Data Stream. Coming to the data store the data store in Kinesis Data Stream is only for couple of days so it does not serve the purpose here The storage part will be taken care of by S3 anyway. Firehose would just transform to Parquet on the fly. Firehose Not sure Firehose can store the data .... Data Stream can store the data. Someone please explain the answer Firehose is to Store the data. Stream requires other service to do that. Kinesis Data Streams can Store for up to 365 days, While Firehouse sends it to S3. Which is correct? Firehose can do it if the data is in JSON or ORC format initially! It should be KDS Amazon Kinesis Data Firehose is a fully managed service that can automatically load streaming data into data stores and analytics tools. It can ingest real-time streaming data such as application logs, website clickstreams, and IoT telemetry data, and then store it in the correct format, such as Apache Parquet files, for exploration and analysis. This makes it a suitable option for the requirement described in the question. B https://github.com/ravsau/aws-exam-prep/issues/10 B) Only Amazon Kinesis Data Streams can store and Ingest data. We don't need to apply any transformation; the question asks to ingest and store data in Apache Parquet format, There is no assumption that the data coming in a different format than parquet. KDS cant store to s3 https://stackoverflow.com/questions/66097886/writing-to-s3-via-kinesis-stream-or-firehose It is C with no doubt https://aws.amazon.com/about-aws/whats-new/2018/05/stream_real_time_data_in_apache_parquet_or_orc_format_using_firehose/ It appears all agree that the answer is between Firehose and Analytics. Kinesis Firehose is used for ingestion. Both firehose and analytics can store, only firehose can ingest. https://docs.aws.amazon.com/firehose/latest/dev/record-format-conversion.html shows firehose can store parquet to S3 It appears all agree that the answer is between Firehose and Analytics. Data Streams handle stuff like event data, clickstream etc. Its not interested in special format, the focus is speed. The question did not talk of transformation, only ingestion. Kinesis Firehose is used for ingestion. Both firehose and analytics can store, only firehose can ingest. https://docs.aws.amazon.com/firehose/latest/dev/record-format-conversion.html shows firehose can store parquet to S3 Think just like this -- batch process Glue ETL and Streaming process Firehose ETL ......covert to parquet or any other format. C for Firehose Just in case https://acloud.guru/forums/aws-certified-big-data-specialty/discussion/-KhI3MgPEo-FY5rfgl3J/what_is_difference_between_kin Amazon Kinesis Data Firehose can convert the format of your input data from JSON to Apache Parquet or Apache ORC before storing the data in Amazon S3. https://github.com/awsdocs/amazon-kinesis-data-firehose-developer-guide/blob/master/doc_source/record-format-conversion.md I would go with B. Kinesis data streams stores data, while Firehose not. It's the other way around. Firehouses stores data; data streams does not. - https://www.examtopics.com/discussions/amazon/view/10085-exam-aws-certified-machine-learning-specialty-topic-1/
59
59 - A data scientist has explored and sanitized a dataset in preparation for the modeling phase of a supervised learning task. The statistical dispersion can vary widely between features, sometimes by several orders of magnitude. Before moving on to the modeling phase, the data scientist wants to ensure that the prediction performance on the production data is as accurate as possible. Which sequence of steps should the data scientist take to meet these requirements? - A.. Apply random sampling to the dataset. Then split the dataset into training, validation, and test sets. B.. Split the dataset into training, validation, and test sets. Then rescale the training set and apply the same scaling to the validation and test sets. C.. Rescale the dataset. Then split the dataset into training, validation, and test sets. D.. Split the dataset into training, validation, and test sets. Then rescale the training set, the validation set, and the test set independently.
B - C would be my answer here. Rescaling each set independently could lead to strange skews. Training set, Test set and Evaluation set should be on the same scale You're right. test set and val set should be rescaled on the same scale. But the scale value should be extracted by only statistical value from training data. I think C means that the rescaling stage is affected by the values from the whole data (with val, test set) So, I think B is correct https://stackoverflow.com/questions/49444262/normalize-data-before-or-after-split-of-training-and-testing-data C also leads to data leakage. You are using the test data to scale everything. So part of the data in the test set is used to scale for when you build the model on the training and check against the validation set. If you Rescale all the data first you are going to do data leakage by showing all the variance of data with in training. The rescaling needs to be after splitting the data and not before it The best practice is --> to split the dataset into training, validation, and test sets first, and then rescale the training set and apply the SAME scaling to the validation and test sets. This ensures that the scaling parameters (e.g., mean and standard deviation for standardization or min and max values for min-max scaling) are calculated only based on the training set to prevent data leakage and maintain the integrity of the evaluation process. By following this approach, you prevent information from the validation and test sets from influencing the scaling parameters, which could lead to data leakage and overestimation of model performance. Keeping the scaling consistent across all subsets ensures a fair evaluation of the model's generalization performance on new, unseen data. Answer is B. The other options have shortcomings: A: Random sampling is a good practice, but it doesn't address the issue of feature scaling. Also, rescaling should occur after splitting the data. C: Rescaling the entire dataset before splitting could lead to data leakage, where information from the validation/test sets inadvertently influences the training process. D: Rescaling the sets independently would lead to inconsistencies in scale across the training, validation, and test sets, which could negatively impact model performance and evaluation. OPTION C. Rescale the dataset. Then split the dataset into training, validation, and test sets. Explanation: Rescaling the dataset: This is the first step to address the varying statistical dispersion among features. By rescaling, you ensure that all features are on a similar scale, which is important for many machine learning algorithms. Splitting into training, validation, and test sets: After rescaling, the dataset is split into training, validation, and test sets. This ensures that the model is trained on one set, validated on another set, and tested on a third set. This separation helps evaluate the model's performance on unseen data. Option C ensures that the rescaling is applied before splitting the data, ensuring consistency in the scaling across different sets. This approach prevents data leakage and provides a more accurate representation of how the model will perform on new, unseen data. Validation and test set should be scaled as per parameters used for scaling of training set. Independent scaling of test set would mean that drift of model in production will be way quicker and is not recommended in data science B is correct, scale on train and apply the others. prevent to data leakage Answer B, C is not a good data science practise. We need firstly split the data to avoid data leakage from test/eval sets, then rescale data in all sets using statistics from training set I think the right answer here is B. We need to split the dataset into Training, Validation and Test set. Then we can only scale (by using some technique) data contained in the Training set. Data that belong to Validation and Test set must be scaled by using the parameters used on the training. For example, if we want to apply a standardization, we can do that only on the Training set as we should not be allowed to use mean and standard deviation computed on Validation/Test set. We must act as we don't own those data! option B Data Science 101: (A) Given the question, doesn't solve the magnitude problem. (B) Correct (C) Data Leakage (D) It's not correct, still data leakage. Tricky question, but, D, definitely! B: You can't apply the same scaling to the validation and test sets 'cause you may suffer data leakage! C: You shouldn't rescale the whole dataset then split into training, validation and test, it's not a good practice and may suffer data leakage as well. D: You're first splitting the whole dataset and applying rescaling individually, preventing any data leakage and each set is rescaled based in your own statistics. Theoretically, you should not have Test set data at Training time (when you're doing the scaling), so how do you think to do that? What if you will not have an entire Test set, but you will receive each new row at a time? but you are leaking information from validation samples between themselves. From Bing chat (and it makes complete sense) "Based on the search results, I think the best sequence of steps for the data scientist to take is B. Split the dataset into training, validation, and test sets. Then rescale the training set and apply the same scaling to the validation and test sets. This sequence of steps ensures that the data scientist can evaluate the model performance on different subsets of data that have not been used for training or tuning. It also ensures that the data scientist can rescale the features to have a common scale without introducing any data leakage from the validation or test sets. Rescaling the features can help improve the accuracy of some machine learning algorithms that are sensitive to the magnitude or distribution of the data, such as distance-based methods or gradient-based methods 1. You want to measure how the model performs on new data. Scaling with the test set is a no-no. B or D, I dont understand the semantics of "independently" and the effect it would have. It's most def not done before because of data leakage. https://www.linkedin.com/pulse/feature-scaling-dataset-spliting-arnab-mukherjee/ - https://www.examtopics.com/discussions/amazon/view/74274-exam-aws-certified-machine-learning-specialty-topic-1/
60
60 - A Machine Learning Specialist is assigned a TensorFlow project using Amazon SageMaker for training, and needs to continue working for an extended period with no Wi-Fi access. Which approach should the Specialist use to continue working? - A.. Install Python 3 and boto3 on their laptop and continue the code development using that environment. B.. Download the TensorFlow Docker container used in Amazon SageMaker from GitHub to their local environment, and use the Amazon SageMaker Python SDK to test the code. C.. Download TensorFlow from tensorflow.org to emulate the TensorFlow kernel in the SageMaker environment. D.. Download the SageMaker notebook to their local environment, then install Jupyter Notebooks on their laptop and continue the development in a local notebook.
B - ANSWER B. YOU COULD INSTALL DOCKER-COMPOSE (AND NVIDIA-DOCKER IF TRAINING WITH A GPU) FOR LOCAL TRAINING HTTPS://SAGEMAKER.READTHEDOCS.IO/EN/STABLE/OVERVIEW.HTML#LOCAL-MODE HTTPS://GITHUB.COM/AWSLABS/AMAZON-SAGEMAKER-EXAMPLES/BLOB/MASTER/SAGEMAKER-PYTHON-SDK/TENSORFLOW_DISTRIBUTED_MNIST/TENSORFLOW_LOCAL_MODE_MNIST.IPYNB None of these links are working https://aws.amazon.com/blogs/machine-learning/use-the-amazon-sagemaker-local-mode-to-train-on-your-notebook-instance/ B https://aws.amazon.com/blogs/machine-learning/use-the-amazon-sagemaker-local-mode-to-train-on-your-notebook-instance/ stop using gpt.... GPT said SageMaker Python SDK is less suitable for offline Correction it will be B, while D is possible, it cannot exactly mimic the sagemaker env, with docker all the configuration and libs will be available to the user which would be an ideal working setup for the DS to work with. You can easily download the notebook instance, and work locally using jupyter notebook configured on your laptop which is one the advantages of using sagemaker, and that is what Amazon also promotes imo. Both Amazon Q (AWS Expert) and ChatGPT insist on D. Plus all the links that I see here about Docker/Git and stuff, they either not working or deprecated so far. Not to mention their complexity to my eyes. Thus, I will go for D. the local mode of sagemaker SDK: https://sagemaker.readthedocs.io/en/stable/overview.html#local-mode B Option B B, https://github.com/aws/sagemaker-tensorflow-serving-container It's B Answer : D why not D? My assumption is that D there is no way to test the code. You need the Sagemaker SDK in order to utilize dockerized container of Tensorflow from Sagemaker is my best guess. Cannot be D. If you used Jupyter notebook, you are unable to use it without internet access. That is incorrect, once jupyter notebook is configured you can use it offline. Agreed for B - https://www.examtopics.com/discussions/amazon/view/8392-exam-aws-certified-machine-learning-specialty-topic-1/
61
61 - A Machine Learning Specialist is working with a large cybersecurity company that manages security events in real time for companies around the world. The cybersecurity company wants to design a solution that will allow it to use machine learning to score malicious events as anomalies on the data as it is being ingested. The company also wants be able to save the results in its data lake for later processing and analysis. What is the MOST efficient way to accomplish these tasks? - A.. Ingest the data using Amazon Kinesis Data Firehose, and use Amazon Kinesis Data Analytics Random Cut Forest (RCF) for anomaly detection. Then use Kinesis Data Firehose to stream the results to Amazon S3. B.. Ingest the data into Apache Spark Streaming using Amazon EMR, and use Spark MLlib with k-means to perform anomaly detection. Then store the results in an Apache Hadoop Distributed File System (HDFS) using Amazon EMR with a replication factor of three as the data lake. C.. Ingest the data and store it in Amazon S3. Use AWS Batch along with the AWS Deep Learning AMIs to train a k-means model using TensorFlow on the data in Amazon S3. D.. Ingest the data and store it in Amazon S3. Have an AWS Glue job that is triggered on demand transform the new data. Then use the built-in Random Cut Forest (RCF) model within Amazon SageMaker to detect anomalies in the data.
A - I WOULD LIKE TO CHOOSE ANSWER A. https://aws.amazon.com/tw/blogs/machine-learning/use-the-built-in-amazon-sagemaker-random-cut-forest-algorithm-for-anomaly-detection/ Donald, do you know your CAPS LOCK has been on the whole time? I know why his caps lock has been on :D to enter the "I am not robot" code easier :D yes, but it works with minus also... Answer is A. As the word anamoly talks about Random Cut Forest in the exam and that can be done in a cost effective manner using Kinesis Data Analytics I think it would have been more accurate if the options were kinetic data stream -> kinesis data analytics -> kinesis firehose -> S3 The question says REAL TIME events doesn't that eliminate Data Firehose as it is technically NEAR real time but not real time like Data Stream? Though Random Cut Forest seems like the best option for anomaly detection. I'm torn between A and B Kinesis Firehose and Data Analytics with random cut forest should do it. A. Based on these considerations, Option A is the most efficient way to accomplish the tasks. It provides a seamless, real-time data ingestion and processing pipeline, leverages machine learning for anomaly detection, and efficiently stores data in a data lake, meeting all the key requirements of the cybersecurity company. ONLY A B not as efficient for real-time processing and storing results as using Kinesis services. At least B is a possible solution, but A will not work as KDF doesn't support KDA as a destination service https://docs.aws.amazon.com/firehose/latest/dev/create-name.html . In my opinion, KDF should always be the latest Kinesis Service in a streaming pipeline KDF does support KDA as destination A has all the required steps A. YES - Firehose can pipe into KDA, and KDA supports RCF B. NO - RCF best for anomality detection C. NO - no need for intermediary S3 storage D. NO - no need for intermediary S3 storage option A A is the correct. One tip for the exam: When you see Data Streaming, possibly the solution should contains a Kinesis Service. B is too much complex! Makes sense to select A here. I strongly believe A is the right answer. At a minimum there should be some justification provided for your answer. Amazon Kinesis Data Firehose is a fully managed service for streaming real-time data to Amazon S3 and can handle the ingestion of large amounts of data in real time. Kinesis Data Analytics Random Cut Forest (RCF) is a fully managed service that can be used to perform anomaly detection on streaming data, making it well suited for this use case. The results of the anomaly detection can then be streamed to Amazon S3 using Kinesis Data Firehose, providing a scalable and cost-effective data lake for later processing and analysis. The problem with A, is that there is that KDF doesn't support KDA as a destination service https://docs.aws.amazon.com/firehose/latest/dev/create-name.html . In my opinion, KDF should always be the latest Kinesis Service in a streaming pipeline I would select A B is too resource intensive for that use case. I choose A, but I think the data should be better ingested using Kinesis streams - https://www.examtopics.com/discussions/amazon/view/8394-exam-aws-certified-machine-learning-specialty-topic-1/
62
62 - A Data Scientist wants to gain real-time insights into a data stream of GZIP files. Which solution would allow the use of SQL to query the stream with the LEAST latency? - A.. Amazon Kinesis Data Analytics with an AWS Lambda function to transform the data. B.. AWS Glue with a custom ETL script to transform the data. C.. An Amazon Kinesis Client Library to transform the data and save it to an Amazon ES cluster. D.. Amazon Kinesis Data Firehose to transform the data and put it into an Amazon S3 bucket.
A - A is correct. Kinesis Data Analytics can use lamda to convert GZIP and can run SQL on the converted data. https://aws.amazon.com/about-aws/whats-new/2017/10/amazon-kinesis-analytics-can-now-pre-process-data-prior-to-running-sql-queries/ A is correct: https://aws.amazon.com/about-aws/whats-new/2017/10/amazon-kinesis-analytics-can-now-pre-process-data-prior-to-running-sql-queries/ "To get started, simply select an AWS Lambda function from the Kinesis Analytics application source page in the AWS Management console. Your Kinesis Analytics application will automatically process your raw data records using the Lambda function, and send transformed data to your SQL code for further processing. Kinesis Analytics provides Lambda blueprints for common use cases like converting GZIP ..." Use Amazon Kinesis Data Analytics if you need SQL-based processing and advanced analytics capabilities for streaming data. Use Amazon Kinesis Data Firehose if your primary requirement is to deliver, transform, and load streaming data into various AWS destinations with simplified configurations, but not for SQL-based processing. If gaining real-time insights involves complex analytics or custom processing, Amazon Kinesis Data Analytics with AWS Lambda is likely a more suitable choice. If the requirements can be met with simpler data transformations, Amazon Kinesis Data Firehose might provide a more straightforward and potentially lower-latency solution. In other words, if this data is in GZIP files and the processing requirements are relatively simple, Amazon Kinesis Data Firehose might be a more straightforward and efficient choice. GZIP files typically contain compressed data, and if our primary objective is to ingest, transform, and load this data into other AWS services for real-time insights, Kinesis Data Firehose provides a managed and streamlined solution that can handle GZIP compression. The answer can be A , please comment if you have more clarity. After searching more, I also found out the following: (I have missed the SQL requirement in the question) Use Amazon Kinesis Data Analytics if you need SQL-based processing and advanced analytics capabilities for streaming data. Use Amazon Kinesis Data Firehose if your primary requirement is to deliver, transform, and load streaming data into various AWS destinations with simplified configurations, but not for SQL-based processing. A is correct, why D xiyarsan sen? A is correct "allow the use ohttps://www.examtopics.com/exams/amazon/aws-certified-machine-learning-specialty/view/13/#f SQL to query the stream with the LEAST latency?" Well, the only solution that presents SQL query is (A). It's a description of KDA. the term "lease latency" is the the hidden point. with Glue we can have near real-time but Kinesis data analytics will give you real-time transformation with internal lambda A is correct, with KDA you can run sql queries in the data during the streaming (real-time SQL queries). D. Amazon Kinesis Data Firehose to transform the data and put it into an Amazon S3 bucket would be the best solution for allowing the use of SQL to query the stream with the least latency. Amazon Kinesis Data Firehose can be configured to transform the data before writing it to Amazon S3 in real-time. Once the data is in S3, it can be queried using SQL with Amazon Athena, which is a serverless query service that allows running standard SQL queries against data stored in Amazon S3. This approach provides the lowest latency compared to other options and requires minimal setup and maintenance. Query has to be run on stream so firehose not possible. A is correct. And somehow "transformation" is added to the answer as a requirement when it clearly was not part of the requirement from the question. AAAAAAA what about "LEAST latency"? A is correct. you can pre-process data prior to running SQL queries with Kinesis Data Analytics and Lambda (more or less) is always a best practice :) Answer is B. Kinesis Data Analytics does not do any transformation, it is only for querying. Glue ETL can have scripts that can transform the data so you need lambda But we need to run SQL on real time stream data. - https://www.examtopics.com/discussions/amazon/view/11384-exam-aws-certified-machine-learning-specialty-topic-1/
63
63 - A retail company intends to use machine learning to categorize new products. A labeled dataset of current products was provided to the Data Science team. The dataset includes 1,200 products. The labeled dataset has 15 features for each product such as title dimensions, weight, and price. Each product is labeled as belonging to one of six categories such as books, games, electronics, and movies. Which model should be used for categorizing new products using the provided dataset for training? - A.. AnXGBoost model where the objective parameter is set to multi:softmax B.. A deep convolutional neural network (CNN) with a softmax activation function for the last layer C.. A regression forest where the number of trees is set equal to the number of product categories D.. A DeepAR forecasting model based on a recurrent neural network (RNN)
A - Ans: A XGBoost multi class classification. https://medium.com/@gabrielziegler3/multiclass-multilabel-classification-with-xgboost-66195e4d9f2d CNN is used for image classificaiton problems Answer is A. This a classification problem thus XGBoost and the fact that there are six categories SOFTMAX is the right activation function Deep convolutional neural networks (CNNs) are primarily used for image processing tasks. Given that the dataset provided is structured/tabular in nature (with features like dimensions, weight, and price) and does not mention image data, a CNN is not the most appropriate choice. A. YES - perfect fit, multi:softmax the highest probability class is assigned B. NO - CNN is for imaging C. NO - regression forest is for continuuous variables, we can discrete classification D. NO - it is classification, not forecasting Option A XGBoost multi class classification A is the answer. The XGBoost algorithm is a popular and effective technique for multi-class classification. The objective parameter can be set to multi:softmax, which uses a softmax objective function for multi-class classification. This will train the model to predict the probability of each product belonging to each category, and the most probable category will be chosen as the final prediction. A deep convolutional neural network (CNN) (B) is a powerful technique commonly used for image recognition tasks. However, it is less appropriate for tabular data like the dataset provided. A, CNN is used for image classification. It would be suitable if we were classifying products using pictures of them. https://xgboost.readthedocs.io/en/stable/parameter.html B - CNN is used for dataset that have "local intermediate features" ex) images, or textCNN, etc C - We need classfication model, not regression model D - RNN is used for dataset that have sequential features A is correct A is the best option here. Only 1200 items and 6 classes are not enough data to involve a deep neural architecture for classification. Ans- A … For multiclassification - multi: SoftMax That is a classification problem so A is the answer Easy one. A is correct A is correct Definitely A. 100% is A; the the others are clearly wrong Convolutional Neural Network (ConvNet or CNN) is a special type of Neural Network used effectively for image recognition and classification Recurrent neural networks (RNN) are a class of neural networks that is powerful for modeling sequence data such as time series or natural language - https://www.examtopics.com/discussions/amazon/view/10089-exam-aws-certified-machine-learning-specialty-topic-1/
64
64 - A Data Scientist is working on an application that performs sentiment analysis. The validation accuracy is poor, and the Data Scientist thinks that the cause may be a rich vocabulary and a low average frequency of words in the dataset. Which tool should be used to improve the validation accuracy? - A.. Amazon Comprehend syntax analysis and entity detection B.. Amazon SageMaker BlazingText cbow mode C.. Natural Language Toolkit (NLTK) stemming and stop word removal D.. Scikit-leam term frequency-inverse document frequency (TF-IDF) vectorizer
D - D is correct. Amazon Comprehend syntax analysis =/= Amazon Comprehend sentiment analysis. You need to read choices very carefully. We're looking only to improve the validation accuracy and Comprehend syntax analysis would help that because the word set is rich and the sentiment carying words infrequent. We're not looking to replace the sentiment analysis tool with Comprehend. AWS COMPREHEND IS A NATURAL LANGUAGE PROCESSING (NLP) SERVICE THAT USES MACHINE LEARNING TO DISCOVER INSIGHTS FROM TEXT. AMAZON COMPREHEND PROVIDES KEYPHRASE EXTRACTION, SENTIMENT ANALYSIS, ENTITY RECOGNITION, TOPIC MODELING, AND LANGUAGE DETECTION APIS SO YOU CAN EASILY INTEGRATE NATURAL LANGUAGE PROCESSING INTO YOUR APPLICATIONS. HTTPS://AWS.AMAZON.COM/COMPREHEND/FEATURES/?NC1=H_LS JUST THROUGH AMAZON COMPREHEND IS MUCH EASY THAN OTHER THE MUCH MORE CONVENIENT ANSWER IS A. Agree Also Keyword is TOOL rest are frameworks Both Amazon Comprehend and the TF-IDF with a classifier solution are valid. If ease of use and pre-trained capabilities are high priorities, Comprehend is a solid option. If customization and dataset-specific nuances are crucial, building a custom model with TF-IDF may be needed. Since Comprehend is a tool, I am going with A. D. Scikit-learn term frequency-inverse document frequency (TF-IDF) vectorizer Here's why: TF-IDF Vectorizer: This tool from Scikit-learn is effective in handling issues of rich vocabularies and low frequency words. TF-IDF down-weights words that appear frequently across documents (thus might be less informative) and gives more weight to words that appear less frequently but might be more indicative of the sentiment. This approach can enhance the model's ability to focus on more relevant features, potentially improving validation accuracy. C I think c is correct. stemming involves reducing words to their root or base form, and stop word removal involves removing common words (e.g., "the," "and," "is") that may not contribute much to sentiment analysis. By using NLTK for stemming and stop word removal, you can simplify the vocabulary and potentially improve the model's ability to capture sentiment from the remaining meaningful words. A - syntax and entity recognition wont solve the scenario B - blaze text for words. D - capturing the importance of words in a document collection. frequency of a word in a document. D is the correct guys Amazon Comprehend's syntax analysis and entity detection are more about understanding the structure of sentences and identifying entities within the text rather than tackling the problem of a rich vocabulary with low average frequency of words. TF-IDF vectorization is a technique that can help reduce the impact of common, low-information words in the dataset while emphasizing the importance of more informative, less frequent words. This could potentially improve the validation accuracy by addressing the identified problem. A. YES - he works on an application and not a model, Amazon Comprehend is the ready-to-use tool he wants; TF-IDF is built-in B. NO - word2vec will be challenged with low frequency terms; GloVe and FastText are better for that C. NO - the vocabulary is righ, so stemming and stop word removal will not address the core issue D. NO - right approach, but that is not "a tool" Option D. This approach can help in reducing the impact of words that occur frequently in the dataset and increasing the impact of words that occur less frequently. This can help in improving the accuracy of the model. The anwer is B. Blazing text can hadle OOV words as explained below. https://docs.aws.amazon.com/sagemaker/latest/dg/blazingtext.html This is an AWS exam, so why would you choose anything other than A or B, and based on the link, it looks like B most likely The passage “low average frequency of words” points directly to the use of TF-IDF. Letter A deviates from what the question proposes and is discarded. Letter B proposes a radical change in my POV. Letter C does not solve the passage mentioned at the beginning. Letter D is correct. The Amazon SageMaker BlazingText algorithm provides highly optimized implementations of the Word2vec and text classification algorithms. The Word2vec algorithm is useful for many downstream natural language processing (NLP) tasks, such as *****sentiment analysis, named entity recognition, machine translation, etc. Text classification is an important task for applications that perform web searches, information retrieval, ranking, and document classification. I would say since the buzzword "low average frequency" comes up, the safe choise would be the tfid vectorizer. I go for D. The Scikit-learn term frequency-inverse document frequency (TF-IDF) vectorizer is a widely used tool to mitigate the high dimensionality of text data. Option A, Amazon Comprehend syntax analysis, and entity detection, can help in extracting useful features from the text, but it does not address the issue of high dimensionality. Option B, Amazon SageMaker BlazingText cbow mode, is a tool for training word embeddings, which can help to represent words in a lower dimensional space. However, it does not directly address the issue of high dimensionality and low frequency of words. Option C, Natural Language Toolkit (NLTK) stemming and stop word removal, can reduce the dimensionality of the feature space, but it does not address the issue of low-frequency words that are important for sentiment analysis. Emphasis is on the rich words - so stemming can help reduce these to more common words. Blazing Text in cbow mode doesnt seem relevant is about providing words given a context. And TF-IDF I'm not sure would do anything except highlight the problem you are already having? D. Scikit-learn term frequency-inverse document frequency (TF-IDF) vectorizer would be the best tool to use in this scenario. The TF-IDF vectorizer will give less weight to the less frequent words in the dataset, and allow the more informative and frequent words to have a greater impact on the sentiment analysis. This can help to improve the validation accuracy of the model. - https://www.examtopics.com/discussions/amazon/view/8395-exam-aws-certified-machine-learning-specialty-topic-1/
65
65 - Machine Learning Specialist is building a model to predict future employment rates based on a wide range of economic factors. While exploring the data, the Specialist notices that the magnitude of the input features vary greatly. The Specialist does not want variables with a larger magnitude to dominate the model. What should the Specialist do to prepare the data for model training? - A.. Apply quantile binning to group the data into categorical bins to keep any relationships in the data by replacing the magnitude with distribution. B.. Apply the Cartesian product transformation to create new combinations of fields that are independent of the magnitude. C.. Apply normalization to ensure each field will have a mean of 0 and a variance of 1 to remove any significant magnitude. D.. Apply the orthogonal sparse bigram (OSB) transformation to apply a fixed-size sliding window to generate new features of a similar magnitude.
C - Ans: C; Normalization is correct Ans is not C. What is listed there is the definition of STANDARDIZATION. Normalization just scales and is not useful for reducing the effect of outliers nevermind ignore this Guys, I passed the exam today. It is a tough one but there are many questions here. Good luck everyone! Thank examtopics Hi Phong! Please add my skype: haison8x Ans: C; Normalization is correct C (Yep, STANDARDIZATION is the correct name) That's an odd question for me ans C is correct. ANS should be C as Normalization work best in case of amplitude diff Hi, guys, First thanks this website for the information it provided. However, the ML exam has updated most of the questions. only 20+ questions here are included in today's test. Anyway, it is still helpful. GOOD LUCK EVERYONE! So there are 40+ other questions on the exam that aren't included in Examtopics? QUESTION 69 A large consumer goods manufacturer has the following products on sale: • 34 different toothpaste variants • 48 different toothbrush variants • 43 different mouthwash variants The entire sales history of all these products is available in Amazon S3. Currently, the company is using custom-built autoregressive integrated moving average (ARIMA) models to forecast demand for these products. The company wants to predict the demand for a new product that will soon be launched. Which solution should a Machine Learning Specialist apply? A. Train a custom ARIMA model to forecast demand for the new product. B. Train an Amazon SageMaker DeepAR algorithm to forecast demand for the new product. C. Train an Amazon SageMaker k-means clustering algorithm to forecast demand for the new product. D. Train a custom XGBoost model to forecast demand for the new product. Correct Answer: B https://aws.amazon.com/blogs/machine-learning/forecasting-time-series-with-dynamic-deep-learning-on-aws/ Answer: B QUESTION 68 An agency collects census information within a country to determine healthcare and social program needs by province and city. The census form collects responses for approximately 500 questions from each citizen. Which combination of algorithms would provide the appropriate insights? (Select TWO.) A. The factorization machines (FM) algorithm B. The Latent Dirichlet Allocation (LDA) algorithm C. The principal component analysis (PCA) algorithm D. The k-means algorithm E. The Random Cut Forest (RCF) algorithm Correct Answer: CD https://aws.amazon.com/blogs/machine-learning/analyze-us-census-data-for-population-segmentation-using-amazon-sagemaker/ Answer: C and D I think the answer is A and B. The census question and answer will be in text. Use LDA (unsupervised algorithm) which takes the census question/answer and groups them into categories. Use the categorization to group the people and identify similar people. Use the Factorization Machine to group the people. For each person identify if they answer a question or not. Find the total questions they answered and that will be the Target variable. Now the problem is similar to movie recommendation (consider each question a movie and the total number of questions answered will be the Rating). Based on the questions a Person answered, Factorization Machine groups the people. Findings from both the algorithms can be used to compare and identify the people for the social programs. it's CD FM is mainly used in recommendation system to find hidden variables between two known variables to find correlation between two variables. QUESTION 67 A. Use AWS Lambda to trigger an AWS Step Functions workflow to wait for dataset uploads to complete in Amazon S3. Use AWS Glue to join the datasets. Use an Amazon CloudWatch alarm to send an SNS notification to the Administrator in the case of a failure. B. Develop the ETL workflow using AWS Lambda to start an Amazon SageMaker notebook instance. Use a lifecycle configuration script to join the datasets and persist the results in Amazon S3. Use an Amazon CloudWatch alarm to send an SNS notification to the Administrator in the case of a failure. C. Develop the ETL workflow using AWS Batch to trigger the start of ETL jobs when data is uploaded to Amazon S3. Use AWS Glue to join the datasets in Amazon S3. Use an Amazon CloudWatch alarm to send an SNS notification to the Administrator in the case of a failure. D. Use AWS Lambda to chain other Lambda functions to read and join the datasets in Amazon S3 as soon as the data is uploaded to Amazon S3. Use an Amazon CloudWatch alarm to send an SNS notification to the Administrator in the case of a failure. Correct Answer: A QUESTION 67 A Machine Learning Specialist is developing a daily ETL workflow containing multiple ETL jobs. The workflow consists of the following processes: • Start the workflow as soon as data is uploaded to Amazon S3. • When all the datasets are available in Amazon S3, start an ETL job to join the uploaded datasets with multiple terabyte-sized datasets already stored in Amazon S3. • Store the results of joining datasets in Amazon S3. • If one of the jobs fails, send a notification to the Administrator. Which configuration will meet these requirements? QUESTION 66 A Machine Learning Specialist must build out a process to query a dataset on Amazon S3 using Amazon Athena. The dataset contains more than 800,000 records stored as plaintext CSV files. Each record contains 200 columns and is approximately 1.5 MB in size. Most queries will span 5 to 10 columns only. How should the Machine Learning Specialist transform the dataset to minimize query runtime? A. Convert the records to Apache Parquet format. B. Convert the records to JSON format. C. Convert the records to GZIP CSV format. D. Convert the records to XML format. Correct Answer: A - https://www.examtopics.com/discussions/amazon/view/10090-exam-aws-certified-machine-learning-specialty-topic-1/
66
66 - A Machine Learning Specialist must build out a process to query a dataset on Amazon S3 using Amazon Athena. The dataset contains more than 800,000 records stored as plaintext CSV files. Each record contains 200 columns and is approximately 1.5 MB in size. Most queries will span 5 to 10 columns only. How should the Machine Learning Specialist transform the dataset to minimize query runtime? - A.. Convert the records to Apache Parquet format. B.. Convert the records to JSON format. C.. Convert the records to GZIP CSV format. D.. Convert the records to XML format.
A - Answer A seems correct... sorry, the link https://aws.amazon.com/blogs/big-data/prepare-data-for-model-training-and-invoke-machine-learning-models-with-amazon-athena/ A ( Most queries will span 5 to 10 columns only) Option A clue is: most queries will span 5 to 10 column while there are 200 columns. Indicating Data Warehouse means columner storage. Option A is correct. A. See https://aws.amazon.com/blogs/big-data/analyzing-data-in-s3-using-amazon-athena/ - https://www.examtopics.com/discussions/amazon/view/19369-exam-aws-certified-machine-learning-specialty-topic-1/
67
67 - A Machine Learning Specialist is developing a daily ETL workflow containing multiple ETL jobs. The workflow consists of the following processes: * Start the workflow as soon as data is uploaded to Amazon S3. * When all the datasets are available in Amazon S3, start an ETL job to join the uploaded datasets with multiple terabyte-sized datasets already stored in Amazon S3. * Store the results of joining datasets in Amazon S3. * If one of the jobs fails, send a notification to the Administrator. Which configuration will meet these requirements? - A.. Use AWS Lambda to trigger an AWS Step Functions workflow to wait for dataset uploads to complete in Amazon S3. Use AWS Glue to join the datasets. Use an Amazon CloudWatch alarm to send an SNS notification to the Administrator in the case of a failure. B.. Develop the ETL workflow using AWS Lambda to start an Amazon SageMaker notebook instance. Use a lifecycle configuration script to join the datasets and persist the results in Amazon S3. Use an Amazon CloudWatch alarm to send an SNS notification to the Administrator in the case of a failure. C.. Develop the ETL workflow using AWS Batch to trigger the start of ETL jobs when data is uploaded to Amazon S3. Use AWS Glue to join the datasets in Amazon S3. Use an Amazon CloudWatch alarm to send an SNS notification to the Administrator in the case of a failure. D.. Use AWS Lambda to chain other Lambda functions to read and join the datasets in Amazon S3 as soon as the data is uploaded to Amazon S3. Use an Amazon CloudWatch alarm to send an SNS notification to the Administrator in the case of a failure.
A - A: Correct. S3 events can trigger AWS Lambda function. B: Wrong. There's nothing to do with SageMaker in the provided context. C: Wrong. AWS Batch cannot receive events from S3 directly. D: Wrong. Will not meed the requirement: "When all the datasets are available in Amazon S3..." https://docs.aws.amazon.com/step-functions/latest/dg/tutorial-cloudwatch-events-s3.html I agree. Step Functions can be used to implement a workflow. In this case, wait for all the datasets to be loaded before triggering the glue job. Actually, I think that D does meet the requirement of waiting until all datasets are in S3, BUT you do need Glue to join the datasets. Answer is still A. Option A Batch isn't event driven, answer is A. If EMR were present I would have chose that because of the size of dataset, else is Glue exactly, this is also where I got confused. Since Glue is not good at handling such large dataset, multiple terabyte-sized datasets + multiple ETL jobs + daily A. The answer omits stuffs like Lambda functions and Event Bridge. https://aws.amazon.com/blogs/big-data/orchestrate-multiple-etl-jobs-using-aws-step-functions-and-aws-lambda/ https://d1.awsstatic.com/r2018/a/product-page-diagram-aws-step-functions-use-case-aws-glue.bc69d97a332c2dd29abb724dd747fd82ae110352.png - https://www.examtopics.com/discussions/amazon/view/26038-exam-aws-certified-machine-learning-specialty-topic-1/
68
68 - An agency collects census information within a country to determine healthcare and social program needs by province and city. The census form collects responses for approximately 500 questions from each citizen. Which combination of algorithms would provide the appropriate insights? (Choose two.) - A.. The factorization machines (FM) algorithm B.. The Latent Dirichlet Allocation (LDA) algorithm C.. The principal component analysis (PCA) algorithm D.. The k-means algorithm E.. The Random Cut Forest (RCF) algorithm
CD - C: (OK) Use PCA for reducing number of variables. Each citizen's response should have answer for 500 questions, so it should have 500 variables D: (OK) Use K-means clustering A: (Not OK) Factorization Machines Algorithm is usually used for tasks dealing with high dimensional sparse datasets B: (Not OK) The Latent Dirichlet Allocation (LDA) algorithm should be used for task dealing topic modeling in NLP E: (Not OK) Random Cut Forest should be used for detecting anormal in data https://aws.amazon.com/blogs/machine-learning/analyze-us-census-data-for-population-segmentation-using-amazon-sagemaker/ Answer: C and D If the form contains free-text answers, it would be interesting to apply LDA to identify the most frequent/relevant topics in the answers Option C and D The answer depends on the type of question is it is open ended then would need LDA hence B and D but if the question is a feature then PCA should work C and D are the way CD, C - for reduce number of columns. D - for data clustering C. The principal component analysis (PCA) algorithm D. The k-means algorithm PCA is a dimensionality reduction technique that can be used to identify the underlying structure of the census data. This algorithm can help to identify the most important questions and provide an overview of the relationship between the questions and the responses. K-means is an unsupervised learning algorithm that can be used to segment the population into different groups based on their responses to the census questions. This algorithm can help to determine the healthcare and social program needs by province and city based on the responses collected from each citizen. These algorithms can help to provide insights into the patterns and relationships within the census data, which can inform decision making for healthcare and social program planning. Reduce dimensionality and cluster subjects. This is the same question as Topic 2 Q3 how to reach Topic 2 every questions here seem to belong to topic 1 - https://www.examtopics.com/discussions/amazon/view/17281-exam-aws-certified-machine-learning-specialty-topic-1/
69
69 - A large consumer goods manufacturer has the following products on sale: * 34 different toothpaste variants * 48 different toothbrush variants * 43 different mouthwash variants The entire sales history of all these products is available in Amazon S3. Currently, the company is using custom-built autoregressive integrated moving average (ARIMA) models to forecast demand for these products. The company wants to predict the demand for a new product that will soon be launched. Which solution should a Machine Learning Specialist apply? - A.. Train a custom ARIMA model to forecast demand for the new product. B.. Train an Amazon SageMaker DeepAR algorithm to forecast demand for the new product. C.. Train an Amazon SageMaker k-means clustering algorithm to forecast demand for the new product. D.. Train a custom XGBoost model to forecast demand for the new product.
B - B https://docs.aws.amazon.com/sagemaker/latest/dg/deepar.html "...When your dataset contains hundreds of related time series, DeepAR outperforms the standard ARIMA and ETS methods. You can also use the trained model to generate forecasts for new time series that are similar to the ones it has been trained on." "You can also use the trained model to generate forecasts for new time series that are similar to the ones it has been trained on" https://docs.aws.amazon.com/sagemaker/latest/dg/deepar.html 'autoregressive integrated moving average (ARIMA)' <--> DeepAR. https://docs.aws.amazon.com/sagemaker/latest/dg/deepar.html B - DeepAR is based on GluonTS, and can use multiple time series for learning Option B DeepAr for new products forever! The DeepAR algorithm is a powerful time series forecasting algorithm that is designed to handle multiple time series data and can handle irregularly spaced time series data and missing values, making it a good fit for this task. Additionally, the large amount of sales history data available in Amazon S3 makes the use of a deep learning algorithm like DeepAR more appropriate. B https://docs.aws.amazon.com/sagemaker/latest/dg/deepar.html It is B This is the same question as Topic 2 Q4 - https://www.examtopics.com/discussions/amazon/view/17280-exam-aws-certified-machine-learning-specialty-topic-1/
70
70 - A Machine Learning Specialist uploads a dataset to an Amazon S3 bucket protected with server-side encryption using AWS KMS. How should the ML Specialist define the Amazon SageMaker notebook instance so it can read the same dataset from Amazon S3? - A.. Define security group(s) to allow all HTTP inbound/outbound traffic and assign those security group(s) to the Amazon SageMaker notebook instance. B.. ׀¡onfigure the Amazon SageMaker notebook instance to have access to the VPC. Grant permission in the KMS key policy to the notebook's KMS role. C.. Assign an IAM role to the Amazon SageMaker notebook with S3 read access to the dataset. Grant permission in the KMS key policy to that role. D.. Assign the same KMS key used to encrypt data in Amazon S3 to the Amazon SageMaker notebook instance.
C - Should be C. "You don’t need to specify the AWS KMS key ID when you download an SSE-KMS-encrypted object from an S3 bucket. Instead, you need the permission to decrypt the AWS KMS key. When a user sends a GET request, Amazon S3 checks if the AWS Identity and Access Management (IAM) user or role that sent the request is authorized to decrypt the key associated with the object. If the IAM user or role belongs to the same AWS account as the key, then the permission to decrypt must be granted on the AWS KMS key’s policy." https://aws.amazon.com/premiumsupport/knowledge-center/decrypt-kms-encrypted-objects-s3/?nc1=h_ls Should be C. I think it is not possible to assign a key directly to a Sagemaker notebook instance like D suggests. Normally in AWS in general, IAM roles are used to do so. So C. 'IAM role' principle of least privilege (PoLP) IAM roles securely provide temporary AWS credentials that services (like SageMaker notebooks) can assume to access other resources. This avoids using long-lived access keys or directly embedding API keys into code. KMS Key Policy: This policy controls access to your KMS key. Granting the notebook's role permission within this policy lets SageMaker decrypt the data when reading from S3. Seems to follow the best cloud authorization practice IAM role associated with the SageMaker notebook instance must be given permissions in the KMS key policy to decrypt the data using the KMS key that was used for encryption. answer is C Assign an IAM role to the Amazon SageMaker notebook with S3 read access to the dataset. Grant permission in the KMS key policy to that role. To read data from Amazon S3 that is encrypted with AWS KMS, the Amazon SageMaker notebook instance needs to have both S3 read access and KMS decrypt permissions. This can be achieved by assigning an IAM role to the notebook instance that has the necessary policies attached, and by granting permission in the KMS key policy to that role. C only. Should be C. The reference doc provided did not have any information about assigning keys to the notebook. Doing so become very cumbersome as you can have 100's of notebooks and its not scalable. Someone needs to moderate these answers. To allow an Amazon SageMaker notebook instance to read a dataset stored in an Amazon S3 bucket that is protected with server-side encryption using AWS KMS, the ML Specialist should assign an IAM role to the Amazon SageMaker notebook with S3 read access to the dataset. The IAM role should have permissions to access the S3 bucket and the KMS key that was used to encrypt the data. This role should be granted permission in the KMS key policy to allow it to decrypt the data. To encrypt the machine learning (ML) storage volume that is attached to notebooks, processing jobs, training jobs, hyperparameter tuning jobs, batch transform jobs, and endpoints, you can pass a AWS KMS key to SageMaker. If you don't specify a KMS key, SageMaker encrypts storage volumes with a transient key and discards it immediately after encrypting the storage volume. For notebook instances, if you don't specify a KMS key, SageMaker encrypts both OS volumes and ML data volumes with a system-managed KMS key. I correct myself- Option C is correct: Background AWS Key Management Service (AWS KMS) enables Server-side encryption to protect your data at rest. Amazon SageMaker training works with KMS encrypted data if the IAM role used for S3 access has permissions to encrypt and decrypt data with the KMS key. Further, a KMS key can also be used to encrypt the model artifacts at rest using Amazon S3 server-side encryption. Additionally, a KMS key can also be used to encrypt the storage volume attached to training, endpoint, and transform instances. In this notebook, we demonstrate SageMaker encryption capabilities using KMS-managed keys. resource: https://github.com/aws/amazon-sagemaker-examples/blob/main/advanced_functionality/handling_kms_encrypted_data/handling_kms_encrypted_data.ipynb Option D is correct if sagemaker does the encryption, if you are dealing with encrypted data then C is 100% correct. C. Assign an IAM role to the Amazon SageMaker notebook with S3 read access to the dataset. Grant permission in the KMS key policy to that role. To access the encrypted dataset in Amazon S3, the Amazon SageMaker notebook instance must have the appropriate permissions. This can be achieved by assigning an IAM role to the notebook with read access to the dataset in Amazon S3 and granting permission in the KMS key policy to that role. This ensures that the notebook has the necessary permissions to access the encrypted data in Amazon S3, while adhering to best practices for securing sensitive data. agreed with C Answer is C : Open the IAM console. Add a policy to the IAM user that grants the permissions to upload and download from the bucket. You can use a policy that's similar to the following: https://aws.amazon.com/premiumsupport/knowledge-center/s3-bucket-access-default-encryption/ (number 2) Seems to be D https://docs.aws.amazon.com/sagemaker/latest/dg/encryption-at-rest-nbi.html Not D as if you assign the key in the notebook, that's not secure, it will make the encryption ineffective. Instead, you assign the access permission by using IAM. - https://www.examtopics.com/discussions/amazon/view/43708-exam-aws-certified-machine-learning-specialty-topic-1/
71
71 - A Data Scientist needs to migrate an existing on-premises ETL process to the cloud. The current process runs at regular time intervals and uses PySpark to combine and format multiple large data sources into a single consolidated output for downstream processing. The Data Scientist has been given the following requirements to the cloud solution: ✑ Combine multiple data sources. ✑ Reuse existing PySpark logic. ✑ Run the solution on the existing schedule. ✑ Minimize the number of servers that will need to be managed. Which architecture should the Data Scientist use to build this solution? - A.. Write the raw data to Amazon S3. Schedule an AWS Lambda function to submit a Spark step to a persistent Amazon EMR cluster based on the existing schedule. Use the existing PySpark logic to run the ETL job on the EMR cluster. Output the results to a ג€processedג€ location in Amazon S3 that is accessible for downstream use. B.. Write the raw data to Amazon S3. Create an AWS Glue ETL job to perform the ETL processing against the input data. Write the ETL job in PySpark to leverage the existing logic. Create a new AWS Glue trigger to trigger the ETL job based on the existing schedule. Configure the output target of the ETL job to write to a ג€processedג€ location in Amazon S3 that is accessible for downstream use. C.. Write the raw data to Amazon S3. Schedule an AWS Lambda function to run on the existing schedule and process the input data from Amazon S3. Write the Lambda logic in Python and implement the existing PySpark logic to perform the ETL process. Have the Lambda function output the results to a ג€processedג€ location in Amazon S3 that is accessible for downstream use. D.. Use Amazon Kinesis Data Analytics to stream the input data and perform real-time SQL queries against the stream to carry out the required transformations within the stream. Deliver the output results to a ג€processedג€ location in Amazon S3 that is accessible for downstream use.
B - B it is . I agree, B is serverless and reuses Pyspark. Similar example shown here: https://docs.aws.amazon.com/glue/latest/dg/aws-glue-programming-python-samples-medicaid.html A is not correct because Minimize the number of servers that will need to be managed. EMR is not server-less. B is correct. AWS Glue supports an extension of the PySpark Python dialect for scripting extract, transform, and load... C is not correct because using Lambda for ETL you will not be able to Reuse existing PySpark logic D is not correct because Kinesis is not server-less. And you can not Reuse existing PySpark logic Option B (using AWS Glue for the ETL process) is the best solution for the described requirements. A: This solution requires managing an Amazon EMR cluster, which would involve more server management than AWS Glue, violating the requirement to minimize the number of servers to be managed. C: AWS Lambda is not ideal for this use case because it has resource limitations, including memory and execution time limits (15 minutes max), which might not be suitable for large-scale ETL operations involving PySpark logic. D: Amazon Kinesis Data Analytics is focused on real-time stream processing, which doesn't fit the described scheduled batch processing scenario. Answer is A, as B clearly mentions that Pyspark code is written with leverage from already existing code. Also, the server architecture used currently is on-premises which will have more servers that solution A. Amazon Kinesis Data Analytics is more suited for real-time processing and streaming data. The given use case does not indicate a need for real-time processing, so this might not be the best fit. Furthermore, it doesn't support PySpark natively. Voted B based on the serverless (minimum servers) and https://docs.aws.amazon.com/glue/latest/dg/aws-glue-programming.html Indeed B using Glue B is the correct. A you have to manage EMR, so it's wrong. D you don't use Spark, so it's wrong. C you will not be using Spark, so it's wrong. B ticks all boxes. Minimize servers -> AWS managed services -> Glue. Option A would be the best response for this scenario. This solution allows the Data Scientist to reuse the existing PySpark logic while migrating the ETL process to the cloud. The raw data is written to Amazon S3, and a Lambda function is scheduled to trigger a Spark step on a persistent EMR cluster based on the existing schedule. The PySpark logic is used to run the ETL job on the EMR cluster, and the results are output to a processed location in Amazon S3 that is accessible for downstream use. This solution minimizes the number of servers that need to be managed, and it allows for a seamless migration of the existing ETL process to the cloud. Option D is wrong it should be B D cannot be answer as there is no streaming data or Realtime processing. the answer is b Answer should be B. Serverless, on a regular schedule (no real time requirement), reuses PySpark code in Glue ETL script. Answer is B as they specifically ask about reusing existing PySpark, which can be done with Glue https://docs.aws.amazon.com/glue/latest/dg/creating_running_workflows.html It is B. ! "Minimize number of servers to be managed". B is a Serverless solution which fulfils other requirements! - https://www.examtopics.com/discussions/amazon/view/43716-exam-aws-certified-machine-learning-specialty-topic-1/
72
72 - A Data Scientist is building a model to predict customer churn using a dataset of 100 continuous numerical features. The Marketing team has not provided any insight about which features are relevant for churn prediction. The Marketing team wants to interpret the model and see the direct impact of relevant features on the model outcome. While training a logistic regression model, the Data Scientist observes that there is a wide gap between the training and validation set accuracy. Which methods can the Data Scientist use to improve the model performance and satisfy the Marketing team's needs? (Choose two.) - A.. Add L1 regularization to the classifier B.. Add features to the dataset C.. Perform recursive feature elimination D.. Perform t-distributed stochastic neighbor embedding (t-SNE) E.. Perform linear discriminant analysis
AC - AC - correct answer I think ACE are all correct A. YES - standard for overfitting B. NO - we have already too much overfitting C. YES - feature elimination can reduce model complexity and thus overfitting D. NO - that does dimensionnality reduction to 2D or 3D, for visualization; we want more than a few features E. NO - LDA is an alternative to logistic regression; it may not address overfitting A due to fitting C Recursive feature elimination (RFE) is a wrapper method that iteratively removes features based on their importance scores from a classifier. RFE starts with all features and then eliminates the least important ones until a desired number of features is reached. This can help to reduce the dimensionality of the dataset and improve the model performance by removing irrelevant or redundant features. The Marketing team can then interpret the model by looking at the remaining features and their importance scores. AC are the correct How can we add features to the dataset provided.... we can't make them up from thin air. Hopefully the moderators can provide some insight on this. I was thinking of paying for this site but the answers are all over the place. A. Add L1 regularization to the classifier and C. Perform recursive feature elimination are the methods that can be used to improve the model performance and satisfy the Marketing team's needs. Explanation: A. Adding L1 regularization to the logistic regression classifier can help to improve the model performance and reduce overfitting. This can also help to highlight the relevant features for churn prediction as L1 regularization can shrink the coefficients of irrelevant features to zero. C. Recursive feature elimination can be used to select the most relevant features for the model. This can help to improve the model performance and highlight the relevant features for churn prediction. A. Adding L1 regularization can help to reduce overfitting by shrinking the coefficients of less important features towards zero, which can improve the model's generalization performance on the validation set. C. Recursive feature elimination is a feature selection technique that removes the least important feature at each iteration and trains the model on the remaining features until a desired number of features is reached. This method can be used to identify the most relevant features for the prediction task and reduce the dimensionality of the dataset, leading to improved model performance and interpretability for the Marketing team. AC - Key: logistic regression model = non linear in terms of Odds and Probability, however it is linear in terms of Log Odds. Key: Large gap between training & validation = overfitting => 5 techniques to prevent overfitting: 1. Simplifying the model | 2. Early stopping 3. Use data argumentation | 4. Use regularization | 5. Use dropouts A - yes to avoid overfitting (although i am thinking it is talking about regressor) Not B - add feature will lead to overfitting C - feature elimination - prevent overfitting Not D - t-SNE is a nonlinear dimensionality reduction technique Not E - find feature correlation only - Linear discriminant analysis (LDA), normal discriminant analysis (NDA), or discriminant function analysis is a generalization of Fisher's linear discriminant, a method used in statistics and other fields, to find a linear combination of features that characterizes or separates two or more classes of objects or events. L1 won't do naturally the feature elimination? I guess AB why not A & D? or C & D? does not t-SNE grant the marketing team's wish for visualization of relationships? or are we to presume that A&C are best as C (recursive feature elimination) grants us some visualization of feature importance. AC is correct overfitting: add regularization, remove features - https://www.examtopics.com/discussions/amazon/view/74983-exam-aws-certified-machine-learning-specialty-topic-1/
73
73 - An aircraft engine manufacturing company is measuring 200 performance metrics in a time-series. Engineers want to detect critical manufacturing defects in near- real time during testing. All of the data needs to be stored for offline analysis. What approach would be the MOST effective to perform near-real time defect detection? - A.. Use AWS IoT Analytics for ingestion, storage, and further analysis. Use Jupyter notebooks from within AWS IoT Analytics to carry out analysis for anomalies. B.. Use Amazon S3 for ingestion, storage, and further analysis. Use an Amazon EMR cluster to carry out Apache Spark ML k-means clustering to determine anomalies. C.. Use Amazon S3 for ingestion, storage, and further analysis. Use the Amazon SageMaker Random Cut Forest (RCF) algorithm to determine anomalies. D.. Use Amazon Kinesis Data Firehose for ingestion and Amazon Kinesis Data Analytics Random Cut Forest (RCF) to perform anomaly detection. Use Kinesis Data Firehose to store data in Amazon S3 for further analysis.
D - D near-real time The main problem with D is that Amazon Kinesis Data Firehose can not be a source service for Amazon Kinesis Data Analytics. The answer would be correct if it said "Using Amazon Kinesis Data Stream to ingest data, using Amazon Kinesis Data Analytics for defect detection and using Amazon Kinesis Data Firehose for storing data for further Analysis" https://docs.aws.amazon.com/firehose/latest/dev/create-name.html Actually Kinesis Data Firehose can be used for Data Ingestion. So the correct option is still D Glad we are all in agreement D is the correct answer Amazon Kinesis Data Firehose is a fully managed service for real-time data ingestion, which fits the requirement for near-real-time defect detection. It can ingest large volumes of data from various sources and reliably load the data into other AWS services like Amazon S3 for storage. Amazon Kinesis Data Analytics with Random Cut Forest (RCF) is highly efficient for detecting anomalies in streaming data in near real time, which is what the engineers need to catch manufacturing defects during testing. After detecting anomalies, the data can be stored in Amazon S3 via Kinesis Data Firehose for offline analysis. D - firehose for near realtime Kinesis Data Firehose is a fully managed service that can ingest streaming data and load it into destinations like S3, Redshift, Elasticsearch. and with Kinesis Data Analytics and RCF and then Data Firehose again to store on S3. D is the best choice. https://docs.aws.amazon.com/managed-flink/latest/java/get-started-exercise-fh.html Kinesis seems like the only viable option The answer is D. Since, data is continuously coming in Kinesis datafirehose is our streaming application (also we need near Real time defect detection and storage in S3) and anomaly detection can be done by kinesis data application (RCF algorithm). D, near real-time ingestion is the key A. NO - AWS IoT will first store the data, then make it available for Analytics/Jupyter (https://docs.aws.amazon.com/iotanalytics/latest/userguide/welcome.html); so not real-time B. NO - not realtime to store the data before analytics C. NO - not realtime to store the data before analytics D. YES - real-time pipe, RCF best for anomalities How can someone use S3 for ingestion? Firehose is the right answer This option meets the requirements of performing near-real time defect detection, storing all the data for offline analysis, and handling 200 performance metrics in a time-series. Amazon Kinesis Data Firehose is a fully managed service that can ingest streaming data from various sources and deliver it to destinations such as Amazon S3, Amazon OpenSearch Service, and Amazon Redshift. Amazon Kinesis Data Analytics is a service that can process streaming data using SQL or Apache Flink applications. Amazon Kinesis Data Analytics provides a built-in RANDOM_CUT_FOREST function, a machine learning algorithm that can detect anomalies in streaming data1. This function can handle high-dimensional data and assign an anomaly score to each record based on how distant it is from other records1. The anomaly scores can then be delivered to another destination using Kinesis Data Firehose or consumed by other applications using Kinesis Data Streams. D is the correct If the question says "data streaming", "real time data" or "near real time" you should look for kinesis services. B and C are totally wrong: It's not possible to use S3 to ingestion, only storage. D, https://docs.aws.amazon.com/kinesisanalytics/latest/sqlref/sqlrf-random-cut-forest.html At a minimum the moderators should put some explanation when the community vote overwhelmingly for a different option. Option D is not necessarily incorrect, but it may not be the most effective approach to perform near-real time defect detection in this scenario. Here are some potential drawbacks of this approach: Amazon Kinesis Data Firehose is primarily used for data ingestion and delivery to other services, and may not be the best choice for real-time analysis. Using Amazon Kinesis Data Analytics for anomaly detection may be less flexible than using Amazon SageMaker, which provides a wide range of algorithms and models for anomaly detection. Random Cut Forest (RCF) is a popular anomaly detection algorithm used for time-series data, and Amazon SageMaker provides an RCF implementation that can be used for anomaly detection in real-time or offline. While Amazon Kinesis Data Analytics also provides RCF, using Amazon SageMaker may be a better choice for scalability and flexibility. Yes, option C can provide near real-time defect detection. Amazon SageMaker's Random Cut Forest (RCF) algorithm is designed to work with streaming data and can detect anomalies in near real-time. It can process data in batches as small as a single data point, making it well-suited for real-time anomaly detection. In this scenario, if the manufacturing process is generating data in real-time, it can be ingested into Amazon S3 and processed by Amazon SageMaker's RCF algorithm, allowing for near real-time detection of critical manufacturing defects during testing. this is ridiculous. How can you store in s3 and then conduct real-time analysis? - https://www.examtopics.com/discussions/amazon/view/43717-exam-aws-certified-machine-learning-specialty-topic-1/
74
74 - A Machine Learning team runs its own training algorithm on Amazon SageMaker. The training algorithm requires external assets. The team needs to submit both its own algorithm code and algorithm-specific parameters to Amazon SageMaker. What combination of services should the team use to build a custom algorithm in Amazon SageMaker? (Choose two.) - A.. AWS Secrets Manager B.. AWS CodeStar C.. Amazon ECR D.. Amazon ECS E.. Amazon S3
CE - CE is the right answer. ECR uses ECS internally while using SGM. CE based on criteria and this documentation: https://docs.aws.amazon.com/sagemaker/latest/dg/sagemaker-mkt-create-model-package.html "For Location of inference image, type the path to the image that contains your inference code. The image must be stored as a Docker container in Amazon ECR. For Location of model data artifacts, type the location in S3 where your model artifacts are stored." the answer is correct but the explanation is completely wrong. The question is about how to create your own algorithm using container, not "put the inference in market" (which is your resource link). the right citation should be "Adapting your own training container": create s3 to store model articraft, and push code to ECR. CE IS THE CORRECT ANSWER 100% Amazon ECR (Option C): Amazon Elastic Container Registry (ECR) is used to store, manage, and deploy Docker container images. The team can package their custom algorithm code into a Docker container and store it in Amazon ECR1. Amazon S3 (Option E): Amazon Simple Storage Service (S3) is used to store external assets and data. The team can store the algorithm-specific parameters and any other required data in Amazon S3 Amazon ECR is a fully managed container registry service that allows users to store, manage, and deploy Docker container images. Amazon SageMaker supports using custom Docker images for training and inference, which can contain the user’s own training algorithm and any external assets or dependenciesAd1. The user can push their Docker image to Amazon ECR and then reference it in their Amazon SageMaker training job configurationAd1. CE is correct! ECR for the code, S3 for the parameters! C contain the algorithm's image and E contain algorithm's parameters. The location of the model artifacts. Model artifacts can either be packaged in the same Docker container as the inference code or stored in Amazon S3. Not so sure. https://aws.amazon.com/blogs/machine-learning/bringing-your-own-custom-container-image-to-amazon-sagemaker-studio-notebooks/ If you wish to use your private VPC to securely bring your custom container, you also need the following: A VPC with a private subnet VPC endpoints for the following services: Amazon Simple Storage Service (Amazon S3) Amazon SageMaker Amazon ECR AWS Security Token Service (AWS STS) CodeBuild for building Docker containers Answer C+E For me CD. needs storage and create a custom docker using ECR to store it. Sorry, CE is correct. Sagemaker will spin up the instances needed with the right image. No need to use ECS. CE is right - https://www.examtopics.com/discussions/amazon/view/43916-exam-aws-certified-machine-learning-specialty-topic-1/
75
75 - A Machine Learning Specialist wants to determine the appropriate SageMakerVariantInvocationsPerInstance setting for an endpoint automatic scaling configuration. The Specialist has performed a load test on a single instance and determined that peak requests per second (RPS) without service degradation is about 20 RPS. As this is the first deployment, the Specialist intends to set the invocation safety factor to 0.5. Based on the stated parameters and given that the invocations per instance setting is measured on a per-minute basis, what should the Specialist set as the SageMakerVariantInvocationsPerInstance setting? - A.. 10 B.. 30 C.. 600 D.. 2,400
C - C is correct . SageMakerVariantInvocationsPerInstance = (MAX_RPS * SAFETY_FACTOR) * 60 AWS recommended Saf_fac =0 .5 Answer C: SageMakerVariantInvocationsPerInstance = (MAX_RPS * SAFETY_FACTOR) * 60 https://docs.aws.amazon.com/sagemaker/latest/dg/endpoint-scaling-loadtest.html To calculate the SageMakerVariantInvocationsPerInstance setting, we can use the following equation from the web search results1: SageMakerVariantInvocationsPerInstance = (MAX_RPS * SAFETY_FACTOR) * 60 Where MAX_RPS is the maximum RPS that the variant can handle, SAFETY_FACTOR is the safety factor that we choose to ensure that we don’t exceed the maximum RPS, and 60 is to convert from RPS to invocations-per-minute. Plugging in the given values, we get: SageMakerVariantInvocationsPerInstance = (20 * 0.5) * 60 SageMakerVariantInvocationsPerInstance = 10 * 60 SageMakerVariantInvocationsPerInstance = 600 Therefore, the Specialist should set the SageMakerVariantInvocationsPerInstance setting to 600. SageMakerVariantInvocationsPerInstance = (MAX_RPS * SAFETY_FACTOR) * 60 SageMakerVariantInvocationsPerInstance = (MAX_RPS * SAFETY_FACTOR) * 60 (20RPS * 0.5Safety Factor) * 60 (10)*60 = 600 Answer C Maximum request at peak time = 20 RPS = 20x60 = 1200RPM Safety factor of 0.5 = 1200*0.5 = 600 Basic setting of parameter = 600 (requests per minutes) - https://www.examtopics.com/discussions/amazon/view/43915-exam-aws-certified-machine-learning-specialty-topic-1/
76
76 - A company uses a long short-term memory (LSTM) model to evaluate the risk factors of a particular energy sector. The model reviews multi-page text documents to analyze each sentence of the text and categorize it as either a potential risk or no risk. The model is not performing well, even though the Data Scientist has experimented with many different network structures and tuned the corresponding hyperparameters. Which approach will provide the MAXIMUM performance boost? - A.. Initialize the words by term frequency-inverse document frequency (TF-IDF) vectors pretrained on a large collection of news articles related to the energy sector. B.. Use gated recurrent units (GRUs) instead of LSTM and run the training process until the validation loss stops decreasing. C.. Reduce the learning rate and run the training process until the training loss stops decreasing. D.. Initialize the words by word2vec embeddings pretrained on a large collection of news articles related to the energy sector.
D - I think the right answer is D D is correct. C is not the best the answer because the question states that tuning parameters doesn't help a lot. Transfer learning would be better solution! Using word2vec embeddings would give the model more accurate representations of words at the start, potentially leading to a significant performance boost for text classification tasks. A. NO, transfer learning helps, word2vec > TD-ITF as the first keeps into account part of the word context (there is a hyperparameter for this) B. LTSM delivers better results wrt GRU which is in turn a compromise architecture to balance accuracy with training time/cost C. Heperparameters tuning has been already applied, this will not help D. YEs, transfer learning will help and word3vec is better option in this scenario How are the 'correct' answers being provided? I'm seeing so many answers that seem to be wrong and usually, the community vote seems to be correct. This is kind of frustrating. Word2vec is a technique that can learn distributed representations of words, also known as word embeddings, from large amounts of text data. Word embeddings can capture the semantic and syntactic similarities and relationships between words, and can be used as input features for neural network models. Word2vec can be trained on domain-specific corpora to obtain more relevant and accurate word embeddings for a particular task. From my perspective, B and C are wrong because the DS already tried something close to this. D is correct. I don't think High Dimensionality is take care by C2V; TF-IDF is required. A. Transfer learning, in my experience, has been a good way to boost performance when hyperparameter tuning did not work. The case ask for predicting labels for sentences, the appropriate algo should be "Text Classification" Which, just as "wrod2vec,i part of Blazing Text. The answer should be D. My reasoning is that by using a word embedding which is trained on domain specific material, the embeddings between two words are more domain specific. This means that relations (good or bad) are represented in a better way, which also means that the model should be able to predict the results in a more accurate way. both A & D "seem" correct, but word2vec takes ORDER of words into acc (to some extent)--while TF-IDF does not. Thus max boost is from D. B,C are wrong because the DS has tried several network architectures (aka LSTM) and hyperparameter tuning (aka option C) i think answer is A as The model reviews multi-page text documents I think that the general tf-idf vectors cannot be directly adapted to the deep learning model, because of the large dimension in vector values I think it should be B A/D are false flags because the question doesn't specify what kind of data engineering is currently done on the inputs, as a baseline Per wikipedia, for GRUs, "GRUs have been shown to exhibit better performance on certain smaller and less frequent datasets", which fits the context of a particular energy sector why not B?? Generally, LSTM has the better performance then GRU in large datasets such as multi-page documents. GRU has advantages of memory allocation and training time. Early stopping can give the model better performance, but I think that the model needs more condition like patience value for early stopping. This is because the model doesn't always show the performance at its maximum when the validation loss stops decreasing. It cannot be C, because hyper parameter tuning didnt work as given in question. Also, A and D are same, however, word2vec model internally implements tf-idf much more efficiently. So answer got to be D but they need to classify the whole sentence i think for such a case we use object2vec not word2vec, but since it's not available in the answers, B is the only answer left. I go for C - https://www.examtopics.com/discussions/amazon/view/43874-exam-aws-certified-machine-learning-specialty-topic-1/
77
77 - A Machine Learning Specialist needs to move and transform data in preparation for training. Some of the data needs to be processed in near-real time, and other data can be moved hourly. There are existing Amazon EMR MapReduce jobs to clean and feature engineering to perform on the data. Which of the following services can feed data to the MapReduce jobs? (Choose two.) - A.. AWS DMS B.. Amazon Kinesis C.. AWS Data Pipeline D.. Amazon Athena E.. Amazon ES
BC - should be BC Agreed, AWS Example: https://docs.aws.amazon.com/datapipeline/latest/DeveloperGuide/what-is-datapipeline.html It is obviously B and C, I am frustrated with the number of wrong answers. Why the moderator's answers keep being super weird? B for near real-time C for hourly The right answer is BC AWS Data Pipeline (Option C) can be used to move the hourly data, as it provides a way to move data from various sources to Amazon EMR for processing. Amazon Kinesis (Option B) can be used to process data in near-real time, as it is a real-time data streaming service that can handle large amounts of incoming data from multiple sources. The data can be fed to Amazon EMR MapReduce jobs for processing. should be BC Kinesis for near realtime data and pipeline for the other data moved hourly. AWS ES is an elastic search , it is nothing to do with this question. Kinesis data into EMR: https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-kinesis.html BC. easy I believe the answer is BC BD. Data Pipeline is to orchestrate the workflow, how can that feed data to the MR jobs? Answer is B and C Answer is for sure BC Ans is BC. (https://aws.amazon.com/jp/emr/?whats-new-cards.sort-by=item.additionalFields.postDateTime&whats-new-cards.sort-order=desc) - https://www.examtopics.com/discussions/amazon/view/43721-exam-aws-certified-machine-learning-specialty-topic-1/
78
78 - A Machine Learning Specialist previously trained a logistic regression model using scikit-learn on a local machine, and the Specialist now wants to deploy it to production for inference only. What steps should be taken to ensure Amazon SageMaker can host a model that was trained locally? - A.. Build the Docker image with the inference code. Tag the Docker image with the registry hostname and upload it to Amazon ECR. B.. Serialize the trained model so the format is compressed for deployment. Tag the Docker image with the registry hostname and upload it to Amazon S3. C.. Serialize the trained model so the format is compressed for deployment. Build the image and upload it to Docker Hub. D.. Build the Docker image with the inference code. Configure Docker Hub and upload the image to Amazon ECR.
A - Ans : A Refer the below : https://sagemaker-workshop.com/custom/containers.html A https://sagemaker-workshop.com/custom/containers.html You need the container to be hosted on ECR. A. YES - the inference code is built after inspecting the coefficient of the Linear Model (or, alternatively, the model can be serialized via pickle and the inference code is simply to unserialized the mode); ECR is only registry supported by SageMaer; tagging the Docker image with the registry hostname (eg. docker tag image1 public.ecr.aws/g6h7x5m5/image1) is required so that the docker push command knows where to push the image B. NO - no need to compress; image must be on ECR C. NO - no need to compress; image must be on ECR D. NO - image must be on ECR A is the right answer For SageMaker to run a container for training or hosting, it needs to be able to find the image hosted in the image repository, Amazon Elastic Container Registry (Amazon ECR). The three main steps to this process are building locally, tagging with the repository location, and pushing the image to the repository. A for sure. Answer is A. Docker Hub is a repository so ANS D makes no sense. Option A is the way to go. - https://www.examtopics.com/discussions/amazon/view/43938-exam-aws-certified-machine-learning-specialty-topic-1/
79
79 - A trucking company is collecting live image data from its fleet of trucks across the globe. The data is growing rapidly and approximately 100 GB of new data is generated every day. The company wants to explore machine learning uses cases while ensuring the data is only accessible to specific IAM users. Which storage option provides the most processing flexibility and will allow access control with IAM? - A.. Use a database, such as Amazon DynamoDB, to store the images, and set the IAM policies to restrict access to only the desired IAM users. B.. Use an Amazon S3-backed data lake to store the raw images, and set up the permissions using bucket policies. C.. Setup up Amazon EMR with Hadoop Distributed File System (HDFS) to store the files, and restrict access to the EMR instances using IAM policies. D.. Configure Amazon EFS with IAM policies to make the data available to Amazon EC2 instances owned by the IAM users.
B - B is the right answer B to use as storage with policies - Amazon S3-backed data lake: S3 is the best storage option for large and rapidly growing datasets like images from trucks. S3 scales easily, handles large volumes of data, and is cost-effective for long-term storage, making it a natural choice for this scenario. - IAM access control: You can use bucket policies in S3 to set very specific access controls, ensuring that only certain IAM users have permission to access or modify the data. This satisfies the requirement for access control using IAM. - Processing flexibility: Storing the images in S3 offers flexibility for future machine learning use cases. The data stored in S3 can easily be integrated with other AWS services like SageMaker, Athena, EMR, and more for processing and analysis. EMR/HDFS is not more 'flexible' than S3 A. NO - volume too big for a DB B. YES C. NO - instance access will not control HDFS access D. NO - EFS does not use IAM policies (it is unix) S3 indeed S3 always I would say the answer is B not because of the cost on EMR,. that is also a current answer. however: "most processing flexibility" indicates that S3 is a better option. because all ML solutions and work flows integrate with S3. it hasn't spoken what the ML solution and which services so I take the safe side and go with S3 C is not affordable because it is ephemeral storage. https://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-plan-file-systems.html "HDFS is used by the master and core nodes. One advantage is that it's fast; a disadvantage is that it's ephemeral storage which is reclaimed when the cluster ends. It's best used for caching the results produced by intermediate job-flow steps." the question does not require long-term storage. C is correct. it says real time data and to be used for ml process so EMR more suitable. also S3 bucket policies not same as IAM users so B is not correct. Why will you need to spin up servers (EMR) just to store visual data for ML? I think Amazon EMR is more appropriate, as the data scheme stated is a big data scheme. https://aws.amazon.com/emr/?whats-new-cards.sort-by=item.additionalFields.postDateTime&whats-new-cards.sort-order=desc IAM support is required for storage feature , that is not possible as per options described as IAM is supported for HDFS for the instance running on top of it, hence B should be correct B is the right answer S3 is the easy, scalable and secure option to store the image data. B is the right answer B is an appropriate choice - https://www.examtopics.com/discussions/amazon/view/43940-exam-aws-certified-machine-learning-specialty-topic-1/
80
80 - A credit card company wants to build a credit scoring model to help predict whether a new credit card applicant will default on a credit card payment. The company has collected data from a large number of sources with thousands of raw attributes. Early experiments to train a classification model revealed that many attributes are highly correlated, the large number of features slows down the training speed significantly, and that there are some overfitting issues. The Data Scientist on this project would like to speed up the model training time without losing a lot of information from the original dataset. Which feature engineering technique should the Data Scientist use to meet the objectives? - A.. Run self-correlation on all features and remove highly correlated features B.. Normalize all numerical values to be between 0 and 1 C.. Use an autoencoder or principal component analysis (PCA) to replace original features with new features D.. Cluster raw data using k-means and use sample data from each cluster to build a new dataset
C - Answer C. Need reduce the features preserving the information on it this is achieve using PCA. without losing a lot of information from the original dataset since when PCA retains information? PCA helps to speed up the training Answer is A, because one must avoid information loss that PCA or autoencoders introduce through new features (https://www.i2tutorials.com/what-are-the-pros-and-cons-of-the-pca/). Otherwise, I would perform C. If you REMOVE highly correlated features(that means in pairs), the model lost a lot of information. A doesn't have sense. Self-correlation is for times series data, not for pair correlation This question can be misleading. I would choose A if self-correlation in the dataset is meaning pair-wise correlation, this is the most typical approach in real life. But if self-correlation means auto-correlation as in the time-series treatment, then it is wrong. Issues with answer C: Autoencoders are notorious for being hard to interpret. With PCA it is possible, but definitely not easy if you have a large dataset. In real life with this scenario, you would always go with pairwise correlation as the most simple yet effective approach. Answer is C Answer C PCA (Principal Component Analysis) takes advantage of multicollinearity and combines the highly correlated variables into a set of uncorrelated variables. Therefore, PCA can effectively eliminate multicollinearity between features. https://towardsdatascience.com/how-do-you-apply-pca-to-logistic-regression-to-remove-multicollinearity-10b7f8e89f9b#:~:text=PCA%20(Principal%20Component%20Analysis)%20takes,effectively%20eliminate%20multicollinearity%20between%20features. Option C An autoencoder is a type of neural network that can learn a compressed representation of the input data, called the latent space, by encoding and decoding the data through multiple hidden layers1. PCA is a statistical technique that can reduce the dimensionality of the data by finding a set of orthogonal axes, called the principal components, that capture the most variance in the data2. Both methods can transform the original features into new features that are lower-dimensional, uncorrelated, and informative. C is the correct. Self-correlation is for time series, which is not mention here. Besides that, even if was correlation only, try to do this in thousand features... A . run correlation matrix and remove highly correlated features. PCA for feature reduction is it just me or is every 15th answer here PCA? Using an autoencoder or PCA can help reduce the dimensionality of the dataset by creating new features that capture the most important information in the original dataset while discarding some of the noise and highly correlated features. This can help speed up the training time and reduce overfitting issues without losing a lot of information from the original dataset. Option A may remove too many features and may not capture all the important information in the dataset, while option B only rescales the data and does not address the issue of highly correlated features. Option D is not a feature engineering technique and may not be an effective way to reduce the dimensionality of the dataset. PCA builds new features starting from high correlated ones. So it matches the question It's C. The Data Scientist should use principal component analysis (PCA) to replace the original features with new features. PCA is a technique that reduces the dimensionality of a dataset by projecting it onto a lower-dimensional space, while preserving as much of the original variation as possible. This can help to speed up the training time of the model and reduce overfitting issues, without losing a significant amount of information from the original dataset. C: PCA is the solution Correction to C. Removing correlated features from hundreds of columns will be tedious and time consuming. PCA is the way to go here. Apologies for the flip Answer is A. Eliminate features that are highly correlated. This will not compromise the quality of the feature space as much as PCA would. - https://www.examtopics.com/discussions/amazon/view/43919-exam-aws-certified-machine-learning-specialty-topic-1/
81
81 - A Data Scientist is training a multilayer perception (MLP) on a dataset with multiple classes. The target class of interest is unique compared to the other classes within the dataset, but it does not achieve and acceptable recall metric. The Data Scientist has already tried varying the number and size of the MLP's hidden layers, which has not significantly improved the results. A solution to improve recall must be implemented as quickly as possible. Which techniques should be used to meet these requirements? - A.. Gather more data using Amazon Mechanical Turk and then retrain B.. Train an anomaly detection model instead of an MLP C.. Train an XGBoost model instead of an MLP D.. Add class weights to the MLP's loss function and then retrain
D - For me answer is D, adjust to higher weight for class of interest: https://androidkt.com/set-class-weight-for-imbalance-dataset-in-keras/. More data may/may not be available and a data labeling job will take time. I believe is C, because we already made all changes possible in MLP hidden layers and the results have not improved then we must change model so XGBoot seems the best option In this case, the data scientist is training a multilayer perceptron (MLP), which is a type of neural network, on a dataset with multiple classes. The target class of interest is unique compared to the other classes within the dataset, but it does not achieve an acceptable recall metric. Recall is a measure of how well the model can identify the relevant examples from the minority class. The data scientist has already tried varying the number and size of the MLP’s hidden layers, which has not significantly improved the results. A solution to improve recall must be implemented as quickly as possible. The fastest one is D "quickly as possible" mean do not change to new stuff, so it's D. Not C, as the question ask for a quick solution. I accept D. Answer C : https://towardsdatascience.com/boosting-techniques-in-python-predicting-hotel-cancellations-62b7a76ffa6c Adding class weights to the MLP's loss function balances the class frequencies in the cost function during training, so the optimization process focuses more on the underrepresented class, improving recall. I have done this before, class weights help with unbalanced data. Only logical solution that would help if not done, XGBoost could be different, but who knows, both NNs and XGBoost have comparable performance. Answer D! In this example, it is necessary to improve recall as soon as possible, so instead of creating additional datasets, it is effective to change the weight of each class during learning. C: 'distinct' indicates we can simplify this as a binary classification problem; then, NN is just overkill. plus, retraining a NN is much slower than training an XGboost model I feel answer is B. Question says Target is different than the input data which is hint for anomaly detection. stop overthink I believe the answer is C because we need to use hyperparameters to improve model performance. In case of the quickest possible way, D seems fine. For XGBoost, it will take a bit of time to code again For me Answer A. Why no other model instead xgBoost, the model need more labeled data to be trained and learn more positive examples. A is incorrect. Even if you hire Amazon Mechanical Turk, you won't have more data. This question is NOT asking about "labeling". - https://www.examtopics.com/discussions/amazon/view/43921-exam-aws-certified-machine-learning-specialty-topic-1/
82
82 - A Machine Learning Specialist works for a credit card processing company and needs to predict which transactions may be fraudulent in near-real time. Specifically, the Specialist must train a model that returns the probability that a given transaction may fraudulent. How should the Specialist frame this business problem? - A.. Streaming classification B.. Binary classification C.. Multi-category classification D.. Regression classification
B - Answer B. B IS NOT CORRECT! Return the probability. Not the 1 or 0. D IS THE CORRECT ANSWER. Regression Classification is a made-up term, any binary classifier makes decisions based on probability score. there is nothing like regression classification. (instead it should have said logistic regression). It should be Binary. i.e., either fraud or non fraud. Even with probabilities, we have a threshold to decide the class. Logistic regression will give the probability, and logistic regression is a binary classification algorithm. https://machinelearningmastery.com/logistic-regression-for-machine-learning/ Streaming classification: is the process of organizing and categorizing large amounts of data that are continuously flowing. This data can include medical records, banking transactions, and internet records Binary Classification: Logistic Regression Multiclass Classification: Softmax regression Regression Classification is a made-up term Random forest is the most suitable model for predicting fraudulent transactions. Answer is A I always see that the community voting is more appropriate and the moderator answer looks out to be on wrong side. I see this for almost in 1 out of 5 questions. Which answer should we consider here as right one ?? Its definitely a classification problem, and between Binary and Streaming classification. Binary classification makes more sence Binary classification B, easy. B, obviously! from sklearn.linear_model import LogisticRegression log_reg = LogisticRegression() log_reg.fit(X_train, y_train) log_reg.predict_proba(X_test) =) The correct solution obviously is binary classification. For the comment above that says that binary classfication doesn't returns a probablity (for example SVM(classification) only returns a class and logistic, RFClassifier, XGBoostClassifier gives a probability and also a class given a threshold), you should ask yourself if that a regressor model returns always a probability, that is, if there is a restriction in a regressor model to predict values only in [0,1]. The Specialist is trying to determine whether a given transaction is fraudulent or not, which is a binary outcome (yes or no). Therefore, the problem should be framed as binary classification. The goal is to predict the probability of a transaction being fraudulent or not, and based on that, the Specialist can make a binary decision (fraudulent or not). This is just binary classification, I don't understand how it could be anything else It's B. This business problem can be framed as a binary classification problem, where the goal is to predict whether a given transaction is fraudulent (positive class) or not fraudulent (negative class). The model should output a probability for each transaction, indicating the likelihood that it is fraudulent. should be D Logistic regression models the probability of the default class (e.g. the first class). For example, if we are modeling people’s sex as male or female from their height, then the first class could be male and the logistic regression model could be written as the probability of male given a person’s height. I think the answer is B, fraud has various cases which hard to define. So, Classification result will be fraud or not fraud. If Multi-category classfication, must define case of fraud in detaily More specifically, anomaly detection model will be needed I believe the answer is B: it a binary classification problem because we are classifying an observation into one of two categories and the target variable in this problem is limited to two options: fraudulent or not fraudulent well, regression classification is bullshit, I hope they formulate their questions better on the real exam. binary classification gives probability between 0 and 1 - https://www.examtopics.com/discussions/amazon/view/43922-exam-aws-certified-machine-learning-specialty-topic-1/
83
83 - A real estate company wants to create a machine learning model for predicting housing prices based on a historical dataset. The dataset contains 32 features. Which model will meet the business requirement? - A.. Logistic regression B.. Linear regression C.. K-means D.. Principal component analysis (PCA)
B - B is the correct answer. Linear regression B, the only model for regression in the options. Answer B Answer B. - https://www.examtopics.com/discussions/amazon/view/43923-exam-aws-certified-machine-learning-specialty-topic-1/
84
84 - A Machine Learning Specialist is applying a linear least squares regression model to a dataset with 1,000 records and 50 features. Prior to training, the ML Specialist notices that two features are perfectly linearly dependent. Why could this be an issue for the linear least squares regression model? - A.. It could cause the backpropagation algorithm to fail during training B.. It could create a singular matrix during optimization, which fails to define a unique solution C.. It could modify the loss function during optimization, causing it to fail during training D.. It could introduce non-linear dependencies within the data, which could invalidate the linear assumptions of the model
B - B is correct answer . Agree, B. why B is the correct answer and not C? A square matrix is singular, that is, its determinant is zero, if it contains rows or columns which are proportionally interrelated; in other words, one or more of its rows (columns) is exactly expressible as a linear combination of all or some other its rows (columns), the combination being without a constant term. For example. If you have two variables, X and Y, and you have two data points. You want to solve the problem: aX1+bY1 = Z1, aX2 + bY2 = Z2. However, if Y=2X -> Y1 = 2X1, Y2 = 2X2, then problem becomes: aX1+bY1 = Z1, a*2X1 + b*2Y1 = Z2 = 2*Z1. So you end up with only one function: aX1+bY1=Z1, meaning there will be more than one answer for (a, b). If you are familiar with linear algebra, it's easier to express the concept. B: If two features in the dataset are perfectly linearly dependent, it means that one feature can be expressed as a linear combination of the other. This can create a singular matrix during optimization, as the linear model would be trying to fit a linear equation to a dataset where one variable is fully determined by the other. This would lead to an ill-defined optimization problem, as there would be no unique solution that minimizes the sum of the squares of the residuals. This could lead to problems during training, as the model would not be able to find appropriate parameter values to fit the data. Option B The presence of linearly dependent features means that they are redundant, and provide no additional information to the model. This can result in a matrix that is not invertible, which is a requirement for solving a linear least squares regression problem. The presence of a singular matrix can also cause numerical instability and make it impossible to find an optimal solution to the optimization problem. linera dependence creates singular matrix that causes problems at the moment we fit the modle https://towardsdatascience.com/multi-collinearity-in-regression-fe7a2c1467ea B - two features are perfectly linearly dependent = singular matrix during optimization Not D - Not100% correct (as Multicollinearity happens when independent variables in the regression model are highly correlated to each other) they can still be independent variables Consider one of the 5 assumptions of linear regression. This situation violates the assumption of "No multicollinearity between feature variables" Hence, D B. See the multicollinearity problem in wikipedia https://en.wikipedia.org/wiki/Multicollinearity (second paragraph) This issue is overfitting. - https://www.examtopics.com/discussions/amazon/view/43942-exam-aws-certified-machine-learning-specialty-topic-1/
85
85 - Given the following confusion matrix for a movie classification model, what is the true class frequency for Romance and the predicted class frequency for Adventure? [https://www.examtopics.com/assets/media/exam-media/04145/0004900001.png] - A.. The true class frequency for Romance is 77.56% and the predicted class frequency for Adventure is 20.85% B.. The true class frequency for Romance is 57.92% and the predicted class frequency for Adventure is 13.12% C.. The true class frequency for Romance is 0.78 and the predicted class frequency for Adventure is (0.47-0.32) D.. The true class frequency for Romance is 77.56% ֳ— 0.78 and the predicted class frequency for Adventure is 20.85% ֳ— 0.32
B - B is the correct answer. Straightforward! https://docs.aws.amazon.com/machine-learning/latest/dg/multiclass-model-insights.html to be able to understand this Multiclass Model Insights and to be able to answer this question : True class-frequencies in the evaluation data: The second to last column shows that in the evaluation dataset, 57.92% of the observations in the evaluation data is Romance, 21.23% is Thriller, and 20.85% is Adventure. Predicted class-frequencies for the evaluation data: The last row shows the frequency of each class in the predictions. 77.56% of the observations is predicted as Romance, 9.33% is predicted as Thriller, and 13.12% is predicted as Adventure. REF: https://docs.aws.amazon.com/machine-learning/latest/dg/multiclass-model-insights.html 12-sep exam The image can be found here: https://vceguide.com/what-is-the-true-class-frequency-for-romance-and-the-predicted-class-frequency-for-adventure/ No image is there! WHy there is no image? Admin. Please fix it. B is correct A seems to be correct - https://www.examtopics.com/discussions/amazon/view/44056-exam-aws-certified-machine-learning-specialty-topic-1/
86
86 - A Machine Learning Specialist wants to bring a custom algorithm to Amazon SageMaker. The Specialist implements the algorithm in a Docker container supported by Amazon SageMaker. How should the Specialist package the Docker container so that Amazon SageMaker can launch the training correctly? - A.. Modify the bash_profile file in the container and add a bash command to start the training program B.. Use CMD config in the Dockerfile to add the training program as a CMD of the image C.. Configure the training program as an ENTRYPOINT named train D.. Copy the training program to directory /opt/ml/train
C - C seems correct as per documentations. I would answer C: https://docs.aws.amazon.com/sagemaker/latest/dg/your-algorithms-training-algo-dockerfile.html "To configure a Docker container to run as an executable, use an ENTRYPOINT instruction in a Dockerfile. SageMaker overrides any default CMD statement in a container by specifying the train argument after the image name" I thought it was D, but it is C. It's not D because we copy the TRAINING code into /opt/ml/code/train.py In Docker, the ENTRYPOINT instruction is used to specify the executable that should be run when the container starts. However, Amazon SageMaker expects the training script to be launched by specific commands provided by SageMaker itself, rather than relying solely on the Docker container's ENTRYPOINT. The convention for using Docker containers with Amazon SageMaker is to copy the training script and associated resources to specific directories within the container, such as /opt/ml/code, and let SageMaker manage the execution of the training process. I would go with D C is correct C is correct Amazon SageMaker requires that a custom algorithm container has an executable named train that runs your training program. This executable can be configured as an ENTRYPOINT in the Dockerfile, which specifies the default command to run when the container is launched. Amazon SageMaker requires that a custom algorithm container has an executable named train that runs your training program. This executable can be configured as an ENTRYPOINT in the Dockerfile, which specifies the default command to run when the container is launched. you are all wrong, it is D based on,https://docs.aws.amazon.com/sagemaker/latest/dg/adapt-training-container.html To package a Docker container for use with Amazon SageMaker, the training program should be configured as an ENTRYPOINT named train in the Dockerfile. This means that the training program will be automatically executed when the container is launched by Amazon SageMaker, and it can be passed command-line arguments to specify hyperparameters or other training settings. The recommended option to package the Docker container for Amazon SageMaker is to configure the training program as an ENTRYPOINT named train. This is because ENTRYPOINT allows you to specify a command that will always be executed when the Docker container is run, ensuring that the training program will always run when the container is launched by Amazon SageMaker. Additionally, naming the ENTRYPOINT "train" is a convention used by Amazon SageMaker to identify the main training script. It's C C for sure as per AWS docs: > In your Dockerfile, use the exec form of the ENTRYPOINT instruction: > ENTRYPOINT ["python", "k-means-algorithm.py"] C is correct option C https://github.com/awsdocs/amazon-sagemaker-developer-guide/blob/master/doc_source/your-algorithms-training-algo-dockerfile.md - https://www.examtopics.com/discussions/amazon/view/43943-exam-aws-certified-machine-learning-specialty-topic-1/
87
87 - A Data Scientist needs to analyze employment data. The dataset contains approximately 10 million observations on people across 10 different features. During the preliminary analysis, the Data Scientist notices that income and age distributions are not normal. While income levels shows a right skew as expected, with fewer individuals having a higher income, the age distribution also shows a right skew, with fewer older individuals participating in the workforce. Which feature transformations can the Data Scientist apply to fix the incorrectly skewed data? (Choose two.) - A.. Cross-validation B.. Numerical value binning C.. High-degree polynomial transformation D.. Logarithmic transformation E.. One hot encoding
BD - I would go with B,D. Refer to quantile binning and log transform below. https://towardsdatascience.com/understanding-feature-engineering-part-1-continuous-numeric-data-da4e47099a7b Agree with B &D B binning for age D for make income in normal dist agree B&D. both are strategies to eliminate the effect of skewing SHOULD BE C,D Binning involves grouping numerical values into discrete intervals or bins. While it can simplify the representation of a feature and potentially make the distribution appear less skewed in a histogram, it doesn't fundamentally change the underlying skewness of the continuous data. It discretizes the data rather than transforming its distribution. D. Logarithmic Transformation: Addresses the right-skewed income and age distributions. The log function compresses large values, reducing the impact of outliers and making the distributions closer to normal. B. Numerical Value Binning: Useful for the age distribution. By grouping ages into bins (e.g., 20-29, 30-39, etc.), you reduce the impact of the right skew caused by fewer older individuals. While it doesn't achieve a perfectly normal distribution, it often makes the feature more interpretable and manageable for modeling. B and D Agree with B &D B binning for age D for make income in normal dist BD is correct A and E, it asks incorrectly B & D. Reasonable explanation in below discussion. BD With age, always do quantile binning With skewed data, always use log. B because we have skewed data with few exeptions D log transform can change distribution of data not C - because there is no indicaiton in the text, that data is following any of the HIGH DEGREE polynomial distribution like x^ 10 should be c and d polynomial transformations can also be used for skewed data. https://machinelearningmastery.com/polynomial-features-transforms-for-machine-learning/ It seems the ans are C,D https://anshikaaxena.medium.com/how-skewed-data-can-skrew-your-linear-regression-model-accuracy-and-transfromation-can-help-62c6d3fe4c53 - https://www.examtopics.com/discussions/amazon/view/43728-exam-aws-certified-machine-learning-specialty-topic-1/
88
88 - A web-based company wants to improve its conversion rate on its landing page. Using a large historical dataset of customer visits, the company has repeatedly trained a multi-class deep learning network algorithm on Amazon SageMaker. However, there is an overfitting problem: training data shows 90% accuracy in predictions, while test data shows 70% accuracy only. The company needs to boost the generalization of its model before deploying it into production to maximize conversions of visits to purchases. Which action is recommended to provide the HIGHEST accuracy model for the company's test and validation data? - A.. Increase the randomization of training data in the mini-batches used in training B.. Allocate a higher proportion of the overall data to the training dataset C.. Apply L1 or L2 regularization and dropouts to the training D.. Reduce the number of layers and units (or neurons) from the deep learning network
C - I think C will be answer, because we even don't know how many layers now, so apply L1,L2 and dropouts layer will be first resort to solve overfitting. If it still does not work, then to reduce layers D: D is the correct answer. C could be the answer only if it is a regression problem. You cannot apply L1 (Lasso regression) and L2 (Ridge regression) to classification problems. However, you can use dropout here. Why do you think it works only for regression problems? L1/L2 regularizations are just adding penalties to loss functions. I don't see any problems with applying it to DL model C Regulization if you see overfit think regularization. C is the correct answer: The overfitting problem can be addressed by applying regularization techniques such as L1 or L2 regularization and dropouts. Regularization techniques add a penalty term to the cost function of the model, which helps to reduce the complexity of the model and prevent it from overfitting to the training data. Dropouts randomly turn off some of the neurons during training, which also helps to prevent overfitting. D can work, but C is a better answer! C and D both seems to be correct but, seems like removing layer is first step in to optimization https://www.kaggle.com/general/175912 d C. Apply L1 or L2 regularization and dropouts to the training" because regularization can help reduce overfitting by adding a penalty to the loss function for large weights, preventing the model from memorizing the training data. Dropout is a regularization technique that randomly drops out neurons during the training process, further reducing the risk of overfitting. "The first step when dealing with overfitting is to decrease the complexity of the model. To decrease the complexity, we can simply remove layers or reduce the number of neurons to make the network smaller." https://www.kdnuggets.com/2019/12/5-techniques-prevent-overfitting-neural-networks.html Deep learning tuning order: 1. Number of layers 2. Number of neurons (indirectly implements dropout) 3. L1/L2 regularization 4. Dropout the problem is overfitting, not HP Tuning. Can be used for overfitting as well, but the problem does not say it is a deep learning algorithm being used so C would be more appropriate. Here we are looking to reduce the Overfitting to improve the generalization. In order to do so, L1(or Lasso) regression has always been a good aide. This is not a regression problem at all. C, Regularization and dropouts should be the first attempt Yes, C is right here. Regularization and Dropouts C is the answer - https://www.examtopics.com/discussions/amazon/view/73881-exam-aws-certified-machine-learning-specialty-topic-1/
89
89 - A Machine Learning Specialist is given a structured dataset on the shopping habits of a company's customer base. The dataset contains thousands of columns of data and hundreds of numerical columns for each customer. The Specialist wants to identify whether there are natural groupings for these columns across all customers and visualize the results as quickly as possible. What approach should the Specialist take to accomplish these tasks? - A.. Embed the numerical features using the t-distributed stochastic neighbor embedding (t-SNE) algorithm and create a scatter plot. B.. Run k-means using the Euclidean distance measure for different values of k and create an elbow plot. C.. Embed the numerical features using the t-distributed stochastic neighbor embedding (t-SNE) algorithm and create a line graph. D.. Run k-means using the Euclidean distance measure for different values of k and create box plots for each numerical column within each cluster.
A - A is correct. tSNE can do segmentation or grouping as well. Refer: https://towardsdatascience.com/an-introduction-to-t-sne-with-python-example-5a3a293108d1 A is definitely the correct answer. Pay attention to what the question is asking: "whether there are natural groupings for these columns across all customers and visualize the results as quickly as possible" The key point is to visualize the "groupings"(exactly what t-SNE scatter plot does, it visualize high-dimensional data points on 2D space). The question does not ask to visualize how many groups you would classify (K-Means Elbow Plot does not visualize the groupings, it is used to determine the optimal # of groups=K). option A B doesn't even answer the question: how are you going to see your customer groups in an elbow plot Elbow plot helps you identify the correct number of clusters during K-Means clustering. The clustering happens basis of all the features and thus group employees. This is to help your understanding. And the correct answer however is still tSNE becuase the question focuses on identifying relationships/similarities between the features / columns in the dataset. The correct answer is A Euclidean Distance suffers for high dimensional data. tSNE can suffers as well, but from my perspective is the correct one. Elbow plot will not help visualize groups, only try to predict an optimal number of clusters. I think A is a better choice here A. The t-SNE algorithm is a popular tool for visualizing high-dimensional datasets, as it can transform high-dimensional data into a 2D scatter plot, which makes it easier to visualize and understand the relationships between data points. The scatter plot produced by t-SNE can be interpreted as a map that reveals the structure of the data, showing whether there are natural groupings or clusters within the data. Option A is the quickest and simplest way to visualize the data in a meaningful way, allowing the Specialist to gain insights into the data more efficiently. A is correct 12-sep exam A as k-means elbow is erroneous. It does not helping here. Scatter plot and t-sne is the right answer An elbow plot (B) will not give you what the question is asking for. A scatter plot will, and t-SNE is first for visualizing before dimensionality reduction. A is correct as k means suffer from curse of dimensionality and t-she will be a better option. The B,C,D plots are meaningless wrt the problem —> A t-SNE suffers curse of dimensionality and is indicated for small datasets Additionally the numeric features don't require "embedding". I think they meant to write "standardize" Rooting for A B & D are wrong--because data contains "thousands of columns" and using k-means with euclidean suffers from "curse of dimensionality" Thus leaving A & C, you CANNOT viz clusters/groups/segments in a line graph so correct answer is A - https://www.examtopics.com/discussions/amazon/view/43864-exam-aws-certified-machine-learning-specialty-topic-1/
90
90 - A Machine Learning Specialist is planning to create a long-running Amazon EMR cluster. The EMR cluster will have 1 master node, 10 core nodes, and 20 task nodes. To save on costs, the Specialist will use Spot Instances in the EMR cluster. Which nodes should the Specialist launch on Spot Instances? - A.. Master node B.. Any of the core nodes C.. Any of the task nodes D.. Both core and task nodes
C - Answer is C. https://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-plan-instances-guidelines.html It's definitely C. The fact that this site indicates A is a clear sign that answers are just randomly selected, it would make zero sense to spot-instance the master node for an EMR cluster. Make sure you look at discussions for all of these questions. C is the correct answer. "Long-Running Clusters and Data Warehouses If you are running a persistent Amazon EMR cluster that has a predictable variation in computational capacity, such as a data warehouse, you can handle peak demand at lower cost with Spot Instances. You can launch your master and core instance groups as On-Demand Instances to handle the normal capacity and launch task instance groups as Spot Instances to handle your peak load requirements." According to :https://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-plan-instances-guidelines.html The task nodes process data but do not hold persistent data in HDFS. If they terminate because the Spot price has risen above your maximum Spot price, no data is lost and the effect on your cluster is minimal. When you launch one or more task instance groups as Spot Instances, Amazon EMR provisions as many task nodes as it can, using your maximum Spot price. This means that if you request a task instance group with six nodes, and only five Spot Instances are available at or below your maximum Spot price, Amazon EMR launches the instance group with five nodes, adding the sixth later if possible. The correct answer is C I don't get why the wrong answer are still not updated after more than 1 year of everyone showing docs proving answer C.. 1 and a half year and still wrong.. Incredible! Long-running clusters and data warehouses If you are running a persistent Amazon EMR cluster that has a predictable variation in computational capacity, such as a data warehouse, you can handle peak demand at lower cost with Spot Instances. You can launch your primary and core instance groups as On-Demand Instances to handle the normal capacity and launch the task instance group as Spot Instances to handle your peak load requirements. Only task nodes can be deleted without losing data. C, If you want to cut cost on an EMR cluster in the most efficient way, use spot instances on the task nodes because it, task nodes do not store data so no risk of data loss For Long running jobs, you do not want to compromise the Master node(sudden termination) or the core nodes (HDFS data loss). Spot Instances on 20 task nodes are enough cost savings without compromising the job. Hence, C If your primary concern is the cost, then you can run the master node on spot instances. Adding the related reference from the AWS documentation: Master node on a Spot Instance The master node controls and directs the cluster. When it terminates, the cluster ends, so you should only launch the master node as a Spot Instance if you are running a cluster where sudden termination is acceptable. This might be the case if you are testing a new application, have a cluster that periodically persists data to an external store such as Amazon S3, or are running a cluster where cost is more important than ensuring the cluster's completion. In the question , there are no specific conditions mentioned except the concern with the COST, thus I think the answer should be A. Answer: C. https://aws.amazon.com/getting-started/hands-on/optimize-amazon-emr-clusters-with-ec2-spot/ Amazon recommends using On-Demand instances for Master and Core nodes unless you are launching highly ephemeral workloads. Answer should be C. Only master node is incorrect. Either use all on spot or only task or core on spot. As per: https://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-plan-instances-guidelines.html Better to use only task node on spot for long running tasks/jobs Answer is C you should only run core nodes on Spot Instances /*when partial HDFS data loss is tolerable*/ Question is what "Should" be launched as spot instance - https://www.examtopics.com/discussions/amazon/view/43866-exam-aws-certified-machine-learning-specialty-topic-1/
91
91 - A manufacturer of car engines collects data from cars as they are being driven. The data collected includes timestamp, engine temperature, rotations per minute (RPM), and other sensor readings. The company wants to predict when an engine is going to have a problem, so it can notify drivers in advance to get engine maintenance. The engine data is loaded into a data lake for training. Which is the MOST suitable predictive model that can be deployed into production? - A.. Add labels over time to indicate which engine faults occur at what time in the future to turn this into a supervised learning problem. Use a recurrent neural network (RNN) to train the model to recognize when an engine might need maintenance for a certain fault. B.. This data requires an unsupervised learning algorithm. Use Amazon SageMaker k-means to cluster the data. C.. Add labels over time to indicate which engine faults occur at what time in the future to turn this into a supervised learning problem. Use a convolutional neural network (CNN) to train the model to recognize when an engine might need maintenance for a certain fault. D.. This data is already formulated as a time series. Use Amazon SageMaker seq2seq to model the time series.
A - This is a supervised problem and needs labels. Can't use clustering to find when faults can happen. CNN is for images not for timeseries data here. Hence, A seems appropriate. AGREE WITH YOU Agree, the answer is A A. YES - RNN good for time series as we want to use previous input B. NO - we know the class (fault) ahead of time, it is supervised C. NO - CNN is for images D. NO - seq2seq is for word generation Answer is A A recurrent neural network (RNN) is a more suitable choice than a convolutional neural network (CNN) because the data collected from the engines is a sequence of values over time, and the goal is to predict a future event (an engine fault). RNNs are designed to handle sequential data and can learn patterns and dependencies over time, making them well-suited for time-series data like this. On the other hand, CNNs are designed for image processing and are not ideal for sequential data. Answer should be A It can only be A. Agree with the comments before Obviously A A - obviously. Seq2Seq also uses RNN under the hood, BUT option D. did not mention anything about "adding labels"--which is required here--hence --> A A is correct. CNN is for images and RNN is for timeseries. AAAAAAAAAAAA https://towardsdatascience.com/how-to-implement-machine-learning-for-predictive-maintenance-4633cdbe4860 I think A is correct It is A - https://www.examtopics.com/discussions/amazon/view/43867-exam-aws-certified-machine-learning-specialty-topic-1/
92
92 - A company wants to predict the sale prices of houses based on available historical sales data. The target variable in the company's dataset is the sale price. The features include parameters such as the lot size, living area measurements, non-living area measurements, number of bedrooms, number of bathrooms, year built, and postal code. The company wants to use multi-variable linear regression to predict house sale prices. Which step should a machine learning specialist take to remove features that are irrelevant for the analysis and reduce the model's complexity? - A.. Plot a histogram of the features and compute their standard deviation. Remove features with high variance. B.. Plot a histogram of the features and compute their standard deviation. Remove features with low variance. C.. Build a heatmap showing the correlation of the dataset against itself. Remove features with low mutual correlation scores. D.. Run a correlation check of all features against the target variable. Remove features with low target variable correlation scores.
D - D should be the more comprehensive answer. If it's not correlated, you can't make use of it in a linear regression A lot of others say B, but low variance can also be due to the nature/typical magnitudes of the variable itself I think the problem with B is that what is considered "low variance"? The features are on different scales. Correlation indicates only linear relation, but, there might be non linear as well. To exploit it in the Linear Regression, you can take the variables to some power or run some non linear preprocessing on it, and you don't have to change the algorithm for it. So, answer B seem much more solid for me. Answer B. Is not the best solucion prior can use other analysis. https://community.dataquest.io/t/feature-selection-features-with-low-variance/2418 If the variance is low or close to zero, then a feature is approximately constant and will not improve the performance of the model. In that case, it should be removed. Or if only a handful of observations differ from a constant value, the variance will also be very low. Low variance does not mean the feature is not important, right? If variance of target true value is also small and the correlation between above feature and target, the feature can be important feature. it does. If feature and target are correlated and you expect the target to change, the feature must have some sort of variance. Otherwise it means feature is almost constant so does target. D is the best answer as it is mentioned multivariable linear regression applied where correlation is strong between dependent and independent variables. D: We should remove features that are strongly correlated with each other and weakly correlated with the target: https://androidkt.com/find-correlation-between-features-and-target-using-the-correlation-matrix/ You can evaluate the relationship between each feature and target using a correlation and selecting those features that have the strongest relationship with the target variable. I think D is the correct answer. If I remember correctly, Benjamini-Hochberg Method is essentially answer D if you consider the Hypothesis to be: the feature is powerfully influential to the target. My problem with B is that the variance can be easily affected by the scale. In the question, the number of bedroom's variance is very low, while the sqrt of the house has a high variance, both of these could be very useful. Furthermore, zip codes are included, and it is safe to assume the variance of zip codes can be high, but the information is very limited, especially if you use them as numerical instead of categorical features. B is correct but the answer in D is better. D is preferred over C because the goal is to predict the sale price of houses, which is the target variable. By checking the correlation of each feature against the target variable, the machine learning specialist can identify which features are most relevant to the prediction of the sale price and which are less relevant. Removing features with low correlation to the target variable helps reduce the complexity of the model and potentially improve its accuracy. On the other hand, a heatmap showing the correlation of the dataset against itself (C) doesn't directly address the relevance of the features to the target variable, and so it's not as effective in reducing the complexity of the model. Answer should be D, THIS is feature elimination /selection during feature Enginering. Choice c is so close just to confuse test takers to pick the wrong choice! See below C and D answers -- C should have been correct if the question asked about how to visualize correlation among independent variables! PROVIDED second sentence in C needs to be removed or to say which feature you will eliminate in such case then the one with low correlation against target out of those two. C. Build a heatmap showing the correlation of the dataset against itself. Remove features with low mutual correlation scores. D. Run a correlation check of all features against the target variable. Remove features with low target variable correlation scores. The multiple regression model is based on the following assumptions: There is a linear relationship between the dependent variables and the independent variables The independent variables are not too highly correlated with each other yi observations are selected independently and randomly from the population Residuals should be normally distributed with a mean of 0 and variance σ I think the answer is D. If the model is a decision tree or something like that, I don't think it is possible to make a decision based only on the direct correlation with the target variable. But in multiple linear regression, the only thing that matters is the relationship between the target variable and the feature variable. B, if the standard deviation is small but not zero, then we have information. B is correct. To eliminate extraneous information. So, the answer is D. Correct answer is D. The reason B is wrong because it is difficult to reason out why would you plot a histogram? Absolutely unnecessary step and distraction choice. D is not the proper answer. Here is why: It says that it is comparing with the target variable (dependent variable), which implies it is comparing the correlation between the dependent and independent variables. This type of comparison is usually done after a model is constructed in order to prevent assessing the predictive strength of the model. To compare the target label, the label you wish to predict, with the other variables before - is premature and will likely result in weakening your model. Variables with low variance has very less information and the inclusion of which will likely weaken the model performance. Hence, B. Answer is D. https://deep-r.medium.com/difference-between-variance-co-variance-and-correlation-ea0b7ddbaa1 Answer C. Heatmaps is used to visualize for correlation matrix https://towardsdatascience.com/better-heatmaps-and-correlation-matrix-plots-in-python-41445d0f2bec but is mentioned, "Remove features with low mutual correlation scores." which is wrong you should drop features with high correlation scores. so Answer is D The problem with correlation tasks is it capture linear relations only. So, I would go with B - https://www.examtopics.com/discussions/amazon/view/43930-exam-aws-certified-machine-learning-specialty-topic-1/
93
93 - A company wants to classify user behavior as either fraudulent or normal. Based on internal research, a machine learning specialist will build a binary classifier based on two features: age of account, denoted by x, and transaction month, denoted by y. The class distributions are illustrated in the provided figure. The positive class is portrayed in red, while the negative class is portrayed in black. Which model would have the HIGHEST accuracy? [https://www.examtopics.com/assets/media/exam-media/04145/0005400001.png] - A.. Linear support vector machine (SVM) B.. Decision tree C.. Support vector machine (SVM) with a radial basis function kernel D.. Single perceptron with a Tanh activation function
C - Due to straight angles, I would choose Decision tree. See https://scikit-learn.org/stable/auto_examples/classification/plot_classifier_comparison.html#sphx-glr-auto-examples-classification-plot-classifier-comparison-py From your link it is obvious that the best answer is still SVM with RBF kernel. In your link the SVM-RBF got 88% accuracy on the 'square-like' dataset whereas the Decision tree achieved only 80%. Answer is SVM with RBF kernel note the data from sklearn link is shaped as a ball of mass not a square. the RBF kernel would be better but the question shows a square. Decision tree should be better fit for this problem. B - Decision tree - is not the best answer. If you use decision tree to do clustering, every time you need to partition the space into 2 parts. Hence you will split the space into 3*3. The red points in the center box and the black points will fall into the 8 boxes around it. The black points will be identified as 8 different classes. C is the correct answer. SVM with non-linear kernel is appropriate for non-linear clustering. Even if the shape is close to rectangular. SVM with non-linear kernel will be ale to approximate the rectangular boundary shape. Your statement "The black points will be identified as 8 different classes" does not make a lot of sense because the leaf node in a tree with be 1 of 2 classes, not 8 different classes just because they are visually in one place or the other The tree works like this with this branch with 4 nodes: Age > 49? Y Age > 51? N Transaction > 28? Y Transaction > 31? N Positive Correct answer is B. Tip: When details are missing, assume ideal conditions, so assume no overfitting issues. Therefore is B is better than C. If there are overfitting issues or a posibility of overfitting then C is the right answer. Decision tree makes more sense, this decision boundary isn't complex at all and there is no risk of overfitting, all the points are inside the square This is because the RBF kernel can handle non-linear relationships between features, which is often necessary for complex classification tasks. Decision Tree can treat the training data well but will have a risk of overfitting. the SVM with RBF kernel will be more robust. As the positive cases can be interpreted and separated from non positive ones by decision tree easily. SVM would have made sense if the two classes were inseparable or had complex relationship in data. It is C SVM with RBF Kernel can classify this image. For decision tree, it will be more difficult From the visual information provided, an SVM with an RBF kernel (Option C) would likely be the best choice because it can handle the circular class distribution. The RBF kernel is especially good at dealing with such scenarios where the boundary between classes is not linear. Answer C B. Decision Tree: Decision trees can capture non-linear patterns and are capable of splitting the feature space in complex ways. They can be very effective if the decision boundary is not linear, but they might also overfit if the decision boundary is too complex. C. SVM with RBF Kernel: An SVM with a radial basis function (RBF) kernel is designed to handle non-linear boundaries by mapping input features into higher-dimensional spaces where the classes are more likely to be separated by a hyperplane. Given the clustered nature of the classes in the image, an SVM with an RBF kernel would likely be able to separate the classes with a higher degree of accuracy. SVM-RBF is the correct solution Support vector machine (SVM) with a radial basis function kernel would likely have the highest accuracy for this task because it can handle the non-linear separation required by the data. I will lean with C Answer is B as Decision tree can attain 100% accuracy in this case. SVM with RBF and proper C and Gamma value can accomodate this square shape (https://vitalflux.com/svm-rbf-kernel-parameters-code-sample/) confusing between SVN or devision tree. learning towards C Answer C In general, SVMs are a good choice for tasks where accuracy is critical, such as fraud detection and medical diagnosis. Decision trees are a good choice for tasks where interpretability is important, such as customer segmentation and product recommendation. - https://www.examtopics.com/discussions/amazon/view/43870-exam-aws-certified-machine-learning-specialty-topic-1/
94
94 - A health care company is planning to use neural networks to classify their X-ray images into normal and abnormal classes. The labeled data is divided into a training set of 1,000 images and a test set of 200 images. The initial training of a neural network model with 50 hidden layers yielded 99% accuracy on the training set, but only 55% accuracy on the test set. What changes should the Specialist consider to solve this issue? (Choose three.) - A.. Choose a higher number of layers B.. Choose a lower number of layers C.. Choose a smaller learning rate D.. Enable dropout E.. Include all the images from the test set in the training set F.. Enable early stopping
BDF - when looking at an overfitting issue : https://www.kdnuggets.com/2019/12/5-techniques-prevent-overfitting-neural-networks.html 1. Simplifying The Model (reduce number of layers) 2. Early Stopping 3. Use Data Augmentation 4. Use Regularization (L1 + L2) 5. Use Dropouts So looking at the options: B, D, F BDF !!! looking at last 100 questions many answers were wrong , thanks to the discussion forum to provide correct answer I would say BCD or BDF Agree with BDF Over fitting problem. All the options B, D, F reduce over fitting. In what world is ACE the answer ? BDF is the answer BDF is the correct ADE is absolutely wrong. 50 layers is already overfitting the model. We cannot increase the number of layers again. BDF is the correct answer should be BDF One of the correct answer is showing as A. I wanted to understand how A(Choose Higher Number of Layers) is the correct Answer ? I believe the answer is BCE because the model is overfitting. C might not be, because the model yielded 99% accuracy on the training set choose smaller learning rate c, d, f, ignore answer is correct BDF BDF!!! It is supposed to be BDF - https://www.examtopics.com/discussions/amazon/view/43931-exam-aws-certified-machine-learning-specialty-topic-1/
95
95 - This graph shows the training and validation loss against the epochs for a neural network. The network being trained is as follows: ✑ Two dense layers, one output neuron ✑ 100 neurons in each layer ✑ 100 epochs Random initialization of weights Which technique can be used to improve model performance in terms of accuracy in the validation set? [https://www.examtopics.com/assets/media/exam-media/04145/0005500004.png, https://www.examtopics.com/assets/media/exam-media/04145/0005600001.jpg] - A.. Early stopping B.. Random initialization of weights with appropriate seed C.. Increasing the number of epochs D.. Adding another layer with the 100 neurons
A - Answer A. The answer is Early Stopping. Stopp the training before accuracy start do decrease. Appreciates your explanation. Cheers I will go with A Early stopping is a powerful technique to prevent overfitting. It involves monitoring the model’s performance on a validation dataset during training. If the validation loss starts increasing or plateaus, early stopping stops further training. This ensures that the model doesn’t overfit to the training data. Based on the graph, if the validation loss begins to stagnate or increase after a certain number of epochs, enabling early stopping could lead to better generalization. early stopping before error increase Early stopping Early stopping A: stop the training process of a neural network before it reaches the maximum number of epochs or iterations; in this case stop close to 64 Epochs. Early stopping and not increasing epochs. Answer is ”A” Early Stopping can improve the model? A is the answer I would go for A I would choose A. - https://www.examtopics.com/discussions/amazon/view/43932-exam-aws-certified-machine-learning-specialty-topic-1/
96
96 - A Machine Learning Specialist is attempting to build a linear regression model. Given the displayed residual plot only, what is the MOST likely problem with the model? [https://www.examtopics.com/assets/media/exam-media/04145/0005700001.jpg] - A.. Linear regression is inappropriate. The residuals do not have constant variance. B.. Linear regression is inappropriate. The underlying data has outliers. C.. Linear regression is appropriate. The residuals have a zero mean. D.. Linear regression is appropriate. The residuals have constant variance.
A - I would choose A. See: https://www.itl.nist.gov/div898/handbook/pmd/section4/pmd442.htm and https://blog.minitab.com/blog/the-statistics-game/checking-the-assumption-of-constant-variance-in-regression-analyses Ans. is A. High-degree polynomial transformation. I think so too. Answer A. One of the key assumptions of linear regression is that the residuals have constant variance at every level of the predictor variable(s). If this assumption is not met, the residuals are said to suffer from heteroscedasticity. When this occurs, the estimates for the model coefficients become unreliable https://www.statology.org/constant-variance-assumption/ Agree with A https://blog.minitab.com/en/the-statistics-game/checking-the-assumption-of-constant-variance-in-regression-analyses Kind of like heteroskedasticity, anyways it is A. Answer is A. It does not has constant variance ! A is correct answer These images are broken. I cannot review the question properly! D is best answer all x values are scattering as a whole , no matter what x is https://www.statisticshowto.com/residual-plot/ if you take all x values to plot histogram , it will be bel-curv. And it does NOT mean linear regression is not appropriate. It means your linear regression model is biased due to several reasons. yes, it does. One of the main assumptions is homoscedasticity. 100% A I will choose A , because the data is heteroscedastic. It violates a key assumption of linear regeression A. https://www.originlab.com/doc/origin-help/residual-plot-analysis Do not have content variance https://stats.stackexchange.com/questions/52089/what-does-having-constant-variance-in-a-linear-regression-model-mean Answer is A. As x raises, the residuals become higher and higher... Some Good Reading https://www.andrew.cmu.edu/user/achoulde/94842/homework/regression_diagnostics.html Ans is A Thank you for sharing. - https://www.examtopics.com/discussions/amazon/view/43948-exam-aws-certified-machine-learning-specialty-topic-1/
97
97 - A large company has developed a BI application that generates reports and dashboards using data collected from various operational metrics. The company wants to provide executives with an enhanced experience so they can use natural language to get data from the reports. The company wants the executives to be able ask questions using written and spoken interfaces. Which combination of services can be used to build this conversational interface? (Choose three.) - A.. Alexa for Business B.. Amazon Connect C.. Amazon Lex D.. Amazon Polly E.. Amazon Comprehend F.. Amazon Transcribe
CDF - C - voice and text interface E - understanding F - Speech to text Why would I need to transcribe while I have Lex that do the NLU part? It would be more reasonable to select Either Connect (B) or Polly (D) if the specs to generate output speech. E - is more to express the "feeling" or "mood". We would rather need something, that can speak to the customer. So my suggestion is c,d,f The question states that the company wants to "provide executives with an enhanced experience so they can use natural language to get data from the reports." The key phrase here is "use natural language," which implies that the executives will be interacting with the system using human-like language, either written or spoken. To understand and interpret natural language inputs from users, whether written or spoken, the system needs to have natural language understanding (NLU) or natural language processing (NLP) capabilities. Without NLU/NLP capabilities, the system would not be able to make sense of the executives' natural language queries and extract the relevant information to retrieve data from the reports and dashboards. Services like Amazon Lex and Amazon Comprehend are specifically designed to provide NLU and NLP functionalities, respectively. Amazon Lex uses NLU models to understand the intent and extract relevant information from user inputs, while Amazon Comprehend provides NLP capabilities to analyze and extract insights from text data. If we need to build written and spoken interfaces we need : F - Transcribe (speech to text) D- Polly (text ot speech) And for chatbot: E - Lex *C - Lex So C,D,F I second that, the keyword here is "conversational interface". so, no conversation without Amazon Lex Alexa for Business: Handles the voice interaction, converting spoken queries into text and providing the voice interface that executives use to interact with the BI application. Amazon Lex: Processes the text input (converted by Alexa) and understands the intent behind the queries, enabling the conversational interface. Amazon Polly: Optional but useful if you want to convert the textual responses from the BI application back into spoken responses, providing a complete voice-based interaction. Lex for bot service, Polly for text-to-speech (answer) and Transcribe for speech-to-text (question). I believe Answer should be CDF C: Lex D: Polly F: Transcribe For a BI application where executives can ask questions using written and spoken interfaces, the following combination of services would be suitable: Amazon Lex (Option C): To build the core conversational interface that understands and processes natural language queries. Amazon Polly (Option D): To provide spoken responses to written queries, giving a more interactive experience for users who are not using the voice interface. Amazon Transcribe (Option F): To convert spoken queries into text that can be understood by Amazon Lex. These three services would work together to provide a comprehensive conversational interface that allows for both text and voice interactions, meeting the requirements of the scenario provided. C. Amazon Lex: It provides advanced deep learning functionalities of automatic speech recognition (ASR) for converting speech to text, and natural language understanding (NLU) to recognize the intent of the text, enabling you to build applications with highly engaging user experiences and lifelike conversational interactions. D. Amazon Polly: This service turns text into lifelike speech using deep learning. It would enable the BI application to deliver the answers to the executives' questions in a spoken format. F. Amazon Transcribe: This is an automatic speech recognition (ASR) service that makes it easy for developers to add speech-to-text capability to their applications. This would be necessary for the BI application to interpret spoken questions from the executives. CDF -> CEF. you dont need comprehend in this scenario. Amazon Lex (C): This service is crucial for building conversational interfaces. It provides the capabilities to understand and interpret user input in natural language, which is essential for understanding the questions asked by executives. Amazon Transcribe (F): For a spoken interface, you need a service that can convert speech into text. Amazon Transcribe does exactly this, allowing the system to process spoken questions by converting them into text that can then be interpreted by Amazon Lex. Amazon Polly (D): To enhance the user experience by responding to inquiries not only in text but also in spoken form, Amazon Polly is ideal. It converts text responses into lifelike speech, allowing the system to verbally communicate with the executives. Together, these three services (Amazon Lex, Amazon Transcribe, and Amazon Polly) will enable a comprehensive conversational interface for the BI application, catering to both written and spoken queries and responses why does aws use muliplt service for tts and stt? No, don't need E Comprehend because the report has already been generated. Answer is CEF --> Input can be speech but the output to the user will be text (as nothing specific is mentioned) using Lex for conversational interface, Transcribe to convert speech to text (if input is speech) and Comprehend for insights from text CEF is correct I will go with: lex for the chat interface comprehend for getting insights from reports Polly for text-to-speech transformation https://aws.amazon.com/blogs/machine-learning/deriving-conversational-insights-from-invoices-with-amazon-textract-amazon-comprehend-and-amazon-lex/ Amazon Polly is essential for providing spoken responses in a conversational interface, it doesn't directly handle the natural language understanding and processing aspect, which is why it wasn't included as one of the top three services for building the conversational interface in this scenario. Correct is C, E, F A. NO - Alexa for Business B. NO - Amazon Connect for call centers C. YES - Amazon Lex for chatbots D. YES - Lex Text-to-Speech E. NO - Amazon Comprehend is for topic extraction and sentiment analysis, Transcribe already does it F. YES - Transcribe Speech-to-Text Transcribe does not do sentiment analysis and topic extraction it just generates transcript from speech so we need Amazon Comprehend Agree with CDF - https://www.examtopics.com/discussions/amazon/view/43962-exam-aws-certified-machine-learning-specialty-topic-1/
98
98 - A machine learning specialist works for a fruit processing company and needs to build a system that categorizes apples into three types. The specialist has collected a dataset that contains 150 images for each type of apple and applied transfer learning on a neural network that was pretrained on ImageNet with this dataset. The company requires at least 85% accuracy to make use of the model. After an exhaustive grid search, the optimal hyperparameters produced the following: ✑ 68% accuracy on the training set ✑ 67% accuracy on the validation set What can the machine learning specialist do to improve the system's accuracy? - A.. Upload the model to an Amazon SageMaker notebook instance and use the Amazon SageMaker HPO feature to optimize the model's hyperparameters. B.. Add more data to the training set and retrain the model using transfer learning to reduce the bias. C.. Use a neural network model with more layers that are pretrained on ImageNet and apply transfer learning to increase the variance. D.. Train a new model using the current neural network architecture.
B - the answer is B - the model is underfitting = high bias, so we want to reduce it C is wrong because the intention is not to increase variance which equals overfitting (using a more complex model would be good, but to reduce bias not increase variance) The 68% accuracy on the training set and 67% accuracy on the validation set suggest that the model is biased - underfitting and does not have enough capacity or relevant information to learn the underlying patterns in the data. I would think ImageNet network is good enough already, so more data A. NO - HPO has already been done though grid search B. YES - 150 images is very small; need x10 that C. NO - need bigger training set D. NO - what would the new model be ? More data to training set Letter B is the correct one. We can add more data with data augmentation. Letter A would be a repetition of what has already been done. Letter C is impractical. Letter D is starting from scratch without need. I think it should be D: "Train a new model using the current neural network architecture". Because apples data is very specific and ImageNet weights will be to generic there. We still can leave ImageNet weights for an initial configuration but the model should be retrained from scratch. 450 images should be fine. HPO for me. bOTH VALIdation set and train set performing equally but performance not good. So the basic problem here is high bias (train error) and high variance (test error). Ideally we want both low, but there is trade-off need to be cautious to avoid overfitting. So this problem needs solution for Low bias first (so training performance improves with decent) for later to figure out whether that leads to overfit or not when you test it,! Answer choice B why not A? https://aws.amazon.com/about-aws/whats-new/2022/07/amazon-sagemaker-automatic-model-tuning-supports-increased-limits-improve-accuracy-models/ not B, c is corect Given that the model can't even fit the training set properly, it would be convenient to amplify the layers that are trained. If I understood the phrasing correctly, I would go with C. C, accuracy on training set is low, model not complex enough B is more accurate, while adding more complexity for model is viable but you don't want to increase variance It only has 150 photos for training, more complex neural network won't help - https://www.examtopics.com/discussions/amazon/view/73977-exam-aws-certified-machine-learning-specialty-topic-1/
99
99 - A company uses camera images of the tops of items displayed on store shelves to determine which items were removed and which ones still remain. After several hours of data labeling, the company has a total of 1,000 hand-labeled images covering 10 distinct items. The training results were poor. Which machine learning approach fulfills the company's long-term needs? - A.. Convert the images to grayscale and retrain the model B.. Reduce the number of distinct items from 10 to 2, build the model, and iterate C.. Attach different colored labels to each item, take the images again, and build the model D.. Augment training data for each item using image variants like inversions and translations, build the model, and iterate.
D - Data Augumentation is the way to go here. How does converting to grayscale help? What if the colors of the items are relevant in object identification??? data augemntation D is correct How can I make the decision to use gray images if the question doesn't even indicate whether the images are colored or not? and even so, colored images are important to ensure more accuracy in training than compared to gray imagens. Due that the model is underfitting, more data like indicated the option D is the correct action. C: "Attach different colored labels to each item, take the images again, and build the model" It is also kind of augmentation. It is even better than just inverting and translating existing samples. But it's done in real life and your manual work would be lost. shouldnt it be reduced to 2 variables , taking image of empty shelf and non empty and that should do it ? D is of course the right answer, grayscale only won't help anything D is the CORRECT ANSWER https://research.aimultiple.com/data-augmentation/ Data augmentation is correct. we need more samples D is correct D is my answer for this. A can help but it'll need more than that. D, i guess - https://www.examtopics.com/discussions/amazon/view/73978-exam-aws-certified-machine-learning-specialty-topic-1/
100
100 - A Data Scientist is developing a binary classifier to predict whether a patient has a particular disease on a series of test results. The Data Scientist has data on 400 patients randomly selected from the population. The disease is seen in 3% of the population. Which cross-validation strategy should the Data Scientist adopt? - A.. A k-fold cross-validation strategy with k=5 B.. A stratified k-fold cross-validation strategy with k=5 C.. A k-fold cross-validation strategy with k=5 and 3 repeats D.. An 80/20 stratified split between training and validation
B - B - stratified k-fold cross-validation will enforce the class distribution in each split of the data to match the distribution in the complete training dataset. B is the correct answer. Use Stratified k-Fold Cross-Validation for Imbalanced Classification. Stratified train/test splits is an option too. But the question is specifically asking "cross-validation" strategy. In summary, Option B is the most appropriate strategy for handling the imbalanced dataset and ensuring reliable performance metrics for the binary classifier. for imbalanced data. Stratified k-fold cross-validation ensures that the distribution of the target variable is the same in each fold. This is important for binary classification problems, where the target variable is imbalanced. In this case, the disease is seen in only 3% of the population. This means that if we do not use stratified k-fold cross-validation, then there is a risk that the training and validation sets will not be representative of the actual population. B https://towardsdatascience.com/understanding-8-types-of-cross-validation-80c935a4976d Stratified cross validation is for unbalanced data like this! Why K=5? K=5 is just standard Yes, B... - https://www.examtopics.com/discussions/amazon/view/45178-exam-aws-certified-machine-learning-specialty-topic-1/
101
101 - A technology startup is using complex deep neural networks and GPU compute to recommend the company's products to its existing customers based upon each customer's habits and interactions. The solution currently pulls each dataset from an Amazon S3 bucket before loading the data into a TensorFlow model pulled from the company's Git repository that runs locally. This job then runs for several hours while continually outputting its progress to the same S3 bucket. The job can be paused, restarted, and continued at any time in the event of a failure, and is run from a central queue. Senior managers are concerned about the complexity of the solution's resource management and the costs involved in repeating the process regularly. They ask for the workload to be automated so it runs once a week, starting Monday and completing by the close of business Friday. Which architecture should be used to scale the solution at the lowest cost? - A.. Implement the solution using AWS Deep Learning Containers and run the container as a job using AWS Batch on a GPU-compatible Spot Instance B.. Implement the solution using a low-cost GPU-compatible Amazon EC2 instance and use the AWS Instance Scheduler to schedule the task C.. Implement the solution using AWS Deep Learning Containers, run the workload using AWS Fargate running on Spot Instances, and then schedule the task using the built-in task scheduler D.. Implement the solution using Amazon ECS running on Spot Instances and schedule the task using the ECS service scheduler
A - Answer is A https://aws.amazon.com/blogs/compute/gpu-workloads-on-aws-batch/ Makes most sense I would go for D. As far as I know Fargate does not support GPU computing. It does support GPU https://docs.aws.amazon.com/batch/latest/userguide/fargate.html this is wrong information. It does not support GPU the problem that fargate is serverless which mean you can't control its compute capabilities To scale the solution at the lowest cost, the best architecture would be Option A: Implement the solution using AWS Deep Learning Containers and run the container as a job using AWS Batch on a GPU-compatible Spot Instance. This approach leverages AWS Batch to manage the job scheduling and execution, while using Spot Instances to significantly reduce costs12. Would you like more details on how to set this up or any other aspect of optimizing your architecture? fargate doesnt support GPU. https://github.com/aws/containers-roadmap/issues/88 AWS batch will easily satisfy the requiremtns Fargate doesn't support GPU. So go with AWS Batch and DLC (Deep Learning Container) A. NO - Fargate provides batch functionnalities already fully integrated with ECS B. NO - too low level C. YES - AWS Deep Learning Containers are optimized; AWS Fargate is serverless (so less ops complexity); Spot best for cost D. NO - ECS service scheduler is not serverless Answer is A. C is not correct. GPU resources aren't supported for jobs that run on Fargate resources. A and C are both great answers but when it comes to cost I believe A is the more cost effective solution. So A is my answer Automate the workload by scheduling the job to run once a week using AWS Batch’s built-in scheduler or a cron expression. Optimize the performance by using AWS Deep Learning Containers that are tailored for GPU acceleration and deep learning frameworks. Reduce the cost by using Spot Instances that offer significant savings compared to On-Demand Instances. Handle failures by using AWS Batch’s retry strategies that can automatically restart the job on a different instance if the Spot Instance is interrupted. Answer is A. (But the question is tricky) A and D are both correct solutions but pay attention to the words - "Senior managers are concerned about the complexity of the solution's resource management and the costs". With Cost is everything simple - use Spot instances, with resource management - use higher abstraction servicr AWS Batch is a management/abstraction layer on top of ECS and EC2 (and some other AWS resources). It does some things for you, like cost optimization, that can be difficult to do yourself. Think of it like Elastic Beanstalk for batch operations. It provides a management layer on top of lower-level AWS resources, but if you are comfortable managing those lower level resources yourself and want more control over them it is certainly an option to use those lower-level resources directly. Why not C? AWS Fargate is oriented to manage resources as you need. A looks good to me https://aws.amazon.com/blogs/compute/deep-learning-on-aws-batch/ b: for those who think its B because of spot instance interruption, ready question phrase "The job can be paused, restarted, and continued at any time in the event of a failure, and is run from a central queue." between a and c: at the time of this question i doubt if fargate supported GPU, even if it did I choose aws batch for job and fargate for services/apps that need to run all the time. a is answer Option A is the most cost-effective architecture as it uses GPU-compatible Spot Instance which is the lowest cost compute option for GPU instances in the AWS cloud. AWS Batch is a fully managed service that schedules, runs, and manages the processing and analysis of batch workloads. The use of AWS Deep Learning Containers enables the technology startup to use pre-built, optimized Docker containers for deep learning, which reduces the complexity of the solution's resource management and eliminates the need for repeated processing. Answer is A. Option B is similar to A, but it uses a low-cost GPU-compatible EC2 instance rather than a container, which may not be as flexible or scalable as using containers. Answer is A https://aws.amazon.com/blogs/compute/gpu-workloads-on-aws-batch/ https://aws.amazon.com/blogs/compute/gpu-workloads-on-aws-batch/ There you have it - https://www.examtopics.com/discussions/amazon/view/44886-exam-aws-certified-machine-learning-specialty-topic-1/
102
102 - A Machine Learning Specialist prepared the following graph displaying the results of k-means for k = [1..10]: Considering the graph, what is a reasonable selection for the optimal choice of k? [https://www.examtopics.com/assets/media/exam-media/04145/0006100001.png] - A.. 1 B.. 4 C.. 7 D.. 10
B - B seems correct based on the elbow method I agree, most likely B. Hi all, I am not able to see the image. It is broken for me. Is it possible for someone to share the image. The closest no.to the elbow is clearly 4 B is correct Elbow method B: https://en.wikipedia.org/wiki/Elbow_method_(clustering) Elbow is more visible at 4 B seems correct https://www.analyticsvidhya.com/blog/2021/01/in-depth-intuition-of-k-means-clustering-algorithm-in-machine-learning/ Elbow method . B seems correct. B seems better number 4 is the elbow of the hand, B is correct because the elbow method is a heuristic method its open to debate as to where the correct bend in the cluster is. It's a good tool to use with lower computation cost than computing the silhouette score. When looking at https://www.youtube.com/watch?v=qs8nfzUsW5U instead of eyeballing where the bend is, he calculates where the difference between scores is smaller than the 90th persentile - https://www.examtopics.com/discussions/amazon/view/44105-exam-aws-certified-machine-learning-specialty-topic-1/
103
103 - A media company with a very large archive of unlabeled images, text, audio, and video footage wishes to index its assets to allow rapid identification of relevant content by the Research team. The company wants to use machine learning to accelerate the efforts of its in-house researchers who have limited machine learning expertise. Which is the FASTEST route to index the assets? - A.. Use Amazon Rekognition, Amazon Comprehend, and Amazon Transcribe to tag data into distinct categories/classes. B.. Create a set of Amazon Mechanical Turk Human Intelligence Tasks to label all footage. C.. Use Amazon Transcribe to convert speech to text. Use the Amazon SageMaker Neural Topic Model (NTM) and Object Detection algorithms to tag data into distinct categories/classes. D.. Use the AWS Deep Learning AMI and Amazon EC2 GPU instances to create custom models for audio transcription and topic modeling, and use object detection to tag data into distinct categories/classes.
A - A. Fastest route must Amazon Services. Amazon Mechanical Turk is an Amazon service A. YES - Rekognition with built-in labels for images & video, Transcribe to convert sound to text and Comprehend for Topic Modeling B. NO - complicated C. NO - complicated D. NO - complicated AWS services for fastest route why not C? C. Use Amazon Transcribe to convert speech to text. Use the Amazon SageMaker Neural Topic Model (NTM) and Object Detection algorithms to tag data into distinct categories/classes. this takes time and u need to have atleast some technical ML expertise I will choose B,https://aws.amazon.com/cn/getting-started/hands-on/machine-learning-tutorial-label-training-data/ Mechanical Turk is the most accurate, but the three services in letter A is the fastest! A. Use Amazon Rekognition, Amazon Comprehend, and Amazon Transcribe to tag data into distinct categories/classes is the fastest route to index the assets. These AWS services provide pre-built machine learning models that can be used to tag the content in the archive without the need for building custom models from scratch. This option would be faster than using custom models with the AWS Deep Learning AMI and Amazon EC2 GPU instances, or using Amazon Mechanical Turk for human labeling. Additionally, the use of pre-built models reduces the need for machine learning expertise, aligning with the company's goal of accelerating efforts by its in-house researchers. A. Option B is for those without ML experience. But "researchers who have limited machine learning expertise“, so A is better. A. The most straight forward use of services. I would have said B, but in B it says "label footage" which means it ignored the rest of the data, so i'd go with A The question said "a very large archive" meaning a lot of money to pay for labour. B won't be as fast as machine, plus you only label the footage, ignored other stuff. Would go for A I will go with B Correct answer is A B. as no one in-house is an expert and It probably is the fastest way to get there Take into consideration that it is "a very large archive" - https://www.examtopics.com/discussions/amazon/view/44106-exam-aws-certified-machine-learning-specialty-topic-1/
104
104 - A Machine Learning Specialist is working for an online retailer that wants to run analytics on every customer visit, processed through a machine learning pipeline. The data needs to be ingested by Amazon Kinesis Data Streams at up to 100 transactions per second, and the JSON data blob is 100 KB in size. What is the MINIMUM number of shards in Kinesis Data Streams the Specialist should use to successfully ingest this data? - A.. 1 shards B.. 10 shards C.. 100 shards D.. 1,000 shards
B - Agreed, B it is. See https://medium.com/slalom-data-analytics/amazon-kinesis-data-streams-auto-scaling-the-number-of-shards-105dc967bed5 One shard can Ingest 1 MB/second or 1,000 records/second. So 100 KB * 100 = 10 MB (10 shards required) 100 KB * 100 = 10 MB 1 MB/second 10 / 1 = 10 shards. Each shard in Amazon Kinesis Data Streams can support up to 1,000 transactions per second. The data needs to be ingested at up to 100 transactions per second, so we need at least 1 shard. However, we also need to consider the size of the JSON data blob. Each JSON data blob is 100 KB in size, and each shard can only store up to 1 MB of data. This means that we need to have at least 10 shards, so that each shard can store 100 KB of data. B - Max. ingestion per shard = 1000 KB/s --> 100 Records * 100 KB = 10.000 KB --> 10.000 KB / 1000 KB/per Shard = 10 Shards 10 should be correct. B is correct 100 kb * 100 t/second = 10000 kb = 10 mb 10mb / max_threshold_per_shard (1 mb) = 10 shards Reference: https://docs.aws.amazon.com/streams/latest/dev/service-sizes-and-limits.html - https://www.examtopics.com/discussions/amazon/view/43964-exam-aws-certified-machine-learning-specialty-topic-1/
105
105 - A Machine Learning Specialist is deciding between building a naive Bayesian model or a full Bayesian network for a classification problem. The Specialist computes the Pearson correlation coefficients between each feature and finds that their absolute values range between 0.1 to 0.95. Which model describes the underlying data in this situation? - A.. A naive Bayesian model, since the features are all conditionally independent. B.. A full Bayesian network, since the features are all conditionally independent. C.. A naive Bayesian model, since some of the features are statistically dependent. D.. A full Bayesian network, since some of the features are statistically dependent.
D - I would say D, because of correlations and dependencies between features. See https://towardsdatascience.com/basics-of-bayesian-network-79435e11ae7b and https://www.quora.com/Whats-the-difference-between-a-naive-Bayes-classifier-and-a-Bayesian-network?share=1 I agree, makes moste sense It should be D. Naive Bayes is called naive because it assumes that each input variable is independent. This is a strong assumption and unrealistic for real data; however, the technique is very effective on a large range of complex problems. In this case, the absolute values of the Pearson correlation coefficients range between 0.1 to 0.95. This means that some of the features are statistically dependent. Therefore, a full Bayesian network is a better model for the underlying data than a naive Bayesian model. In a full Bayesian network, features are connected to each other by edges that represent their conditional dependence relationships. A full Bayesian network is useful when the relationships between the features are complex, non-linear or when they are not conditionally independent. In this situation, where the Pearson correlation coefficients range between 0.1 and 0.95, it suggests that there are dependencies between the features, indicating that a full Bayesian network would be appropriate to capture the relationships between the features and model the data. distinction between Bayes theorem and Naive Bayes is that Naive Bayes assumes conditional independence where Bayes theorem does not. This means the relationship between all input features are independent . The Pearson correlation coefficient (r) is the most common way of measuring a linear correlation. It is a number between –1 and 1 that measures the strength and direction of the relationship between two variables. A naive Bayesian model, since some of the features, are statistically dependent. D. Naive bayes - features are independent given the class. I would say, B. Naive Bayes assumes conditional independence and not statistical you mean (a) naive bayes not (b) This is also a good source of information to help build your understanding https://www.simplypsychology.org/correlation.html - https://www.examtopics.com/discussions/amazon/view/43965-exam-aws-certified-machine-learning-specialty-topic-1/
106
106 - A Data Scientist is building a linear regression model and will use resulting p-values to evaluate the statistical significance of each coefficient. Upon inspection of the dataset, the Data Scientist discovers that most of the features are normally distributed. The plot of one feature in the dataset is shown in the graphic. What transformation should the Data Scientist apply to satisfy the statistical assumptions of the linear regression model? [https://www.examtopics.com/assets/media/exam-media/04145/0006400001.jpg] - A.. Exponential transformation B.. Logarithmic transformation C.. Polynomial transformation D.. Sinusoidal transformation
B - I would say B. Logarithmic transformation converts skewed distributions towards normal I would go with B. For right skewed distributions -> Logrithmic transformation For left skewed distributions -> exponential transformations= The linear regression model assumes that the errors are normally distributed. The plot of the feature shows that the errors are not normally distributed. The logarithmic transformation can be used to transform the errors to be normally distributed. The exponential transformation, polynomial transformation, and sinusoidal transformation cannot be used to transform the errors to be normally distributed. B when the feature data is not normally distributed, applying a logarithmic transformation can help to normalize the data and satisfy the assumptions of the linear regression model. 'A' would make it considerably worse. Exponential transformation would make it exponentially worse. :D Log Normal Distribution => Log() => Normal Distribution B is correct answer This is B, as this feature seems skewed while others have a regular distribution according to the question. The log transformation will reduce this features skewness. I think it's B. reference: https://corporatefinanceinstitute.com/resources/knowledge/other/positively-skewed-distribution/#:~:text=For%20positively%20skewed%20distributions%2C%20the,each%20value%20in%20the%20dataset. "For positively skewed distributions, the most popular transformation is the log transformation. The log transformation implies the calculations of the natural logarithm for each value in the dataset. The method reduces the skew of a distribution. Statistical tests are usually run only when the transformation of the data is complete." I would also go for B, as Log transformation is often mentioned, when we are talking about right (positive) skewness. - https://www.examtopics.com/discussions/amazon/view/44249-exam-aws-certified-machine-learning-specialty-topic-1/
107
107 - A Machine Learning Specialist is assigned to a Fraud Detection team and must tune an XGBoost model, which is working appropriately for test data. However, with unknown data, it is not working as expected. The existing parameters are provided as follows. Which parameter tuning guidelines should the Specialist follow to avoid overfitting? [https://www.examtopics.com/assets/media/exam-media/04145/0006500001.png] - A.. Increase the max_depth parameter value. B.. Lower the max_depth parameter value. C.. Update the objective to binary:logistic. D.. Lower the min_child_weight parameter value.
B - B lower max_depth is the correct answer. D min_child_weight means something like "stop trying to split once your sample size in a node goes below a given threshold" Lower min_child_weight, the tree becomes more deep and complex. Increase min_child_weight, the tree will have less branches and less complexity. The max_depth parameter controls the maximum depth of the decision trees in the XGBoost model. A higher max_depth value will result in more complex decision trees, which can lead to overfitting. Overfitting occurs when a model performs well on the training data but poorly on unseen or test data. In the context of XGBoost, reducing the max_depth parameter helps prevent overfitting. The max_depth parameter controls the maximum depth of the trees in the ensemble. A smaller max_depth value limits the complexity of the trees, making them less likely to memorize the noise in the training data and improve generalization to unseen data. It is B B: overfitting problem. 12-Sep Exam. When a model overfits, the solutions are: 1. Reduce model flexibility and complexity 2. Reduce the number of feature combinations 3. Decrease n-grams size 4. Decrease the number of numeric attribute bins 5. Increase the amount of regularization 6. Add dropout B. 30-deep tree is crazy; normally it's 6-7 no more A. Increase the max_depth parameter value. (This would increase the complexity resulting in overfitting) B. Lower the max_depth parameter value. (This would reduce the complexity and minimize overfitting) C. Update the objective to binary:logistic. it depends on what the target(s) generally you would have a binary classification for fraud detection but there is nothing to say you can't have a multi class so there is not enough information given. D. Lower the min_child_weight parameter value. (This would reduce the complexity and minimize overfitting) I find that there are 2 correct answers to this question which does not help B & D Ans : B , Lower values avoid over-fitting. No for D - Larger values avoid over-fitting. Thus, those parameters can be used to control the complexity of the trees. It is important to tune them together in order to find a good trade-off between model bias and variance min_child_weight is the minimum weight (or number of samples if all samples have a weight of 1) required in order to create a new node in the tree. A smaller min_child_weight allows the algorithm to create children that correspond to fewer samples, thus allowing for more complex trees, but again, more likely to overfit. max_depth is the maximum number of nodes allowed from the root to the farthest leaf of a tree. Deeper trees can model more complex relationships by adding more nodes, but as we go deeper, splits become less relevant and are sometimes only due to noise, causing the model to overfit. - https://www.examtopics.com/discussions/amazon/view/44289-exam-aws-certified-machine-learning-specialty-topic-1/
108
108 - A data scientist is developing a pipeline to ingest streaming web traffic data. The data scientist needs to implement a process to identify unusual web traffic patterns as part of the pipeline. The patterns will be used downstream for alerting and incident response. The data scientist has access to unlabeled historic data to use, if needed. The solution needs to do the following: ✑ Calculate an anomaly score for each web traffic entry. Adapt unusual event identification to changing web patterns over time. Which approach should the data scientist implement to meet these requirements? [https://www.examtopics.com/assets/media/exam-media/04145/0006500003.png] - A.. Use historic web traffic data to train an anomaly detection model using the Amazon SageMaker Random Cut Forest (RCF) built-in model. Use an Amazon Kinesis Data Stream to process the incoming web traffic data. Attach a preprocessing AWS Lambda function to perform data enrichment by calling the RCF model to calculate the anomaly score for each record. B.. Use historic web traffic data to train an anomaly detection model using the Amazon SageMaker built-in XGBoost model. Use an Amazon Kinesis Data Stream to process the incoming web traffic data. Attach a preprocessing AWS Lambda function to perform data enrichment by calling the XGBoost model to calculate the anomaly score for each record. C.. Collect the streaming data using Amazon Kinesis Data Firehose. Map the delivery stream as an input source for Amazon Kinesis Data Analytics. Write a SQL query to run in real time against the streaming data with the k-Nearest Neighbors (kNN) SQL extension to calculate anomaly scores for each record using a tumbling window. D.. Collect the streaming data using Amazon Kinesis Data Firehose. Map the delivery stream as an input source for Amazon Kinesis Data Analytics. Write a SQL query to run in real time against the streaming data with the Amazon Random Cut Forest (RCF) SQL extension to calculate anomaly scores for each record using a sliding window.
D - I think the answer is D - RCF works together with Data Analytics, and sliding window helped on new information better to say "RCF is a built-in algorithm/function in Kinesis Data Analytics" D uses the built-in RCF algorithm, which is designed for anomaly detection on streaming data and can adapt to changing patterns over time.It does not require any training data or preprocessing steps, as the RCF algorithm can learn from the streaming data directly. It uses a sliding window, which allows for continuous updating of the anomaly scores based on the most recent data points. It leverages the Amazon Kinesis Data Analytics service, which provides a scalable and managed platform for running SQL queries on streaming data. Option A requires training an RCF model on historic data, which may not reflect the current web traffic patterns. It also adds complexity and latency by invoking a Lambda function for each record. Answer D Letra B está descartada, pois trás um modelo supervisionado de classificação para um problema não supervisionado. Letra C trás outro modelo que não é recomendado também, em comparação ao RCF. A solução mais fácil de implementar e que atinge os critérios pedidos é a Letra D. Letra A está errada, pois usamos KDS para ingestão apenas. the data scientist needs to identify unusual web traffic patterns in real-time and adapt to changing web patterns over time. Amazon Kinesis Data Analytics provides real-time analytics capabilities on streaming data. The Amazon Random Cut Forest (RCF) SQL extension is designed for anomaly detection in streaming data, which fits the requirement to calculate an anomaly score for each web traffic entry. Answer is D "The algorithm starts developing the machine learning model using current records in the stream when you start the application. The algorithm does not use older records in the stream for machine learning, nor does it use statistics from previous executions of the application." https://docs.aws.amazon.com/kinesisanalytics/latest/sqlref/sqlrf-random-cut-forest.html D it is. easy one RCF is dynamic and adapts with time. D seems more appropriate. It is A. The only way to handle the historic data is using sagemaker and you can preprocess a data stream using a lambda. But, the question does not require using historical data. BTW, it only has unlabled historic data, and unlabled data is not really useful training a detection model. Definitly D, Data Anaytics is using RCF, Using window for selecting data with SQL One more reason to select D, not A, is there is no Lambda function to preprocess record in Kinesis Data Stream. That's not true: https://docs.aws.amazon.com/kinesisanalytics/latest/dev/lambda-preprocessing.html “ Adapt unusual event identification to changing web patterns over time.” -> option A does not satisfy this, only mentions build the model once The data scientist has access to unlabeled historic data to use, if needed. D has no mention of this. Also, A says the lambda function provides data enrichment. For me it's A. A and D both seems to works. But A does not satisfy requirement 2, adapt to patterns over time. Since the model is only trained on old data. So D may be better. It is definitely D - https://www.examtopics.com/discussions/amazon/view/44530-exam-aws-certified-machine-learning-specialty-topic-1/
109
109 - A Data Scientist received a set of insurance records, each consisting of a record ID, the final outcome among 200 categories, and the date of the final outcome. Some partial information on claim contents is also provided, but only for a few of the 200 categories. For each outcome category, there are hundreds of records distributed over the past 3 years. The Data Scientist wants to predict how many claims to expect in each category from month to month, a few months in advance. What type of machine learning model should be used? - A.. Classification month-to-month using supervised learning of the 200 categories based on claim contents. B.. Reinforcement learning using claim IDs and timestamps where the agent will identify how many claims in each category to expect from month to month. C.. Forecasting using claim IDs and timestamps to identify how many claims in each category to expect from month to month. D.. Classification with supervised learning of the categories for which partial information on claim contents is provided, and forecasting using claim IDs and timestamps for all other categories.
C - I think it should be C as the final outcome among 200 categories is already know. No need to build a classification model. It's pure forecasting problem. he said for a few what about the unclassified many i think we need to make classification for the rest first as it will help us with forecasting later with month to month forecasting C is my answer. No need to do classification. Because you know whether the insurance has a claim or not in the dataset. The claim contents do not provide additional information. forcasting It's pure forescasting problem. I would say no machine learning model needed at all. Just using count group by categories SQL is enough FinalOutcome 1 2 . 200 RecordID, FinalOutcome, Date, ClaimContents 1 2 . 100000 Note: claim content has partial information, only for few of 200 categories predict how many claims to expect in each category from month to month, a few months in advance We dont need the claim contents, we have all we need from first 3 columns to train a forecast model c Forecasting using claim IDs and timestamps to identify how many claims in each category to expect from month to month. The problem requires the prediction of the number of claims in each category for each month, which is a time series forecasting problem. The timestamps and record IDs can be used to model the underlying patterns in the data, and the model can be trained to predict the number of claims in each category for future months based on these patterns. While the claim contents might provide additional information, the fact that partial information is only available for a few categories suggests that this information might not be enough to build a robust model, and that it might not be possible to apply supervised learning to all 200 categories. Instead, the model should be trained on the time series data (claim IDs and timestamps) for all categories, and the claim contents can be used to improve the accuracy of the model only for the categories for which such information is available. how can a forecasting/classification model can be based on the claim ID? (that should be unique) it's a forecasting problem, not a classification one predict how many claims to expect in each category from month to month, a few months in advance C is the only one mentioning forecasting D is correct. Multi-label classification to impute the missing claim contents, then forecasting what we want. C is missing the imputation part. The question is, can we get something useful out of the handful of 200s and will this impact the forecast as we could forecast the numbers without... It is true that the final outcome is known. But C does not use the partial information from the 200 categories. Reinforcement learning currently is state of the art in stock prediction and other time series. Why waste valuable information? For me it's B. This is a supervised learning approach: Supervised learning problems can be further grouped into regression and classification problems. Classification: A classification problem is when the output variable is a category, such as “red” and “blue” or “disease” and “no disease.” Regression: A regression problem is when the output variable is a real value, such as “dollars” or “weight.” - https://www.examtopics.com/discussions/amazon/view/44290-exam-aws-certified-machine-learning-specialty-topic-1/
110
110 - A company that promotes healthy sleep patterns by providing cloud-connected devices currently hosts a sleep tracking application on AWS. The application collects device usage information from device users. The company's Data Science team is building a machine learning model to predict if and when a user will stop utilizing the company's devices. Predictions from this model are used by a downstream application that determines the best approach for contacting users. The Data Science team is building multiple versions of the machine learning model to evaluate each version against the company's business goals. To measure long-term effectiveness, the team wants to run multiple versions of the model in parallel for long periods of time, with the ability to control the portion of inferences served by the models. Which solution satisfies these requirements with MINIMAL effort? - A.. Build and host multiple models in Amazon SageMaker. Create multiple Amazon SageMaker endpoints, one for each model. Programmatically control invoking different models for inference at the application layer. B.. Build and host multiple models in Amazon SageMaker. Create an Amazon SageMaker endpoint configuration with multiple production variants. Programmatically control the portion of the inferences served by the multiple models by updating the endpoint configuration. C.. Build and host multiple models in Amazon SageMaker Neo to take into account different types of medical devices. Programmatically control which model is invoked for inference based on the medical device type. D.. Build and host multiple models in Amazon SageMaker. Create a single endpoint that accesses multiple models. Use Amazon SageMaker batch transform to control invoking the different models through the single endpoint.
B - B is the correct answer. A/B testing with Amazon SageMaker is required in the Exam. In A/B testing, you test different variants of your models and compare how each variant performs. Amazon SageMaker enables you to test multiple models or model versions behind the `same endpoint` using `production variants`. Each production variant identifies a machine learning (ML) model and the resources deployed for hosting the model. To test multiple models by `distributing traffic` between them, specify the `percentage of the traffic` that gets routed to each model by specifying the `weight` for each `production variant` in the endpoint configuration. I would answer B, it seems similar to this AWS example: https://docs.aws.amazon.com/sagemaker/latest/dg/model-ab-testing.html#model-testing-target-variant Option B This solution allows the Data Science team to build and host multiple models in Amazon SageMaker, which is a fully managed service for training, deploying, and managing machine learning models. The team can then create an endpoint configuration with multiple production variants, which are different versions of the models. By programmatically updating the endpoint configuration, the team can control the portion of inferences served by the different models. This allows them to evaluate the models against their business goals and measure their long-term effectiveness without having to make changes at the application layer. Answer D as it is said “the team intends to run numerous versions in parallel for extended periods of time,” so batch transform How can you create a single endpoint for batch transforms? this answer is nonsensical. It is possible to create a single endpoint for AWS Batch transforms. Here are the key steps: Create an interface endpoint for AWS Batch in your VPC using the AWS CLI or console. The endpoint service name will be in the format of com.amazonaws..batch . When creating the endpoint, assign an IAM role with necessary permissions to make calls to the Batch API. You can then submit batch transform jobs to AWS Batch referencing resources in both public and private subnets of the VPC. The endpoint ensures private connectivity to Batch. The single endpoint allows chaining multiple transforms together in a pipeline efficiently without needing internet access. New transforms can be added without redeploying the endpoint. AWS Batch will automatically provision the required compute environments like EC2 instances or containers to run the transforms and scale as needed based on job requirements. it says,"host a sleep monitoring application", it is the host which means online, not batch, b is correct The possibility to alter the percentage of inferences supplied by the models. Which method achieves these criteria with the LEAST amount of effort? B. Easy Think anser is D, below is from the Sagemaker doc. "https://docs.aws.amazon.com/sagemaker/latest/dg/batch-transform.html" Use Batch Transform to Test Production Variants To test different models or various hyperparameter settings, create a separate transform job for each new model variant and use a validation dataset. For each transform job, specify a unique model name and location in Amazon S3 for the output file. To analyze the results, use Inference Pipeline Logs and Metrics. The question talks about the LEAST amount of effort. In this case, there will be as many transform jobs required to be built as there are variants. That may not be the least amount of effort. - https://www.examtopics.com/discussions/amazon/view/44917-exam-aws-certified-machine-learning-specialty-topic-1/
111
111 - An agricultural company is interested in using machine learning to detect specific types of weeds in a 100-acre grassland field. Currently, the company uses tractor-mounted cameras to capture multiple images of the field as 10 ֳ— 10 grids. The company also has a large training dataset that consists of annotated images of popular weed classes like broadleaf and non-broadleaf docks. The company wants to build a weed detection model that will detect specific types of weeds and the location of each type within the field. Once the model is ready, it will be hosted on Amazon SageMaker endpoints. The model will perform real-time inferencing using the images captured by the cameras. Which approach should a Machine Learning Specialist take to obtain accurate predictions? - A.. Prepare the images in RecordIO format and upload them to Amazon S3. Use Amazon SageMaker to train, test, and validate the model using an image classification algorithm to categorize images into various weed classes. B.. Prepare the images in Apache Parquet format and upload them to Amazon S3. Use Amazon SageMaker to train, test, and validate the model using an object- detection single-shot multibox detector (SSD) algorithm. C.. Prepare the images in RecordIO format and upload them to Amazon S3. Use Amazon SageMaker to train, test, and validate the model using an object- detection single-shot multibox detector (SSD) algorithm. D.. Prepare the images in Apache Parquet format and upload them to Amazon S3. Use Amazon SageMaker to train, test, and validate the model using an image classification algorithm to categorize images into various weed classes.
C - C is my answer. Pay attention that the question is asking for 2 things: 1. detect specific types of weeds 2. detect the location of each type within the field. Image Classification can only classify images. Object detection algorithm: 1.identifies all instances of objects within the image scene. 2.its location and scale in the image are indicated by a rectangular bounding box. Data format for Computer Vision algorithms in SageMaker: Recommend to use RecordIO. RecordIO Format: This format is efficient for storing and processing large datasets, which is beneficial for training deep learning models. Object Detection SSD Algorithm: This algorithm is designed to detect and locate multiple objects within an image, making it ideal for identifying and pinpointing various types of weeds in the field Record IO preffered and also object detection due to several types of weeds The goal is to detect specific types of weeds and their locations within a field, which is a task that requires object detection, rather than image classification. Object detection algorithms are designed to identify objects and their locations within an image, whereas image classification algorithms only categorize an entire image into various classes. Single-shot multibox detectors (SSD) are a type of object detection algorithm that are well-suited for real-time inferencing and have been shown to be effective for a variety of object detection tasks. By preparing the images in RecordIO format and using Amazon SageMaker, the company can easily train, test, and validate the model, making it easier to deploy the model in a scalable and secure environment. C is the right answer. you need to detect location C You can detect the type of weeds and the location within the field. If they had an answer with "Faster R-CNN" then it would be different. This is a good article talking about SSD, Faster R-CNN, R-FCN and others which is a good read. https://jonathan-hui.medium.com/object-detection-speed-and-accuracy-comparison-faster-r-cnn-r-fcn-ssd-and-yolo-5425656ae359 I would go with answer C .. SSD are new architectures faster than the old CNN https://towardsdatascience.com/understanding-ssd-multibox-real-time-object-detection-in-deep-learning-495ef744fab I would select answer A, situation is very similar to this one: https://aws.amazon.com/blogs/machine-learning/building-a-lawn-monitor-and-weed-detection-solution-with-aws-machine-learning-and-iot-services/ I think it's better go with C, since the question also ask for the location of the weed on the field, while the example you posted is just a classifier. since field is divided in to 10x10 grid I felt A is more suitable. So you expect precisely one weed per grid (total 100) of an entire field? If a field is a hectare, then each grid would be 100m2 The llink says clearly only classification and not talking about object detection. Answer should be C - https://www.examtopics.com/discussions/amazon/view/44063-exam-aws-certified-machine-learning-specialty-topic-1/
112
112 - A manufacturer is operating a large number of factories with a complex supply chain relationship where unexpected downtime of a machine can cause production to stop at several factories. A data scientist wants to analyze sensor data from the factories to identify equipment in need of preemptive maintenance and then dispatch a service team to prevent unplanned downtime. The sensor readings from a single machine can include up to 200 data points including temperatures, voltages, vibrations, RPMs, and pressure readings. To collect this sensor data, the manufacturer deployed Wi-Fi and LANs across the factories. Even though many factory locations do not have reliable or high- speed internet connectivity, the manufacturer would like to maintain near-real-time inference capabilities. Which deployment architecture for the model will address these business requirements? - A.. Deploy the model in Amazon SageMaker. Run sensor data through this model to predict which machines need maintenance. B.. Deploy the model on AWS IoT Greengrass in each factory. Run sensor data through this model to infer which machines need maintenance. C.. Deploy the model to an Amazon SageMaker batch transformation job. Generate inferences in a daily batch report to identify machines that need maintenance. D.. Deploy the model in Amazon SageMaker and use an IoT rule to write data to an Amazon DynamoDB table. Consume a DynamoDB stream from the table with an AWS Lambda function to invoke the endpoint.
B - I would select B, based on the following AWS examples: https://aws.amazon.com/blogs/iot/industrial-iot-from-condition-based-monitoring-to-predictive-quality-to-digitize-your-factory-with-aws-iot-services/ https://aws.amazon.com/blogs/iot/using-aws-iot-for-predictive-maintenance/ B is my answer. For latency-sensitive use cases and for use-cases that require analyzing large amounts of streaming data, it may not be possible to run ML inference in the cloud. Besides, cloud-connectivity may not be available all the time. For these use cases, you need to deploy the ML model close to the data source. SageMaker Neo + IoT GreenGrass To design and push something to edge: 1. design something to do the job, say TF model 2. compile it for the edge device using SageMaker Neo, say Nvidia Jetson 3. run it on the edge using IoT GreenGrass without relying on internet connectivity. The described solution will be solved by an edge solution as internet relability is low. IoT Greengrass is the best solution for the edge inference. This is an edge solution, having as little traffic with AWS resources in regions. For this, start thinking IoT Greengrass and Sagemaker Neo, and you'll be halfway there. Answer is B, no doubt B is the answer, obviously This solution requires edge capabilities and to be able to run the inference models in near real-time. SageMaker Neo is a deployable unit on the edge architecture (IoT Greengrass) which can host the runtime inference model. A: not a complete solution a lot of details is missed C: daily batch training is huge defect in this solution D: writing to dynamoDB and invoking endpoint make this solution slower than using an IoT Green Grass Answer: B I would choose B because IoT reduce latency because they work on local machine I would choose B - https://www.examtopics.com/discussions/amazon/view/44064-exam-aws-certified-machine-learning-specialty-topic-1/
113
113 - A Machine Learning Specialist is designing a scalable data storage solution for Amazon SageMaker. There is an existing TensorFlow-based model implemented as a train.py script that relies on static training data that is currently stored as TFRecords. Which method of providing training data to Amazon SageMaker would meet the business requirements with the LEAST development overhead? - A.. Use Amazon SageMaker script mode and use train.py unchanged. Point the Amazon SageMaker training invocation to the local path of the data without reformatting the training data. B.. Use Amazon SageMaker script mode and use train.py unchanged. Put the TFRecord data into an Amazon S3 bucket. Point the Amazon SageMaker training invocation to the S3 bucket without reformatting the training data. C.. Rewrite the train.py script to add a section that converts TFRecords to protobuf and ingests the protobuf data instead of TFRecords. D.. Prepare the data in the format accepted by Amazon SageMaker. Use AWS Glue or AWS Lambda to reformat and store the data in an Amazon S3 bucket.
B - I would select B. Based on the following AWS documentation it appears this is the right approach: https://sagemaker.readthedocs.io/en/stable/frameworks/tensorflow/using_tf.html https://github.com/aws-samples/amazon-sagemaker-script-mode/blob/master/tf-horovod-inference-pipeline/train.py B is my answer. Reading Data filenames = ["s3://bucketname/path/to/file1.tfrecord", "s3://bucketname/path/to/file2.tfrecord"] dataset = tf.data.TFRecordDataset(filenames) This approach leverages the existing TFRecord format and minimizes changes to the current setup, ensuring a smooth transition to using Amazon SageMaker with minimal development effort. option B Letters C and D need code development and are therefore discarded. As we want a scalable data storage, it is recommended to use the Letter B, since S3 is scalable. Letter A is wrong as your personal computer is not scalable. Where had the Capslock Donald gone? I kinda miss his answers Internet connectivity issue: then how IOT can be a solution? (Correct answer should be A) Amazon SageMaker script mode enables training a machine learning model using a script that you provide. By using the unchanged train.py script and putting the TFRecord data into an Amazon S3 bucket, you can easily point the Amazon SageMaker training invocation to the S3 bucket without reformatting the training data. This option avoids the need to rewrite the train.py script or to prepare the data in a different format. It also leverages the scalability and cost-effectiveness of Amazon S3 for storing large amounts of data, which is important for training machine learning models. thank you chatgpt B, obviously I like answer B Why not A? Why can't we train it from local? Sagemaker to my understanding requires the data to be in S3. B. https://aws.amazon.com/about-aws/whats-new/2019/01/amazon-sagemaker-batch-transform-now-supports-tfrecord-format/ Unfortunilty you can't use the script unchanged, there are some things that need to be added: 1. Make sure your script can handle --model_dir as an additional command line argument. If you did not specify a location when you created the TensorFlow estimator, an S3 location under the default training job bucket is used. Distributed training with parameter servers requires you to use the tf.estimator.train_and_evaluate API and to provide an S3 location as the model directory during training. 2. Load input data from the input channels. The input channels are defined when fit is called. ## https://sagemaker.readthedocs.io/en/stable/frameworks/tensorflow/using_tf.html Beause of the pre-rec Ans A and B are an easy disqualifcation. There is no need to change the training format so option C is a red herring Ans is D Not the most obvious answer according your explaination, the correct answer should be B It mentions using sagemaker in "script mode" Which is different from working on Sagemaker using python SDK. - https://www.examtopics.com/discussions/amazon/view/44065-exam-aws-certified-machine-learning-specialty-topic-1/
114
114 - The chief editor for a product catalog wants the research and development team to build a machine learning system that can be used to detect whether or not individuals in a collection of images are wearing the company's retail brand. The team has a set of training data. Which machine learning algorithm should the researchers use that BEST meets their requirements? - A.. Latent Dirichlet Allocation (LDA) B.. Recurrent neural network (RNN) C.. K-means D.. Convolutional neural network (CNN)
D - D. CNN - image yeah, nothing creepy about a company wanting to do this :) D - image D is the correct Well, I did those exams topics questions for 2 weeks and still easier compared to Maarek exams that simulate AWS MLS. Is that correct? Can I expect the exam to be easier as it's in exams topics? Convolutional neural networks (CNNs) are specifically designed for image recognition tasks and have been highly successful in detecting patterns and features within images. CNNs are particularly effective in capturing spatial patterns and visual features from images Answer is "D" - https://www.examtopics.com/discussions/amazon/view/45463-exam-aws-certified-machine-learning-specialty-topic-1/
115
115 - A retail company is using Amazon Personalize to provide personalized product recommendations for its customers during a marketing campaign. The company sees a significant increase in sales of recommended items to existing customers immediately after deploying a new solution version, but these sales decrease a short time after deployment. Only historical data from before the marketing campaign is available for training. How should a data scientist adjust the solution? - A.. Use the event tracker in Amazon Personalize to include real-time user interactions. B.. Add user metadata and use the HRNN-Metadata recipe in Amazon Personalize. C.. Implement a new solution using the built-in factorization machines (FM) algorithm in Amazon SageMaker. D.. Add event type and event value fields to the interactions dataset in Amazon Personalize.
A - A is the correct answer. Because in this case, it is not the problem with the existing historical data (event value, event type(click or not)), the sales do not keep growing and now you need to obtain more recent interactive data. An event tracker specifies a destination dataset group for new event data. I agree.. A is the right choice.. The model need the real time data to adjust to create recommendations.. here is the receipt: https://docs.aws.amazon.com/personalize/latest/dg/maintaining-relevance.html A real time data A is the right choice. A. Use the event tracker in Amazon Personalize to include real-time user interactions. By using the event tracker in Amazon Personalize, the data scientist can collect real-time user interactions, including clicks, views, and purchases, and use these interactions to update the model and generate more accurate recommendations. This could help address the decrease in sales after deploying a new solution version, as the model can be updated to reflect the latest customer behavior. Additionally, including real-time user interactions can help the model better respond to changes in customer behavior and provide more relevant and personalized recommendations, which can help increase sales. easy one. A A. https://docs.aws.amazon.com/personalize/latest/dg/recording-events.html - https://www.examtopics.com/discussions/amazon/view/45464-exam-aws-certified-machine-learning-specialty-topic-1/
116
116 - A machine learning (ML) specialist wants to secure calls to the Amazon SageMaker Service API. The specialist has configured Amazon VPC with a VPC interface endpoint for the Amazon SageMaker Service API and is attempting to secure traffic from specific sets of instances and IAM users. The VPC is configured with a single public subnet. Which combination of steps should the ML specialist take to secure the traffic? (Choose two.) - A.. Add a VPC endpoint policy to allow access to the IAM users. B.. Modify the users' IAM policy to allow access to Amazon SageMaker Service API calls only. C.. Modify the security group on the endpoint network interface to restrict access to the instances. D.. Modify the ACL on the endpoint network interface to restrict access to the instances. E.. Add a SageMaker Runtime VPC endpoint interface to the VPC.
AC - A&C...>https://aws.amazon.com/blogs/machine-learning/securing-all-amazon-sagemaker-api-calls-with-aws-privatelink/ A - VPC endpoint policy can limit the access to specific group of user/roles Not B - setting iam user policy can limit user access other aws service but not secure the traffic C - “specific” sets of instances - means security rules in instance level Not D - ACL (access control list) allows or denies specific inbound or outbound traffic at the subnet level. Not E - VPC is configured with public subnet, adding interface without limit the traffic means not secure A. YES - for users B. NO - the users should access more than just SageMaker C. YES - for instances D. NO - ACL are not supported for SageMaker endpoint (only S3, RDS, EKS, etc.) E. NO - endpoint is already there A. Add a VPC endpoint policy to allow access to the IAM users: This will specify the permissions for the IAM users to access the Amazon SageMaker Service API through the VPC endpoint. C. Modify the security group on the endpoint network interface to restrict access to the instances: By configuring the security group, the specialist can control which instances are allowed to communicate with the SageMaker Service API through the VPC endpoint. Should be A & D n0? We want to configure the endpoint - first to allow IAM users, second to control access to instances. Since Security Groups are attached to instances (not VPCs) and only allow allow rules - it should be D. Yes, A & D are correct. A> This will limit access to only names IAM users. It is like defining all for given principals as below: { "Statement": [ { "Effect": "Allow", "Principal": "*", "Action": "*", "Resource": "*" } ] } D-> To restrict access to certain instances or IP address you define deny rule at NACL level. Here VPC Interface endpoint is in subnet (the only subnet in VPC). So modify NACL configurations at this subnet level. Security group are only for allowing the traffic not for deny so so C is incorrect. security group cannot restrict access explicitly, C? i mean A, D Security Group controls instance level access. The question requires instance level access. The VPC endpoint is already set up. It needs a policy attachment for particular IAM Users. I would have preferred this to be IAM Roles instead of Users, as a more appropriate question. Nevertheless, answer is A & C. A say allow access TO the IAM users? That's wired, why to the IAM users? How do you access them? The VPC endpoint is already available waiting to be configured. No need to add one. A and E are out. Furthermore if an IAM endpoint is not set, a default one will be provided and you can't have more than 1 IAM policy but can modify the one that's available. -Restric access to only calls coming from the VPC, then modify the security group to give access to user group or roles that need access to that notebook. I think the answer is B and C A says add a VPC endpoint policy, not add an endpoint. https://docs.aws.amazon.com/vpc/latest/userguide/vpc-endpoints-access.html https://docs.aws.amazon.com/sagemaker/latest/dg/notebook-interface-endpoint.html#nbi-private-link-policy https://docs.aws.amazon.com/vpc/latest/userguide/integrated-services-vpce-list.html - https://www.examtopics.com/discussions/amazon/view/44301-exam-aws-certified-machine-learning-specialty-topic-1/
117
117 - An e commerce company wants to launch a new cloud-based product recommendation feature for its web application. Due to data localization regulations, any sensitive data must not leave its on-premises data center, and the product recommendation model must be trained and tested using nonsensitive data only. Data transfer to the cloud must use IPsec. The web application is hosted on premises with a PostgreSQL database that contains all the data. The company wants the data to be uploaded securely to Amazon S3 each day for model retraining. How should a machine learning specialist meet these requirements? - A.. Create an AWS Glue job to connect to the PostgreSQL DB instance. Ingest tables without sensitive data through an AWS Site-to-Site VPN connection directly into Amazon S3. B.. Create an AWS Glue job to connect to the PostgreSQL DB instance. Ingest all data through an AWS Site-to-Site VPN connection into Amazon S3 while removing sensitive data using a PySpark job. C.. Use AWS Database Migration Service (AWS DMS) with table mapping to select PostgreSQL tables with no sensitive data through an SSL connection. Replicate data directly into Amazon S3. D.. Use PostgreSQL logical replication to replicate all data to PostgreSQL in Amazon EC2 through AWS Direct Connect with a VPN connection. Use AWS Glue to move data from Amazon EC2 to Amazon S3.
A - ASK : Extract Data over IPsec So we need an ETL + Site to site VPN GLUE is an ETL service but can it connect to PostgreSQL? yes https://docs.aws.amazon.com/glue/latest/dg/aws-glue-programming-etl-connect.html#aws-glue-programming-etl-connect-jdbc How to connect Glue to an on-site DB https://aws.amazon.com/blogs/big-data/how-to-access-and-analyze-on-premises-data-stores-using-aws-glue/ My Answer would be A Anser C only makes a 443 (SSL) connection so does not meet the IPsec requirement A? IPSec needs to be covered as well Yes. https://aws.amazon.com/blogs/big-data/how-to-access-and-analyze-on-premises-data-stores-using-aws-glue/. 'A' is the correct answer. It's 'A' because IPSec is required. A: https://medium.com/awsblackbelt/loading-on-prem-postgres-data-into-amazon-s3-with-server-side-filtering-c13bcee8b769 B - Doesnt take care of only nonsensitive data being allowed to leave the on-premise. C - Uses SSL and not IPSec. D - like B transfers all data. Hence the correct answer is A I will go with B Site to Site VPN -> IPsec requirement AWS Glue -> connect and catalog PostgressSQL Pyspark -> remove sensitive information. AWS glue supports pyspark The best option is to use AWS Database Migration Service (AWS DMS) with table mapping to select PostgreSQL tables with no sensitive data through an SSL connection. Replicate data directly intoAmazon S3. This option meets the following requirements:It ensures that only nonsensitive data is transferred to the cloud by using table mapping to filter outthe tables that contain sensitive data1.It uses IPsec to secure the data transfer by enabling SSL encryption for the AWS DMS endpoint2.It uploads the data to Amazon S3 each day for model retraining by using the ongoing replicationfeature of AWS DMS3 IPsec and SSL are two different things. Using SSL does not necessarily mean option C has IPsec implemented, which is required. but glue can not filter out the data during the ingestion and hence option A wouldn't be the right one! I would go for B I think A is saying only to ingest tables that don't contain sensitive data, meaning while configuring Glue, the specialist will only select the tables that don't contain sensitive data for ingestion. B Both A and C are not correct... due that the question is not talking about tables with no sensitive data... and that DMS tipically act on the data on AWS side, the right answer is B AWS Glue connects to the PostgreSQL database, allowing the removal of sensitive data using a PySpark job BEFORE securely ingesting the data into Amazon S3, thus aligning with the requirements. I think the issue with this answer would be that the data actually leaves the DC and enters the glue service before sensitive data is redacted. - Which makes me lean A Option c A: https://aws.amazon.com/blogs/big-data/doing-data-preparation-using-on-premises-postgresql-databases-with-aws-glue-databrew/ A. Create an AWS Glue job to connect to the PostgreSQL DB instance. Ingest tables without sensitive data through an AWS Site-to-Site VPN connection directly into Amazon S3. This solution meets the requirements of data localization regulations and secure data transfer. By creating an AWS Glue job to connect to the PostgreSQL DB instance, the machine learning specialist can extract tables without sensitive data. By using a Site-to-Site VPN connection, the data can be securely transferred from the on-premises data center to Amazon S3, where it can be used for model retraining. This solution ensures that any sensitive data remains in the on-premises data center, and that only non-sensitive data is uploaded to the cloud. IPSec means VPN Answer is A. IPsec is not the same as SSL. Site to site VPN is for IPsec: https://aws.amazon.com/vpn/site-to-site-vpn/ Also Glue can directly connect to Postgres and upload to S3: https://aws.amazon.com/blogs/big-data/how-to-access-and-analyze-on-premises-data-stores-using-aws-glue/ A is the answer Between A and C, I pick A because IPSec requires VPN; otherwise DMS is a better option Between A and C, I pick A because IPSec requires VPN; otherwise DMS is a better option - https://www.examtopics.com/discussions/amazon/view/44073-exam-aws-certified-machine-learning-specialty-topic-1/
118
118 - A logistics company needs a forecast model to predict next month's inventory requirements for a single item in 10 warehouses. A machine learning specialist uses Amazon Forecast to develop a forecast model from 3 years of monthly data. There is no missing data. The specialist selects the DeepAR+ algorithm to train a predictor. The predictor means absolute percentage error (MAPE) is much larger than the MAPE produced by the current human forecasters. Which changes to the CreatePredictor API call could improve the MAPE? (Choose two.) - A.. Set PerformAutoML to true. B.. Set ForecastHorizon to 4. C.. Set ForecastFrequency to W for weekly. D.. Set PerformHPO to true. E.. Set FeaturizationMethodName to filling.
AD - I would choose A and D, however both of them is not possible at the same time. The question is ambiguous, it could mean which two options, but no necessarily both. A - If you want Amazon Forecast to evaluate each algorithm and choose the one that minimizes the objective function, set PerformAutoML to true. D - The following algorithms support HPO: - > DeepAR+. If custom forecast types are specified, Forecast evaluates metrics at those specified forecast types, and takes the averages of those metrics to determine the optimal outcomes during HPO and AutoML. For both AutoML and HPO, Forecast chooses the option that minimizes the average losses over the forecast types. During HPO, Forecast uses the first backtest window to find the optimal hyperparameter values. During AutoML, Forecast uses the averages across all backtest windows and the optimal hyperparameters values from HPO to find the optimal algorithm. https://docs.aws.amazon.com/forecast/latest/dg/metrics.html It is A and D, there are no weekly data, they have only monthly data and can not switch horizon to 4 A. YES - DeepAR+ most likely to be chosen, but worth a try B. NO - increasing the forecast horizon is not likely to improve the 3 months we want C. NO - we want monthly, nto weekly D. YES E. NO - there are no missing values The changes to the CreatePredictor API call that could improve the MAPE are option A and option D. By setting PerformAutoML to true, you can enable Amazon Forecast to automatically explore different algorithms and choose the best one for your data and business problem. By setting PerformHPO to true, you can enable Amazon Forecast to perform hyperparameter optimization (HPO) and tune the algorithm parameters to improve the accuracy of the predictor. These options can help you find the optimal configuration for your forecast model without manually specifying the algorithm or the hyperparameters. A. Set PerformAutoML to true. D. Set PerformHPO to true. Setting PerformAutoML to true will enable Amazon Forecast to automatically select the best algorithm and hyperparameters for your data and problem. This can help improve the MAPE by finding the optimal combination of algorithm and hyperparameters that minimize prediction error. Setting PerformHPO to true will enable Amazon Forecast to perform a hyperparameter optimization search to find the best combination of hyperparameters that result in the best prediction performance. This can help improve the MAPE by finding the optimal combination of hyperparameters that minimize prediction error. A. Looking for better algorithms performance D. Hyperparameters optimization 12-sep exam Why are not B and C? The question asks about modifications that increase MAPE (thats bad): B - If FH is larger, error will increase C - Data is based on months, change that will make erros on forecasting values E - There is no data gap so is useless A - Selec best between all should DECREASE MAPE D - Tunning hyperparms will DECREASE MAPE A&D...>By default, Amazon Forecast uses the 0.1 (P10), 0.5 (P50), and 0.9 (P90) quantiles for hyperparameter tuning during hyperparameter optimization (HPO) and for model selection during AutoML. If you specify custom forecast types when creating a predictor, Forecast uses those forecast types during HPO and AutoML. If custom forecast types are specified, Forecast evaluates metrics at those specified forecast types, and takes the averages of those metrics to determine the optimal outcomes during HPO and AutoML. For both AutoML and HPO, Forecast chooses the option that minimizes the average losses over the forecast types. During HPO, Forecast uses the first backtest window to find the optimal hyperparameter values. During AutoML, Forecast uses the averages across all backtest windows and the optimal hyperparameters values from HPO to find the optimal algorithm. C. ForecastFrequency M- MONTHLY W- WEEKLY D. PerformHPO Whether to perform hyperparameter optimization (HPO). HPO finds optimal hyperparameter values for your training data. The process of performing HPO is known as running a hyperparameter tuning job. The default value is false. In this case, Amazon Forecast uses default hyperparameter values from the chosen algorithm. E. FeaturizationMethodName The name of the method. The "filling" method is the only supported method. But for option C, according to the Developer Guide, The forecast frequency must be greater than or equal to the TARGET_TIME_SERIES dataset frequency. and the training data is monthly data, so ForecastFrequency can not be less than Monthly. ABE can be excluded. CD is my answer. A. PerformAutoML If you want Amazon Forecast to evaluate each algorithm and choose the one that minimizes the objective function, set PerformAutoML to true. The objective function is defined as the mean of the weighted losses over the forecast types. By default, these are the p10, p50, and p90 quantile losses. When AutoML is enabled, the following properties are disallowed: AlgorithmArn HPOConfig PerformHPO TrainingParameters B. ForecastHorizon Specifies the number of time-steps that the model is trained to predict. The forecast horizon is also called the prediction length. For example, if you configure a dataset for daily data collection (using the DataFrequency parameter of the CreateDataset operation) and set the forecast horizon to 10, the model returns predictions for 10 days. The maximum forecast horizon is the lesser of 500 time-steps or 1/3 of the TARGET_TIME_SERIES dataset length. - https://www.examtopics.com/discussions/amazon/view/45611-exam-aws-certified-machine-learning-specialty-topic-1/
119
119 - A data scientist wants to use Amazon Forecast to build a forecasting model for inventory demand for a retail company. The company has provided a dataset of historic inventory demand for its products as a .csv file stored in an Amazon S3 bucket. The table below shows a sample of the dataset. How should the data scientist transform the data? [https://www.examtopics.com/assets/media/exam-media/04145/0007200001.png] - A.. Use ETL jobs in AWS Glue to separate the dataset into a target time series dataset and an item metadata dataset. Upload both datasets as .csv files to Amazon S3. B.. Use a Jupyter notebook in Amazon SageMaker to separate the dataset into a related time series dataset and an item metadata dataset. Upload both datasets as tables in Amazon Aurora. C.. Use AWS Batch jobs to separate the dataset into a target time series dataset, a related time series dataset, and an item metadata dataset. Upload them directly to Forecast from a local machine. D.. Use a Jupyter notebook in Amazon SageMaker to transform the data into the optimized protobuf recordIO format. Upload the dataset in this format to Amazon S3.
A - I would answer A. Target and metadata must be in two files and loaded from S3, based on documentation: https://docs.aws.amazon.com/forecast/latest/dg/dataset-import-guidelines-troubleshooting.html 1. I cannot find any evidence support the seperate file defination. 2. A,B,C all seperate datasets, this explanation is weak. Amazon Forecast requires the input data to be separated into a target time series dataset and an item metadata dataset. The target time series dataset should include the time series data that you want to use for forecasting, such as inventory demand in this case. The item metadata dataset should include the metadata that describes the items in the time series, such as product IDs, categories, and attributes. Therefore, the data scientist should use ETL jobs in AWS Glue to separate the dataset into a target time series dataset and an item metadata dataset. Both datasets should be uploaded as .csv files to Amazon S3, which is a suitable storage option for input data to Amazon Forecast. thank you chatgpt I would go with A Input formats for forecast -> Json, CSV and paraquet (Selects A & eliminates B, C, D) Data needs to be split in target time series dataset and an item metadata dataset Target and metadata must be in two files Letter A is correct, as it uses a specific transformation service (AWS Glue) and saves it in a cloud database for AWS Forecast to access. By default in ML, our storage option will be AWS S3 (unless caveats or issue specifications). That said, we discard B and C. Letter D is discarded due to the format requested by AWS Forecast being csv. I would vote for A The answer is A. According to the https://docs.aws.amazon.com/forecast/latest/dg/forecast.dg.pdf , page 51. Target Time Series Dataset: Required: timestamp, item_id, demand Additional: lead_time Item Metadata Dataset: item_id, category You can find the same question with the picture at https://ccnav7.net/a-data-scientist-wants-to-use-amazon-forecast-to-build-a-forecasting-model-for-inventory-demand-for-a-retail-company/ The correct answer is A ''Forecast supports only the comma-separated values (CSV) file format. You can't separate values using tabs, spaces, colons, or any other characters. Guideline: Convert your dataset to CSV format (using only commas as your delimiter) and try importing the file again.'' lead time belongs to related time series, as its not a target variable - https://www.examtopics.com/discussions/amazon/view/44077-exam-aws-certified-machine-learning-specialty-topic-1/
120
120 - A machine learning specialist is running an Amazon SageMaker endpoint using the built-in object detection algorithm on a P3 instance for real-time predictions in a company's production application. When evaluating the model's resource utilization, the specialist notices that the model is using only a fraction of the GPU. Which architecture changes would ensure that provisioned resources are being utilized effectively? - A.. Redeploy the model as a batch transform job on an M5 instance. B.. Redeploy the model on an M5 instance. Attach Amazon Elastic Inference to the instance. C.. Redeploy the model on a P3dn instance. D.. Deploy the model onto an Amazon Elastic Container Service (Amazon ECS) cluster using a P3 instance.
B - B is correct. Redeploy with CPU and add elastic inference to reduce costs. See: https://aws.amazon.com/machine-learning/elastic-inference/ The Amazon EC2 M5 instance family is designed for general-purpose workloads, and they are CPU-optimized. Therefore, M5 instances do not come with GPUs. The only option that talks about optmising the use of already provisoned resources is option D, So that must be the answer. This solution allows you to use a more cost-effective instance type while leveraging Elastic Inference to provide the necessary GPU acceleration redeploying the model on a P3dn instance is the best approach to ensure the provisioned GPU resources are being utilized effectively. My vote is B Elastic inference - provides cheper accelration that full GPU - works with M class machines - Works with Tensorflow, MXNet, pytorch, image classification and object detection algorithms Elastic Inference has been depreciated since Apr 2023. can reduce the cost and improve the resource utilization of your model, as Amazon Elastic Inference allows you to attach low-cost GPU-powered acceleration to Amazon EC2 and Amazon SageMaker instances to run inference workloads with a fraction of the compute resources. You can also choose the right amount of inference acceleration that suits your needs, and scale it up or down as needed. Amazon Elastic Inference allows you to attach low-cost GPU-powered acceleration to EC2 and Sagemaker instances, to reduce the cost of running deep learning inference. You can choose any CPU instance that is best suited to the overall compute and memory needs of your application, and then separately configure the right amount of GPU-powered inference acceleration. This would allow you to efficiently utilize resources and reduce costs. Elastic inference enables GPU only when load increases. With 50% utilisation there is no need to deploy P3 as the base inference machine. Agreed with B 12-sep exam Answer: B Explanation: https://aws.amazon.com/machine-learning/elastic-inference/ B: production mostly needs CPU with EI rather than GPU machines B..>Amazon Elastic Inference (EI) is a resource you can attach to your Amazon EC2 CPU instances to accelerate your deep learning (DL) inference workloads. Amazon EI accelerators come in multiple sizes and are a cost-effective method to build intelligent capabilities into applications running on Amazon EC2 instances. B is correct - https://www.examtopics.com/discussions/amazon/view/44078-exam-aws-certified-machine-learning-specialty-topic-1/
121
121 - A data scientist uses an Amazon SageMaker notebook instance to conduct data exploration and analysis. This requires certain Python packages that are not natively available on Amazon SageMaker to be installed on the notebook instance. How can a machine learning specialist ensure that required packages are automatically available on the notebook instance for the data scientist to use? - A.. Install AWS Systems Manager Agent on the underlying Amazon EC2 instance and use Systems Manager Automation to execute the package installation commands. B.. Create a Jupyter notebook file (.ipynb) with cells containing the package installation commands to execute and place the file under the /etc/init directory of each Amazon SageMaker notebook instance. C.. Use the conda package manager from within the Jupyter notebook console to apply the necessary conda packages to the default kernel of the notebook. D.. Create an Amazon SageMaker lifecycle configuration with package installation commands and assign the lifecycle configuration to the notebook instance.
D - I would select D. See AWS documentation: https://docs.aws.amazon.com/sagemaker/latest/dg/nbi-add-external.html by excluding wrong options: A you might not have access to the EC2 instance => out B no automation => out C only the default kernel, which limits the DS => out => D option D D. https://docs.aws.amazon.com/sagemaker/latest/dg/nbi-add-external.html Install custom environments and kernels on the notebook instance's Amazon EBS volume. This ensures that they persist when you stop and restart the notebook instance, and that any external libraries you install are not updated by SageMaker. To do that, use a lifecycle configuration that includes both a script that runs when you create the notebook instance (on-create) and a script that runs each time you restart the notebook instance (on-start). Even the link given suggest Option D Please ignore my previous comment, the answer is D Key word here is how can the developer "guarantee"?? He guarantees that by including the install commands as part of the notebook. So, against the grain, I stand with B Scratch that. The Answer is D D should be the answer https://docs.aws.amazon.com/sagemaker/latest/dg/nbi-add-external.html upvoted 1 times D "automatically" is the key here and using lifecycle configuration https://docs.aws.amazon.com/sagemaker/latest/dg/nbi-add-external.html It is D, although is not even the best answer in my opinion. Although by default conda packages are installed in ephemeral storage, you can change that default behaviour. I did that in my last project and we created our own conda environment that persisted between shutdowns. based on the refernce given under the answer its D not B You can install packages using the following methods: 1-Lifecycle configuration scripts 2-Notebooks – The following commands are supported. %conda install %pip install 3-The Jupyter terminal – You can install packages using pip and conda directly. NOT B ...>/etc/init contains configuration files used by Upstart. ANS...>D Its for sure D D https://docs.aws.amazon.com/sagemaker/latest/dg/nbi-add-external.html - https://www.examtopics.com/discussions/amazon/view/44080-exam-aws-certified-machine-learning-specialty-topic-1/
122
122 - A data scientist needs to identify fraudulent user accounts for a company's ecommerce platform. The company wants the ability to determine if a newly created account is associated with a previously known fraudulent user. The data scientist is using AWS Glue to cleanse the company's application logs during ingestion. Which strategy will allow the data scientist to identify fraudulent accounts? - A.. Execute the built-in FindDuplicates Amazon Athena query. B.. Create a FindMatches machine learning transform in AWS Glue. C.. Create an AWS Glue crawler to infer duplicate accounts in the source data. D.. Search for duplicate accounts in the AWS Glue Data Catalog.
B - B ,You can use the FindMatches transform to find duplicate records in the source data. A labeling file is generated or provided to help teach the transform. B it is. Reasonable explanation. option B Find matches It's B, how it's using Glue to clean the data the easiest way will be use Glue's ML FindMatches extension to do this too. It is B. Agree. Please refer to: https://aws.amazon.com/blogs/big-data/integrate-and-deduplicate-datasets-using-aws-lake-formation-findmatches/ - https://www.examtopics.com/discussions/amazon/view/44079-exam-aws-certified-machine-learning-specialty-topic-1/
123
123 - A Data Scientist is developing a machine learning model to classify whether a financial transaction is fraudulent. The labeled data available for training consists of 100,000 non-fraudulent observations and 1,000 fraudulent observations. The Data Scientist applies the XGBoost algorithm to the data, resulting in the following confusion matrix when the trained model is applied to a previously unseen validation dataset. The accuracy of the model is 99.1%, but the Data Scientist needs to reduce the number of false negatives. Which combination of steps should the Data Scientist take to reduce the number of false negative predictions by the model? (Choose two.) [https://www.examtopics.com/assets/media/exam-media/04145/0007400001.png] - A.. Change the XGBoost eval_metric parameter to optimize based on Root Mean Square Error (RMSE). B.. Increase the XGBoost scale_pos_weight parameter to adjust the balance of positive and negative weights. C.. Increase the XGBoost max_depth parameter because the model is currently underfitting the data. D.. Change the XGBoost eval_metric parameter to optimize based on Area Under the ROC Curve (AUC). E.. Decrease the XGBoost max_depth parameter because the model is currently overfitting the data.
BD - B and D Compensate for imbalance and optimize on AUC. This is a class imbalance problem, not an overfitting problem. totally right, overfitting has nothing to do so there is no need to reduce tree depth the question didnt show the model performance on training data, so the overfitting issues is not correct. A. NO - that will not address FN specifically but also FP B. YES - changing weight is best practice for class imbalance C. NO - there is no underfitting at 99.1% accuracy D. YES - AUC will address recall, which takes into account FN rate E. NO - there is no overfitting at 99.1% accuracy Step B: Increasing the XGBoost scale_pos_weight parameter to adjust the balance of positive and negative weights can help the model deal with the imbalanced dataset. According to the XGBoost documentation, this parameter controls the balance of positive and negative weights, and is useful for unbalanced classes. A typical value to consider is sum(negative instances) / sum(positive instances). In this case, since there are 100 times more non-fraudulent transactions than fraudulent ones, setting scale_pos_weight to 100 can make the model more sensitive to the minority class and reduce false negatives. Step D: Changing the XGBoost eval_metric parameter to optimize based on Area Under the ROC Curve (AUC) can help the model focus on improving the true positive rate and the true negative rate, which are both important for fraud detection. According to the XGBoost Step B: Increasing the XGBoost scale_pos_weight parameter to adjust the balance of positive and negative weights can help the model deal with the imbalanced dataset. According to the XGBoost documentation, this parameter controls the balance of positive and negative weights, and is useful for unbalanced classes. A typical value to consider is sum(negative instances) / sum(positive instances). In this case, since there are 100 times more non-fraudulent transactions than fraudulent ones, setting scale_pos_weight to 100 can make the model more sensitive to the minority class and reduce false negatives. Step D: Changing the XGBoost eval_metric parameter to optimize based on Area Under the ROC Curve (AUC) can help the model focus on improving the true positive rate and the true negative rate, which are both important for fraud detection. I have some doubts about D and E. Precision-Recall AUC is better than AUC curve in imbalanced classes. Then, I choose E Option A and Option E are unlikely to help reduce false negatives. Option C, increasing max_depth, may lead to overfitting, which could make the model worse. Option D, changing the eval_metric to optimize based on AUC, could help improve the model's ability to discriminate between the two classes. Option B, increasing the scale_pos_weight parameter to adjust the balance of positive and negative weights, can help the model better handle imbalanced datasets, which is the case here. By increasing the weight of positive examples, the model will learn to prioritize correctly classifying them, which should reduce the number of false negatives. BD, I have done this before, but it would be better to use Average Precision(AP) instead of AUC, but it is better than other answers. 12-sep exam Compensate for imbalance and overwriting. B and E B. Increase the XGBoost scale_pos_weight parameter to adjust the balance of positive and negative weights is the correct answer. According to https://docs.aws.amazon.com/sagemaker/latest/dg/xgboost_hyperparameters.html, scale_pos_weight controls the balance of positive and negative weights. It's useful for unbalanced classes. - https://www.examtopics.com/discussions/amazon/view/75336-exam-aws-certified-machine-learning-specialty-topic-1/
124
124 - A data scientist has developed a machine learning translation model for English to Japanese by using Amazon SageMaker's built-in seq2seq algorithm with 500,000 aligned sentence pairs. While testing with sample sentences, the data scientist finds that the translation quality is reasonable for an example as short as five words. However, the quality becomes unacceptable if the sentence is 100 words long. Which action will resolve the problem? - A.. Change preprocessing to use n-grams. B.. Add more nodes to the recurrent neural network (RNN) than the largest sentence's word count. C.. Adjust hyperparameters related to the attention mechanism. D.. Choose a different weight initialization type.
C - I agree with an answer of C Attention mechanism. The disadvantage of an encoder-decoder framework is that model performance decreases as and when the length of the source sequence increases because of the limit of how much information the fixed-length encoded feature vector can contain. To tackle this problem, in 2015, Bahdanau et al. proposed the attention mechanism. In an attention mechanism, the decoder tries to find the location in the encoder sequence where the most important information could be located and uses that information and previously decoded words to predict the next token in the sequence. C. By tuning attention-related hyperparameters (such as attention type, attention layer size, and dropout), the model can focus on relevant parts of the input sequence during translation. A. NO - n-grams are more the opposite, it is to capture local information B. NO - it could help, but not best C. YES - best practice (https://docs.aws.amazon.com/sagemaker/latest/dg/seq-2-seq-hyperparameters.html) D. NO - weight are for vectorized words, they do not relate to sequences Action C: Adjusting hyperparameters related to the attention mechanism can help improve the translation quality for long sentences, because the attention mechanism allows the decoder to focus on the most relevant parts of the source sentence at each time step. According to the Amazon SageMaker documentation, the seq2seq algorithm supports several types of attention mechanisms, such as dot, general, concat, and location. The data scientist can experiment with different values of the hyperparameters attention_type, attention_coverage_type, and attention_num_hidden to find the optimal configuration for the translation task. C. Adjust hyperparameters related to the attention mechanism. The seq2seq algorithm uses an attention mechanism to dynamically focus on relevant parts of the input sequence for each output sequence element. Increasing the attention mechanism's ability to learn dependencies between long input and output sequences might help improve the translation quality for long sentences. The data scientist could try adjusting relevant hyperparameters such as attention depth or attention scale, or try a different attention mechanism such as scaled dot-product attention, to see if that improves the translation quality for long sentences. i go with C Ans: C Explanation: https://docs.aws.amazon.com/sagemaker/latest/dg/seq-2-seq- howitworks.html This is such a niche question for a niche market. Geared towards someone who specializes in NLP. c is correct https://docs.aws.amazon.com/sagemaker/latest/dg/seq-2-seq-howitworks.html I believe the answer is C https://docs.aws.amazon.com/sagemaker/latest/dg/seq-2-seq-howitworks.html - https://www.examtopics.com/discussions/amazon/view/44186-exam-aws-certified-machine-learning-specialty-topic-1/
125
125 - A financial company is trying to detect credit card fraud. The company observed that, on average, 2% of credit card transactions were fraudulent. A data scientist trained a classifier on a year's worth of credit card transactions data. The model needs to identify the fraudulent transactions (positives) from the regular ones (negatives). The company's goal is to accurately capture as many positives as possible. Which metrics should the data scientist use to optimize the model? (Choose two.) - A.. Specificity B.. False positive rate C.. Accuracy D.. Area under the precision-recall curve E.. True positive rate
DE - D, E is the answer. we need to make the recall rate(not precision) high. To maximize detection of fraud in real-world, imbalanced datasets, D and E should always be applied. https://en.wikipedia.org/wiki/Sensitivity_and_specificity https://machinelearningmastery.com/roc-curves-and-precision-recall-curves-for-imbalanced-classification/ Note, True positive rate = Sensitivity = Recall that is not correct unfortunately Recall is = Sensitivity = False Negative which is a Type II error Precision = specificity = False Positive which is a Type I error I do agree that in the real world you would focus on Recall/sensitivity ie. reducing type II errors. However, in the question, they want to reduce the False Positives so you would need to focus on precision and specificity minimizing type I errors recall = sensitivity = TRUE POSITIVE RATE https://www.google.com/url?sa=t&rct=j&q=&esrc=s&source=web&cd=&cad=rja&uact=8&ved=2ahUKEwjhyduhndjwAhXtzDgGHVsSBacQFjABegQIBRAD&url=https%3A%2F%2Fwww.split.io%2Fglossary%2Ffalse-positive-rate%2F&usg=AOvVaw10zzmY-IDlhbboUTwEMnqw This is incorrect. The goal is to capture as many positive as possible, so false positive is not a concern. suppose we have 100 samples, 2 are positive, the rest 98 negative. We have two models: A has TP = 2, TN = 48, FP = 50, FN = 0. B has TP = 1, TN = 88, FP = 10, FN = 1. Model A has higher false positive rate (50/98 vs 10/98). However we will choose A, since it captures all TP. I will go with D, E. "accurately capture positives" means maximize TPR. A. NO - Specificity = TN / ( TN + FP ) is a measure of negative cases B. NO - FPR = FP / Total C. NO - given the class imbalance, overall accuracy would not help D. YES - not sure if we need that on top of E, but other options are eliminated anyway E. YES - TPR = TP / Total, what we want The data scientist should use True positive rate and Area under the precision-recall curve to optimize the model. The true positive rate (TPR) is the proportion of actual positives that are correctly identified as such. It is also known as sensitivity or recall. In this case, it is important to capture as many fraudulent transactions as possible, so the TPR should be maximized. The area under the precision-recall curve (AUPRC) is a measure of how well the model is able to distinguish between positive and negative classes. It is a good metric to use when the classes are imbalanced, as in this case where only 2% of transactions are fraudulent. The AUPRC summarizes the trade-off between precision and recall across all possible thresholds. Accuracy and specificity are not good metrics to use when the classes are imbalanced because they can be misleading. The false positive rate (FPR) is also not a good metric to use because it does not take into account the number of true negatives. Metric D: Area under the precision-recall curve (AUPRC) is a good metric to use for imbalanced classification problems, where the positive class is much less frequent than the negative class. Precision is the proportion of positive predictions that are correct, and recall (or true positive rate) is the proportion of positive cases that are detected. AUPRC summarizes the trade-off between precision and recall for different decision thresholds, and a higher AUPRC means that the model can achieve both high precision and high recall. Since the company’s goal is to accurately capture as many positives as possible, AUPRC can help them evaluate how well the model performs on the minority class. Metric E: True positive rate (TPR) is another good metric to use for imbalanced classification problems, as it measures the sensitivity of the model to the positive class. TPR is the same as recall, and it is the proportion of positive cases that are detected by the model. A higher TPR means that the model can identify more fraudulent transactions, which is the company’s goal. The goal is to accurately capture as many fraudulent transactions (positives) as possible. To optimize the model towards this goal, the data scientist should focus on metrics that emphasize the true positive rate and the area under the precision-recall curve. True positive rate (TPR or sensitivity) is the proportion of actual positive cases that are correctly identified as positive by the model. A higher TPR means that more fraudulent transactions are being captured. The precision-recall curve is a graph that shows the trade-off between precision and recall for different thresholds. Precision is the fraction of correctly identified positive instances among all instances the model has classified as positive. Recall, also known as the true positive rate, is the fraction of positive instances that are correctly identified as positive by the model. A higher area under the precision-recall curve indicates that the model is making fewer false positive predictions and more true positive predictions, which aligns with the goal of the financial company to accurately capture as many fraudulent transactions as possible. agreed with DE Why not A and D? - Specificity shows us how the FNR is - AUC PR includes Precision and Recall which shows us the ratio of TP to TP/FP and TP to TP / FN Answer: D&E I meant say D&E not BD I believe the answer is B&D, which equals F1. F1 combines precision and Sensitivity. D&E is the only choices that takes False Negatives into considration TPR is already included in the AUC PR TNR is not included in all others besides A Recall and TPR D and E AB is the answer - https://www.examtopics.com/discussions/amazon/view/44081-exam-aws-certified-machine-learning-specialty-topic-1/
126
126 - A machine learning specialist is developing a proof of concept for government users whose primary concern is security. The specialist is using Amazon SageMaker to train a convolutional neural network (CNN) model for a photo classifier application. The specialist wants to protect the data so that it cannot be accessed and transferred to a remote host by malicious code accidentally installed on the training container. Which action will provide the MOST secure protection? - A.. Remove Amazon S3 access permissions from the SageMaker execution role. B.. Encrypt the weights of the CNN model. C.. Encrypt the training and validation dataset. D.. Enable network isolation for training jobs.
D - I will go with D, "cannot be accessed and transferred to a remote host by malicious code accidentally installed on the training container" Based on the following link: https://aws.amazon.com/blogs/security/secure-deployment-of-amazon-sagemaker-resources/ "EnableNetworkIsolation – Set this to true when creating training, hyperparameter tuning, and inference jobs to prevent situations like malicious code being accidentally installed and transferring data to a remote host." If you enable network isolation, the containers can't make any outbound network calls, even to other AWS services such as Amazon S3. Additionally, no AWS credentials are made available to the container runtime environment. In the case of a training job with multiple instances, network inbound and outbound traffic is limited to the peers of each training container. SageMaker still performs download and upload operations against Amazon S3 using your SageMaker execution role in isolation from the training or inference container. ahaha this link literally contains the answer For example, a malicious user or code that you accidentally install on the container (in the form of a publicly available source code library) could access your data and transfer it to a remote host. 'network isolation' make sense A. NO - Remove Amazon S3 access permissions from the SageMaker execution role. B. NO - Encrypting the weights has nothing to do with protecting the training data C. NO - If the dataset is encrypted, one may still hack SageMaker instance and get access to uncrypted data D. YES - Enable network isolation for training jobs, data is protected end-to-end Network isolation It's D, not C because encrypted can be stole. I choose D. More document about it: https://docs.aws.amazon.com/sagemaker/latest/dg/mkt-algo-model-internet-free.html Answer is D. https://aws.amazon.com/blogs/security/secure-deployment-of-amazon-sagemaker-resources/ search for 'isolation' and there is a security parameter : EnableNetworkIsolation talking about this. I would choose C most likely it is C. https://docs.aws.amazon.com/sagemaker/latest/dg/data-protection.html incorrect; you CAN transfer encrypted files even w/o a key D is a better option - https://www.examtopics.com/discussions/amazon/view/46341-exam-aws-certified-machine-learning-specialty-topic-1/
127
127 - A medical imaging company wants to train a computer vision model to detect areas of concern on patients' CT scans. The company has a large collection of unlabeled CT scans that are linked to each patient and stored in an Amazon S3 bucket. The scans must be accessible to authorized users only. A machine learning engineer needs to build a labeling pipeline. Which set of steps should the engineer take to build the labeling pipeline with the LEAST effort? - A.. Create a workforce with AWS Identity and Access Management (IAM). Build a labeling tool on Amazon EC2 Queue images for labeling by using Amazon Simple Queue Service (Amazon SQS). Write the labeling instructions. B.. Create an Amazon Mechanical Turk workforce and manifest file. Create a labeling job by using the built-in image classification task type in Amazon SageMaker Ground Truth. Write the labeling instructions. C.. Create a private workforce and manifest file. Create a labeling job by using the built-in bounding box task type in Amazon SageMaker Ground Truth. Write the labeling instructions. D.. Create a workforce with Amazon Cognito. Build a labeling web application with AWS Amplify. Build a labeling workflow backend using AWS Lambda. Write the labeling instructions.
C - I would answer C, because of the requirement that authorized users should only have access. These users will comprise the private workforce of AWS Ground Truth. See documentation: https://docs.aws.amazon.com/sagemaker/latest/dg/sms-workforce-private.html Agree C Yes it is C agree C Answer is C. The question mentions that "to detect *areas* of concern on patients' CT scans", that can be achieved by bounding box instead of image classification. bounding box: https://docs.aws.amazon.com/sagemaker/latest/dg/sms-bounding-box.html image classification: https://docs.aws.amazon.com/sagemaker/latest/dg/sms-image-classification.html The concern here is not about "object detection" or "image classification". it is about using "ground truth" and "private workforce" The principal key is that Mechanical Turk workforce does not ensure privacy of the CT Scans, and Ground Truth does. This option would allow the medical imaging company to create a private workforce, which can ensure that only authorized users have access to the scans, and to use Amazon SageMaker Ground Truth to create a labeling job, which would simplify the labeling pipeline process. 12-sep exam C - GroundTruth and privacy concerns - https://www.examtopics.com/discussions/amazon/view/44091-exam-aws-certified-machine-learning-specialty-topic-1/
128
128 - A company is using Amazon Textract to extract textual data from thousands of scanned text-heavy legal documents daily. The company uses this information to process loan applications automatically. Some of the documents fail business validation and are returned to human reviewers, who investigate the errors. This activity increases the time to process the loan applications. What should the company do to reduce the processing time of loan applications? - A.. Configure Amazon Textract to route low-confidence predictions to Amazon SageMaker Ground Truth. Perform a manual review on those words before performing a business validation. B.. Use an Amazon Textract synchronous operation instead of an asynchronous operation. C.. Configure Amazon Textract to route low-confidence predictions to Amazon Augmented AI (Amazon A2I). Perform a manual review on those words before performing a business validation. D.. Use Amazon Rekognition's feature to detect text in an image to extract the data from scanned images. Use this information to process the loan applications.
C - I agree with C, given we are evaluating model inferences (predictions). See https://aws.amazon.com/augmented-ai/ and https://aws.amazon.com/blogs/machine-learning/automated-monitoring-of-your-machine-learning-models-with-amazon-sagemaker-model-monitor-and-sending-predictions-to-human-review-workflows-using-amazon-a2i/ yeap, it literally says it there Loan or mortgage applications, tax forms, and many other financial documents contain millions of data points which need to be processed and extracted quickly and effectively. Using Amazon Textract and Amazon A2I you can extract critical data from these forms why not a? the differences rely on the function of these two service. Ground Truth is used for "labeling" typically, text or image label: if the service cannot automatically label the data, it send to ground truth and wait for human to label it. but A2I is for validate prediction. The model already predict the results and human then add views to it. https://docs.aws.amazon.com/textract/latest/dg/a2i-textract.html https://aws.amazon.com/blogs/machine-learning/using-amazon-textract-with-amazon-augmented-ai-for-processing-critical-documents/ i think ground truth can do same task instead of Augmented AI The answer is C, Augmented AI is made for review ML predictions! By routing the low-confidence predictions to Amazon Augmented AI, the company can reduce the time to process the loan applications by leveraging human intelligence to review and validate the predictions. This way, the company can quickly address any errors or mistakes that Amazon Textract might make, reducing the time to process loan applications. correct is C - https://www.examtopics.com/discussions/amazon/view/44092-exam-aws-certified-machine-learning-specialty-topic-1/
129
129 - A company ingests machine learning (ML) data from web advertising clicks into an Amazon S3 data lake. Click data is added to an Amazon Kinesis data stream by using the Kinesis Producer Library (KPL). The data is loaded into the S3 data lake from the data stream by using an Amazon Kinesis Data Firehose delivery stream. As the data volume increases, an ML specialist notices that the rate of data ingested into Amazon S3 is relatively constant. There also is an increasing backlog of data for Kinesis Data Streams and Kinesis Data Firehose to ingest. Which next step is MOST likely to improve the data ingestion rate into Amazon S3? - A.. Increase the number of S3 prefixes for the delivery stream to write to. B.. Decrease the retention period for the data stream. C.. Increase the number of shards for the data stream. D.. Add more consumers using the Kinesis Client Library (KCL).
C - C is the correct answer. # of shard is determined by: 1. # of transactions per second times 2. data blob eg. 100 KB in size 3. One shard can Ingest 1 MB/second the answer should be A - the reason why shards are not the right answer is the lack of ProvisionedThroughputExceeded exceptions that occur when a KDS has too few shards. The scenario talks about a consistent pace of delivery into S3 and a rising backlog of data (which indicates KDS stream is still able to ingest data) in the stream, hence the S3 write limit per prefix is at fault: https://www.amazonaws.cn/en/kinesis/data-streams/faqs/#:~:text=Q%3A%20What%20happens%20if%20the%20capacity%20limits%20of%20a%20Kinesis%20stream%20are%20exceeded%20while%20the%20data%20producer%20adds%20data%20to%20the%20stream%3F https://docs.aws.amazon.com/AmazonS3/latest/userguide/optimizing-performance.html from https://aws.amazon.com/kinesis/data-firehose/faqs/?nc1=h_ls Q: How often does Kinesis Data Firehose read data from my Kinesis stream? A: Kinesis Data Firehose calls Kinesis Data Streams GetRecords() once every second for each Kinesis shard. // and the number of records per GetRecords() is at most 10.000 => having n shards you will get at most 10.000n records to firehose per sec. => hence firehose instead of s3 could be the limiting factor. => i'd also go with inc shards as the first choice (to not having to change the S3 consumers) shards is a concept in kinesis data stream. But here the topic mention "There also is an increasing backlog of data for Kinesis Data Streams and Kinesis Data Firehose to ingest" So even firehose has large backlogs, which means the limit comes from the S3. So A. The bottle neck is not at data ingestion (i.e. Kinesis shards), but in write to S3, which throughput is bound by prefixes used. A is not solving the issue, the bottleneck locate not in S3 but in the KDS, so we should solve the problem at the KDS, the Shards I think the question is very ambiguous. "There also is an increasing backlog of data for Kinesis Data Streams and Kinesis Data Firehose to ingest.", that suggest the backlog is on the client-side (even before reaching KDS). Any component down the chain can be a bottlneck (KDS shrad, Firehose, S3). There is just no way to know in my opinion, but increasing shard is certainly the easiest to try without impact the storage structure in S3 and possibly breaking the app. this is my key word to solve this problem : There also is an increasing backlog of data for Kinesis Data Streams and Kinesis Data Firehose to ingest. so increasing the shards to ingest is the solution no of shards A is not correct, because "There also is an increasing backlog of data for Kinesis Data Streams and Kinesis Data Firehose to ingest", the backlog is totally not caused by S3 performance, but the shard issue. To increase ingest The increasing backlog of data for Kinesis Data Streams and Kinesis Data Firehose indicates that the ingestion rate is slower than the data production rate. Therefore, the next step to improve the data ingestion rate into Amazon S3 is to increase the capacity of Kinesis Data Streams by increasing the number of shards. This will increase the parallelism of data processing, allowing for a higher throughput rate. Option C is the correct answer. Option A is incorrect because increasing the number of S3 prefixes for the delivery stream will not directly affect the ingestion rate into S3. To improve the data ingestion rate into Amazon S3, the ML specialist should consider increasing the number of shards for the Kinesis data stream. A Kinesis data stream is made up of one or more shards, and each shard provides a fixed amount of capacity for ingesting and storing data. By increasing the number of shards, the specialist can increase the overall capacity of the data stream and improve the rate at which data is ingested. C is the correct answer Clearly S3 is a bottleneck. S3 has parallel perfromance acrtoss prefixes, thus increasing throughput It seems S3 is the bottlneck. Adding more prefixes will help: https://docs.aws.amazon.com/AmazonS3/latest/userguide/optimizing-performance.html 12-sep exam The question seems to indicate the problem in the ability of S3 to load the data. Therefore, I think the answer is A. https://docs.aws.amazon.com/firehose/latest/dev/dynamic-partitioning.html - https://www.examtopics.com/discussions/amazon/view/45615-exam-aws-certified-machine-learning-specialty-topic-1/
130
130 - A data scientist must build a custom recommendation model in Amazon SageMaker for an online retail company. Due to the nature of the company's products, customers buy only 4-5 products every 5-10 years. So, the company relies on a steady stream of new customers. When a new customer signs up, the company collects data on the customer's preferences. Below is a sample of the data available to the data scientist. How should the data scientist split the dataset into a training and test set for this use case? [https://www.examtopics.com/assets/media/exam-media/04145/0007800001.png] - A.. Shuffle all interaction data. Split off the last 10% of the interaction data for the test set. B.. Identify the most recent 10% of interactions for each user. Split off these interactions for the test set. C.. Identify the 10% of users with the least interaction data. Split off all interaction data from these users for the test set. D.. Randomly select 10% of the users. Split off all interaction data from these users for the test set.
B - I would select B, straight from this AWS example: https://aws.amazon.com/blogs/machine-learning/building-a-customized-recommender-system-in-amazon-sagemaker/ the blog didn't mentioned anything about sample selection. how is B arrived? I think the answer is D because customers by only 4-5 products every 5-10 years so it doesn't make sense to get 10% interactions for each user as a test set. Yes, agree. Answer should be D B. Recommendation should use the historcial to predict the furture action. B is using the older records to prediect the newer records. D is using 90% user to predict other 10%, 90% is irrelevant to other 10%. There is no difference between A and D, so I prefer B as the answer The best way to split the dataset into a training and test set for this use case is to randomly select 10% of the users and split off all interaction data from these users for the test set. This is because the company relies on a steady stream of new customers, so the test set should reflect the behavior of new customers who have not been seen by the model before. The other options are not suitable because they either mix old and new customers in the test set (A and B), or they bias the test set towards users with less interaction data . References: Amazon SageMaker Developer Guide: Train and Test Datasets Amazon Personalize Developer Guide: Preparing and Importing Data Primary concern is to evaluate the model's performance on completely new users then option D would be more appropriate. I'd also take time into consideration, since even for such long-lived products there might be trends or regulations or whatever that make customers prefer one over the other. => A,D are out C will not give you a test set of desired size => out => B If the primary concern is to evaluate the model's performance on completely new users (which seems to be the case for the company in question), then option D would be more appropriate. I would choose D. According to the question, because of the product nature, the company doesn't rely on customer-product historical interactions for recommendations. It relies on customer explicit preferences, which are gathered on the first sign-up. The company wants to make recommendations for these new users. It is the main source of revenue for the company. To conduct thorough testing company needs to simulate the new users, not existing ones. To do it we need to randomly choose some percentage of users and remove all of their transactions from the train set. And use their transactions only in test. By selecting the most recent interactions for each user, you are simulating the scenario of having new customers in your test set. This method allows you to assess how well the model generalizes to both existing and new users. A. NO - the data is denormalized and users' preferences are present in multiple rows in the interactions; if we split off interactions, we introduce leakage as the same user will be present in train & test A. NO - the data is denormalized and users' preferences are present in multiple rows in the interactions; if we split off based on the interaction, we introduce leakage as the same user will be present in train & test C. NO - bias D. YES - no bias and user based A NO introduces a bias in the training set (old interactions) vs. test set (new interactions) C NO will have a very sparse test set B NO the same user will be present in the training and test set; we want a user-based model, not an interaction-based one, so a user should belong to only one set D YES - last remaining option. Changing to B Between B and D but the issue is 4-5 transaction every 5-10 years. Hence last 10% transaction is difficult. So going for D I would select B as it is time series data. Order might be important. So for each user, last 10% of transactions ordered by date could be a good answer. You want different users in training and in testing datasets, which is C or D. In addition, B is wrong since you cannot take 10% of 4-5 transactions per customer. Actually, between B, C and D, only in D you can get exactly 10%. This method is appropriate because it takes into account the unique buying behavior of each customer and is likely to reflect the latest preferences of the customer. It ensures that the test set contains a representative sample of the most recent customer preferences, which is important in this use case where customer preferences change infrequently over time. B makes the most business sense. Since customers buy products every 4-5 years, it makes sense to be able to predict future sales from really old data. splitting the test set to be only recent interactions is the best way to test model performance from historically 'recent' data - https://www.examtopics.com/discussions/amazon/view/44093-exam-aws-certified-machine-learning-specialty-topic-1/
131
131 - A financial services company wants to adopt Amazon SageMaker as its default data science environment. The company's data scientists run machine learning (ML) models on confidential financial data. The company is worried about data egress and wants an ML engineer to secure the environment. Which mechanisms can the ML engineer use to control data egress from SageMaker? (Choose three.) - A.. Connect to SageMaker by using a VPC interface endpoint powered by AWS PrivateLink. B.. Use SCPs to restrict access to SageMaker. C.. Disable root access on the SageMaker notebook instances. D.. Enable network isolation for training jobs and models. E.. Restrict notebook presigned URLs to specific IPs used by the company. F.. Protect data with encryption at rest and in transit. Use AWS Key Management Service (AWS KMS) to manage encryption keys.
ADE - ADF - the concepts in ADF are explained in detail on the official Amazon Exam Readiness Exam Readiness: AWS Certified Machine Learning - Specialty. Amazon official materials do not mention other concepts in BCE. ADE for sure. F is for encryption and not data egress. I agree with ADF. SCP is to control access to a service, it's not related to securing data. As per official document only 4 ways to do data egress Enforcing deployment in VPC,Enforcing network isolation,Restricting notebook pre-signed URLs to IPs,Disabling internet access Correct Ans - ADE Read Controlling data egress section Link - https://aws.amazon.com/blogs/machine-learning/millennium-management-secure-machine-learning-using-amazon-sagemaker/ ADF : O BCE : X https://aws.amazon.com/blogs/machine-learning/millennium-management-secure-machine-learning-using-amazon-sagemaker/ Correct Choices and Reasoning: A. Connect to SageMaker by using a VPC interface endpoint powered by AWS PrivateLink: Keeps traffic within the VPC. B. Use SCPs to restrict access to SageMaker: Limits authorized actions and services. D. Enable network isolation for training jobs and models: Prevents network access during training and inference. Therefore, the three mechanisms that the ML engineer can use to control data egress from SageMaker are A. Connect to SageMaker by using a VPC interface endpoint powered by AWS PrivateLink, B. Use SCPs to restrict access to SageMaker, and D. Enable network isolation for training jobs and models. A. Connect to SageMaker by using a VPC interface endpoint powered by AWS PrivateLink PrivateLink ensures that communication between SageMaker and other AWS services happens entirely within the AWS network, avoiding exposure to the public internet. This reduces the risk of unintended data egress. D. Enable network isolation for training jobs and models Enabling network isolation ensures that containers used for training jobs and models cannot make outbound network connections. This prevents accidental or malicious data egress. F. Protect data with encryption at rest and in transit. Use AWS Key Management Service (AWS KMS) to manage encryption keys Encrypting data ensures its security even if it is inadvertently accessed or stored improperly. KMS allows centralized and secure management of encryption keys. F - it takes care of data sitting in sagemaker env which is encrypted but E ensures that the srvices or its resources cannot be accessed outside of the allowed IP's My vote for ADF I changed my selection It is truly ADE. I read the link provided by rahulw230 A = VPC endpoints are well know safety mechanism in SM so traffic doesn’t leave AWS B = service control policy can restrict access at org level D = Network isolation limits training model access only to S3 To control data egress from SageMaker, the ML engineer can use the following mechanisms: Connect to SageMaker by using a VPC interface endpoint powered by AWS PrivateLink. This allows the ML engineer to access SageMaker services and resources without exposing the traffic to the public internet. This reduces the risk of data leakage and unauthorized access1 Enable network isolation for training jobs and models. Question is wrong A, B, E and D are all valid to a point. The more I see it, the more likely I will go with ABD, the only answers than address the data egress issue For those who are sure that is E, please explain how you can use pre-signed urls to restrict IP's, from my understanding it is a time based access to your S3 objects, you can policies to control access, like SCP (Service Control Policy), Isolation is definitely one option so that leaves F (Encrypting in transit and Encrypting objects) as the only possible solution as BDF A and D are for sure. The challenge between E and F. E restrict access to the notebook hence indirectly control who access it and can access data but encrypting the data is more direct way to protect the egress of the data. hence leaning more towards F Not F because the que4stion is "to control data egress". F (encryption) is not egress control. A, D, F are the mechanisms that the ML engineer can use to control data egress from SageMaker. B, C, and E do not directly control data egress from SageMaker. SCPs restrict access to AWS services, disabling root access on the SageMaker notebook instances improves security, and restricting notebook presigned URLs to specific IPs used by the company adds another layer of security, but none of these mechanisms control data egress from SageMaker. https://aws.amazon.com/blogs/machine-learning/millennium-management-secure-machine-learning-using-amazon-sagemaker/ Are the correct - https://www.examtopics.com/discussions/amazon/view/44155-exam-aws-certified-machine-learning-specialty-topic-1/
132
132 - A company needs to quickly make sense of a large amount of data and gain insight from it. The data is in different formats, the schemas change frequently, and new data sources are added regularly. The company wants to use AWS services to explore multiple data sources, suggest schemas, and enrich and transform the data. The solution should require the least possible coding effort for the data flows and the least possible infrastructure management. Which combination of AWS services will meet these requirements? A. ✑ Amazon EMR for data discovery, enrichment, and transformation ✑ Amazon Athena for querying and analyzing the results in Amazon S3 using standard SQL ✑ Amazon QuickSight for reporting and getting insights B. ✑ Amazon Kinesis Data Analytics for data ingestion ✑ Amazon EMR for data discovery, enrichment, and transformation ✑ Amazon Redshift for querying and analyzing the results in Amazon S3 C. ✑ AWS Glue for data discovery, enrichment, and transformation ✑ Amazon Athena for querying and analyzing the results in Amazon S3 using standard SQL ✑ Amazon QuickSight for reporting and getting insights D. ✑ AWS Data Pipeline for data transfer ✑ AWS Step Functions for orchestrating AWS Lambda jobs for data discovery, enrichment, and transformation ✑ Amazon Athena for querying and analyzing the results in Amazon S3 using standard SQL ✑ Amazon QuickSight for reporting and getting insights -
A - I would choose C. Answer here is C. Glue, Athena and Quicksight are serverless and need little code (only SQL) C is the right answer Answer is C, all serverless I woul choose C as well In the presence of AWS Glue, with a goal to minimise coding efforts. C is the correct answer I also chose C. A has code/infra overhead of EMR. B is wrong b/c you dont query S3 with redshift D is overhead from orchestrating lambda jobs with step funcs AWS Glue CROWLER for data discovery C is correct The correct is C Why no voting option? It is option C C is correct I think it's C Answer is C, all serverless answer is c as data sources varies alot so requires glue crawler C Is correct, you use Glue for ingestion Option C is the most suitable choice to meet the given requirements. AWS Glue is a fully managed extract, transform, and load (ETL) service that allows users to discover, enrich, and transform data easily, without the need for extensive coding. It supports different data sources, schema detection, and schema evolution, which makes it an ideal choice for the given scenario. Amazon Athena, a serverless interactive query service, allows users to run standard SQL queries against data stored in Amazon S3, which makes it easy to analyze the enriched and transformed data. Amazon QuickSight is a cloud-based business intelligence service that can connect to various data sources, including Amazon Athena, to create interactive dashboards and reports, which makes it a suitable choice for gaining insights from the data. Option A is not an ideal choice because Amazon EMR is a heavy-weight service and requires more infrastructure management than AWS Glue. - https://www.examtopics.com/discussions/amazon/view/74067-exam-aws-certified-machine-learning-specialty-topic-1/
133
133 - A company is converting a large number of unstructured paper receipts into images. The company wants to create a model based on natural language processing (NLP) to find relevant entities such as date, location, and notes, as well as some custom entities such as receipt numbers. The company is using optical character recognition (OCR) to extract text for data labeling. However, documents are in different structures and formats, and the company is facing challenges with setting up the manual workflows for each document type. Additionally, the company trained a named entity recognition (NER) model for custom entity detection using a small sample size. This model has a very low confidence score and will require retraining with a large dataset. Which solution for text extraction and entity detection will require the LEAST amount of effort? - A.. Extract text from receipt images by using Amazon Textract. Use the Amazon SageMaker BlazingText algorithm to train on the text for entities and custom entities. B.. Extract text from receipt images by using a deep learning OCR model from the AWS Marketplace. Use the NER deep learning model to extract entities. C.. Extract text from receipt images by using Amazon Textract. Use Amazon Comprehend for entity detection, and use Amazon Comprehend custom entity recognition for custom entity detection. D.. Extract text from receipt images by using a deep learning OCR model from the AWS Marketplace. Use Amazon Comprehend for entity detection, and use Amazon Comprehend custom entity recognition for custom entity detection.
C - C is the correct answer. You definitely need Amazon Textract service which eliminate options B & D. Between A & C - Comprehend will quicker. Textract and Comprehend will do the job Keywords 'Amazon Textract' and 'Amazon Comprehend' C indeed due to least effort C is correct I think C I go for C C is the best answer, textract is to extract data from documents and comprehend to understand the filling, objective or origin of a file. C is correct, you can extract Entity information easily with Comprehend. https://aws.amazon.com/comprehend/features/ - https://www.examtopics.com/discussions/amazon/view/76354-exam-aws-certified-machine-learning-specialty-topic-1/
134
134 - A company is building a predictive maintenance model based on machine learning (ML). The data is stored in a fully private Amazon S3 bucket that is encrypted at rest with AWS Key Management Service (AWS KMS) CMKs. An ML specialist must run data preprocessing by using an Amazon SageMaker Processing job that is triggered from code in an Amazon SageMaker notebook. The job should read data from Amazon S3, process it, and upload it back to the same S3 bucket. The preprocessing code is stored in a container image in Amazon Elastic Container Registry (Amazon ECR). The ML specialist needs to grant permissions to ensure a smooth data preprocessing workflow. Which set of actions should the ML specialist take to meet these requirements? - A.. Create an IAM role that has permissions to create Amazon SageMaker Processing jobs, S3 read and write access to the relevant S3 bucket, and appropriate KMS and ECR permissions. Attach the role to the SageMaker notebook instance. Create an Amazon SageMaker Processing job from the notebook. B.. Create an IAM role that has permissions to create Amazon SageMaker Processing jobs. Attach the role to the SageMaker notebook instance. Create an Amazon SageMaker Processing job with an IAM role that has read and write permissions to the relevant S3 bucket, and appropriate KMS and ECR permissions. C.. Create an IAM role that has permissions to create Amazon SageMaker Processing jobs and to access Amazon ECR. Attach the role to the SageMaker notebook instance. Set up both an S3 endpoint and a KMS endpoint in the default VPC. Create Amazon SageMaker Processing jobs from the notebook. D.. Create an IAM role that has permissions to create Amazon SageMaker Processing jobs. Attach the role to the SageMaker notebook instance. Set up an S3 endpoint in the default VPC. Create Amazon SageMaker Processing jobs with the access key and secret key of the IAM user with appropriate KMS and ECR permissions.
B - A; IAM assigned to SageMaker Notebook instance can be passed to other SageMaker jobs like training; processing, automl, etc., Why should the IAM permission be assigned to create S3, when the data is already stored in S3? It only require permission to read and write data in S3. I believe A is incorrect. Checked this on CoPilot I changed my mind its 'B' Option B is generally better because it provides a more secure and controlled approach to managing permissions. By separating the roles, you can ensure that the SageMaker notebook instance has only the permissions it needs to create processing jobs, while the processing job itself has the specific permissions required to access the S3 bucket, KMS, and ECR. This separation of duties enhances security and minimizes the risk of over-permissioning any single role. A is not correct, for safety and principle of least privilege, you should decouple the role of each service. based on answers from here Option A: The IAM role is created with the necessary permissions to create Amazon SageMaker Processing jobs, read and write data to the relevant S3 bucket, and access the KMS CMKs and ECR container image. The IAM role is attached to the SageMaker notebook instance, which allows the notebook to assume the role and create the Amazon SageMaker Processing job with the necessary permissions. The Amazon SageMaker Processing job is created from the notebook, which ensures that the job has the necessary permissions to read data from S3, process it, and upload it back to the same S3 bucket. Option B is close, but it's not entirely correct. It mentions creating an IAM role with permissions to create Amazon SageMaker Processing jobs, but it doesn't mention attaching the role to the SageMaker notebook instance. This is a crucial step, as it allows the notebook to assume the role and create the Amazon SageMaker Processing job with the necessary permissions. The correct solution for granting permissions for data preprocessing is to use the following steps:Create an IAM role that has permissions to create Amazon SageMaker Processing jobs. Attach therole to the SageMaker notebook instance. This role allows the ML specialist to run Processing jobsfrom the notebook code1 Create an Amazon SageMaker Processing job with an IAM role that hasread and write permissions to the relevant S3 bucket, and appropriate KMS and ECR permissions.This role allows the Processing job to access the data in the encrypted S3 bucket, decrypt it with theKMS CMK, and pull the container image from ECR23 The other options are incorrect because theyeither miss some permissions or use unnecessary steps. For example Least priv. A. Create an IAM role with S3, KMS, ECR permissions and SageMaker Processing job creation permissions. Attach it to the SageMaker notebook instance: This option seems comprehensive as it includes all necessary permissions. However, attaching this role directly to the SageMaker notebook instance would not be sufficient for the Processing job itself. The Processing job needs its own role with appropriate permissions. B. Create two IAM roles: one for the SageMaker notebook with permissions to create Processing jobs, and another for the Processing job itself with S3, KMS, and ECR permissions: This option is more aligned with best practices. The notebook instance and the Processing job have different roles tailored to their specific needs. This separation ensures that each service has only the permissions necessary for its operation, following the principle of least privilege. The processing job may not run on the notebook instance. AWS will provide resources to execute the job. So A is wrong. B. If we follow the principle of Least Privillege, B is correct. The notebook instance does not need access to S3 and KMS given that it is only needed to trigger the processsing Job. Not A b/c it does not indicate perms given to the Job via IAM role. => I went with B. My answer is B. The notebook instance doesn't need access to S3 and ECR. This access is needed for Processing Job only. And as a best practice of least privilege I'll choose B where permissions are granted to the SageMaker Processing job itself and not to the notebook instance. This approach offers better security and control over permissions, making it the preferred choice for running SageMaker Processing jobs with the required access to S3, KMS, and ECR. ( Follows the principle of least privilege and have more control over permissions. It says "Amazon SageMaker Processing job that is triggered from code in an Amazon SageMaker notebook." - so A or C. There is no need to create an S3 endpoint (C), that is only to allow traffic over the internet. So A. Confusing between A and B. Leaning to B The main difference between A and B is the IAM role that is attached to the SageMaker notebook instance. In A, the role has permissions to access the data, the container image, and the KMS CMK. In B, the role only has permissions to create SageMaker Processing jobs. This means that in A, the notebook instance can potentially access or modify the data or the image without using a Processing job, which is not desirable. In B, the notebook instance can only create Processing jobs, and the Processing jobs themselves have a separate IAM role that grants them access to the data, the image, and the KMS CMK. This way, the data and the image are only accessed by the Processing jobs, which are more secure and controlled than the notebook instance. Letters C and D are wrong, as they bring VPC, something that is not mentioned in the problem. Letter A is correct, since Letter B asks for the creation of two different IAM roles. What is the problem with creating two different IAM roles? Option A ensures that the role has the necessary permissions to access the required resources (S3, KMS, ECR) and that the notebook has the ability to create a processing job in SageMaker seamlessly. It also follows the principle of "least privilege" by granting only the necessary permissions to perform the task without exposing more access than required. - https://www.examtopics.com/discussions/amazon/view/74974-exam-aws-certified-machine-learning-specialty-topic-1/
135
135 - A data scientist has been running an Amazon SageMaker notebook instance for a few weeks. During this time, a new version of Jupyter Notebook was released along with additional software updates. The security team mandates that all running SageMaker notebook instances use the latest security and software updates provided by SageMaker. How can the data scientist meet this requirements? - A.. Call the CreateNotebookInstanceLifecycleConfig API operation B.. Create a new SageMaker notebook instance and mount the Amazon Elastic Block Store (Amazon EBS) volume from the original instance C.. Stop and then restart the SageMaker notebook instance D.. Call the UpdateNotebookInstanceLifecycleConfig API operation
C - This is correct according to official documentation. https://docs.aws.amazon.com/sagemaker/latest/dg/nbi-software-updates.html Amazon SageMaker periodically tests and releases software that is installed on notebook instances, such as Jupyter Notebook, security patches, AWS SDK updates, and so on. To ensure that you have the most recent software updates, you need to stop and restart your notebook instance, either in the SageMaker console or by calling StopNotebookInstance. By stopping and restarting the SageMaker notebook instance, it will automatically apply the latest security and software updates provided by SageMaker. This process refreshes the underlying infrastructure, ensuring that the notebook instance is running with the most up-to-date software and security patches. It is a simple and effective way to comply with the security team's mandate for using the latest updates. C per Developer Documentation https://gmoein.github.io/files/Amazon%20SageMaker.pdf Page44 - https://www.examtopics.com/discussions/amazon/view/74279-exam-aws-certified-machine-learning-specialty-topic-1/
136
136 - A library is developing an automatic book-borrowing system that uses Amazon Rekognition. Images of library members' faces are stored in an Amazon S3 bucket. When members borrow books, the Amazon Rekognition CompareFaces API operation compares real faces against the stored faces in Amazon S3. The library needs to improve security by making sure that images are encrypted at rest. Also, when the images are used with Amazon Rekognition. they need to be encrypted in transit. The library also must ensure that the images are not used to improve Amazon Rekognition as a service. How should a machine learning specialist architect the solution to satisfy these requirements? - A.. Enable server-side encryption on the S3 bucket. Submit an AWS Support ticket to opt out of allowing images to be used for improving the service, and follow the process provided by AWS Support. B.. Switch to using an Amazon Rekognition collection to store the images. Use the IndexFaces and SearchFacesByImage API operations instead of the CompareFaces API operation. C.. Switch to using the AWS GovCloud (US) Region for Amazon S3 to store images and for Amazon Rekognition to compare faces. Set up a VPN connection and only call the Amazon Rekognition API operations through the VPN. D.. Enable client-side encryption on the S3 bucket. Set up a VPN connection and only call the Amazon Rekognition API operations through the VPN.
A - A Images passed to Amazon Rekognition API operations may be stored and used to improve the service unless you unless you have opted out by visiting the AI services opt-out policy page and following the process explained there https://docs.aws.amazon.com/rekognition/latest/dg/security-data-encryption.html So the answer is A https://docs.aws.amazon.com/organizations/latest/userguide/orgs_manage_policies_ai-opt-out.html Yes, but server-side encryption doesn't protect at transit. Only client-side encryption can do it. Ok, I see "encryption in transit" mean HTTPS: Amazon Rekognition API endpoints only support secure connections over HTTPS. All communication is encrypted with Transport Layer Security (TLS). Absolutely A. Rekognition API endpoints only support secure connections over HTTPS and all communication is encrypted in transit with TLS https://aws.amazon.com/rekognition/faqs/?nc1=h_ls B is correct one client-side encryption requires you to manage the encryption and decryption of your data yourself and is an overkill. Will go with Server side encryption. Recognition already encrypts data in transit B https://docs.aws.amazon.com/rekognition/latest/dg/collections.html You can opt-out of AI data usage of aws through organizations settings. Option A is correct Also, when the images are used with Amazon Rekognition. they need to be encrypted in transit A server-site encryption doesn't encrypt images in transit. onyly when they are already uploaded to the S3. Only client-side encryption can encrypt the images before they are moving to AWS cloud. You forgot about removing the possibility of Rekognition training. client side encryption means the key is stored on the client side. AWS has no key, how can they train? According to Rekognition FAQs, You may opt out of having your image and video inputs used to improve or develop the quality of Amazon Rekognition and other Amazon machine-learning/artificial-intelligence technologies by using an AWS Organizations opt-out policy. https://aws.amazon.com/rekognition/faqs/ how is it A??? - https://www.examtopics.com/discussions/amazon/view/74070-exam-aws-certified-machine-learning-specialty-topic-1/
137
137 - A company is building a line-counting application for use in a quick-service restaurant. The company wants to use video cameras pointed at the line of customers at a given register to measure how many people are in line and deliver notifications to managers if the line grows too long. The restaurant locations have limited bandwidth for connections to external services and cannot accommodate multiple video streams without impacting other operations. Which solution should a machine learning specialist implement to meet these requirements? - A.. Install cameras compatible with Amazon Kinesis Video Streams to stream the data to AWS over the restaurant's existing internet connection. Write an AWS Lambda function to take an image and send it to Amazon Rekognition to count the number of faces in the image. Send an Amazon Simple Notification Service (Amazon SNS) notification if the line is too long. B.. Deploy AWS DeepLens cameras in the restaurant to capture video. Enable Amazon Rekognition on the AWS DeepLens device, and use it to trigger a local AWS Lambda function when a person is recognized. Use the Lambda function to send an Amazon Simple Notification Service (Amazon SNS) notification if the line is too long. C.. Build a custom model in Amazon SageMaker to recognize the number of people in an image. Install cameras compatible with Amazon Kinesis Video Streams in the restaurant. Write an AWS Lambda function to take an image. Use the SageMaker endpoint to call the model to count people. Send an Amazon Simple Notification Service (Amazon SNS) notification if the line is too long. D.. Build a custom model in Amazon SageMaker to recognize the number of people in an image. Deploy AWS DeepLens cameras in the restaurant. Deploy the model to the cameras. Deploy an AWS Lambda function to the cameras to use the model to count people and send an Amazon Simple Notification Service (Amazon SNS) notification if the line is too long.
D - Answer is D: A is not correct since restaurant has limited bandwidth B is not correct since cannot enable Rekognition service on DeepLens C is not correct the same reason as A B is correct with Rekognition integrated with Deeplens and no extra configuration needed. (https://aws.amazon.com/blogs/machine-learning/building-a-smart-garage-door-opener-with-aws-deeplens-and-amazon-rekognition/) In this blog, rekognition service is not running on Deeplens. It said "After you deploy the sample object detection project into AWS DeepLens, you need to change the inference (edge) Lambda function to upload image frames to Amazon S3. ... and Rekognition would do its work from the Cloud to image frames on S3."... It would still consume lots of bandwidth. So B is NOT correct. I also agree with D. B is incorrect due to that it's no need to do "person is recognized". It just needs to count the number of people. AWS will not recommend to use Deeplense in production. From https://aws.amazon.com/deeplens/device-terms-of-use/ aws doesn't allow use in production but in evaluation. can we accept counting number of people as an evaluation? https://aws.amazon.com/deeplens/device-terms-of-use/ Sorry guys, not B, C or D. Reasons? Deeplens is a deprecated product, not suitable for being used in real production environment (as clearly stated in its T&C), thus B & D option are out. Between A & C, the clearly option is A. C implies the creation of a custom ML model. Making a custom model is very expensive, time consuming, error prone and a highly specialized task. Option A uses a well-known, key-in-hand service as AWS Rekognition which implies very little effort in comparison with uses a custom-made one. I know, this option does not follow the flock, but I think that I am right. Yes, Amazon Rekognition can be integrated with AWS DeepLens. You can use AWS DeepLens to capture video and perform initial processing on the device. For more advanced image and video analysis, you can send frames from DeepLens to Amazon Rekognition The best solution for building a line-counting application for use in a quick-service restaurant is to usethe following steps:Build a custom model in Amazon SageMaker to recognize the number of people in an image. AmazonSageMaker is a fully managed service that provides tools and workflows for building, training, anddeploying machine learning models. A custom model can be tailored to the specific use case of linecounting and achieve higher accuracy than a generic model1 Deploy AWS DeepLens cameras in therestaurant to capture video B. AWS DeepLens with Local Amazon Rekognition and AWS Lambda: AWS DeepLens is designed for local processing and can run models at the edge (i.e., on the device itself). This setup would enable local analysis of the video feed without the need to stream the video to the cloud, thus conserving bandwidth. Amazon Rekognition and Lambda can then be used to analyze the footage and send notifications. This option aligns well with the bandwidth limitations. D. Custom Model on AWS DeepLens with AWS Lambda: Deploying a custom model built in SageMaker to AWS DeepLens allows for local processing of video data. This option also avoids the bandwidth issue by processing data on the device. However, developing a custom model might be more complex than using pre-built solutions like Amazon Rekognition. Rekognition is a managed service. It uses API's and can't be deployed locally on devices. What we need here is local inference on the camera. AWS DeepLens comes pre-installed with a high performance, efficient, optimized inference engine for deep learning using Apache MXNet. I would go with A, As DeepLenght is not for production workloads, we are left with A or C. A requires less effort. B : https://aws.amazon.com/ko/blogs/machine-learning/building-a-smart-garage-door-opener-with-aws-deeplens-and-amazon-rekognition/ Based on the requirements, the best solution is option D. This option uses AWS DeepLens cameras to capture video and process it locally on the device, without sending any video streams to external services. This reduces the bandwidth consumption and avoids impacting other operations in the restaurant. The option also uses a custom model built in Amazon SageMaker to recognize the number of people in an image, which can be more accurate and tailored to the specific use case than a generic face detection model. The option also deploys an AWS Lambda function to the cameras to use the model to count people and send an Amazon Simple Notification Service (Amazon SNS) notification if the line is too long. it is D. "The restaurant locations have limited bandwidth for connections to external services and cannot accommodate multiple video streams without impacting other operations." So, using Amazon Kinesis Video Streams is not a solution here. Ok, DeepLens dissapears in 2024...but this questions is for 2022... In the real world, the restaurant would buy good signal internet and use answer C, which is better solution. C is Answer AWS DeepLens will reach end-of-ilfe in 31/01/2024 so, I don't think this question will even appear in the exam. Deeplens + lambda + model inference After giving this some thought, I am thinking D. Tricky, my initial answer was C. But D is a better solution - given DeepLens and counting the number of people. https://aws.amazon.com/ko/blogs/machine-learning/building-a-smart-garage-door-opener-with-aws-deeplens-and-amazon-rekognition/ According to this link, Answer should be D, because we can directly deploy model in Deep lense to count the number of people instead a use of rekognition. https://aws.amazon.com/blogs/machine-learning/optimize-workforce-in-your-store-using-amazon-rekognition/ B - https://www.examtopics.com/discussions/amazon/view/74926-exam-aws-certified-machine-learning-specialty-topic-1/
138
138 - A company has set up and deployed its machine learning (ML) model into production with an endpoint using Amazon SageMaker hosting services. The ML team has configured automatic scaling for its SageMaker instances to support workload changes. During testing, the team notices that additional instances are being launched before the new instances are ready. This behavior needs to change as soon as possible. How can the ML team solve this issue? - A.. Decrease the cooldown period for the scale-in activity. Increase the configured maximum capacity of instances. B.. Replace the current endpoint with a multi-model endpoint using SageMaker. C.. Set up Amazon API Gateway and AWS Lambda to trigger the SageMaker inference endpoint. D.. Increase the cooldown period for the scale-out activity.
D - I believe this is a problem to do with scaling out (increasing the number of instances), cooldown period should be increased. https://docs.aws.amazon.com/autoscaling/ec2/userguide/Cooldown.html https://aws.amazon.com/blogs/machine-learning/configuring-autoscaling-inference-endpoints-in-amazon-sagemaker/ Option D The issue is related to scaling out, specifically the fact that new instances are being launched before the existing ones are ready. To address this issue, the ML team could consider increasing the minimum number of instances, reducing the target value for CPU utilization, or increasing the warm-up time for the instances. These actions can help to ensure that new instances are not launched until the existing ones have reached a stable state, which can prevent performance issues and ensure the reliability of the service. Option D, which suggests increasing the cooldown period for the scale-out activity, could potentially help to address this issue by ensuring that the new instances are not launched too quickly. Option A, which suggests decreasing the cooldown period for the scale-in activity and increasing the maximum capacity of instances, is not an appropriate solution to the problem described. Decreasing the cooldown period for scale-in activity would result in instances being terminated too quickly, and increasing the maximum capacity of instances would not necessarily prevent new instances from being launched too quickly. Agreed with D. should be increased not decreased Answer is "D" Definitely D. - https://www.examtopics.com/discussions/amazon/view/74280-exam-aws-certified-machine-learning-specialty-topic-1/
139
139 - A telecommunications company is developing a mobile app for its customers. The company is using an Amazon SageMaker hosted endpoint for machine learning model inferences. Developers want to introduce a new version of the model for a limited number of users who subscribed to a preview feature of the app. After the new version of the model is tested as a preview, developers will evaluate its accuracy. If a new version of the model has better accuracy, developers need to be able to gradually release the new version for all users over a fixed period of time. How can the company implement the testing model with the LEAST amount of operational overhead? - A.. Update the ProductionVariant data type with the new version of the model by using the CreateEndpointConfig operation with the InitialVariantWeight parameter set to 0. Specify the TargetVariant parameter for InvokeEndpoint calls for users who subscribed to the preview feature. When the new version of the model is ready for release, gradually increase InitialVariantWeight until all users have the updated version. B.. Configure two SageMaker hosted endpoints that serve the different versions of the model. Create an Application Load Balancer (ALB) to route traffic to both endpoints based on the TargetVariant query string parameter. Reconfigure the app to send the TargetVariant query string parameter for users who subscribed to the preview feature. When the new version of the model is ready for release, change the ALB's routing algorithm to weighted until all users have the updated version. C.. Update the DesiredWeightsAndCapacity data type with the new version of the model by using the UpdateEndpointWeightsAndCapacities operation with the DesiredWeight parameter set to 0. Specify the TargetVariant parameter for InvokeEndpoint calls for users who subscribed to the preview feature. When the new version of the model is ready for release, gradually increase DesiredWeight until all users have the updated version. D.. Configure two SageMaker hosted endpoints that serve the different versions of the model. Create an Amazon Route 53 record that is configured with a simple routing policy and that points to the current version of the model. Configure the mobile app to use the endpoint URL for users who subscribed to the preview feature and to use the Route 53 record for other users. When the new version of the model is ready for release, add a new model version endpoint to Route 53, and switch the policy to weighted until all users have the updated version.
C - https://docs.aws.amazon.com/sagemaker/latest/dg/model-ab-testing.html after reviewing it maybe C not A https://sagemaker-examples.readthedocs.io/en/latest/sagemaker_endpoints/a_b_testing/a_b_testing.html Should be A Answer is C, hosting two models under single endpoint has less operational overheads than two hosting endpoints in option A it's mentioned that we set initial_weight to 0 which isn't true as the value should be 1 -> C While Option C is a viable method, Option A is generally more straightforward and aligns well with common practices for deploying and managing model versions in SageMaker. Supported by Copilot The Answer is A. The question says "Developers want to introduce a new version of the model for a limited number of users who subscribed to a..." In order to introduce a new production version with least overhead you have to create a production variant by using CreateEndpointConfig operation and set the InitialVariantWeight to 0. You then specify the TargetVariant parameter for InvokeEndpoint calls for users who subscribed to the preview feature and gradually update the weight. https://docs.aws.amazon.com/sagemaker/latest/dg/model-ab-testing.html preview feature of the app -CreateEndPointConfig with initial weight to 0 prohibits any traffic to new variant - TargetVariant Parameter in the endpoint calls made by selected users ensures new variant be used - Change of InitialWeight causes gradual release of new variant Obviously C https://docs.aws.amazon.com/sagemaker/latest/dg/deployment-best-practices.html You can modify an endpoint without taking models that are already deployed into production out of service. For example, you can add new model variants, update the ML Compute instance configurations of existing model variants, or change the distribution of traffic among model variants. To modify an endpoint, you provide a new endpoint configuration. SageMaker implements the changes without any downtime. For more information see, UpdateEndpoint and UpdateEndpointWeightsAndCapacities. According to this doc, new variants can be deployed with UpdateEndpoint, and weights can be updated with UpdateEndpointWeightsAndCapacities. Though for using UpdateEndpoint we need to create an endpoint config. I will go with C The company can implement the testing model with the least amount of operational overhead by using Option A. The developers can update the ProductionVariant data type with the new version of the model by using the CreateEndpointConfig operation with the InitialVariantWeight parameter set to 0. They can specify the TargetVariant parameter for InvokeEndpoint calls for users who subscribed to the preview feature. When the new version of the model is ready for release, they can gradually increase InitialVariantWeight until all users have the updated version The best option for the company to implement the testing model with the least amount of operational overhead is option C. Option C uses the SageMaker feature of production variants, which allows the company to test multiple models on a single endpoint and control the traffic distribution between them. By setting the DesiredWeight parameter to 0 for the new version of the model, the company can ensure that only users who subscribed to the preview feature will invoke the new version by specifying the TargetVariant parameter. When the new version of the model is ready for release, the company can gradually increase the DesiredWeight parameter until all users have the updated version. This option minimizes the operational overhead by avoiding the need to create and manage additional endpoints, load balancers, or DNS records. C is correct. The existing model will be updated using parameter DesiredWeightAndCapacity for new production variant and lead to less operational effort. This one is tricky, but I think it is testing the difference between UpdateEndpointWeightsAndCapacities and ProductionVariant UpdateEndpointWeightsAndCapacities: Updates variant weight of one or more variants associated with an existing endpoint, or capacity of one variant associated with an existing endpoint https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_UpdateEndpointWeightsAndCapacities.html ProductionVariant: Identifies a model that you want to host and the resources chosen to deploy for hosting it. If you are deploying multiple models, tell SageMaker how to distribute traffic among the models by specifying variant weights. https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_ProductionVariant.html So it must be A, because the variant must exist before it is updated This link gave me confidence to choose A https://docs.aws.amazon.com/sagemaker/latest/dg/model-ab-testing.html I agree with C. Please see step 4: https://docs.aws.amazon.com/sagemaker/latest/dg/model-ab-testing.html & in option A it's mentioned that we set initial_weight to 0 which isn't true as the value should be 1. I did not found the InitialVariantWeight, only DesiredWeight, therefore is C: https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_DesiredWeightAndCapacity.html https://docs.aws.amazon.com/sagemaker/latest/dg/model-ab-testing.html Step 4: Increase traffic to the best model Now that we have determined that Variant2 performs better than Variant1, we shift more traffic to it. We can continue to use TargetVariant to invoke a specific model variant, but a simpler approach is to update the weights assigned to each variant by calling UpdateEndpointWeightsAndCapacities. Update should be the correct action to this change. - https://www.examtopics.com/discussions/amazon/view/74921-exam-aws-certified-machine-learning-specialty-topic-1/
140
140 - A company offers an online shopping service to its customers. The company wants to enhance the site's security by requesting additional information when customers access the site from locations that are different from their normal location. The company wants to update the process to call a machine learning (ML) model to determine when additional information should be requested. The company has several terabytes of data from its existing ecommerce web servers containing the source IP addresses for each request made to the web server. For authenticated requests, the records also contain the login name of the requesting user. Which approach should an ML specialist take to implement the new security feature in the web application? - A.. Use Amazon SageMaker Ground Truth to label each record as either a successful or failed access attempt. Use Amazon SageMaker to train a binary classification model using the factorization machines (FM) algorithm. B.. Use Amazon SageMaker to train a model using the IP Insights algorithm. Schedule updates and retraining of the model using new log data nightly. C.. Use Amazon SageMaker Ground Truth to label each record as either a successful or failed access attempt. Use Amazon SageMaker to train a binary classification model using the IP Insights algorithm. D.. Use Amazon SageMaker to train a model using the Object2Vec algorithm. Schedule updates and retraining of the model using new log data nightly.
B - B? because ip insights algorithm is unsupervised learning that don't need label Answer should be B Amazon SageMaker IP Insights is an unsupervised learning algorithm that learns the usage patterns for IPv4 addresses. It is designed to capture associations between IPv4 addresses and various entities, such as user IDs or account numbers. You can use it to identify a user attempting to log into a web service from an anomalous IP address, for example. Or you can use it to identify an account that is attempting to create computing resources from an unusual IP address. Trained IP Insight models can be hosted at an endpoint for making real-time predictions or used for processing batch transforms. Agree with the comments beliw B. https://docs.aws.amazon.com/sagemaker/latest/dg/ip-insights.html B; IP Insights for IP address anomaly detection - https://www.examtopics.com/discussions/amazon/view/74071-exam-aws-certified-machine-learning-specialty-topic-1/
141
141 - A retail company wants to combine its customer orders with the product description data from its product catalog. The structure and format of the records in each dataset is different. A data analyst tried to use a spreadsheet to combine the datasets, but the effort resulted in duplicate records and records that were not properly combined. The company needs a solution that it can use to combine similar records from the two datasets and remove any duplicates. Which solution will meet these requirements? - A.. Use an AWS Lambda function to process the data. Use two arrays to compare equal strings in the fields from the two datasets and remove any duplicates. B.. Create AWS Glue crawlers for reading and populating the AWS Glue Data Catalog. Call the AWS Glue SearchTables API operation to perform a fuzzy- matching search on the two datasets, and cleanse the data accordingly. C.. Create AWS Glue crawlers for reading and populating the AWS Glue Data Catalog. Use the FindMatches transform to cleanse the data. D.. Create an AWS Lake Formation custom transform. Run a transformation for matching products from the Lake Formation console to cleanse the data automatically.
C - C; Glue can use FindMatches transformation to find duplicates It says "Each dataset contains records with a unique structure and format.", so C would not be correct. but thats exactly the use of FindMatches: The FindMatches transform enables you to identify duplicate or matching records in your dataset, even when the records do not have a common unique identifier and no fields match exactly It is C as described in the tutorial - https://docs.aws.amazon.com/glue/latest/dg/machine-learning-transform-tutorial.html LakeFormation can also invoke a FindMatches algorithm (because it manages Data Ingestion through Glue), but we don't have a data lake in this example. No one would build a whole Data Lake - a process that takes days - only to find some matching records. Option C Lake Formation helps clean and prepare your data for analysis by providing a Machine Learning (ML) Transform called FindMatches for deduplication and finding matching records. For example, use FindMatches to find duplicate records in your database of restaurants, such as when one record lists “Joe's Pizza” at “121 Main St.” and another shows “Joseph's Pizzeria” at “121 Main.” You don't need to know anything about ML to do this. FindMatches will simply ask you to label sets of records as either “matching” or “not matching.” The system will then learn your criteria for calling a pair of records a match and will build an ML Transform that you can use to find duplicate records within a database or matching records across two databases. https://aws.amazon.com/lake-formation/features/ AWS Lake Formation FindMatches is a new machine learning (ML) transform that enables you to match records across different datasets as well as identify and remove duplicate records, with little to no human intervention Ans is D Thing is, FindMatches is not a custom transformation in LakeFormation. And LakeFormation transforms are actually Glue jobs D is correct - https://www.examtopics.com/discussions/amazon/view/74993-exam-aws-certified-machine-learning-specialty-topic-1/
142
142 - A company provisions Amazon SageMaker notebook instances for its data science team and creates Amazon VPC interface endpoints to ensure communication between the VPC and the notebook instances. All connections to the Amazon SageMaker API are contained entirely and securely using the AWS network. However, the data science team realizes that individuals outside the VPC can still connect to the notebook instances across the internet. Which set of actions should the data science team take to fix the issue? - A.. Modify the notebook instances' security group to allow traffic only from the CIDR ranges of the VPC. Apply this security group to all of the notebook instances' VPC interfaces. B.. Create an IAM policy that allows the sagemaker:CreatePresignedNotebooklnstanceUrl and sagemaker:DescribeNotebooklnstance actions from only the VPC endpoints. Apply this policy to all IAM users, groups, and roles used to access the notebook instances. C.. Add a NAT gateway to the VPC. Convert all of the subnets where the Amazon SageMaker notebook instances are hosted to private subnets. Stop and start all of the notebook instances to reassign only private IP addresses. D.. Change the network ACL of the subnet the notebook is hosted in to restrict access to anyone outside the VPC.
B - B appears to be correct according to the official source. https://docs.aws.amazon.com/sagemaker/latest/dg/notebook-interface-endpoint.html#notebook-private-link-restrict https://docs.aws.amazon.com/sagemaker/latest/dg/security.html Security group is suffice https://docs.aws.amazon.com/sagemaker/latest/dg/notebook-interface-endpoint.html describes this scenario -To restrict access to only connections made from within your VPC, create an AWS Identity and Access Management policy that restricts access to only calls that come from within your VPC. Then add that policy to every AWS Identity and Access Management user, group, or role used to access the notebook instance. Its A. This solutions works for all users, no more configurations needed. Going with B because - underlying notebook instances are managed by aws and can’t apply security groups - updating IAM policy only restricts connection only from VPC endpoints The issue is that the notebook instances' security group allows inbound traffic from any source IP address, which means that anyone with the authorized URL can access the notebook instances over the internet. To fix this issue, the data science team should modify the security group to allow traffic only from the CIDR ranges of the VPC, which are the IP addresses assigned to the resources within the VPC. This way, only the VPC interface endpoints and the resources within the VPC can communicate with the notebook instances. The data science team should apply this security group to all of the notebook instances' VPC interfaces, which are the network interfaces that connect the notebook instances to the VPC. B. notebook instances are controlled by AWS service accounts and hence no access to those instances A. Modify the notebook instances' security group: This approach involves adjusting the security group settings to only allow traffic from the VPC's CIDR ranges. By applying this security group to all of the notebook instances' VPC interfaces, it ensures that only traffic originating from within the VPC can access the notebook instances. This is a viable solution because it directly restricts access based on the source of the traffic. B. Create an IAM policy for VPC endpoint access: This solution involves crafting an IAM policy that restricts certain SageMaker actions to only the VPC endpoints. However, this approach might not fully address the issue of external access to the notebook instances themselves. It's more about controlling who can create or describe notebook instances, rather than restricting network access. BUT according to here, it should be A: https://docs.aws.amazon.com/sagemaker/latest/dg/notebook-interface-endpoint.html should be B* B is talking about a policy to allow. It doesn't ban anything, it's only about allow.... So the answer can't be B. A A. NO - it is not possible the security group of the instances, they are managed by SageMaker and will not appear in the console B. YES - https://docs.aws.amazon.com/sagemaker/latest/dg/notebook-interface-endpoint.html#notebook-private-link-restrict C. NO - subnets cannot be converted from public to private D. NO - ACL are for the notebooks, not the network Based on my search, the answer is A. Modifying the notebook instances’ security group to allow traffic only from the CIDR ranges of the VPC is a way to restrict access to anyone outside the VPC1. Amazon VPC interface endpoints enable you to privately connect your VPC to supported AWS services and VPC endpoint services powered by AWS PrivateLink without requiring an internet gateway, NAT device, VPN connection, or AWS Direct Connect connection2. However, they do not prevent users from accessing the notebook instances using presigned URLs3. Therefore, options B, C and D are not correct. guys the right answer is B according to this reference: https://docs.aws.amazon.com/sagemaker/latest/dg/notebook-interface-endpoint.html#notebook-private-link-restrict To restrict access to only connections made from within your VPC, create an AWS Identity and Access Management policy that restricts access to only calls that come from within your VPC. Then add that policy to every AWS Identity and Access Management user, group, or role used to access the notebook instance. This question may be old based on this https://aws.amazon.com/blogs/machine-learning/customize-your-amazon-sagemaker-notebook-instances-with-lifecycle-configurations-and-the-option-to-disable-internet-access/ but you can still remove all other allowed access and just add the VPC ciders to the SGs as there is an explicit Deny for anything not explicitly allowed. Option B creates an IAM policy that allows the sagemaker:CreatePresignedNotebookInstanceUrl and sagemaker:DescribeNotebookInstance actions from only the VPC endpoints. These actions are required to access the notebook instances through the Amazon SageMaker console or the AWS CLI1. By applying this policy to all IAM users, groups, and roles used to access the notebook instances, the data science team can ensure that only authorized users within the VPC can connect to the notebook instances across the internet. Modifying the notebook instances' security group to allow traffic only from the CIDR ranges of the VPC ensures that only connections from within the VPC are permitted. This restricts access to the notebook instances from individuals outside the VPC, effectively securing the communication and preventing unauthorized access. On the other hand, Option B, creating an IAM policy for sagemaker:CreatePresignedNotebookInstanceUrl and sagemaker:DescribeNotebookInstance actions from VPC endpoints, does not address the issue of restricting direct internet access to the notebook instances. IAM policies manage permissions for AWS service actions and resources, but they do not control network-level access. "...the data science team realizes that individuals outside the VPC can still connect to the notebook instances across the internet.." B states - "Create an IAM policy that allows the sagemaker:CreatePresignedNotebooklnstanceUrl and sagemaker:DescribeNotebooklnstance actions from only the VPC endpoints" Ok, so now the individuals outside the VPC can't create a CreatePresignedNotebooklnstanceUrl or DescribeNotebooklnstance, but does that stop them from StopNotebookInstance or DeleteNotebookInstance operations? For option A, we only allow traffic from the VPC The problem about a is that "You can specify allow rules, but not deny rules." https://docs.aws.amazon.com/vpc/latest/userguide/security-group-rules.html#security-group-rule-characteristics Therefore, you cannot restrict the unauthorized access Should be security group thing - https://www.examtopics.com/discussions/amazon/view/74388-exam-aws-certified-machine-learning-specialty-topic-1/
143
143 - A company will use Amazon SageMaker to train and host a machine learning (ML) model for a marketing campaign. The majority of data is sensitive customer data. The data must be encrypted at rest. The company wants AWS to maintain the root of trust for the master keys and wants encryption key usage to be logged. Which implementation will meet these requirements? - A.. Use encryption keys that are stored in AWS Cloud HSM to encrypt the ML data volumes, and to encrypt the model artifacts and data in Amazon S3. B.. Use SageMaker built-in transient keys to encrypt the ML data volumes. Enable default encryption for new Amazon Elastic Block Store (Amazon EBS) volumes. C.. Use customer managed keys in AWS Key Management Service (AWS KMS) to encrypt the ML data volumes, and to encrypt the model artifacts and data in Amazon S3. D.. Use AWS Security Token Service (AWS STS) to create temporary tokens to encrypt the ML storage volumes, and to encrypt the model artifacts and data in Amazon S3.
C - C is correct answer. Straight forward to use KMS. "The company wants AWS to maintain the root of trust for the master keys" The reason A is wrong. So C option C Using customer managed keys in AWS KMS will allow the company to maintain the root of trust for the master keys, and AWS KMS will log key usage. This ensures that the encryption keys used to encrypt the ML data volumes and model artifacts are properly managed and secured. Additionally, using customer managed keys allows the company to have greater control over the encryption process. "AWS Security Token Service (AWS STS) to create temporary tokens" - AWS STS also using KMS keys. https://docs.aws.amazon.com/kms/latest/developerguide/security-logging-monitoring.html - https://www.examtopics.com/discussions/amazon/view/76355-exam-aws-certified-machine-learning-specialty-topic-1/
144
144 - A machine learning specialist stores IoT soil sensor data in Amazon DynamoDB table and stores weather event data as JSON files in Amazon S3. The dataset in DynamoDB is 10 GB in size and the dataset in Amazon S3 is 5 GB in size. The specialist wants to train a model on this data to help predict soil moisture levels as a function of weather events using Amazon SageMaker. Which solution will accomplish the necessary transformation to train the Amazon SageMaker model with the LEAST amount of administrative overhead? - A.. Launch an Amazon EMR cluster. Create an Apache Hive external table for the DynamoDB table and S3 data. Join the Hive tables and write the results out to Amazon S3. B.. Crawl the data using AWS Glue crawlers. Write an AWS Glue ETL job that merges the two tables and writes the output to an Amazon Redshift cluster. C.. Enable Amazon DynamoDB Streams on the sensor table. Write an AWS Lambda function that consumes the stream and appends the results to the existing weather files in Amazon S3. D.. Crawl the data using AWS Glue crawlers. Write an AWS Glue ETL job that merges the two tables and writes the output in CSV format to Amazon S3.
D - D. AWS Glue can connect with DynamoDB and join both data sets together via Glue Studio. Requiring minimal overheads D. AWS Glue can connect with DynamoDB and join both data sets together via Glue Studio. Requiring minimal overheads Option D with AWS Glue crawlers and ETL job provides a straightforward and efficient way to merge the data from DynamoDB and Amazon S3 into a format suitable for training the Amazon SageMaker model with minimal administrative overhead. D. https://aws.amazon.com/blogs/big-data/accelerate-amazon-dynamodb-data-access-in-aws-glue-jobs-using-the-new-aws-glue-dynamodb-elt-connector/ 12-sep exam - https://www.examtopics.com/discussions/amazon/view/74392-exam-aws-certified-machine-learning-specialty-topic-1/
145
145 - A company sells thousands of products on a public website and wants to automatically identify products with potential durability problems. The company has 1.000 reviews with date, star rating, review text, review summary, and customer email fields, but many reviews are incomplete and have empty fields. Each review has already been labeled with the correct durability result. A machine learning specialist must train a model to identify reviews expressing concerns over product durability. The first model needs to be trained and ready to review in 2 days. What is the MOST direct approach to solve this problem within 2 days? - A.. Train a custom classifier by using Amazon Comprehend. B.. Build a recurrent neural network (RNN) in Amazon SageMaker by using Gluon and Apache MXNet. C.. Train a built-in BlazingText model using Word2Vec mode in Amazon SageMaker. D.. Use a built-in seq2seq model in Amazon SageMaker.
A - A should be the answer it's a sentiment analysis problem => comprehend BlazingText can also do supervised text classification yes but only in TextClassification mode, note W2V mode... so A Built-in BlazingText model using Word2Vec mode in Amazon SageMaker would likely be quicker to set up compared to using Amazon Comprehend for this specific use case. Since the problem statement mentions that the review data is already labeled with the correct durability result, preparing the training data should be relatively straightforward. Additionally, as a built-in algorithm, BlazingText is optimized and pre-configured for text classification tasks, reducing the need for extensive customization and configuration compared to using Amazon Comprehend for this specific use case. It's important to note that while BlazingText may be quicker to set up for this particular task, Amazon Comprehend offers a broader range of NLP capabilities and may be more suitable for other NLP tasks or scenarios where more customization and flexibility are required. However, given the time constraint of 2 days and the specific requirement of identifying product durability concerns from reviews, training a built-in BlazingText model using Word2Vec mode in Amazon SageMaker is likely to be the more direct and quicker approach to get a working solution set up and running. yes but only in TextClassification mode, note W2V mode... so A Given the time constraint of 2 days and the need for a quick solution, the most direct approach would be to choose an option that provides a ready-to-use solution without the need for extensive customization or training. Among the given options, the most direct approach would be: C. Train a built-in BlazingText model using Word2Vec mode in Amazon SageMaker. This option allows you to leverage a pre-built model (BlazingText) that is optimized for text classification tasks. Word2Vec mode is suitable for analyzing text data and can quickly provide insights into sentiment or, in this case, concerns over product durability. This approach minimizes the need for extensive data preprocessing and model tuning, allowing you to focus on training and deploying the model within the given timeframe. Using a existing model to do the task in 2 days. A I would say A A. YES - Amazon Comprehend with multi-class mode and Augmented manifest file B. NO - Gluon is for timeseries C. NO - still a lot of work after generating embedding D. NO - seq2seq is to generate text, we want to classify To solve the problem in 2 days, and dealing with sentiment analysis so A will be the right answer using the comprehend AWS Comprehend is a natural language processing (NLP) service that uses machine learning to discover insights from text. It provides a range of functionalities, including detecting language and sentiment, extracting named entities and key phrases, and tagging parts of speech5. AWS Comprehend can automatically break down concepts like entities, phrases, and syntax in a document, which is particularly helpful for identifying events, organizations, persons, or products referenced in a document The most direct approach to solve this problem within 2 days is option A, train a custom classifier by using Amazon Comprehend. By doing so, you can use Amazon Comprehend, a natural language processing (NLP) service that uses machine learning to find insights and relationships in text, to create a custom classifier that can identify reviews expressing concerns over product durability. You can use the labeled reviews as your training data and specify the durability result as the class label. Amazon Comprehend will automatically preprocess the text, extract features, and train the classifier for you. You can also use Amazon Comprehend to evaluate the performance of your classifier and deploy it as an endpoint. This way, you can train a model to solve this problem within 2 days without requiring much coding or infrastructure management. A: You can customize Amazon Comprehend for your specific requirements without the skillset required to build machine learning-based NLP solutions. Using automatic machine learning, or AutoML, Comprehend Custom builds customized NLP models on your behalf, using training data that you provide. Comprehend can do Custom Classification Comprehend can do Sentiment Analysis The answer is C, because of the amount of data, and the time constraint. C is the most efficient solution. Conventionally A would be the right answer, but given the time constraint the answer is C. I would say blaze text. Cuz comprehend needs custom code, so we have only 2 days. https://docs.aws.amazon.com/comprehend/latest/dg/how-document-classification.html If the problem needs to be solved in 2 days I would avoid going with any customised solution which would eliminate A and B. As the data is labelled already we don't need an unsupervised algorithm therefore eliminating C. Which leaves us with D its exactly the opposite, because its needs to be ready in 2 day I would use Comprehend ;) You don't need to write code, you have the data already available, so its faster then D - https://www.examtopics.com/discussions/amazon/view/74394-exam-aws-certified-machine-learning-specialty-topic-1/
146
146 - A company that runs an online library is implementing a chatbot using Amazon Lex to provide book recommendations based on category. This intent is fulfilled by an AWS Lambda function that queries an Amazon DynamoDB table for a list of book titles, given a particular category. For testing, there are only three categories implemented as the custom slot types: "comedy," "adventure,` and "documentary.` A machine learning (ML) specialist notices that sometimes the request cannot be fulfilled because Amazon Lex cannot understand the category spoken by users with utterances such as "funny," "fun," and "humor." The ML specialist needs to fix the problem without changing the Lambda code or data in DynamoDB. How should the ML specialist fix the problem? - A.. Add the unrecognized words in the enumeration values list as new values in the slot type. B.. Create a new custom slot type, add the unrecognized words to this slot type as enumeration values, and use this slot type for the slot. C.. Use the AMAZON.SearchQuery built-in slot types for custom searches in the database. D.. Add the unrecognized words as synonyms in the custom slot type.
D - D is the answer. The unrecognized words are synonyms for "comedy", so they should be added as synonyms under the comedy slot type see the excerpt: "For each intent, you can specify parameters that indicate the information that the intent needs to fulfill the user's request. These parameters, or slots, have a type. A slot type is a list of values that Amazon Lex uses to train the machine learning model to recognize values for a slot. For example, you can define a slot type called "Genres." Each value in the slot type is the name of a genre, "comedy," "adventure," "documentary," etc. You can define a synonym for a slot type value. For example, you can define the synonyms "funny" and "humorous" for the value "comedy."" https://docs.aws.amazon.com/lex/latest/dg/howitworks-custom-slots.html D? can not be C.Amazon Lex doesn't support the AMAZON.LITERAL or the AMAZON.SearchQuery built-in slot types. https://docs.aws.amazon.com/lex/latest/dg/howitworks-builtins-slots.html https://docs.aws.amazon.com/lex/latest/dg/howitworks-custom-slots.html D is the answer. The best way to fix the problem is option D, add the unrecognized words as synonyms in the custom slot type. By doing so, you can map different words that have the same meaning to the same slot value, without changing the Lambda code or data in DynamoDB. For example, you can add “funny”, “fun”, and “humor” as synonyms for the slot value “comedy”. This way, Amazon Lex can understand the category spoken by users and pass it to the Lambda function that queries the DynamoDB table for a list of book titles. Option A, adding the unrecognized words in the enumeration values list as new values in the slot type, is not a good choice because it would create new slot values that do not match the existing categories in the DynamoDB table. For example, if you add “funny” as a new value in the slot type, Amazon Lex would pass it to the Lambda function, which would not find any book titles for that category in the DynamoDB table. C is the answer AMAZON.SearchQuery As you think about what users are likely to ask, consider using a built-in or custom slot type to capture user input that is more predictable, and the AMAZON.SearchQuery slot type to capture less-predictable input that makes up the search query. The following example shows an intent schema for SearchIntent, which uses the AMAZON.SearchQuery slot type and also includes a CityList slot that uses the AMAZON.City slot type. Make sure that your skill uses no more than one AMAZON.SearchQuery slot per intent. The Amazon.SearchQuery slot type cannot be combined with another intent slot in sample utterances. Each sample utterance must include a carrier phrase. The exception is that you can omit the carrier phrase in slot samples. A carrier phrase is the word or words that are part of the utterance, but not the slot, such as "search for" or "find out". The ML specialist should add the unrecognized words as synonyms in the custom slot type. This will allow Amazon Lex to understand the user's intent even if they use synonyms for the predefined slot values. By adding the synonyms, Amazon Lex will recognize them as variations of the predefined slot values and map them to the appropriate slot value. This approach can be a quick and effective way to improve the accuracy of the chatbot's understanding of user requests without having to change the Lambda code or the data in DynamoDB. B is correct - https://www.examtopics.com/discussions/amazon/view/74078-exam-aws-certified-machine-learning-specialty-topic-1/
147
147 - A manufacturing company uses machine learning (ML) models to detect quality issues. The models use images that are taken of the company's product at the end of each production step. The company has thousands of machines at the production site that generate one image per second on average. The company ran a successful pilot with a single manufacturing machine. For the pilot, ML specialists used an industrial PC that ran AWS IoT Greengrass with a long-running AWS Lambda function that uploaded the images to Amazon S3. The uploaded images invoked a Lambda function that was written in Python to perform inference by using an Amazon SageMaker endpoint that ran a custom model. The inference results were forwarded back to a web service that was hosted at the production site to prevent faulty products from being shipped. The company scaled the solution out to all manufacturing machines by installing similarly configured industrial PCs on each production machine. However, latency for predictions increased beyond acceptable limits. Analysis shows that the internet connection is at its capacity limit. How can the company resolve this issue MOST cost-effectively? - A.. Set up a 10 Gbps AWS Direct Connect connection between the production site and the nearest AWS Region. Use the Direct Connect connection to upload the images. Increase the size of the instances and the number of instances that are used by the SageMaker endpoint. B.. Extend the long-running Lambda function that runs on AWS IoT Greengrass to compress the images and upload the compressed files to Amazon S3. Decompress the files by using a separate Lambda function that invokes the existing Lambda function to run the inference pipeline. C.. Use auto scaling for SageMaker. Set up an AWS Direct Connect connection between the production site and the nearest AWS Region. Use the Direct Connect connection to upload the images. D.. Deploy the Lambda function and the ML models onto the AWS IoT Greengrass core that is running on the industrial PCs that are installed on each machine. Extend the long-running Lambda function that runs on AWS IoT Greengrass to invoke the Lambda function with the captured images and run the inference on the edge component that forwards the results directly to the web service.
D - D is correct according to official documentation. https://docs.aws.amazon.com/greengrass/v1/developerguide/ml-inference.html A-C: excluded out, Direct Connect is expensive Option D eliminates the need for internet connection since the inference is done on the edge component, and the results are directly forwarded to the web service. This approach also reduces the need for larger instances and direct connect connections, thus being the most cost-effective solution. - https://www.examtopics.com/discussions/amazon/view/74395-exam-aws-certified-machine-learning-specialty-topic-1/
148
148 - A data scientist is using an Amazon SageMaker notebook instance and needs to securely access data stored in a specific Amazon S3 bucket. How should the data scientist accomplish this? - A.. Add an S3 bucket policy allowing GetObject, PutObject, and ListBucket permissions to the Amazon SageMaker notebook ARN as principal. B.. Encrypt the objects in the S3 bucket with a custom AWS Key Management Service (AWS KMS) key that only the notebook owner has access to. C.. Attach the policy to the IAM role associated with the notebook that allows GetObject, PutObject, and ListBucket operations to the specific S3 bucket. D.. Use a script in a lifecycle configuration to configure the AWS CLI on the instance with an access key ID and secret.
C - Agree with the Answer C. Attach the policy to the IAM roal associated with the notebook. c is the right answer Amazon SageMaker notebook ARN , I don't think there is such a thing. So A is not right . So C C. Attach policy to IAM role associated with the notebook: This is a standard and recommended approach in AWS. By attaching a policy to the IAM role that the SageMaker notebook instance assumes, you can precisely control the notebook's access to the specific S3 bucket. This method follows the AWS best practice of using IAM roles for managing permissions and also allows for easier management and scalability. A. Add an S3 bucket policy: This approach involves modifying the S3 bucket policy to grant permissions directly to the SageMaker notebook instance's ARN. While this method can effectively grant access, it is less flexible and scalable compared to using IAM roles. It directly ties the bucket's access policy to a specific resource (the notebook instance), which might not be ideal for managing access in a larger environment. The best way for the data scientist to securely access data stored in a specific Amazon S3 bucket from an Amazon SageMaker notebook instance is option C, attach the policy to the IAM role associated with the notebook that allows GetObject, PutObject, and ListBucket operations to the specific S3 bucket. By doing so, the data scientist can use IAM role-based access control to grant permissions to the notebook instance to access the S3 bucket without exposing any credentials or keys. The data scientist can also limit the scope of the permissions to only the necessary operations and resources, following the principle of least privilege. Option A suggests adding an S3 bucket policy, but it is not the recommended way to grant permissions to specific IAM roles associated with SageMaker notebook instances. Bucket policies are generally used for granting cross-account access or public access, not for specifying access for specific IAM roles. An IAM policy cannot attach to an ARN. An IAM policy can only attach to an IAM role or an IAM user. So the answer is C A - we allow access to specific notebook. AIM role policy can be global and related to all user notebooks. On the other hand, in C they state "specific S3 bucket" and in the A - only "an S3 bucket". Maybe in A they add global policy to allow access to all S3 buckets? AC are both correct answer, but A is better than C, mostly due to the limitation of IAM policy. IAM policies: The maximum size of an IAM policy document is 6,144 characters. You can attach up to 10 policies to an IAM user, role, or group. Option C ensures that the notebook instance is granted permission to access the S3 bucket without the need to provide credentials. Option A is incorrect because it suggests adding a bucket policy that grants permission to a specific IAM principal, which is less secure than granting permission to an IAM role. I dont agree with this. Restrict bucket access only to limited principal is much secure than grant specific IAM prinicap. Restrict specific principal eliminate other visits, but grant specific IAM user permission does not exclude other visit. 12-sep exam C is correct Quoting the book "Data Science on AWS": "Generally, we would use IAM identity-based policies if we need to define permissions for more than just S3, or if we have a number of S3 buckets, each with different permissions requirements. We might want to keep access control policies in the IAM environment. We would use S3 bucket policies if we need a simple way to grant cross-account access to our S3 environment without using IAM roles, or if we reach the size limit for our IAM policy. We might want to keep access control policies in the S3 environment." A would be the choice then. For A, only some operations are allowed, no specified users or roles have been granted this permission for these operations. I am not sure but in question we don't have cross-account situation? Based on this logic indeed A would be better. B is the answer Only "securely access" is required, not encryption. A - for me - https://www.examtopics.com/discussions/amazon/view/75091-exam-aws-certified-machine-learning-specialty-topic-1/
149
149 - A company is launching a new product and needs to build a mechanism to monitor comments about the company and its new product on social media. The company needs to be able to evaluate the sentiment expressed in social media posts, and visualize trends and configure alarms based on various thresholds. The company needs to implement this solution quickly, and wants to minimize the infrastructure and data science resources needed to evaluate the messages. The company already has a solution in place to collect posts and store them within an Amazon S3 bucket. What services should the data science team use to deliver this solution? - A.. Train a model in Amazon SageMaker by using the BlazingText algorithm to detect sentiment in the corpus of social media posts. Expose an endpoint that can be called by AWS Lambda. Trigger a Lambda function when posts are added to the S3 bucket to invoke the endpoint and record the sentiment in an Amazon DynamoDB table and in a custom Amazon CloudWatch metric. Use CloudWatch alarms to notify analysts of trends. B.. Train a model in Amazon SageMaker by using the semantic segmentation algorithm to model the semantic content in the corpus of social media posts. Expose an endpoint that can be called by AWS Lambda. Trigger a Lambda function when objects are added to the S3 bucket to invoke the endpoint and record the sentiment in an Amazon DynamoDB table. Schedule a second Lambda function to query recently added records and send an Amazon Simple Notification Service (Amazon SNS) notification to notify analysts of trends. C.. Trigger an AWS Lambda function when social media posts are added to the S3 bucket. Call Amazon Comprehend for each post to capture the sentiment in the message and record the sentiment in an Amazon DynamoDB table. Schedule a second Lambda function to query recently added records and send an Amazon Simple Notification Service (Amazon SNS) notification to notify analysts of trends. D.. Trigger an AWS Lambda function when social media posts are added to the S3 bucket. Call Amazon Comprehend for each post to capture the sentiment in the message and record the sentiment in a custom Amazon CloudWatch metric and in S3. Use CloudWatch alarms to notify analysts of trends.
D - D is the correct answer. Following from the previous comment. The company wants to minimize the infrastructure and data science resources needed to evaluate the messages. Therefore any custom services would be eliminated (A and B). Similarly DynamoDB would add complexity to the infrastructure there C is eliminated. leaving D Recording Sentiment in cloudwatch metric seems odd. DynamoDB seems more accurate. Option D is the right answer. Following are the key terms in question to notice, sentiment expressed in social media posts --> Comprehend configure alarms based on various thresholds --> CloudWatch (can send alerts without SNS) wants to minimize the infrastructure and data science resources --> AWS S3 The best services for the data science team to use to deliver this solution are option D, trigger an AWS Lambda function when social media posts are added to the S3 bucket, call Amazon Comprehend for each post to capture the sentiment in the message and record the sentiment in a custom Amazon CloudWatch metric and in S3, and use CloudWatch alarms to notify analysts of trends. By doing so, the data science team can use Amazon Comprehend, a natural language processing (NLP) service that uses machine learning to find insights and relationships in text, to evaluate the sentiment expressed in social media posts. Amazon Comprehend can detect positive, negative, neutral, or mixed sentiment from text input. The data science team can also use AWS Lambda, a service that lets you run code without provisioning or managing servers, to trigger a function when posts are added to the S3 bucket and call Amazon Comprehend for each post. Amazingly D is possible - https://catalog.us-east-1.prod.workshops.aws/workshops/4faab440-8c3a-4527-bd11-0c88a6e6213c/en-US/30-build-the-application/400-send-sentiment-to-cloudwatch I was so sure of option C, because sending a sentiment to a custom CloudWatch metric just didn't make any sense. But you learn something new everyday. This is a puzzling question, as both answers C and D miss essential steps: C is missing DynamoDB Streams to capture new records D is missing a notification mechanism like SNS, as CloudWatch Alarms alone can only be used as a trigger, but are not sufficient for notification I agree that A and B should be eliminated for requiring data science development I also do agree that D is correct answer. In A, why we are adding extra dependency of Dynamo DB. D, blazing text is not for sentiment analysis. The Amazon SageMaker BlazingText algorithm provides highly optimized implementations of the Word2vec and text classification algorithms. The Word2vec algorithm is useful for many downstream natural language processing (NLP) tasks, such as sentiment analysis, named entity recognition, machine translation, etc. Text classification is an important task for applications that perform web searches, information retrieval, ranking, and document classification. BlazingText can do sentiment analysis: https://docs.aws.amazon.com/sagemaker/latest/dg/blazingtext.html - https://www.examtopics.com/discussions/amazon/view/74079-exam-aws-certified-machine-learning-specialty-topic-1/
150
150 - A bank wants to launch a low-rate credit promotion. The bank is located in a town that recently experienced economic hardship. Only some of the bank's customers were affected by the crisis, so the bank's credit team must identify which customers to target with the promotion. However, the credit team wants to make sure that loyal customers' full credit history is considered when the decision is made. The bank's data science team developed a model that classifies account transactions and understands credit eligibility. The data science team used the XGBoost algorithm to train the model. The team used 7 years of bank transaction historical data for training and hyperparameter tuning over the course of several days. The accuracy of the model is sufficient, but the credit team is struggling to explain accurately why the model denies credit to some customers. The credit team has almost no skill in data science. What should the data science team do to address this issue in the MOST operationally efficient manner? - A.. Use Amazon SageMaker Studio to rebuild the model. Create a notebook that uses the XGBoost training container to perform model training. Deploy the model at an endpoint. Enable Amazon SageMaker Model Monitor to store inferences. Use the inferences to create Shapley values that help explain model behavior. Create a chart that shows features and SHapley Additive exPlanations (SHAP) values to explain to the credit team how the features affect the model outcomes. B.. Use Amazon SageMaker Studio to rebuild the model. Create a notebook that uses the XGBoost training container to perform model training. Activate Amazon SageMaker Debugger, and configure it to calculate and collect Shapley values. Create a chart that shows features and SHapley Additive exPlanations (SHAP) values to explain to the credit team how the features affect the model outcomes. C.. Create an Amazon SageMaker notebook instance. Use the notebook instance and the XGBoost library to locally retrain the model. Use the plot_importance() method in the Python XGBoost interface to create a feature importance chart. Use that chart to explain to the credit team how the features affect the model outcomes. D.. Use Amazon SageMaker Studio to rebuild the model. Create a notebook that uses the XGBoost training container to perform model training. Deploy the model at an endpoint. Use Amazon SageMaker Processing to post-analyze the model and create a feature importance explainability chart automatically for the credit team.
B - B, SageMaker Model Debugger is used to generate SHAP values https://aws.amazon.com/blogs/machine-learning/ml-explainability-with-amazon-sagemaker-debugger/ I believe C is the right answer, it is simpler and more accurate than B. It will show only importance of features not their contribution to the final score The problem is at inference time, not training time. So its A I hesitate between A and B... In the question, the credit team wants to understand the reason why the model denies credit at inference time, not at training time... Sagemaker Model Monitor compute SHAP values at inference time while Sagemaker Debugger compute SHAP values at training time... I'm leading more for A as an answer. The best option is to use Amazon SageMaker Studio to rebuild the model and deploy it at an endpoint. Then, use Amazon SageMaker Model Monitor to store inferences and use the inferences to create Shapley values that help explain model behavior. Shapley values are a way of attributing the contribution of each feature to the model output. They can help the credit team understand why the model makes certain decisions and how the features affect the model outcomes. A chart that shows features and SHapley Additive exPlanations (SHAP) values can be created using the SHAP library in Python. This option is the most operationally efficient because it leverages the existing XGBoost training container and the built-in capabilities of Amazon SageMaker Model Monitor and SHAP library. A. NO - too complicated to compute SHAP B. YES - Debugger supports built-in SHAP C. NO - too complicated to compute SHAP D. NO - too complicated to compute SHAP Option B utilizes Amazon SageMaker Studio to build and train the model, and it also activates Amazon SageMaker Debugger, which allows calculating and collecting Shapley values. These Shapley values will help explain accurately why the model denies credit to certain customers. Generating a chart that displays the features and their SHAP values will provide a visual and clear explanation of the impact of each feature on the model's decisions, making it easier for the credit team with limited data science skills to understand. Either A or B Sage Maker Monitor require no experience so A is preferred while B can provide more details but depend if require knowledge to use it. More towards B SageMaker Model Monitor is a tool that helps monitor the quality of model predictions over time by analyzing data inputs and outputs during inference. It can detect and alert when data drift or concept drift occurs, and can identify features that are most responsible for the changes in model behavior. Model Monitor can be used to continuously monitor and improve model performance, and can be integrated with SageMaker endpoints or SageMaker Pipelines. SageMaker Debugger is a tool that helps debug machine learning models during training by analyzing the internal states of the model, such as weights and gradients, as well as the data inputs and outputs during training. It can detect and alert when common training issues occur, such as overfitting or underfitting, and can identify the root causes of these issues. Debugger can be used to improve model accuracy and convergence, and can be integrated with SageMaker training jobs. After reconsideration, it is actually B. https://aws.amazon.com/blogs/machine-learning/ml-explainability-with-amazon-sagemaker-debugger/ Debugger because we are in the context of "training data" There are so many explanations, but most of them are just superfacial, focusing on what service is related to SHAP. This is the only one really answer the difference between A and C. 1. Both SagaMaker Model Monitor and Debugger can explain model, can generate SHAP. so it should be either A or C 2. Monitor is about inference. After deploy the model, we may find some attributes start to contribute more to the model, contradict to the training dataset. This case we use SageMaker Model Monitor. But our problem is not about deploying, is still in training stage. We only want to figure out why some customer with specfic characteristics are more likely to get loan, in other words, certain feature contribute more to the prediction. It is C !!!! If you don't fully understant the question, stop explaining !!! not comparing between A and C, should be A and B C is the straight forward and simpler. Why not C? 'C' is the most easiest way to find out.! This is debugger’s work Option A suggests using Amazon SageMaker Model Monitor to store inferences and create Shapley values that can help explain the model's behavior. This option can be more operationally efficient because it doesn't require the credit team to understand the complexities of Shapley values, and it doesn't necessarily slow down the model's inference time. After a review, I go with option B - https://www.examtopics.com/discussions/amazon/view/74996-exam-aws-certified-machine-learning-specialty-topic-1/
151
151 - A data science team is planning to build a natural language processing (NLP) application. The application's text preprocessing stage will include part-of-speech tagging and key phase extraction. The preprocessed text will be input to a custom classification algorithm that the data science team has already written and trained using Apache MXNet. Which solution can the team build MOST quickly to meet these requirements? - A.. Use Amazon Comprehend for the part-of-speech tagging, key phase extraction, and classification tasks. B.. Use an NLP library in Amazon SageMaker for the part-of-speech tagging. Use Amazon Comprehend for the key phase extraction. Use AWS Deep Learning Containers with Amazon SageMaker to build the custom classifier. C.. Use Amazon Comprehend for the part-of-speech tagging and key phase extraction tasks. Use Amazon SageMaker built-in Latent Dirichlet Allocation (LDA) algorithm to build the custom classifier. D.. Use Amazon Comprehend for the part-of-speech tagging and key phase extraction tasks. Use AWS Deep Learning Containers with Amazon SageMaker to build the custom classifier.
D - I will go with A. Refer to link : https://aws.amazon.com/comprehend/features/ whoever select A misunderstant "Custom classification", it is model for custom classificaiton, not submitting your own script!!!! and for the above reply with document, read document first. Agree. A is my answer. 1. part of speech tagging : https://docs.aws.amazon.com/comprehend/latest/dg/API_PartOfSpeechTag.html 2. Key phas extraction https://docs.aws.amazon.com/comprehend/latest/dg/how-key-phrases.html 3. custum classification algorithm https://docs.aws.amazon.com/comprehend/latest/dg/how-document-classification.html D is the answer. Using Apache MXNet rules out Comprehend from making the classification task any reference? "Automatically improve performance with optimized model training for popular frameworks like TensorFlow, PyTorch, and Apache MXNet." https://aws.amazon.com/cn/machine-learning/containers/ Preprocessing using Comprehend. Then use preprocessed text as input to custom classifier. Amazon Comprehend cant bring your own model, the feature of "custom classification" is meaning that your can train a classifier on the service with your own data, not bring your own model on. So the answer is definitely D. Amazon Comprehend is a natural language processing (NLP) service that can perform part-of-speech tagging and key phrase extraction tasks. AWS Deep Learning Containers are Docker images that are pre-installed with popular deep learning frameworks such as Apache MXNet. Amazon SageMaker is a fully managed service that can help build, train, and deploy machine learning models. Using Amazon Comprehend for the text preprocessing tasks and AWS Deep Learning Containers with Amazon SageMaker to build the custom classifier is the solution that can be built most quickly to meet the requirements. References: Amazon Comprehend AWS Deep Learning Container The Custom classification in AWS Comprehend cannot choose algorithm, you cannot use your own algorithm in it. You only feed dataset to it. So A is wrong. The data science team want to use their own MXNET model, so D. Will go with A A is the most quickly solution. We have to solve two NLP problems: part-of-speech tagging and key phase extraction. Note that the custom classifier already exists and has been trained! The question asks that it be done as quickly as possible, so the idea is to use a ready-made service. Letter A is wrong, as it uses another service compared to the already created model to classify. Letter B requires development and therefore would not be the fastest solution. Letter C is wrong for the same reason as Letter A, in addition it proposes an unsupervised service (LDA) for a supervised problem. Letter D is correct. Therefore, option D is the most efficient solution for building a NLP application that meets the requirements of the data science team. Quickest A Latest is A. The other mxnet model is the key option D is the most appropriate answer, given that the team has already written and trained a custom classification algorithm using Apache MXNet. Option D allows the team to use Amazon Comprehend for part-of-speech tagging and key phrase extraction, while also using AWS Deep Learning Containers with Amazon SageMaker to build and deploy the custom classifier. D for me. Question says "The preprocessed text WILL be input to a custom classification algorithm that the data science team has already written and trained using Apache MXNet". So for some reason they want to use MXNet to do the classification, not Amazon Comprehend. So using MXNet for classification is a part of their requirement. How do we meet these requirements quickly? Well, use Amazon Comprehend for part-of-speech and key phrase tasks; and use container for the MXNet stuff. I had selected "A" in my first go, thanks for understanding the question. Although, comprehend does all three, since they have already built custom classification, we only need to provide solution for first two. D for me too. The question did not make it clear whether the new solution has to use the custom model that the team built or not. A for me Agreed with A, Comprehend 3 functions - https://www.examtopics.com/discussions/amazon/view/75436-exam-aws-certified-machine-learning-specialty-topic-1/
152
152 - A machine learning (ML) specialist must develop a classification model for a financial services company. A domain expert provides the dataset, which is tabular with 10,000 rows and 1,020 features. During exploratory data analysis, the specialist finds no missing values and a small percentage of duplicate rows. There are correlation scores of > 0.9 for 200 feature pairs. The mean value of each feature is similar to its 50th percentile. Which feature engineering strategy should the ML specialist use with Amazon SageMaker? - A.. Apply dimensionality reduction by using the principal component analysis (PCA) algorithm. B.. Drop the features with low correlation scores by using a Jupyter notebook. C.. Apply anomaly detection by using the Random Cut Forest (RCF) algorithm. D.. Concatenate the features with high correlation scores by using a Jupyter notebook.
A - Dimensions are too high. Use PCA A should be the answer to avoid the curse of dimensionality Easy choice. Always choose PCA for dim reduction the best feature engineering strategy for the ML specialist to use with Amazon SageMaker is to apply dimensionality reduction by using the PCA algorithm. Selected Answer: A Given that the dataset has 1,020 features and 200 of them are highly correlated, it is likely that the dataset suffers from multicollinearity. In such cases, dimensionality reduction techniques like principal component analysis (PCA) can be used to transform the data into a lower dimensional space without losing much information. Therefore, option A, "Apply dimensionality reduction by using the principal component analysis (PCA) algorithm" is the most appropriate feature engineering strategy for the ML specialist to use with Amazon SageMaker. This would help reduce the computational complexity of the model, improve model performance, and help to avoid overfitting. A. Apply dimensionality reduction by using the principal component analysis (PCA) algorithm. Since the dataset has many features, and a significant number of them have high correlation scores, the model may suffer from the curse of dimensionality. To reduce the dimensionality of the dataset, the specialist can use a technique like PCA, which reduces the number of features while still retaining the maximum amount of information. PCA can help remove redundant features and improve the model's performance by reducing the chances of overfitting. Additionally, since there are no missing values and a small percentage of duplicate rows, no data cleaning techniques like anomaly detection or dropping the features are required. Concatenating features with high correlation scores is not an appropriate strategy since it may lead to collinearity issues. A PCA: PCA is a linear dimensionality reduction technique (algorithm) that transforms a set of correlated variables (p) into a smaller k (k
153
153 - A manufacturing company asks its machine learning specialist to develop a model that classifies defective parts into one of eight defect types. The company has provided roughly 100,000 images per defect type for training. During the initial training of the image classification model, the specialist notices that the validation accuracy is 80%, while the training accuracy is 90%. It is known that human-level performance for this type of image classification is around 90%. What should the specialist consider to fix this issue? - A.. A longer training time B.. Making the network larger C.. Using a different optimizer D.. Using some form of regularization
D - D - over fitting problem. The specialist should consider using some form of regularization to fix this issue. Regularization techniques such as dropout or L2 regularization can help prevent overfitting, which can occur when the model performs well on the training data but poorly on the validation data. Option A, a longer training time, might not necessarily fix the issue and could lead to overfitting if the model is already performing well on the training data. Option B, making the network larger, could also lead to overfitting and may not be necessary if the current network architecture is sufficient to perform the classification task. Option C, using a different optimizer, might not necessarily fix the issue and could lead to slower convergence or worse performance. Therefore, option D, using some form of regularization, is the most appropriate solution to consider in this situation. some form of regularization I wouldn't go with D since it doesn't seem an overfitting problem considering training accuracy is not so high. So the main problem here is to get an higher accuracy even on training set. I would go with A or B A - IMO it's an underfitting problem, as training accuracy is not better than baseline error (human accuracy). Would consider B as well, but it may actually decrease accuracy. typical overfitting problem typical overfitting problem C - It is not a overfitting problem as the training accuracy stands at 90%, which is at same level of human performance. That means the algorithm used is not optimized for this problem. So, some other algorithm should applied for this problem. I'd go A. Regularization could not guarantee higher validation accuracy. I believe answer is B , because clearly it is a overfiting problem , if we reduce complexity the accurate will reduce close to 80% ... But human works can reach up to 90% . I mean looks like a overfitting problem.... - https://www.examtopics.com/discussions/amazon/view/75020-exam-aws-certified-machine-learning-specialty-topic-1/
154
154 - A machine learning specialist needs to analyze comments on a news website with users across the globe. The specialist must find the most discussed topics in the comments that are in either English or Spanish. What steps could be used to accomplish this task? (Choose two.) - A.. Use an Amazon SageMaker BlazingText algorithm to find the topics independently from language. Proceed with the analysis. B.. Use an Amazon SageMaker seq2seq algorithm to translate from Spanish to English, if necessary. Use a SageMaker Latent Dirichlet Allocation (LDA) algorithm to find the topics. C.. Use Amazon Translate to translate from Spanish to English, if necessary. Use Amazon Comprehend topic modeling to find the topics. D.. Use Amazon Translate to translate from Spanish to English, if necessary. Use Amazon Lex to extract topics form the content. E.. Use Amazon Translate to translate from Spanish to English, if necessary. Use Amazon SageMaker Neural Topic Model (NTM) to find the topics.
C - C and E B needs to build custom model The SageMaker seq2seq algorithm is a supervised learning algorithm. And it needs to train then translate. translate can directly use to translate from Spanish to English The question did not say you cannot build a custom model. They have a ML specialist, so building a custom model shouldn't be a problem. It asked 2 answers, but I can see only one answer. Please advise. Thanks! (Choose two.) C,E C E - Amazon Translate can handle the translation from Spanish to English, ensuring all comments are in a single language. Amazon Comprehend provides robust topic modeling capabilities to identify the most discussed topics in the translated comments. Use Amazon Translate to translate from Spanish to English, if necessary. Use Amazon SageMaker Neural Topic Model (NTM) to find the topics. Amazon SageMaker Neural Topic Model (NTM) is an unsupervised learning algorithm designed for topic modeling, which can effectively identify topics in the translated comments CE are the answers. C use Amazon Comprehend for topic modeling - use LDA (https://docs.aws.amazon.com/comprehend/latest/dg/topic-modeling.html) E is using NTM (https://docs.aws.amazon.com/sagemaker/latest/dg/ntm.html) For ease of use, I will start with Amazon Comprehend and go to NTM if the success criteria isn't met. C, E is the correct answer. C & E based on comments, but you are not allowed to select multiple choices. C and E - Use translate so that text is in common language - In options with translate only Comprehend and NTM allow for topic modeling (C & E) Other options Blazingtext is for text classification, not topic modelling, LDA is requireres user specified topics and Lex is for conversational interfaces C & E Option C (Amazon Translate and Amazon Comprehend): This is a strong combination. Amazon Translate can be used to translate Spanish comments into English, and then Amazon Comprehend, which supports topic modeling, can be used to identify the most discussed topics. Option E (Amazon Translate and Amazon SageMaker Neural Topic Model): This is also a viable combination. Amazon Translate would handle the translation of Spanish comments, and the Neural Topic Model (NTM) in Amazon SageMaker can then be used for topic modeling. NTM uses neural networks for topic discovery and is well-suited for analyzing large sets of text data. B and E I dont think amazon comprehend can do topic modelling. LDA is used for topic modelling BCE are all right. https://docs.amazonaws.cn/en_us/sagemaker/latest/dg/algos.html LDA and NTM are all topic modeling tools. A. NO - BlazingText is word2vec, will not do topic modeling alone B. NO - Translate better than custom seq2seq C. NO - NTM better than LDA used by Comprehend D. NO - Lex is for chatbots E. YES The right answers are C & E The other steps are not suitable because: A. The BlazingText algorithm is for word embeddings and text classification, not topic modeling. B. The LDA algorithm is an unsupervised learning algorithm that requires a user-specified number of topics. D. Amazon Lex is for building conversational interfaces, not extracting topics from content Correct answer is BE It has to be B + C .. for spanish to English use Translate. For Topics it has to be LDA. Sorry and NTM.. in that case, C is a winner for translation.. then pick E to be consistent.. final answer B + E. For me: B - C - E are correct: it's solved translation + topic modelling. The question is not well construct from my POV. C and E - https://www.examtopics.com/discussions/amazon/view/74991-exam-aws-certified-machine-learning-specialty-topic-1/
155
155 - A machine learning (ML) specialist is administering a production Amazon SageMaker endpoint with model monitoring configured. Amazon SageMaker Model Monitor detects violations on the SageMaker endpoint, so the ML specialist retrains the model with the latest dataset. This dataset is statistically representative of the current production traffic. The ML specialist notices that even after deploying the new SageMaker model and running the first monitoring job, the SageMaker endpoint still has violations. What should the ML specialist do to resolve the violations? - A.. Manually trigger the monitoring job to re-evaluate the SageMaker endpoint traffic sample. B.. Run the Model Monitor baseline job again on the new training set. Configure Model Monitor to use the new baseline. C.. Delete the endpoint and recreate it with the original configuration. D.. Retrain the model again by using a combination of the original training set and the new training set.
B - I would go with B: https://docs.aws.amazon.com/sagemaker/latest/dg/model-monitor-create-baseline.html Agree, the answer is B. From the document, the violation file contains several checks and "The violations file is generated as the output of a MonitoringExecution" . https://docs.aws.amazon.com/sagemaker/latest/dg/model-monitor-interpreting-violations.html. The baseline job computes baseline statistics and constraints for the new training set. By using this updated baseline, Model Monitor can better detect any drift or violations in the production traffic. B. Run the Model Monitor baseline job again on the new training set: This is a key step after retraining the model. Since the model has been retrained with a new dataset, the baseline against which its predictions are compared should also be updated. Running the baseline job again on the new training set and configuring Model Monitor to use this new baseline will ensure that the monitoring is relevant to the current state of the model and the data it's processing. D. Retrain the model again with a combination of the original and new training sets: While retraining the model can be a good approach in some scenarios, there's no indication in this case that the issue lies with the model's performance itself. The issue seems to be with the Model Monitor's baseline not aligning with the current model. https://docs.aws.amazon.com/sagemaker/latest/dg/model-monitor-interpreting-violations.html running the Model Monitor baseline job again on the new training set and configuring Model Monitor to use the new baseline, is the most appropriate step to resolve the violations and ensure the SageMaker endpoint's performance is in line with expectations. Running the Model Monitor baseline job again with the new training set and configuring Model Monitor to use the new baseline is a valid option to resolve the violations. By running the baseline job with the new training set, a new baseline is created, which can be used to compare with the new data to detect any drifts in the data distribution. Then, the updated baseline can be set as the new baseline for monitoring the endpoint. So, option B is also a valid solution to resolve the violations. - https://www.examtopics.com/discussions/amazon/view/75415-exam-aws-certified-machine-learning-specialty-topic-1/
156
156 - A company supplies wholesale clothing to thousands of retail stores. A data scientist must create a model that predicts the daily sales volume for each item for each store. The data scientist discovers that more than half of the stores have been in business for less than 6 months. Sales data is highly consistent from week to week. Daily data from the database has been aggregated weekly, and weeks with no sales are omitted from the current dataset. Five years (100 MB) of sales data is available in Amazon S3. Which factors will adversely impact the performance of the forecast model to be developed, and which actions should the data scientist take to mitigate them? (Choose two.) - A.. Detecting seasonality for the majority of stores will be an issue. Request categorical data to relate new stores with similar stores that have more historical data. B.. The sales data does not have enough variance. Request external sales data from other industries to improve the model's ability to generalize. C.. Sales data is aggregated by week. Request daily sales data from the source database to enable building a daily model. D.. The sales data is missing zero entries for item sales. Request that item sales data from the source database include zero entries to enable building the model. E.. Only 100 MB of sales data is available in Amazon S3. Request 10 years of sales data, which would provide 200 MB of training data for the model.
AC - AC would be my answer. As half the stores have only been open for 6 months, no seasonality would be captured. The aggregation of the daily also removes trends we see during the week which is also not great when we are looking for the daily predicated sales figure B - no reason to assume there is not enough variance D - missing data can be assumed to be 0, no need to ask for empty data E - no reason to ask for two years of data having one already Missing data and achieving daily predictions with weekly data will be issues. I would go for AD A : Many stores have been in business for < 6 months --> unable to capture seasonality D : Zero sales are also sales records and will result in bias if omitted. Since half of the stores are 6 months old seasonality would be a problem for them. instead of omitting weeks with no sales could lead to bias, requesting zero entries will help in predicting better I changed my mind. It should be C and D. Since both of them foundation aspect of training. A as missing seasonality is an issue for the majority of the stores. D as we need to impute zeros as we would otherwise miss data. C won't do anything on performance. The factors that will adversely impact the performance of the forecast model are: Sales data is aggregated by week. This will reduce the granularity and resolution of the data, and make it harder to capture the daily patterns and variations in sales volume. The data scientist should request daily sales data from the source database to enable building a daily model, which will be more accurate and useful for the prediction task. Sales data is missing zero entries for item sales. This will introduce bias and incompleteness in the data, and make it difficult to account for the items that have no demand or are out of stock. The data scientist should request that item sales data from the source database include zero entries to enable building the model, which will be more robust and realistic C. Aggregated Weekly Data: Since the objective is to predict daily sales volume, weekly aggregated data might mask important daily trends and variations. Requesting daily sales data will provide a finer granularity of information that is crucial for building an accurate daily sales prediction model. D. Missing Zero Entries for Item Sales: The omission of weeks with no sales can lead to biased predictions, as the model might not correctly account for periods of no sales. Including zero entries for item sales would provide a more accurate representation of sales patterns, including the absence of sales, which is valuable information for the model. Based on this analysis, the factors that would most adversely impact the model's performance are the aggregated weekly data (Option C) and the omission of weeks with no sales (Option D). A - six months is likely not enough to detect clear seasonality C - Can do weekly from daily but cant reliably do daily from weekly Letters A and C are correct: we want to do a daily model (our base is on weeks) and we need to deal with new stores VS old stores. It is important to emphasize that the letter D also makes sense: we need to know the days when there were no sales, however the way it is written means saving lines (days of sales) with zero in the database, which is not practical. the two factors that will adversely impact the forecast model's performance are seasonality detection for new stores and the aggregation of sales data on a weekly basis. The data scientist should request categorical data to relate new stores with historical data and request daily sales data from the source database to build a daily model, respectively, to mitigate these issues effectively. AD. rest makes no sense. A. Since more than half of the stores have been in business for less than 6 months, it will be challenging to detect seasonality patterns for these new stores. Therefore, one solution is to request categorical data to relate new stores with similar stores that have more historical data. This will help the model to identify common patterns and accurately forecast sales for new stores. C. Since the sales data is aggregated by week, it may not be possible to identify daily patterns or trends. Hence, one solution is to request daily sales data from the source database to enable building a daily model. This will help the model to identify daily patterns and improve its forecasting accuracy. I go with CD. How could we ignore the days with 0 sales? The model should be trained so that it can predict 0 sales days as well. B, C, D are possible. A couldn't be an answer because the model must predict daily sales volumes while A says 'Request categorical data'. - https://www.examtopics.com/discussions/amazon/view/74397-exam-aws-certified-machine-learning-specialty-topic-1/
157
157 - An ecommerce company is automating the categorization of its products based on images. A data scientist has trained a computer vision model using the Amazon SageMaker image classification algorithm. The images for each product are classified according to specific product lines. The accuracy of the model is too low when categorizing new products. All of the product images have the same dimensions and are stored within an Amazon S3 bucket. The company wants to improve the model so it can be used for new products as soon as possible. Which steps would improve the accuracy of the solution? (Choose three.) - A.. Use the SageMaker semantic segmentation algorithm to train a new model to achieve improved accuracy. B.. Use the Amazon Rekognition DetectLabels API to classify the products in the dataset. C.. Augment the images in the dataset. Use open source libraries to crop, resize, flip, rotate, and adjust the brightness and contrast of the images. D.. Use a SageMaker notebook to implement the normalization of pixels and scaling of the images. Store the new dataset in Amazon S3. E.. Use Amazon Rekognition Custom Labels to train a new model. F.. Check whether there are class imbalances in the product categories, and apply oversampling or undersampling as required. Store the new dataset in Amazon S3.
CDF - B CE is correct. (Option C): Using open source libraries to crop, resize, flip, rotate, and adjust the brightness and contrast of the images can increase the diversity of the training data, helping the model generalize better to new products1. (Option D): Normalizing and scaling the images can help the model learn more effectively by ensuring that the input data is consistent2. (Option F): Addressing class imbalances can prevent the model from being biased towards more frequent classes, improving its overall accuracy. The questions says "The images for each product are classified according to specific product lines." why do we need Amazon Rekognition Custom Labels then? Because the goal is to increase the accuracy of the existing model, not using a built-in service. Option C is correct because augmenting the images in the dataset can help the model learn more features and generalize better to new products. Image augmentation is a common technique to increase the diversity and size of the training data. Option E is correct because Amazon Rekognition Custom Labels can train a custom model to detect specific objects and scenes that are relevant to the business use case. It can also leverage the existing models from Amazon Rekognition that are trained on tens of millions of images across many categories. Option F is correct because class imbalance can affect the performance and accuracy of the model, as it can cause the model to be biased towards the majority class and ignore the minority class. Applying oversampling or undersampling can help balance the classes and improve the model's ability to learn from the data assuming improve accuracy of the (existing) solution Hopefully final answer this time CEF. was initially looking for D but changed to E now C & F for sure the confusion between D and E but lets go for D as E will need more steps The question asks for quick solutions and to improve the classifier's accuracy. Since we want a quick fix, I'm going to avoid solutions that requires a new model implementation. Therefore, the alternatives that can improve the performance of the current classification are: Letter F, C and D. Letters B and E would bring a new development cost from zero and Letter A does not solve the classification problem. NVM, D is wrong! the three steps that would improve the accuracy of the solution are C (data augmentation), D (image normalization and scaling), and F (addressing class imbalances) See community answer is CEF due to images all same dimension so D removed. CEF : C&F for Overfitting; E : "Rekognition DetectLabel" is the general image labeling capability of Amazon Rekognition, which provides predefined labels for common objects and concepts out-of-the-box. On the other hand, "Rekognition Custom Labels" allows you to create custom models to detect specific labels or objects that are not covered by the default labels, CEF better choose This is CDF. No idea why this is unclear here. The problem is about "Overfitting", because the new products doesn't work well. It is not about simply improve model accuracy. C is great answer, augmentation is for overfitting. D is wrong, because normalization of pixel is not for overfitting, and "all images have the same dimensions. no need for scaling, they are already scaled. F is for imbalance data. if the data is imbalanced, they should perform poor on both training and testing data(new product). And the new product should perform bad only on those cold category, not overall poor performance. B and E are all about Rokognition, one is Rekognition Detect label, a built-in image classification model; one is Rekognition Custom Labels, a pre-trained with fine-tuning model. you fix images (C), train Rekognition with these images (E) and finally infer to get the classes (B) Amazon Rekognition custom label model requires time, expertise, and resources, often taking months to complete. Additionally, it often requires thousands or tens-of-thousands of hand-labeled images to provide the model with enough data to accurately make decisions. The solution must be quick. https://aws.amazon.com/rekognition/custom-labels-features/ Sorry, never mind my answer it's actually CEF. not F, think carefully what is imbalanced data? what is its effect? does it only affect new product? CEF is correct D seems doing nothing with new product - https://www.examtopics.com/discussions/amazon/view/74934-exam-aws-certified-machine-learning-specialty-topic-1/
158
158 - A data scientist is training a text classification model by using the Amazon SageMaker built-in BlazingText algorithm. There are 5 classes in the dataset, with 300 samples for category A, 292 samples for category B, 240 samples for category C, 258 samples for category D, and 310 samples for category E. The data scientist shuffles the data and splits off 10% for testing. After training the model, the data scientist generates confusion matrices for the training and test sets. What could the data scientist conclude form these results? [https://www.examtopics.com/assets/media/exam-media/04145/0009500001.png, https://www.examtopics.com/assets/media/exam-media/04145/0009600001.png] - A.. Classes C and D are too similar. B.. The dataset is too small for holdout cross-validation. C.. The data distribution is skewed. D.. The model is overfitting for classes B and E.
A - Isn't it A? the model doesn't classify C & D well. the correct answer should be A, the model is clearly unable to tell C and D apart the reason why B is incorrect is subtle - there is holdout validation or cross-validation, but not holdout cross-validation; while I think it would be more reasonable to use CV with such a small dataset rather than holdout, the answer is mixing terms and therefore should be wrong also, the test set confusion matrix is still pretty comparable to the train set one, so I wouldn't say there is objective evidence to claim holdout is a wrong choice here I would go for A as well. I think option A is correct as C & D are behaving similarly. I think the answer is D. A => C, D are similar in train but the testing results contradict that. There are many As and Bs for C These results indicate that the model is overfitting for classes B and E, meaning that it is memorizing the specific features of these classes in the training data, but failing to capture the general features that are applicable to the test data. Overfitting is a common problem in machine learning, where the model performs well on the training data, but poorly on the test data3. Some possible causes of overfitting are: The model is too complex or has too many parameters for the given data. This makes the model flexible enough to fit the noise and outliers in the training data, but reduces its ability to generalize to new data Actually, both A and D are true. It would be an easy one if we had to choose two answers. But we need to choose only one. So how to make sure that the person who created this question thought about A only? Also if we take a look into the test confusion matrix. We can see that the A class also missed with C class at the same rate as the C and D classes. I would even say that here the model is generally overfitted. I would go for B Also because of random peeking of test set entries, we got the wrong proportions of labels between train and test sets. So the answer can be even C Letter A is correct. The model gets confused between (C) and (D) in training and testing. But on the test set it's even confused between A and C classes Selected Answer: B Hold-out Hold-out is when you split up your dataset into a ‘train’ and ‘test’ set. The training set is what the model is trained on, and the test set is used to see how well that model performs on unseen data. A common split when using the hold-out method is using 80% of data for training and the remaining 20% of the data for testing. Hold-out Hold-out is when you split up your dataset into a ‘train’ and ‘test’ set. The training set is what the model is trained on, and the test set is used to see how well that model performs on unseen data. A common split when using the hold-out method is using 80% of data for training and the remaining 20% of the data for testing. Refere: https://medium.com/@eijaz/holdout-vs-cross-validation-in-machine-learning-7637112d3f8f Model in unable to tell c&D D - Training accuracies of B and E are higher than those of test, whereas A has similar accuracy in both. For C and D, test accuracy has actually improved. B and C has below 50% accuracy. D has 98% in train and 86% accuracy in test. And you are telling me, the take away is overfitting of D, Seriously??? I think the answer is A. The model doesn't perform well on class C and D in both training and testing dataset. I don't think B is relevant to the question(cross-validation is not mentioned in the question) What means holdout cross validation. There should be holdout validation vs cross validation B should be the correct answer - https://www.examtopics.com/discussions/amazon/view/75080-exam-aws-certified-machine-learning-specialty-topic-1/
159
159 - A company that manufactures mobile devices wants to determine and calibrate the appropriate sales price for its devices. The company is collecting the relevant data and is determining data features that it can use to train machine learning (ML) models. There are more than 1,000 features, and the company wants to determine the primary features that contribute to the sales price. Which techniques should the company use for feature selection? (Choose three.) - A.. Data scaling with standardization and normalization B.. Correlation plot with heat maps C.. Data binning D.. Univariate selection E.. Feature importance with a tree-based classifier F.. Data augmentation
BDE - i will go for B, D and E. B and D for me are like doing partial regression and corr plot can actually tell you briefly how well the univariate is correlated with your target and i guess that also apply for D.. and E , feature importance ranking that's what feature selection strategy want from my POV. And for Data Binning is data enrichment just like augmentations , but then the question was saying they want to do feature selection over 1k+ variables which implies they actually care more about which variable(s) can contribute more on determining the price ? B. Correlation plot with heat maps: This technique can be used to identify the relationship between each feature and the target variable (sales price). By creating a correlation plot with heat maps, the company can quickly visualize the strength and direction of the relationship between each feature and the target variable. D. Univariate selection: This technique can be used to select the features that have the strongest relationship with the target variable. It involves analyzing each feature independently and selecting the ones that have the highest correlation with the target variable. E. Feature importance with a tree-based classifier: This technique can be used to determine the most important features that contribute to the target variable. By using a tree-based classifier such as Random Forest or Gradient Boosting, the company can rank the importance of each feature and select the ones that have the highest importance. For feature selection in machine learning, you can use the following techniques: B. Correlation plot with heat maps: Correlation analysis helps identify relationships between features and the target variable. A heat map can visually represent the correlation matrix, helping to identify highly correlated features. D. Univariate selection: Univariate selection methods evaluate the relationship between each feature and the target variable independently. Common techniques include statistical tests such as chi-squared tests, ANOVA, or mutual information. E. Feature importance with a tree-based classifier: Tree-based classifiers, such as decision trees or random forests, can provide feature importance scores. These scores help identify which features contribute the most to the predictive performance of the model. A, C and F are not feature selection techniques. BDE seem to be the only viable feature selection methods here Those are the only ones for FS. the most appropriate feature selection techniques for the company to determine the primary features contributing to the sales price are B (correlation plot with heat maps), D (univariate selection), and E (feature importance with a tree-based classifier). BDE as stated here: https://towardsdatascience.com/feature-selection-techniques-in-machine-learning-with-python-f24e7da3f36e BDE for me throwing my weight behind B D E. Correlation with heatmaps help us eliminate multicollinearity, Univariate testing helps us see which ones are correlated with the target, same as feature importances of tree-based algorithms. CDF for me - https://www.examtopics.com/discussions/amazon/view/74806-exam-aws-certified-machine-learning-specialty-topic-1/
160
160 - A power company wants to forecast future energy consumption for its customers in residential properties and commercial business properties. Historical power consumption data for the last 10 years is available. A team of data scientists who performed the initial data analysis and feature selection will include the historical power consumption data and data such as weather, number of individuals on the property, and public holidays. The data scientists are using Amazon Forecast to generate the forecasts. Which algorithm in Forecast should the data scientists use to meet these requirements? - A.. Autoregressive Integrated Moving Average (AIRMA) B.. Exponential Smoothing (ETS) C.. Convolutional Neural Network - Quantile Regression (CNN-QR) D.. Prophet
C - Answer is C, CNN-QR and DeepAR accepts related time series data (weather data, number of people on property, etc.,) Option C: Convolutional Neural Network - Quantile Regression (CNN-QR). This algorithm is well-suited for handling complex datasets with multiple features, such as historical power consumption, weather, number of individuals, and public holidays, providing accurate and robust forecasts. Answer is C, CNN-QR and DeepAR accepts related time series data (weather data, number of people on property, etc.,). Classic forecasting methods, such as ARIMA or exponential smoothing (ETS), fit a single model to each individual time series. In contrast, DeepAR+ creates a global model (one model for all the time series) with the potential benefit of learning across time series. Source: https://aws.amazon.com/blogs/machine-learning/making-accurate-energy-consumption-predictions-with-amazon-forecast/ As per https://docs.aws.amazon.com/forecast/latest/dg/aws-forecast-choosing-recipes.html A. NO - no as powerful as NN B. NO - no as powerful as NN C. NO - works best with 100's of time series D. YES - best for strong seasonnability, expected for power Based on this only CNN-QR can accept historical data https://www.examtopics.com/exams/amazon/aws-certified-machine-learning-specialty/view/32/ CNN-QR is a deep learning algorithm that can model complex relationships between the inputs and outputs, such as the weather and public holidays, with historical power consumption data. CNN-QR has been shown to be effective in generating accurate predictions in many different types of forecasting use cases, including demand forecasting. ETS (Exponential Smoothing) is a classical time series algorithm that is often used for forecasting. It can be effective for simple time series data that have regular patterns, but may not be sufficient to handle the complexity of the given data. ARIMA (Autoregressive Integrated Moving Average) is another classical time series algorithm that can model complex patterns in data. However, it may be difficult to use in cases where there are many different inputs and the relationships between the inputs and outputs are complex. ARIMA & ES are both base time series algos that are available. DeeoAR+ & CNN-QR are refined and able to utilize external data as well to complement the time series data available C, as explained here: https://docs.aws.amazon.com/forecast/latest/dg/aws-forecast-choosing-recipes.html According to the link below, it is either ARIMA or DeepAR. So A is the answer here https://aws.amazon.com/blogs/machine-learning/making-accurate-energy-consumption-predictions-with-amazon-forecast/ Given the provided data, I would discard A and B. Amazon Forecast CNN-QR, Convolutional Neural Network - Quantile Regression, is a proprietary machine learning algorithm for forecasting scalar (one-dimensional) time series I would choose D, Prophet. https://docs.aws.amazon.com/forecast/latest/dg/aws-forecast-recipe-prophet.html How Prophet Works Prophet is especially useful for datasets that: Contain an extended time period (months or years) of detailed historical observations (hourly, daily, or weekly) Have multiple strong seasonalities Include previously known important, but irregular, events Have missing data points or large outliers Have non-linear growth trends that are approaching a limit Prophet is an additive regression model with a piecewise linear or logistic growth curve trend. It includes a yearly seasonal component modeled using Fourier series and a weekly seasonal component modeled using dummy variables. Prophet wont be able to use the additional data that is available in the question Prophet doesn't accept historical-related time series, so it won't work here https://docs.aws.amazon.com/forecast/latest/dg/aws-forecast-choosing-recipes.html#comparing-algos - https://www.examtopics.com/discussions/amazon/view/74920-exam-aws-certified-machine-learning-specialty-topic-1/
161
161 - A company wants to use automatic speech recognition (ASR) to transcribe messages that are less than 60 seconds long from a voicemail-style application. The company requires the correct identification of 200 unique product names, some of which have unique spellings or pronunciations. The company has 4,000 words of Amazon SageMaker Ground Truth voicemail transcripts it can use to customize the chosen ASR model. The company needs to ensure that everyone can update their customizations multiple times each hour. Which approach will maximize transcription accuracy during the development phase? - A.. Use a voice-driven Amazon Lex bot to perform the ASR customization. Create customer slots within the bot that specifically identify each of the required product names. Use the Amazon Lex synonym mechanism to provide additional variations of each product name as mis-transcriptions are identified in development. B.. Use Amazon Transcribe to perform the ASR customization. Analyze the word confidence scores in the transcript, and automatically create or update a custom vocabulary file with any word that has a confidence score below an acceptable threshold value. Use this updated custom vocabulary file in all future transcription tasks. C.. Create a custom vocabulary file containing each product name with phonetic pronunciations, and use it with Amazon Transcribe to perform the ASR customization. Analyze the transcripts and manually update the custom vocabulary file to include updated or additional entries for those names that are not being correctly identified. D.. Use the audio transcripts to create a training dataset and build an Amazon Transcribe custom language model. Analyze the transcripts and update the training dataset with a manually corrected version of transcripts where product names are not being transcribed correctly. Create an updated custom language model.
C - Answer is C. https://aws.amazon.com/blogs/machine-learning/build-a-custom-vocabulary-to-enhance-speech-to-text-transcription-accuracy-with-amazon-transcribe/ Option D involves using the available audio transcripts to create a training dataset and building a custom language model with Amazon Transcribe. This approach provides a high degree of control over the transcription process and the ability to fine-tune the model to the specific vocabulary and pronunciation requirements of the company. Analyzing the transcripts and updating the training dataset with corrected versions is a crucial step in improving transcription accuracy. It enables the model to learn from mistakes and to incorporate the unique spelling and pronunciation of the 200 required product names. Thank you AjoseO for all these detailed explanations! They are very useful! say thank you to chat gpt D is an ideal answer however, the question ask for "The company needs to ensure that everyone can update their customizations multiple times each hour". To retrain custom model each hour when we have changes, will be tedious and time consuming. I go with c, where we can ask everyone to just update the config file. The company requires the correct identification of 200 unique product names, some of which have unique spellings or pronunciations. -> Use the audio transcripts to create a training dataset and build an Amazon Transcribe custom language model. Analyze the transcripts and update the training dataset with a manually corrected version of transcripts where product names are not being transcribed correctly. Create an updated custom language model. why? i think it's C -Creating a custom vocabulary file allows you to explicitly define the correct pronunciation of each product name. -Manually updating the custom vocabulary file based on these observations allows you to continuously improve the ASR system. - As new product names or variations emerge, you can easily add them to the custom vocabulary file without retraining the entire ASR model. D was my initial choice however looking at the requirement "The company needs to ensure that everyone can update their customizations multiple times each hour." I changed my mind due to having to retrain the model with new vocabulary. C gives you the ability to update the vocabulary and have it take effect immediately Answer is C D would required to build a model. It's well known the quantity of products, so it's not necessary. the best approach to maximize transcription accuracy during the development phase is to use the audio transcripts to create a training dataset and build an Amazon Transcribe custom language model. Analyze the transcripts and update the training dataset with a manually corrected version of transcripts where product names are not being transcribed correctly. Create an updated custom language model. Why not D though? I think C is correct. A? any thought? - https://www.examtopics.com/discussions/amazon/view/75078-exam-aws-certified-machine-learning-specialty-topic-1/
162
162 - A company is building a demand forecasting model based on machine learning (ML). In the development stage, an ML specialist uses an Amazon SageMaker notebook to perform feature engineering during work hours that consumes low amounts of CPU and memory resources. A data engineer uses the same notebook to perform data preprocessing once a day on average that requires very high memory and completes in only 2 hours. The data preprocessing is not configured to use GPU. All the processes are running well on an ml.m5.4xlarge notebook instance. The company receives an AWS Budgets alert that the billing for this month exceeds the allocated budget. Which solution will result in the MOST cost savings? - A.. Change the notebook instance type to a memory optimized instance with the same vCPU number as the ml.m5.4xlarge instance has. Stop the notebook when it is not in use. Run both data preprocessing and feature engineering development on that instance. B.. Keep the notebook instance type and size the same. Stop the notebook when it is not in use. Run data preprocessing on a P3 instance type with the same memory as the ml.m5.4xlarge instance by using Amazon SageMaker Processing. C.. Change the notebook instance type to a smaller general purpose instance. Stop the notebook when it is not in use. Run data preprocessing on an ml.r5 instance with the same memory size as the ml.m5.4xlarge instance by using Amazon SageMaker Processing. D.. Change the notebook instance type to a smaller general purpose instance. Stop the notebook when it is not in use. Run data preprocessing on an R5 instance with the same memory size as the ml.m5.4xlarge instance by using the Reserved Instance option.
C - B is wrong as it says it doesn't take advantage of GPUs I believe answer should be C. a) Initial processing needs less cpu and memory so that can be done on a smaller instance. b) Second operation is memory intensive so instance type should be changed to R5 type instance. It's C: 1. Change the notebook instance to a smaller one (fewer resources and lower cost), ideal for feature engineering work. 2. Move data preprocessing to an ml.r5 instance, which is memory-optimized and therefore better suited to the preprocessing workload. 3. Using Amazon SageMaker Processing to perform preprocessing allows you to allocate resources only when needed (2 hours per day), reducing operational costs. WHY IS NOT D? While using Reserved Instances can reduce costs, it involves a long-term commitment that may not be ideal for variable or seasonal workloads. I'd opt for C. A and B are wrong for obvious reasons. D sounds good but it doesn't have a ML instance and also it's just the development phase and we might not want to reserve an instance for too long. Option A need only one instance all other options talks about 2 instances. so why can't it be A... C Memory-optimized instances means provide a high memory In D they mention reserved instance. so it is costly C is correct. Due that B is wrong, is not to use a GPU Instance based. "Which solution will result in the MOST cost savings" Because of this, D is wrong: are you sure that allocating an instance for months / years for a 2h/day is cost saving? Correct is C offers the best balance of cost savings and resource adequacy for both feature engineering and data preprocessing tasks. It' C, as Reserved Instance no good for only 2 hours of daily work. R instance with processes that uses lot of memory. Reserved instances for less cost Selection of D is totally wrong, because you don't understand what "Reserved Instance" is!!! You cannot reserve a instance only for hours a day!!!! this is like apartment rent, can you just rent an apartment for nap time???? D over C because if the EC2 instance is being used consistently for the same two hours each day, customers could consider using a Reserved Instance with a term of 1 or 3 years and payment option that aligns with their usage pattern. This would provide significant cost savings compared to On-Demand pricing for those two hours each day. is better not use everytime chatgpt, and read AWS documentation about instances. B. C is wrong as ml.r5 is not stopped when not in use IMO D is correct. The reserved instance option for an R5 instance, as in Option D, would provide the greatest cost savings, as reserved instances offer a discounted hourly rate in exchange for a one-time payment for a committed usage term. I think C is correct. It only runs for 2 hours once a day, so RI is wasted. So I think D is wrong. "Scheduled RIs: These are available to launch within the time windows you reserve. This option allows you to match your capacity reservation to a predictable recurring schedule that only requires a fraction of a day, a week, or a month." You have the option to reserve for a fraction of a day. Since the question specify precisely how long the job is, it makes it suitable. https://aws.amazon.com/ec2/pricing/reserved-instances/ 12-sep exam I think it is D. Using RIs the customer can have the greatest cost savings, as stated by the question - https://www.examtopics.com/discussions/amazon/view/74919-exam-aws-certified-machine-learning-specialty-topic-1/
163
163 - A machine learning specialist is developing a regression model to predict rental rates from rental listings. A variable named Wall_Color represents the most prominent exterior wall color of the property. The following is the sample data, excluding all other variables: The specialist chose a model that needs numerical input data. Which feature engineering approaches should the specialist use to allow the regression model to learn from the Wall_Color data? (Choose two.) [https://www.examtopics.com/assets/media/exam-media/04145/0009900001.png] - A.. Apply integer transformation and set Red = 1, White = 5, and Green = 10. B.. Add new columns that store one-hot representation of colors. C.. Replace the color name string by its length. D.. Create three columns to encode the color in RGB format. E.. Replace each color name by its training set frequency.
BE - B, and E (frequency encoding) BD? any thought? I think D cannot be because distances in RGB format are not representive of points. CIELAB correlates numerical color values consistently with human visual perception. These methods ensure that the color data is represented numerically while preserving the information’s integrity and relevance for the regression model Using frequency encoding may help in some contexts but can introduce bias, especially if the frequency of a color is not related to the rental rate. This method does not leverage the actual differences between colors. In this scenario, the specialist should use one-hot encoding and RGB encoding to allow the regression model to learn from the Wall_Color data. One-hot encoding is a technique used to convert categorical data into numerical data. It creates new columns that store one-hot representation of colors. For example, a variable named color has three categories: red, green, and blue. After one-hot encoding RGB encoding can capture the intensity and hue of a color, but it may also introduce correlation among the three columns. Therefore, using both one-hot encoding and RGB encoding can providemore information to the regression model than using either one alone. Here we have a non-ordinal categorical variable to receive a numerical conversion for the model. Letter A is wrong as it is not an ordinal variable. Letter C is wrong as we are not going to retain any significant information for the model. The best solutions would be: Letter B and E. Letter D would be very interesting, but it would generate a problem of information fragmentation: most models consider the variables as being independent of each other, and these 3 columns by definition would not be independent. B, and E (frequency encoding) A+B make sense to me B for sure D. This approach involves breaking down each color into its Red, Green, and Blue components and creating separate columns for each component. This allows the model to capture the information about the intensity of each color component, which can be useful in predicting the target variable. A, C, and E are not suitable for encoding color data in a way that can be used by a regression model. The integer transformation approach in option A arbitrarily assigns values to colors without any meaningful relationship between them. The approach in option C replaces the color names with their length, which does not provide any useful information for the model. Option E replaces each color name with its frequency in the training set, which does not capture any information about the color itself. I think frequency encoding cannot be. What if some colors have same amount of frequency? B. Add new columns that store one-hot representation of colors. One-hot encoding is a common approach to represent categorical variables as numerical values. This approach creates new binary variables for each category and assigns a value of 1 to the corresponding category and 0 to the others. In this case, the specialist can create three new binary variables, one for each color (Red, White, and Green) and use them as input to the regression model. E. Replace each color name by its training set frequency. Another approach to convert categorical variables into numerical ones is to replace each category with its frequency of occurrence in the training set. In this case, the specialist can replace the color names with their respective frequencies (1/3 for Red, 1/3 for White, and 1/3 for Green) to represent them numerically. Frequency encoding is a feature engineering technique used to convert categorical variables into numerical ones by replacing each category with the frequency of its occurrence in the training set. This approach can be useful when dealing with high-cardinality categorical variables, which are categorical variables with a large number of distinct categories. These are the only options preserving what "color" is. One-hot encoding is a default standard for any categorical data to be fed to a model that takes in numeric input. RGB format is a good numeric representation of any color by preserving its nature A&B https://victorzhou.com/blog/one-hot/#:~:text=One-Hot%20Encoding%20takes%20a%20single%20integer%20and%20produces,of%20colors%20are%20possible%3A%20red%2C%20green%2C%20or%20blue. B & E. It cannot be A because your URL specifically states that: "This is known as integer encoding. For Machine Learning, this encoding can be problematic - in this example, we’re essentially saying “green” is the average of “red” and “blue”, which can lead to weird unexpected outcomes." Frequency encoding B and E BE is correct. For e, please refer: https://medium.com/analytics-vidhya/different-type-of-feature-engineering-encoding-techniques-for-categorical-variable-encoding-214363a016fb#:~:text=One%20Hot%20Encoding%3A%20%E2%80%94%20In%20this,slows%20down%20the%20learning%20significantly. - https://www.examtopics.com/discussions/amazon/view/75059-exam-aws-certified-machine-learning-specialty-topic-1/
164
164 - A data scientist is working on a public sector project for an urban traffic system. While studying the traffic patterns, it is clear to the data scientist that the traffic behavior at each light is correlated, subject to a small stochastic error term. The data scientist must model the traffic behavior to analyze the traffic patterns and reduce congestion. How will the data scientist MOST effectively model the problem? - A.. The data scientist should obtain a correlated equilibrium policy by formulating this problem as a multi-agent reinforcement learning problem. B.. The data scientist should obtain the optimal equilibrium policy by formulating this problem as a single-agent reinforcement learning problem. C.. Rather than finding an equilibrium policy, the data scientist should obtain accurate predictors of traffic flow by using historical data through a supervised learning approach. D.. Rather than finding an equilibrium policy, the data scientist should obtain accurate predictors of traffic flow by using unlabeled simulated data representing the new traffic patterns in the city and applying an unsupervised learning approach.
A - answer : A, because the setting needs multi agents and is constrained with traffic light correlation. A. The data scientist should obtain a correlated equilibrium policy by formulating this problem as a multi-agent reinforcement learning problem. In this scenario, where the traffic behavior at each light is correlated, a multi-agent reinforcement learning (MARL) approach is well-suited to model the problem. In MARL, multiple agents interact with each other and the environment, and their behavior is influenced by the behavior of other agents. This approach is particularly useful in modeling traffic systems, where the behavior of each vehicle is affected by the behavior of other vehicles and traffic lights. Formulating the problem as a MARL problem can help the data scientist obtain a correlated equilibrium policy, which can optimize traffic flow across multiple traffic lights by taking into account the correlations between them. By optimizing traffic flow across all traffic lights in a correlated way, it may be possible to reduce congestion and improve overall traffic efficiency. thank you chatgpt i am wondering how is this actually implemented, i am learning deep RL right now It's too complex problem for supervised or unsupervised. It's a multi-agent problem. Answer is A https://www.researchgate.net/publication/221456376_Multi-Agent_Reinforcement_Learning_for_Simulating_Pedestrian_Navigation - https://www.examtopics.com/discussions/amazon/view/77501-exam-aws-certified-machine-learning-specialty-topic-1/
165
165 - A data scientist is using the Amazon SageMaker Neural Topic Model (NTM) algorithm to build a model that recommends tags from blog posts. The raw blog post data is stored in an Amazon S3 bucket in JSON format. During model evaluation, the data scientist discovered that the model recommends certain stopwords such as "a," "an," and "the" as tags to certain blog posts, along with a few rare words that are present only in certain blog entries. After a few iterations of tag review with the content team, the data scientist notices that the rare words are unusual but feasible. The data scientist also must ensure that the tag recommendations of the generated model do not include the stopwords. What should the data scientist do to meet these requirements? - A.. Use the Amazon Comprehend entity recognition API operations. Remove the detected words from the blog post data. Replace the blog post data source in the S3 bucket. B.. Run the SageMaker built-in principal component analysis (PCA) algorithm with the blog post data from the S3 bucket as the data source. Replace the blog post data in the S3 bucket with the results of the training job. C.. Use the SageMaker built-in Object Detection algorithm instead of the NTM algorithm for the training job to process the blog post data. D.. Remove the stopwords from the blog post data by using the CountVectorizer function in the scikit-learn library. Replace the blog post data in the S3 bucket with the results of the vectorizer.
D - D: Option A, B & C don't make sense. D removes the stop words and help in count vectors D The only solution that solves our problem: remove stopwords ASAP D: Option A, B & C don't make sense. D removes the stop words and help in count vectors D, ChatGPT confirm :) Needs to remove stopwords and the rare worlds are feasible. The stop words need to be removed. The rare words don't need to be removed because it has been found that they are feasible tags. Why not A? check the requirement in question: "the generated model do not include the stopwords" - https://www.examtopics.com/discussions/amazon/view/76486-exam-aws-certified-machine-learning-specialty-topic-1/
166
166 - A company wants to create a data repository in the AWS Cloud for machine learning (ML) projects. The company wants to use AWS to perform complete ML lifecycles and wants to use Amazon S3 for the data storage. All of the company's data currently resides on premises and is 40 ׀¢׀’ in size. The company wants a solution that can transfer and automatically update data between the on-premises object storage and Amazon S3. The solution must support encryption, scheduling, monitoring, and data integrity validation. Which solution meets these requirements? - A.. Use the S3 sync command to compare the source S3 bucket and the destination S3 bucket. Determine which source files do not exist in the destination S3 bucket and which source files were modified. B.. Use AWS Transfer for FTPS to transfer the files from the on-premises storage to Amazon S3. C.. Use AWS DataSync to make an initial copy of the entire dataset. Schedule subsequent incremental transfers of changing data until the final cutover from on premises to AWS. D.. Use S3 Batch Operations to pull data periodically from the on-premises storage. Enable S3 Versioning on the S3 bucket to protect against accidental overwrites.
C - https://aws.amazon.com/datasync/faqs/ Based on answer for the question - "How do I use AWS DataSync to migrate data to AWS?" it's DataSync All other options seem like they would require some manual coding to meet all requirements. DataSync appears as the best option as a result Option C, using AWS DataSync, is the most appropriate solution. AWS DataSync is a service designed for data transfer between on-premises storage and AWS, and it provides the features the company needs: C- DataSync is the answer AWS DataSync is a service that can be used to transfer large amounts of data between on-premises storage and Amazon S3, EFS, or FSx for Windows File Server. DataSync is optimized for fast, automated, and secure transfers of large amounts of data, and it supports scheduling, monitoring, and data integrity validation. In this scenario, the company wants a solution that can transfer and automatically update data between the on-premises object storage and Amazon S3, with support for encryption, scheduling, monitoring, and data integrity validation. DataSync meets all of these requirements, as it can transfer data using secure network connections, schedule data transfers, verify data integrity, and encrypt data in transit and at rest. A maybe? https://aws.amazon.com/datasync/faqs/ I meant C - https://www.examtopics.com/discussions/amazon/view/75399-exam-aws-certified-machine-learning-specialty-topic-1/
167
167 - A company has video feeds and images of a subway train station. The company wants to create a deep learning model that will alert the station manager if any passenger crosses the yellow safety line when there is no train in the station. The alert will be based on the video feeds. The company wants the model to detect the yellow line, the passengers who cross the yellow line, and the trains in the video feeds. This task requires labeling. The video data must remain confidential. A data scientist creates a bounding box to label the sample data and uses an object detection model. However, the object detection model cannot clearly demarcate the yellow line, the passengers who cross the yellow line, and the trains. Which labeling approach will help the company improve this model? - A.. Use Amazon Rekognition Custom Labels to label the dataset and create a custom Amazon Rekognition object detection model. Create a private workforce. Use Amazon Augmented AI (Amazon A2I) to review the low-confidence predictions and retrain the custom Amazon Rekognition model. B.. Use an Amazon SageMaker Ground Truth object detection labeling task. Use Amazon Mechanical Turk as the labeling workforce. C.. Use Amazon Rekognition Custom Labels to label the dataset and create a custom Amazon Rekognition object detection model. Create a workforce with a third-party AWS Marketplace vendor. Use Amazon Augmented AI (Amazon A2I) to review the low-confidence predictions and retrain the custom Amazon Rekognition model. D.. Use an Amazon SageMaker Ground Truth semantic segmentation labeling task. Use a private workforce as the labeling workforce.
D - A : The data has label. So what we need to do is to enforce accuracy by reviewing low confidence ones internally Not A, bounding box should be a feature of Ground Truth. https://docs.aws.amazon.com/zh_cn/sagemaker/latest/dg/sms-bounding-box.html It's A. See https://aws.amazon.com/rekognition/custom-labels-features/. It says "The Rekognition Custom Labels console provides a visual interface to make labeling your images fast and simple. The interface allows you to apply a label to the entire image or to identify and label specific objects in images using bounding boxes with a simple click-and-drag interface." We are not using semantic segmentation, as it applies a label to every pixel. We don't want that, we want labels to bounding boxes. 1. you didn't understand what Rekognition is about. Rekognition is a CV model, not labeling tools. 2. you didn't read carefully about the document, just below your quote, it said "Alternately, if you have a large data set, you can use Amazon SageMaker Ground Truth to efficiently label your images at scale." Rekognition label function is for individual cases. 3. finally, you really sure this is an object detection, not pixel-level object classification? original model is object detection and it didn't work. semantic segmentation might be a solution and it is good for self-driving. D; B is using MTurk which uses public workforce which violates the requirements that videos need to be kept private A quick google search on SageMaker ground truth will show you that you can indeed create your own private labelers workforce and send them labelling jobs through GroundTruth. "You have options to work with labelers inside and outside your organization. For example, you can send labeling jobs to your own labelers, or you can access a workforce of over 500,000 independent contractors who are already performing ML-related tasks through Amazon Mechanical Turk. If your data requires confidentiality or special skills, you can also use vendors that are pre-screened by AWS for quality and security procedures." Nevermind My bad. Just realized that you are referring particularly to Mechanical Turk being used as the labelers force for Ground Truth, which is what answer B referring to. The Problem with Answer D though is that there is no "semantic segmentation labeling task" within the very limited list of Ground Truth type of jobs it offers. There is a video classification Job type, and there is Video Frame labeling job type, which includes "video frame object detection job" and "video frame object tracking job. However, there is no Semantic Segmentation Labeling Job" There is a Ground Truth for Semantic Segmentation labelling task - https://docs.aws.amazon.com/sagemaker/latest/dg/sms-semantic-segmentation.html. Therefore, D is correct. https://docs.aws.amazon.com/sagemaker/latest/dg/sms-video.html And Semantic segmentation is object classification done at the pixel level. Isnt that something only machines can do? Unless labelers are directing a machine to do the semantic segmentation for them, I think labelers are no use for it. Agreed. Option A used Amazon Augmented AI (Amazon A2I), not a good way to review confidential data. "However, the object detection model cannot clearly demarcate the yellow line, the passengers who cross the yellow line, and the trains." ---> D IMHO, the answer is D: image segmentation. As the question say: "However, the object detection model cannot clearly demarcate the yellow line, the passengers who cross the yellow line, and the trains." You will get a better accuracy with segmentation in this case. semantic segmentation may not be the right choice for labelling, instead Rekognition is ideal for this scenario and private workforce + A2I to validate the labelling. While Amazon Rekognition Custom Labels with Amazon A2I could be used for object detection, semantic segmentation provides more detailed information about the spatial layout of objects in an image, making it potentially more suitable for tasks like demarcating safety lines. Semantic segmentation will provide the precise pixel-level labeling required to demarcate the yellow safety line accurately, passengers, and trains. A private workforce will ensure that the video data remains confidential. As the original model can’t correctly identify the line, semantic segmentation might offer the needed precision. So D is right. Amazon Rekognition does not support creation of private workforce. Between A & D, D is the only option that allows its creation. Semantic segmnentation can easily identify the yellow line. Given the requirements of the task and the need for confidentiality, the best approach would be: D. Use an Amazon SageMaker Ground Truth semantic segmentation labeling task with a private workforce. Semantic segmentation will provide the precise pixel-level labeling required to demarcate the yellow safety line accurately, passengers, and trains. A private workforce will ensure that the video data remains confidential. D. Use an Amazon SageMaker Ground Truth semantic segmentation labeling task. Use a private workforce as the labeling workforce. Here's why this approach is suitable: Semantic Segmentation Labeling: Semantic segmentation involves labeling each pixel in the image, which is more granular than bounding boxes. This approach is ideal for accurately demarcating the yellow line, which might be difficult with just bounding boxes. It also allows for precise detection of passengers and trains. Private Workforce: Given the requirement for confidentiality, using a private workforce ensures that the data is handled by trusted, authorized personnel. This addresses the concern of keeping the video data confidential. Amazon SageMaker Ground Truth: This service provides tools for efficient and accurate labeling of image data, which is essential for training a robust object detection model. Choose D. Just to be exactly, it's essential for training a robust "Semantic Segmentation" model :) A: D is too complicated. Option D suggests using SageMaker Ground Truth with a semantic segmentation labeling task and a private workforce. Semantic segmentation can be useful for delineating the yellow line clearly. However, it might be more complex than necessary for this scenario, and object detection might be more suitable. In my opinion, A is a better option. Rekognition is not guaranteed no to use your data to improve their models. Similarly, mechanical turk will not keep data private. Only viable option is D B and C excluded since use public workforce. D excluded since question is asking for "labeling approach for THIS model" it means you don't want to switch to a semantic segmentation problem. Therefore A is the correct answer Given that the video data must remain confidential, options that use public workforces like Amazon Mechanical Turk or third-party AWS Marketplace vendors would not be suitable. option A relies on object detection (bounding boxes) and doesn't switch to semantic segmentation. semantic segmentation provides precise labels, especially when distinguishing between closely placed objects like the yellow line and the passengers. A: Using Amazon Rekognition Custom Labels you can do the same of Ground Truth this answer is complete with all steps. The company can use Amazon Rekognition Custom Labels to label the dataset and create a custom Amazon Rekognition object detection model. They can create a private workforce and use Amazon Augmented AI (Amazon A2I) to review the low-confidence predictions and retrain the custom Amazon Rekognition model. This approach will help the company improve the model as it allows them to train a custom model that is specific to their business needs. The custom model can be trained to detect the yellow line, passengers who cross the yellow line, and trains in the video feeds. The private workforce ensures that the video data remains confidential, while Amazon A2I helps to improve the accuracy of the model by reviewing low-confidence predictions and retraining the model. which make A is more suitable than D using the Sagemaker Ground Truth semantic segmentation. The question asks for what labelling solution do you suggest, so based on that how can A be an answer? It is a solution that brings in a human review to the problem, while answer D is direct to the requirement - https://www.examtopics.com/discussions/amazon/view/74997-exam-aws-certified-machine-learning-specialty-topic-1/
168
168 - A data engineer at a bank is evaluating a new tabular dataset that includes customer data. The data engineer will use the customer data to create a new model to predict customer behavior. After creating a correlation matrix for the variables, the data engineer notices that many of the 100 features are highly correlated with each other. Which steps should the data engineer take to address this issue? (Choose two.) - A.. Use a linear-based algorithm to train the model. B.. Apply principal component analysis (PCA). C.. Remove a portion of highly correlated features from the dataset. D.. Apply min-max feature scaling to the dataset. E.. Apply one-hot encoding category-based variables.
BC - B,C should be the answer Apply principal component analysis (PCA) (Option B): PCA is a dimensionality reduction technique that transforms the original features into a smaller set of uncorrelated components, which can help in reducing the redundancy caused by highly correlated features. Remove a portion of highly correlated features from the dataset (Option C): By removing some of the highly correlated features, the data engineer can simplify the model and reduce multicollinearity, which can improve the model’s performance and interpretability BD should be the answer I change my mind to B and C now PCA is sensitive to the variance of features, so it's a common practice to standardize (e.g., z-score normalization) or scale (e.g., min-max scaling) the features before applying PCA. If the features are on different scales, it can distort which principal components are viewed as the most important. Well, first time that I go with the Suggested Answer. D - B is the way: We want to solve the base correlation problem. That said, Letters A - E don't solve this problem, so they're wrong. Letter C partially solves the problem, so it is wrong. As we want steps, the correct alternatives are: D (ensure that all variables are on the same scale) and B (apply PCA that removes all correlation from the base while keeping most of the information). Again, from my perspective (C) is vague and using (B) removes the necessity of drop highly correlated features. Agree. As a question asks us what steps to perform, then it is logical to say: "Scale features and apply PCA" we can't answer "remove a portion of correlated features and then apply PCA", or vice versa. as it doesn't make sense. It would make some sense if we were asked "What technics engineer can apply", or smth similar the most effective steps to address the issue of high correlation among the features in the dataset are removing a portion of highly correlated features and applying principal component analysis (PCA) for dimensionality reduction. These steps will help improve the data quality and predictive performance of the model. PCA is a widely used technique for reducing the dimensionality of high-dimensional datasets while retaining as much of the original variability as possible. It is particularly useful when dealing with highly correlated features. Removing a portion of highly correlated features can be another effective way to address the issue of high correlation. By removing some of the correlated features, the model can become less complex and less prone to overfitting. MinMax scaling does nothing to fix the issue here BC for me, minmax scaler cannot remove multicollinearity? - https://www.examtopics.com/discussions/amazon/view/75262-exam-aws-certified-machine-learning-specialty-topic-1/
169
169 - A company is building a new version of a recommendation engine. Machine learning (ML) specialists need to keep adding new data from users to improve personalized recommendations. The ML specialists gather data from the users' interactions on the platform and from sources such as external websites and social media. The pipeline cleans, transforms, enriches, and compresses terabytes of data daily, and this data is stored in Amazon S3. A set of Python scripts was coded to do the job and is stored in a large Amazon EC2 instance. The whole process takes more than 20 hours to finish, with each script taking at least an hour. The company wants to move the scripts out of Amazon EC2 into a more managed solution that will eliminate the need to maintain servers. Which approach will address all of these requirements with the LEAST development effort? - A.. Load the data into an Amazon Redshift cluster. Execute the pipeline by using SQL. Store the results in Amazon S3. B.. Load the data into Amazon DynamoDB. Convert the scripts to an AWS Lambda function. Execute the pipeline by triggering Lambda executions. Store the results in Amazon S3. C.. Create an AWS Glue job. Convert the scripts to PySpark. Execute the pipeline. Store the results in Amazon S3. D.. Create a set of individual AWS Lambda functions to execute each of the scripts. Build a step function by using the AWS Step Functions Data Science SDK. Store the results in Amazon S3.
C - C; Lambda execution time has hard limit of 15 mins which might not be enough for data processing but C requires some coding efforts I think C is correct, because pyspark are also a kind of Python, and it only require a little code change. D is wrong as AWS Lambda has a maximum execution time of 15 minutes, which may not be sufficient for some of the scripts. C is right as it's serverless and not a lot of work. Redshift is definitely going to require some effort to setup, lambda just won't cut it performance-wise if the EC2 instance can't. Guess whats left? C seems the most correct but it misses the part of importing data in AWS option c fit all requirements since, it provides the least development effort using AWS Glue, and convert the python to pyspark which provide the most performance. option D is not suitable because lambda function has a limitation of only 15 minutes ruining while the script needs 1 hour. We want to eliminate server management and reduce development effort. That said, Letter A is wrong, as it brings effort to refactor code. Letter B is wrong as DynamoDB asks for server management. Letter D is wrong, because despite the services being serverless (Lambda and Step Functions), the maximum timeout of a Lambda function is 15 minutes, which would be less than the desired one (1 hour). Letter C is correct, even if there is a pure Python code conversion effort → PySpark, this is the solution that fits the requirements. Option C is the best option because it allows you to use the existing Python scripts without having to convert them to a different language or framework. AWS Glue is a managed service that makes it easy to prepare data for analysis. PySpark is a Python library that allows you to use Spark to process data. This approach would address all of the requirements with the least development effort and would be able to handle large-scale data processing. Overall, option C with AWS Glue and PySpark is the most efficient approach, as it requires the least amount of development effort while effectively addressing all the requirements, including moving away from EC2 maintenance and handling large-scale data processing. corrected to option c The data pipeline involves cleaning, transforming, enriching, and compressing terabytes of data and storing the data in Amazon S3. AWS Glue is an ETL service that makes it easy to move data between data stores. The Glue job allows you to use PySpark scripts to perform ETL tasks. With AWS Glue, you do not need to provision and manage servers, which eliminates the need to maintain servers, as required by the company. Therefore, AWS Glue would address all of the company's requirements with the least development effort. 1) eliminate the need to maintain servers - Lambda is serverless 2) the least development effort - python scripts do no need to be rewritten for Lambda function "with each script taking at least an hour" - lambda would be time-out during taking job. C as for Redshift I need to build a new pipeline Converting python scripts to pyspark is less coding effort than writing up SQL, which is somewhat limited in the types of transformations it can do. Lambda function responses are a deadend for reason already given (timeout) - https://www.examtopics.com/discussions/amazon/view/74998-exam-aws-certified-machine-learning-specialty-topic-1/
170
170 - A retail company is selling products through a global online marketplace. The company wants to use machine learning (ML) to analyze customer feedback and identify specific areas for improvement. A developer has built a tool that collects customer reviews from the online marketplace and stores them in an Amazon S3 bucket. This process yields a dataset of 40 reviews. A data scientist building the ML models must identify additional sources of data to increase the size of the dataset. Which data sources should the data scientist use to augment the dataset of reviews? (Choose three.) - A.. Emails exchanged by customers and the company's customer service agents B.. Social media posts containing the name of the company or its products C.. A publicly available collection of news articles D.. A publicly available collection of customer reviews E.. Product sales revenue figures for the company F.. Instruction manuals for the company's products
ABD - ABD; Email exchange between customer and customer service would be valuable data source. Everyone explained correct. In conclusion, the data scientist should use customer service emails, social media posts, and publicly available customer reviews to augment the dataset of reviews for the analysis of customer feedback and identifying specific areas for improvement. A: Emails exchanged by customers and the company's customer service agents can provide additional customer feedback and opinions about the products or services. This data can be used to improve the ML model. B: Social media posts containing the name of the company or its products can provide additional customer feedback and opinions about the products or services, which can be used to improve the ML model. D: A publicly available collection of customer reviews can be used to augment the existing dataset of reviews and increase the size of the dataset. This can help to improve the accuracy of the ML model. C: A publicly available collection of news articles and F: Instruction manuals for the company's products are not directly related to customer feedback and may not be relevant for improving the ML model in this context. E: Product sales revenue figures for the company can provide valuable insights into the company's financial performance, but this data is not directly related to customer feedback and may not be useful for improving the ML model in this context. I think ABE or maybe(!) ABF A & B -> for sure C -> No glue how this should help D -> We have already reviews!? E -> Could help to find correlations between negative / positive reviews and sales F -> non-sense in the first moment and second as well, but maybe it could help to combine the information of Email, Social and Reviews to some problems. Don't overthink conrad. ABE makes more sense here. That's exactly the point though, we want more volume of reviews, or anything resembling reviews as is the case with email exchanges. BDF - correct answer - https://www.examtopics.com/discussions/amazon/view/74809-exam-aws-certified-machine-learning-specialty-topic-1/
171
171 - A machine learning (ML) specialist wants to create a data preparation job that uses a PySpark script with complex window aggregation operations to create data for training and testing. The ML specialist needs to evaluate the impact of the number of features and the sample count on model performance. Which approach should the ML specialist use to determine the ideal data transformations for the model? - A.. Add an Amazon SageMaker Debugger hook to the script to capture key metrics. Run the script as an AWS Glue job. B.. Add an Amazon SageMaker Experiments tracker to the script to capture key metrics. Run the script as an AWS Glue job. C.. Add an Amazon SageMaker Debugger hook to the script to capture key parameters. Run the script as a SageMaker processing job. D.. Add an Amazon SageMaker Experiments tracker to the script to capture key parameters. Run the script as a SageMaker processing job.
D - while I agree that Sagemaker Experiments is the way to go, it only supports Training, Processing, and Transform jobs, so the right answer is to run the job as a processing job, hence D not B https://docs.aws.amazon.com/sagemaker/latest/dg/experiments-create.html#:~:text=CreateTrainingJob-,Processing,-Processor.run “Generally, you use load_run with no arguments to track metrics, parameters, and artifacts within a SageMaker training or processing job script.” https://docs.aws.amazon.com/sagemaker/latest/dg/experiments-create.html Run PySpark script in SageMaker processing job https://sagemaker.readthedocs.io/en/stable/amazon_sagemaker_processing.html B - https://docs.aws.amazon.com/sagemaker/latest/dg/experiments.html But It doesn't describe glue job. Pyspark -> AWS Glue https://sagemaker.readthedocs.io/en/stable/amazon_sagemaker_processing.html#pysparkprocessor -> D AWS Glue is a fully managed extract, transform, and load (ETL) service that is purpose-built for processing large datasets and executing PySpark scripts. It's more aligned with the task of running a PySpark script with complex window aggregation operations to prepare data for training and testing D https://sagemaker-experiments.readthedocs.io/en/latest/tracker.html A PySpark script can be run as a SageMaker processing job by using the SparkProcessor class. A SageMaker processing job can use Amazon SageMaker Experiments to track the input parameters, output metrics, and artifacts of each run. A SageMaker processing job can also use Amazon SageMaker Debugger to capture tensors and analyze the training behavior, but this is more useful for deep learning models than for data preparation tasks. Running the script as an AWS Glue job would not allow the ML specialist to use Amazon SageMaker Experiments or Amazon SageMaker Debugger, as these features are specific to SageMaker. D: SageMaker Experiments automatically tracks the inputs, parameters, configurations, and results of your iterations as runs. The PySpark script defined above is passed via via the submit_app parameter https://github.com/aws/amazon-sagemaker-examples/blob/main/sagemaker_processing/spark_distributed_data_processing/sagemaker-spark-processing.ipynb Key metrics is the "key". Then D is not a correct answer what is the difference between key metrics and key parameteres? why we care about key metrics, because we can compare the key metrics of different parametes and then find impact of the number of features. so the key is "glue" or "SageMaker processing" D looks the right answer https://docs.aws.amazon.com/sagemaker/latest/dg/experiments.html ---- Use SageMaker Experiments to view, manage, analyze, and compare both custom experiments that you programmatically create and experiments automatically created from SageMaker jobs. "SageMaker jobs" not "Glue job", it is D! B: Glue job goes with window aggregation operations https://sagemaker-examples.readthedocs.io/en/latest/sagemaker_processing/spark_distributed_data_processing/sagemaker-spark-processing.html here: https://aws.amazon.com/about-aws/whats-new/2018/10/aws-glue-now-supports-connecting-amazon-sagemaker-notebooks-to-development-endpoints/#:~:text=AWS%20Glue%20now%20supports%20connecting%20Amazon%20SageMaker%20notebooks%20to%20development%20endpoints,-Posted%20On%3A%20Oct&text=You%20can%20now%20create%20an,an%20AWS%20Glue%20development%20endpoint. - https://www.examtopics.com/discussions/amazon/view/74810-exam-aws-certified-machine-learning-specialty-topic-1/
172
172 - A data scientist has a dataset of machine part images stored in Amazon Elastic File System (Amazon EFS). The data scientist needs to use Amazon SageMaker to create and train an image classification machine learning model based on this dataset. Because of budget and time constraints, management wants the data scientist to create and train a model with the least number of steps and integration work required. How should the data scientist meet these requirements? - A.. Mount the EFS file system to a SageMaker notebook and run a script that copies the data to an Amazon FSx for Lustre file system. Run the SageMaker training job with the FSx for Lustre file system as the data source. B.. Launch a transient Amazon EMR cluster. Configure steps to mount the EFS file system and copy the data to an Amazon S3 bucket by using S3DistCp. Run the SageMaker training job with Amazon S3 as the data source. C.. Mount the EFS file system to an Amazon EC2 instance and use the AWS CLI to copy the data to an Amazon S3 bucket. Run the SageMaker training job with Amazon S3 as the data source. D.. Run a SageMaker training job with an EFS file system as the data source.
D - Should be D according to the following article https://aws.amazon.com/blogs/machine-learning/speed-up-training-on-amazon-sagemaker-using-amazon-efs-or-amazon-fsx-for-lustre-file-systems/ SageMaker Noteboob instances can take input data directly from below, 1. AWS S3 2. Elastic File System (EFS) 3. FSx for Lustre file system Since the question is only regarding less coding effort and does not concern high availability or high performance, Option D would be good boobies Using Amazon SageMaker for training, you can utilize an Amazon EFS as your data source as long as the data already resides in Amazon EFS before starting the training job. This option requires least integration work. My God, the answer is not D!!! Using EFS for Lustre reduces the start-up time by eliminating the data download step of the training process and leveraging the various performance and throughput benefits of the file system to execute the training job faster. So, A IS the correct !!! "with the least number of steps and integration work required" the management wants the data scientist to create and train a model with the least number of steps and integration work required, (this is the keyword) so there is no need to include more things than sagemaker and EFS which make option D is the most suitable This option allows the data scientist to use the existing dataset in EFS without copying or moving it to another storage service. It also minimizes the number of steps and integration work required, as SageMaker supports EFS as a data source for training jobs. This option is also cost-effective and time-efficient, as it avoids additional charges and delays associated with data transfer and storage. Less effort, then D “When you create a training job, you specify the location of a training dataset and an input mode for accessing the dataset. For data location, Amazon SageMaker supports Amazon Simple Storage Service (Amazon S3), Amazon Elastic File System (Amazon EFS), and Amazon FSx for Lustre. ” https://docs.aws.amazon.com/sagemaker/latest/dg/model-access-training-data.html https://aws.amazon.com/blogs/machine-learning/mount-an-efs-file-system-to-an-amazon-sagemaker-notebook-with-lifecycle-configurations/ Amazon SageMaker now supports Amazon Elastic File System (Amazon EFS) and Amazon FSx for Lustre file systems as data sources for training machine learning models on SageMaker. then why not select D ?? Time constraints. A is the right answer for this question A is not the right answer. It's D. - A requires this setup: EFS -> Lustre -> Sagemaker. - D requires this setup: EFS -> Sagemaker It's obviously not A. - https://www.examtopics.com/discussions/amazon/view/74877-exam-aws-certified-machine-learning-specialty-topic-1/
173
173 - A retail company uses a machine learning (ML) model for daily sales forecasting. The company's brand manager reports that the model has provided inaccurate results for the past 3 weeks. At the end of each day, an AWS Glue job consolidates the input data that is used for the forecasting with the actual daily sales data and the predictions of the model. The AWS Glue job stores the data in Amazon S3. The company's ML team is using an Amazon SageMaker Studio notebook to gain an understanding about the source of the model's inaccuracies. What should the ML team do on the SageMaker Studio notebook to visualize the model's degradation MOST accurately? - A.. Create a histogram of the daily sales over the last 3 weeks. In addition, create a histogram of the daily sales from before that period. B.. Create a histogram of the model errors over the last 3 weeks. In addition, create a histogram of the model errors from before that period. C.. Create a line chart with the weekly mean absolute error (MAE) of the model. D.. Create a scatter plot of daily sales versus model error for the last 3 weeks. In addition, create a scatter plot of daily sales versus model error from before that period.
C - C is the correct answer. why? i think it's D https://docs.aws.amazon.com/forecast/latest/dg/predictor-monitoring-results.html change my mind, "weekly" -> not c, so i think it's D C. Creating a line chart of weekly Mean Absolute Error (MAE) → This can show changes in model performance over time, but it does not reveal whether errors are concentrated in specific sales ranges. D. Creating a scatter plot of daily sales vs. model errors for the past three weeks and the period before that → This allows for a direct visual analysis of the relationship between actual sales and prediction errors. → It helps identify whether errors are more significant in specific sales ranges and effectively compares model performance before and after degradation. Most accurately. C is a tempting answer. You will see the model degradation over time. You could see if it's slowly getting worse or was it sudden. B is more accurate. You will only have two histograms to compare, but you will easily see which direction the error move: Are we over or underestimating. In practice you would use both. I'm terms of the exam - most accurate - most added information, B gives more information than C. It's a preference though. Although, following the docs it will be C: https://docs.aws.amazon.com/forecast/latest/dg/predictor-monitoring-results.html D is the correct answer Weekly MAE aggregates the error metrics over a larger time window, which can mask fluctuations and specific patterns in the model's performance on a daily basis. In situations where there are sudden changes or degradation in the model's accuracy within a week, this visualization might not capture those nuances effectively. Degradation over time: line or scatter plot (options C, D) C is aggregrate weekly view and doesn't give any additional details. D compares the model's errors during 3 week period to the errors from before that period giving an accurate picture of anomalies GPT: To accurately visualize the degradation of the model over time and understand the source of inaccuracies, the ML team should focus on comparing the model's performance before and after the reported period of inaccuracies. The most appropriate option is: D. Create a scatter plot of daily sales versus model error for the last 3 weeks. In addition, create a scatter plot of daily sales versus model error from before that period. Claude 3 Sonnet: Based on the evidence from AWS documentation and best practices, Option B: Create a histogram of the model errors over the last 3 weeks. In addition, create a histogram of the model errors from before that period, is the most accurate approach for the ML team to visualize the model's degradation. Histograms of model errors directly visualize the distribution and patterns of the model's inaccuracies, which is crucial for understanding the source of the problem. By comparing the error distributions before and after the 3-week period, the ML team can identify any significant shifts or changes that may indicate the cause of the model's degradation. This approach aligns with AWS best practices for model monitoring and visualization, as recommended by the Amazon SageMaker Model Monitor documentation. It provides a clear and focused visualization of the model's performance, enabling the ML team to gain insights and take appropriate actions to address the inaccuracies. The best option to visualize the model's degradation most accurately would be to compare the model's errors over the relevant periods. This directly addresses the issue of model accuracy and allows for a clear comparison of model performance before and after the reported period of inaccuracy. Therefore, the most appropriate approach would be: B. Create a histogram of the model errors over the last 3 weeks. In addition, create a histogram of the model errors from before that period. This approach will allow the ML team to see if the distribution of errors has changed recently, indicating a degradation in model performance. Is there a reason to create weekly MAE plot, if the prediction is made on daily granularity? this is the key sentence : At the end of each day, an AWS Glue job consolidates the input data that is used for the forecasting with the actual daily sales data and the predictions of the model and that is exactly what MAE do: mean absolute error (MAE) is a statistical measure of the difference between two continuous variables. It is calculated as the average of the absolute differences between the predicted and actual values so the answer is C A. NO - Daily sales histogram does not help to see model error B. NO - Histogram of the model errors is good, but no point to have one for the first 3 weeks and another for older data C. YES - one chart of model errors is perfect D. NO - no point to have 2 charts again C is correct. option B with histograms of model errors for the specific time periods is the most accurate and appropriate visualization to understand the model's degradation and identify the reasons behind the inaccuracies in the daily sales forecasting. C is the answer, line plots are good solutions for time series analysis. C is correct. We could view the "Degradation" as a trend. Line charts are usually very helpful to show if there is any trend in the data over the period of time under analysis. Histogram is normally used to visualizing distributions in your data. Should be A because it is daily forecasting and histograms before and after will show the comparable degradation. It only states to plot daily sales.. how should that help with the error? You need a plot of the actual and predicted values or or the errors - def not A - https://www.examtopics.com/discussions/amazon/view/76292-exam-aws-certified-machine-learning-specialty-topic-1/
174
174 - An ecommerce company sends a weekly email newsletter to all of its customers. Management has hired a team of writers to create additional targeted content. A data scientist needs to identify five customer segments based on age, income, and location. The customers' current segmentation is unknown. The data scientist previously built an XGBoost model to predict the likelihood of a customer responding to an email based on age, income, and location. Why does the XGBoost model NOT meet the current requirements, and how can this be fixed? - A.. The XGBoost model provides a true/false binary output. Apply principal component analysis (PCA) with five feature dimensions to predict a segment. B.. The XGBoost model provides a true/false binary output. Increase the number of classes the XGBoost model predicts to five classes to predict a segment. C.. The XGBoost model is a supervised machine learning algorithm. Train a k-Nearest-Neighbors (kNN) model with K = 5 on the same dataset to predict a segment. D.. The XGBoost model is a supervised machine learning algorithm. Train a k-means model with K = 5 on the same dataset to predict a segment.
D - Answer is D! K-means used for customer segmentation well, both are used for customer segmentation Knn & kmeans but kmeans is for unsupervised learning and knn is for supervised learning. since we have the data it's better to use supervised learning in this case. Ref: https://rstudio-pubs-static.s3.amazonaws.com/599866_59be74824ca7482ba99dbc8466dc36a0.html#:~:text=The%20difference%20between%20the%20two,to%20predict%20the%20unlabelled%20data. The answer is D. 1. "The current segmentation of consumers is unclear." so it is unsupervised learning. 2. Then K-means is for unsupervised learning. Typical clustering problem - use K means KNN is used to solve missing data in regression/supervised problems. Since the question says unknown segmentation, it is an unsupervised problem and K-Means is the right choice. So Option D it is. D is the correct C is wrong because kNN stills a supervised algorithm The XGBoost model is a supervised machine learning algorithm, which means it requires labeled data to learn from. However, the customers’ current segmentation is unknown, so there are no labels to train or evaluate the model. The data scientist needs an unsupervised machine learning algorithm, which can discover patterns and clusters in unlabeled data. A k-means model is an example of an unsupervised machine learning algorithm that can partition the data into K groups based on similarity. By setting K = 5, the data scientist can obtain five customer segments based on age, income, and location. KNN has no k parameter in its input. C is not the answer. in K-means also there is no input parameter "K". What i mean to say here is in knn the k is nothing but "kNN classifier identifies the class of a data point using the majority voting principle. If k is set to 5, the classes of 5 nearest points are examined." D The key work is that the classification is "unclear", therefore k-means - https://www.examtopics.com/discussions/amazon/view/74922-exam-aws-certified-machine-learning-specialty-topic-1/
175
175 - A global financial company is using machine learning to automate its loan approval process. The company has a dataset of customer information. The dataset contains some categorical fields, such as customer location by city and housing status. The dataset also includes financial fields in different units, such as account balances in US dollars and monthly interest in US cents. The company's data scientists are using a gradient boosting regression model to infer the credit score for each customer. The model has a training accuracy of 99% and a testing accuracy of 75%. The data scientists want to improve the model's testing accuracy. Which process will improve the testing accuracy the MOST? - A.. Use a one-hot encoder for the categorical fields in the dataset. Perform standardization on the financial fields in the dataset. Apply L1 regularization to the data. B.. Use tokenization of the categorical fields in the dataset. Perform binning on the financial fields in the dataset. Remove the outliers in the data by using the z- score. C.. Use a label encoder for the categorical fields in the dataset. Perform L1 regularization on the financial fields in the dataset. Apply L2 regularization to the data. D.. Use a logarithm transformation on the categorical fields in the dataset. Perform binning on the financial fields in the dataset. Use imputation to populate missing values in the dataset.
A - agree it's A for me A: it's overfitting so regularization is needed, need apply scaling on financial data fields as it's for regression problem; one hot encoding for city of the house field. Option A is the most likely to improve the testing accuracy the most effectively because it uses appropriate preprocessing techniques for both categorical and numerical data and applies a regularization technique that can help in reducing overfitting, thereby potentially improving the model's generalization to unseen data. I will go with B. (A) suggests applying regularization to the data. It doesn't make sense. (B) answer is well framed. At least it doesn't use the wrong formulation. B also looks suspicious. Use a one-hot encoder for the categorical fields in the dataset. Perform standardization on the financial fields in the dataset. Apply L1 regularization to the data. Agree with A, but I think the answer is slightly inaccurate. L1 regularization within the model and to the loss function. As a result, some features will be removed in the model. The answer suggest L1 regularization is applied to the dataset directly. Option A is the most appropriate approach to improve the testing accuracy of the model. One-hot encoding can effectively represent categorical variables in a numeric format that is suitable for machine learning models. Standardizing the financial fields can make the data more comparable and improve the model's performance. L1 regularization can help in feature selection and avoid overfitting by reducing the number of features. How do you apply regularization to Data and not to the model params? Why are most of the chosen answers by ExamTopics mostly obviously wrong? There is nothing like tokenisation of categorical variable and B should be obviously wrong. When they were published (firtsly, they steal them by photo/camera) they didn't have chatgpt to see the answers, and of course, they don't have any ML specialist or time to resolve them. 12-sep exam - https://www.examtopics.com/discussions/amazon/view/74994-exam-aws-certified-machine-learning-specialty-topic-1/
176
176 - A machine learning (ML) specialist needs to extract embedding vectors from a text series. The goal is to provide a ready-to-ingest feature space for a data scientist to develop downstream ML predictive models. The text consists of curated sentences in English. Many sentences use similar words but in different contexts. There are questions and answers among the sentences, and the embedding space must differentiate between them. Which options can produce the required embedding vectors that capture word context and sequential QA information? (Choose two.) - A.. Amazon SageMaker seq2seq algorithm B.. Amazon SageMaker BlazingText algorithm in Skip-gram mode C.. Amazon SageMaker Object2Vec algorithm D.. Amazon SageMaker BlazingText algorithm in continuous bag-of-words (CBOW) mode E.. Combination of the Amazon SageMaker BlazingText algorithm in Batch Skip-gram mode with a custom recurrent neural network (RNN)
AC - seq2seq and object2vec take care of more than just the words. Any response with blazingText is wrong because blazingText just uses a cbow (continuous bag of words), working only on individual words “One of the well-known embedding techniques is Word2Vec, which provides embeddings for words.” “In addition to word embeddings, there are also use cases where we want to learn the embeddings of more general-purpose objects such as sentences, customers, and products. This is so we can build practical applications for information retrieval, product search, item matching, customer profiling based on similarity or as inputs for other supervised tasks. This is where Amazon SageMaker Object2Vec comes in.” https://aws.amazon.com/blogs/machine-learning/introduction-to-amazon-sagemaker-object2vec/ This is wrong, maybe the response (and the question) is outdated because BlazingText now supports three diferent techniques It should be B an D. The objective is to create a latent space/word embedding that puts similar words closer to each other for other purposes. Thus, we should use Sagemaker Blazing Text in unsupervised mode (Word2Vec mode). cbow, skip-grams, and batch skip-grams are the 3 algorithms for this. However, since we do not need to do the later part of E, E is not correct. The ans should be B and D. yeah, my initial thought was the same. But both B and D embed words, not sentences. Using Amazon SageMaker Object2Vec (C) provides an end-to-end solution for learning embeddings with contextual and sequential relationships, while BlazingText in Skip-gram mode combined with a custom RNN (E) allows for greater flexibility in capturing sequence-level dependencies. Both approaches can produce high-quality embedding vectors that meet the requirements. Why not A : The seq2seq algorithm is designed for sequence-to-sequence tasks like translation or summarization. While it generates embeddings internally, it is not designed to provide a general-purpose feature space for downstream predictive models. Best choices To extract embedding vectors: Blazingtext Word2vec and Object2vec (B, C). Seq to seq: generate one sequence from another (A is out) Amazon SageMaker BlazingText algorithm in continuous bag-of-words (CBOW) mode does not capture word embeddings (D is out) To capture word context and sequential QA information, the embedding vectors need to consider both the order and the meaning of the words in the text. Option B, Amazon SageMaker BlazingText algorithm in Skip-gram mode, is a valid option because it can learn word embeddings that capture the semantic similarity and syntactic relations between words based on their co-occurrence in a window of words. Skip-gram mode can also handle rare words better than continuous bag-of-words (CBOW) mode1. Option E, combination of the Amazon SageMaker BlazingText algorithm in Batch Skip-gram mode with a custom recurrent neural network (RNN), is another valid option because it can leverage the advantages of Skip-gram mode and also use an RNN to model the sequential nature of the text. An RNN can capture the temporal dependencies and long-term dependencies between words, which are important for QA tasks2 Considering the requirements, the two options that can produce the required embedding vectors that capture word context and sequential QA information are: C. Amazon SageMaker Object2Vec algorithm: Because it can learn to capture relationships in pairs of text, which could include the sequential nature of questions and answers. E. Combination of the Amazon SageMaker BlazingText algorithm in Batch Skip-gram mode with a custom recurrent neural network (RNN): This combination provides both context-aware word embeddings and the ability to capture sequential dependencies in text data. C because Object2Vec is a neural network-based algorithm that can learn embeddings for a wide range of data types and tasks. E because If you want to capture word context and sequential information, especially in the context of natural language processing (NLP), it is advisable to use models that are specifically designed for sequence modeling, such as recurrent neural networks (RNNs) or more advanced models like long short-term memory networks (LSTMs) or transformers. A. Amazon SageMaker seq2seq algorithm: Sequence-to-sequence (seq2seq) models are designed to convert sequences from one domain to sequences in another domain, often used in tasks like machine translation. They are capable of understanding the context and the sequence in which words appear, making them suitable for differentiating between questions and answers in a text. E. Combination of the Amazon SageMaker BlazingText algorithm in Batch Skip-gram mode with a custom recurrent neural network (RNN): This combination is promising. BlazingText in Skip-gram mode captures word context, and the recurrent neural network (RNN) is excellent for capturing sequential data, such as the flow in conversations or text. This combination should be effective at understanding both the context of individual words and the sequence of questions and answers. One problem of Object2Vec is it takes two objects as input during training and loss minimizes the difference between embeddings of these two objects. I don't think we have some labels to pass to Object2Vec We might think that we have a QA which we can pass as two objects. But in the question we want embeddings to distinguish between Q & A, but this Object2Vec minimizes the difference. So I wouldn't tell it is for sure C https://aws.amazon.com/blogs/machine-learning/introduction-to-amazon-sagemaker-object2vec/ A . Seq2seq (sequence-to-sequence) models are designed to handle sequences. They are particularly well-suited for tasks like translating sentences from one language to another, but they can also be used for other tasks that involve sequences, such as converting questions to answers. Given that the embedding space must differentiate between questions and answers, a seq2seq model would be a good choice. E . BlazingText in Skip-gram mode can capture word context effectively. However, on its own, it might not capture the sequential information between questions and answers. By combining it with a custom RNN, the sequential nature of the sentences, especially in a QA setting, can be captured. RNNs are designed to work with sequences and can remember past information, making them suitable for this task. A. NO - seq2seq not for word embeddings B. YES - BlazingText in Skip-gram works and can capture Q&A C. NO - object2vec not for word embeddings D. YES - BlazingText in CBOW works and can capture Q&A E. NO - no need for RNN Answers should be B and C See the following for BlazingText with Skip-gram: https://arxiv.org/pdf/1604.04661.pdf (search skip-gram) Linked from this page https://docs.aws.amazon.com/sagemaker/latest/dg/blazingtext.html For Object2Vec see this page https://docs.aws.amazon.com/sagemaker/latest/dg/object2vec.html C is confirmed but confused between A or E then lean to E confusing Seq2Seq will not generate an embedding vector, so A it's wrong from my POV. I go with B - C IMO E for sure then either B or C Combination of the Amazon SageMaker BlazingText algorithm in Batch Skip-gram mode with a custom recurrent neural network (RNN) is a more sophisticated approach that can be used to capture sequential QA information. This is because RNNs are able to learn long-term dependencies between words. - https://www.examtopics.com/discussions/amazon/view/75431-exam-aws-certified-machine-learning-specialty-topic-1/
177
177 - A retail company wants to update its customer support system. The company wants to implement automatic routing of customer claims to different queues to prioritize the claims by category. Currently, an operator manually performs the category assignment and routing. After the operator classifies and routes the claim, the company stores the claim's record in a central database. The claim's record includes the claim's category. The company has no data science team or experience in the field of machine learning (ML). The company's small development team needs a solution that requires no ML expertise. Which solution meets these requirements? - A.. Export the database to a .csv file with two columns: claim_label and claim_text. Use the Amazon SageMaker Object2Vec algorithm and the .csv file to train a model. Use SageMaker to deploy the model to an inference endpoint. Develop a service in the application to use the inference endpoint to process incoming claims, predict the labels, and route the claims to the appropriate queue. B.. Export the database to a .csv file with one column: claim_text. Use the Amazon SageMaker Latent Dirichlet Allocation (LDA) algorithm and the .csv file to train a model. Use the LDA algorithm to detect labels automatically. Use SageMaker to deploy the model to an inference endpoint. Develop a service in the application to use the inference endpoint to process incoming claims, predict the labels, and route the claims to the appropriate queue. C.. Use Amazon Textract to process the database and automatically detect two columns: claim_label and claim_text. Use Amazon Comprehend custom classification and the extracted information to train the custom classifier. Develop a service in the application to use the Amazon Comprehend API to process incoming claims, predict the labels, and route the claims to the appropriate queue. D.. Export the database to a .csv file with two columns: claim_label and claim_text. Use Amazon Comprehend custom classification and the .csv file to train the custom classifier. Develop a service in the application to use the Amazon Comprehend API to process incoming claims, predict the labels, and route the claims to the appropriate queue.
D - I would say D. We shouldn't need Textract to extract columns from a database I think D because Textract doesn't support CSV but only PNG, JPEG, TIFF, and PDF formats D because it does not require heavy machine learning expertise A. NO - Object2Vec is unsupervised, it will create vector representations but not assign to a category the claims B. NO - we want a supervised method, LDA will create topics in an unsupervised way C. NO - again we want a supervised method D. YES - That is supervised; no need for ML skills, only basic API programming C is wrong because the columns doesn't exist. D. Export the database to a .csv file with two columns: claim_label and claim_text. Use Amazon Comprehend custom classification and the .csv file to train the custom classifier. Develop a service in the application to use the Amazon Comprehend API to process incoming claims, predict the labels, and route the claims to the appropriate queue. Option D meets the requirements. The solution requires no ML expertise, and the small development team can use the Amazon Comprehend custom classification API to train a model to automatically detect claim categories. The company can export the database to a .csv file with two columns: claim_label and claim_text. Then, the development team can use the .csv file to train the custom classifier. Finally, the team can develop a service in the application to use the Amazon Comprehend API to process incoming claims, predict the labels, and route the claims to the appropriate queue. This solution is straightforward, does not require extensive expertise, and can be implemented quickly. It should A because Object2Vec is meant for text classification. The problem is to categorize the text based on the content. B. LDA is for topic modelling based on categories. Comprehend is for extracting the entities related to sentiments etc. Comprehend can be used for custom classification of NLP too(https://aws.amazon.com/ko/comprehend/features/). LDA can find document topics and word distribution for topics, but it is necessary to manually link the topics with predefined customer category. But it says the solution should not require ML expertise. LDA requires ML expertise. The firm needs a solution for their manual process and this means among others the routing of their client orders. To do that you will need textract - https://www.examtopics.com/discussions/amazon/view/74871-exam-aws-certified-machine-learning-specialty-topic-1/
178
178 - A machine learning (ML) specialist is using Amazon SageMaker hyperparameter optimization (HPO) to improve a model's accuracy. The learning rate parameter is specified in the following HPO configuration: During the results analysis, the ML specialist determines that most of the training jobs had a learning rate between 0.01 and 0.1. The best result had a learning rate of less than 0.01. Training jobs need to run regularly over a changing dataset. The ML specialist needs to find a tuning mechanism that uses different learning rates more evenly from the provided range between MinValue and MaxValue. Which solution provides the MOST accurate result? [https://www.examtopics.com/assets/media/exam-media/04145/0010800001.png] - A.. Modify the HPO configuration as follows: Select the most accurate hyperparameter configuration form this HPO job. [Image(s): https://www.examtopics.com/assets/media/exam-media/04145/0010900001.png] B.. Run three different HPO jobs that use different learning rates form the following intervals for MinValue and MaxValue while using the same number of training jobs for each HPO job: ✑ [0.01, 0.1] ✑ [0.001, 0.01] ✑ [0.0001, 0.001] Select the most accurate hyperparameter configuration form these three HPO jobs. C.. Modify the HPO configuration as follows: Select the most accurate hyperparameter configuration form this training job. [Image(s): https://www.examtopics.com/assets/media/exam-media/04145/0010900005.png] D.. Run three different HPO jobs that use different learning rates form the following intervals for MinValue and MaxValue. Divide the number of training jobs for each HPO job by three: ✑ [0.01, 0.1] ✑ [0.001, 0.01] [0.0001, 0.001] Select the most accurate hyperparameter configuration form these three HPO jobs. [Image(s): https://www.examtopics.com/assets/media/exam-media/04145/0010900008.png]
C - "Choose logarithmic scaling when you are searching a range that spans several orders of magnitude. For example, if you are tuning a Tune a linear learner model model, and you specify a range of values between .0001 and 1.0 for the learning_rate hyperparameter, searching uniformly on a logarithmic scale gives you a better sample of the entire range than searching on a linear scale would, because searching on a linear scale would, on average, devote 90 percent of your training budget to only the values between .1 and 1.0, leaving only 10 percent of your training budget for the values between .0001 and .1." based on the above from this link https://docs.aws.amazon.com/sagemaker/latest/dg/automatic-model-tuning-define-ranges.html C is clearly the answer I would choose C: https://docs.aws.amazon.com/sagemaker/latest/dg/automatic-model-tuning-define-ranges.html But according to the doc you gave, "Logarithmic scaling works only for ranges that have only values greater than 0." I think choosing ScalingType=Linear is the best fit, but there's no such option. Yes and 0.0001 is greater than 0 C is the way. not A since you choose reverse logarithmic scaling when you are searching a range that is highly sensitive to small changes that are very close to 1. It should be B. In logarithmic parameters the min value is the maximum value. This is the reason that C is not correct "Choose logarithmic scaling when you are searching a range that spans several orders of magnitude. For example, if you are tuning a Tune a linear learner model model, and you specify a range of values between .0001 and 1.0 for the learning_rate hyperparameter, searching uniformly on a logarithmic scale gives you a better sample of the entire range than searching on a linear scale would, because searching on a linear scale would, on average, devote 90 percent of your training budget to only the values between .1 and 1.0, leaving only 10 percent of your training budget for the values between .0001 and .1." B looks better , because learning rates were splited up base on a previous experience (0.1 - 0,01) in this case we are changing the structure . On the other hand A and B only change scaletype and this means no real changes - https://www.examtopics.com/discussions/amazon/view/75200-exam-aws-certified-machine-learning-specialty-topic-1/
179
179 - A manufacturing company wants to use machine learning (ML) to automate quality control in its facilities. The facilities are in remote locations and have limited internet connectivity. The company has 20 ׀¢׀’ of training data that consists of labeled images of defective product parts. The training data is in the corporate on- premises data center. The company will use this data to train a model for real-time defect detection in new parts as the parts move on a conveyor belt in the facilities. The company needs a solution that minimizes costs for compute infrastructure and that maximizes the scalability of resources for training. The solution also must facilitate the company's use of an ML model in the low-connectivity environments. Which solution will meet these requirements? - A.. Move the training data to an Amazon S3 bucket. Train and evaluate the model by using Amazon SageMaker. Optimize the model by using SageMaker Neo. Deploy the model on a SageMaker hosting services endpoint. B.. Train and evaluate the model on premises. Upload the model to an Amazon S3 bucket. Deploy the model on an Amazon SageMaker hosting services endpoint. C.. Move the training data to an Amazon S3 bucket. Train and evaluate the model by using Amazon SageMaker. Optimize the model by using SageMaker Neo. Set up an edge device in the manufacturing facilities with AWS IoT Greengrass. Deploy the model on the edge device. D.. Train the model on premises. Upload the model to an Amazon S3 bucket. Set up an edge device in the manufacturing facilities with AWS IoT Greengrass. Deploy the model on the edge device.
C - C; Using S3 for scalable training and SageMaker Neo for compiling model for edge devices A. NO - SageMaker endpoint does not address low-connectivity for inference B. NO - Train on premises does not address scalability for training C. YES - maximize training scalability and works with low-connectivity D. NO - Train on premises does not address scalability for training The company needs a solution that minimizes costs for compute infrastructure and that maximizes the scalability of resources for training --> S3 The solution also must facilitate the company's use of an ML model in the low-connectivity environments.---> Edge devices and AWS IOT Greengrass Answer is c Moving the training data to an Amazon S3 bucket and training and evaluating the model by using Amazon SageMaker will reduce the company's compute infrastructure costs and maximize the scalability of resources for training. Optimizing the model by using SageMaker Neo will further reduce costs by allowing the model to run on inexpensive edge devices. Setting up an edge device in the manufacturing facilities with AWS IoT Greengrass and deploying the model on the edge device will enable the company to use the ML model in the low-connectivity environments. This solution provides a complete end-to-end solution for the company's needs, from data storage to model deployment, while minimizing costs and providing scalability and offline capabilities. C best satisfies the options of minimising cost, and taking care of lack of connectivity through edge deployment. Same arguments as belie C: is the correct answer. Upload 20 years of massive data to S3 for model training. Sagemaker for creating and training a model. Once ready, deploy at edge using IOT Greengrass (this takes care of poor internet connectivity issue which is not addressed by option A) - https://www.examtopics.com/discussions/amazon/view/74972-exam-aws-certified-machine-learning-specialty-topic-1/
180
180 - A company has an ecommerce website with a product recommendation engine built in TensorFlow. The recommendation engine endpoint is hosted by Amazon SageMaker. Three compute-optimized instances support the expected peak load of the website. Response times on the product recommendation page are increasing at the beginning of each month. Some users are encountering errors. The website receives the majority of its traffic between 8 AM and 6 PM on weekdays in a single time zone. Which of the following options are the MOST effective in solving the issue while keeping costs to a minimum? (Choose two.) - A.. Configure the endpoint to use Amazon Elastic Inference (EI) accelerators. B.. Create a new endpoint configuration with two production variants. C.. Configure the endpoint to automatically scale with the InvocationsPerInstance metric. D.. Deploy a second instance pool to support a blue/green deployment of models. E.. Reconfigure the endpoint to use burstable instances.
AC - AC - for me https://aws.amazon.com/machine-learning/elastic-inference/ https://aws.amazon.com/blogs/machine-learning/configuring-autoscaling-inference-endpoints-in-amazon-sagemaker/ Agree. The problem with E is that it mentioned "majority of its traffic between 8 AM and 6 PM on weekdays", therefore it is not cost effective during period from 6PM to 8AM weekdays and weekends. Whereas the auto-scaling(C) could save money during all the time. AC looks correct traffic between 8 AM and 6 PM on weekdays in a single time zone .--- Reconfigure the endpoint to use burstable instances. Configure the endpoint to automatically scale with the InvocationsPerInstance metric. A. YES - EI provides GPU access (Sept 2023: now deprecated) B. NO - not best practice to scale (altough it might help ?) C. YES - you want to scale with traffic D. NO - not best practice to scale (altough it might help ?) E. NO - burstable instances is for inpredictable traffic, bursting for long period of time is not cost effective Either AC or CE very confusing but C is confirmed and will go with E Or since the question is really old so maybe AC is what needs the answer to be A C for me A - C for me E is costy E is not correct. A C E are both effective method to optimize inference, but A is used for GPU, and E is used for CPU. since this question is about Tensorflow, it should be A and E is not effective. A and C are the right answers Interesting if option A will be relevant anymore, as AWS is discontinuing Elastic Inference starting Apr 15. https://docs.aws.amazon.com/sagemaker/latest/dg/ei.html. Wonder if they change the option to include Inf instane type Why elastic inference given that GPU is not necessary? I guess C and E (IMO) A - This is using Tensorflow which means Elastic inference can be used to save costs for GPU, thereby reducing the compute time. E - Since the load is not uniform, it will help to use burstable instances to operate above the threshold when the situation demands. C is not at all cost effective. AC is the most correct I'd go AE if it requests for cost minimum AC is correct AC - for me - https://www.examtopics.com/discussions/amazon/view/75022-exam-aws-certified-machine-learning-specialty-topic-1/
181
181 - A real-estate company is launching a new product that predicts the prices of new houses. The historical data for the properties and prices is stored in .csv format in an Amazon S3 bucket. The data has a header, some categorical fields, and some missing values. The company's data scientists have used Python with a common open-source library to fill the missing values with zeros. The data scientists have dropped all of the categorical fields and have trained a model by using the open-source linear regression algorithm with the default parameters. The accuracy of the predictions with the current model is below 50%. The company wants to improve the model performance and launch the new product as soon as possible. Which solution will meet these requirements with the LEAST operational overhead? - A.. Create a service-linked role for Amazon Elastic Container Service (Amazon ECS) with access to the S3 bucket. Create an ECS cluster that is based on an AWS Deep Learning Containers image. Write the code to perform the feature engineering. Train a logistic regression model for predicting the price, pointing to the bucket with the dataset. Wait for the training job to complete. Perform the inferences. B.. Create an Amazon SageMaker notebook with a new IAM role that is associated with the notebook. Pull the dataset from the S3 bucket. Explore different combinations of feature engineering transformations, regression algorithms, and hyperparameters. Compare all the results in the notebook, and deploy the most accurate configuration in an endpoint for predictions. C.. Create an IAM role with access to Amazon S3, Amazon SageMaker, and AWS Lambda. Create a training job with the SageMaker built-in XGBoost model pointing to the bucket with the dataset. Specify the price as the target feature. Wait for the job to complete. Load the model artifact to a Lambda function for inference on prices of new houses. D.. Create an IAM role for Amazon SageMaker with access to the S3 bucket. Create a SageMaker AutoML job with SageMaker Autopilot pointing to the bucket with the dataset. Specify the price as the target attribute. Wait for the job to complete. Deploy the best model for predictions.
D - D; A involves too much effort and management overhead. Agree. but A has feature engineering which is the problem of the current model... confusing we don't use Logistic Regression to predict price. AutoML also contains feature engineering/preprocessing tools. A. NO - Logistic regression model is for classification, not to predict numerical values B. NO - approach is the highest quality, but takes time C. NO - XGBoost is for classification D. YES - simplest option D for me also It's D as rest require more operation activities. D is Correct: trick to eliminate is A can not as Logistic is classification algo which gives binary outcome.B &C seems a lot of work . The problem is not a classification problem so A is incorrect as logistic regression is used for binary problems. D is the correct solution The problem is not which model you chose but primitive level feature engineering therefore correct answer should be "B" D. https://aws.amazon.com/sagemaker/autopilot/ Supports missing values, categorical features, etc. The simplest solution for this case why not B? - https://www.examtopics.com/discussions/amazon/view/74970-exam-aws-certified-machine-learning-specialty-topic-1/
182
182 - A data scientist is reviewing customer comments about a company's products. The data scientist needs to present an initial exploratory analysis by using charts and a word cloud. The data scientist must use feature engineering techniques to prepare this analysis before starting a natural language processing (NLP) model. Which combination of feature engineering techniques should the data scientist use to meet these requirements? (Choose two.) - A.. Named entity recognition B.. Coreference C.. Stemming D.. Term frequency-inverse document frequency (TF-IDF) E.. Sentiment analysis
CD - Sentiment analysis is the result of analysis, not feature engineering. I think this answer should be C & D. D is also wrong. For creating a word cloud, the frequency of words (not their inverse frequency, as used in TF-IDF) is typically the most appropriate metric. Word clouds are designed to visually represent how often a word appears in the text, with more frequent words appearing larger and more prominent. A. NO - it is categorization of words and thus inferencing, not pre-processing B. NO - Coreferencing (eg. linking "He" to "Mark" seen in a previous sentence) is a complex task, not pre-processing C. YES - it consists of reducing words to their base form to reduce dimensionality D. YES - fast pre-processing task E. NO - it is not feature engineering, it is training C and D for me C for merge similar words D for remove not important words like "the, is, a" 12-sep exam https://towardsdatascience.com/text-analysis-feature-engineering-with-nlp-502d6ea9225d Sentiment analysis IS a part of feature engg in NLP. sentiment analysis is not part of feature engineering I agree ABE are not feature engineering A, B, E are not feature engineering why not C & D - https://www.examtopics.com/discussions/amazon/view/75045-exam-aws-certified-machine-learning-specialty-topic-1/
183
183 - A data scientist is evaluating a GluonTS on Amazon SageMaker DeepAR model. The evaluation metrics on the test set indicate that the coverage score is 0.489 and 0.889 at the 0.5 and 0.9 quantiles, respectively. What can the data scientist reasonably conclude about the distributional forecast related to the test set? - A.. The coverage scores indicate that the distributional forecast is poorly calibrated. These scores should be approximately equal to each other at all quantiles. B.. The coverage scores indicate that the distributional forecast is poorly calibrated. These scores should peak at the median and be lower at the tails. C.. The coverage scores indicate that the distributional forecast is correctly calibrated. These scores should always fall below the quantile itself. D.. The coverage scores indicate that the distributional forecast is correctly calibrated. These scores should be approximately equal to the quantile itself.
D - https://ts.gluon.ai/tutorials/forecasting/quick_start_tutorial.html https://ts.gluon.ai/stable/tutorials/forecasting/quick_start_tutorial.html C is correct based on this blog - https://aws.amazon.com/blogs/machine-learning/training-debugging-and-running-time-series-forecasting-models-with-the-gluonts-toolkit-on-amazon-sagemaker/ D for me D: A well-calibrated model should have quantile coverage close to the desired coverage level (e.g., 90% quantile coverage should be close to 90%). If the quantile coverage is consistently off from the desired level, it may indicate the need to recalibrate the model or investigate the sources of uncertainty estimation errors. https://apps.microsoft.com/store/detail/move-mouse/9NQ4QL59XLBF?hl=en-us&gl=us I think it is D Thanks to ChatGPT Given the coverage score results, the data scientist can conclude that the distributional forecast related the test set is well calibrated. Specifically, when the model predicts quantiles, around % of the true values should fall within the 0.5 quantile range, and around90% of the true values should fall within the 0.9 quantile range., the GluonTS on Amazon SageMaker DeepAR model performance on the test set was concerning the coverage of the predicted quantiles. My chatgpt is latest: The coverage of a distributional forecast at a given quantile is the fraction of observations that fall below the predicted quantile. In a well-calibrated forecast, the coverage score should be approximately equal to the quantile itself. Given the information: Coverage score is 0.489 at the 0.5 quantile. Coverage score is 0.889 at the 0.9 quantile. For a well-calibrated forecast: At the 0.5 quantile (or median), the coverage should be approximately 0.5. At the 0.9 quantile, the coverage should be approximately 0.9. The provided coverage scores closely match the quantiles, with slight deviations. Therefore, the correct conclusion is: Option D: The coverage scores indicate that the distributional forecast is correctly calibrated. These scores should be approximately equal to the quantile itself. Scores should always fall below the quantile itself. Ref: https://d1.awsstatic.com/asset-repository/Amazon%20Forecast%20Technical%20Guide%20to%20Time-Series%20Forecasting%20Principles.pdf -- Pg 18 PDF Pg 23 https://docs.aws.amazon.com/forecast/latest/dg/metrics.html#metrics-wQL A more concise doc. C is correct - https://www.examtopics.com/discussions/amazon/view/75279-exam-aws-certified-machine-learning-specialty-topic-1/
184
184 - An energy company has wind turbines, weather stations, and solar panels that generate telemetry data. The company wants to perform predictive maintenance on these devices. The devices are in various locations and have unstable internet connectivity. A team of data scientists is using the telemetry data to perform machine learning (ML) to conduct anomaly detection and predict maintenance before the devices start to deteriorate. The team needs a scalable, secure, high-velocity data ingestion mechanism. The team has decided to use Amazon S3 as the data storage location. Which approach meets these requirements? - A.. Ingest the data by using an HTTP API call to a web server that is hosted on Amazon EC2. Set up EC2 instances in an Auto Scaling configuration behind an Elastic Load Balancer to load the data into Amazon S3. B.. Ingest the data over Message Queuing Telemetry Transport (MQTT) to AWS IoT Core. Set up a rule in AWS IoT Core to use Amazon Kinesis Data Firehose to send data to an Amazon Kinesis data stream that is configured to write to an S3 bucket. C.. Ingest the data over Message Queuing Telemetry Transport (MQTT) to AWS IoT Core. Set up a rule in AWS IoT Core to direct all MQTT data to an Amazon Kinesis Data Firehose delivery stream that is configured to write to an S3 bucket. D.. Ingest the data over Message Queuing Telemetry Transport (MQTT) to Amazon Kinesis data stream that is configured to write to an S3 bucket.
C - Answer is C. B, D wrong because Kinesis data stream cannot write to S3 directly. A. Not enough. C. Correct. Check https://docs.aws.amazon.com/firehose/latest/dev/writing-with-iot.html B, D: Wrong. KDS doesn't write directly to S3 A. NO - HTTP is not best protocol for IoT B. NO - No need to buffer write from Firehose to S3 with Kinesis/Kafka in the middle C. YES - Firehose is a good connector MQTT to S3 D. NO - Kinesis/Kafka cannot intake MQTT out-of-the-box, Firehose is the right connector Answer is C C is sorrect - https://www.examtopics.com/discussions/amazon/view/75149-exam-aws-certified-machine-learning-specialty-topic-1/
185
185 - A retail company collects customer comments about its products from social media, the company website, and customer call logs. A team of data scientists and engineers wants to find common topics and determine which products the customers are referring to in their comments. The team is using natural language processing (NLP) to build a model to help with this classification. Each product can be classified into multiple categories that the company defines. These categories are related but are not mutually exclusive. For example, if there is mention of "Sample Yogurt" in the document of customer comments, then "Sample Yogurt" should be classified as "yogurt," "snack," and "dairy product." The team is using Amazon Comprehend to train the model and must complete the project as soon as possible. Which functionality of Amazon Comprehend should the team use to meet these requirements? - A.. Custom classification with multi-class mode B.. Custom classification with multi-label mode C.. Custom entity recognition D.. Built-in models
B - The answer is B. In multi-label mode, individual classes represent different categories, but these categories are not mutually exclusive while individual classes are mutually exclusive in multi-class mode B - In simple language, it is tagging A. NO - multi-class means more than binomial/2 classes possible targets, but still the document belongs to only 1 B. YES - multiple class can be assigned (eg. using SoftMax for different probabilities) C. NO - it is about assigning Entities to terms in the input documents, not classifying the documents D. NO - you need to customize the classes Answer b In multi-label classification, individual classes represent different categories, but these categories are somehow related and are not mutually exclusive. As a result, each document has at least one class assigned to it, but can have more. For example, a movie can simply be an action movie, or it can be an action movie, a science fiction movie, and a comedy, all at the same time. In multi-class classification, each document can have one and only one class assigned to it. The individual classes are mutually exclusive. For example, a movie can be classed as a documentary or as science fiction, but not both at the same time. https://docs.aws.amazon.com/comprehend/latest/dg/prep-classifier-data-multi-label.html 12-sep exam B. Multi-label: https://docs.aws.amazon.com/comprehend/latest/dg/prep-classifier-data-multi-label.html Written in the technical guide related but not mutually exclusive - https://www.examtopics.com/discussions/amazon/view/74925-exam-aws-certified-machine-learning-specialty-topic-1/
186
186 - A data engineer is using AWS Glue to create optimized, secure datasets in Amazon S3. The data science team wants the ability to access the ETL scripts directly from Amazon SageMaker notebooks within a VPC. After this setup is complete, the data science team wants the ability to run the AWS Glue job and invoke the SageMaker training job. Which combination of steps should the data engineer take to meet these requirements? (Choose three.) - A.. Create a SageMaker development endpoint in the data science team's VPC. B.. Create an AWS Glue development endpoint in the data science team's VPC. C.. Create SageMaker notebooks by using the AWS Glue development endpoint. D.. Create SageMaker notebooks by using the SageMaker console. E.. Attach a decryption policy to the SageMaker notebooks. F.. Create an IAM policy and an IAM role for the SageMaker notebooks.
BCF - BCF; This tutorial doc says so: https://docs.aws.amazon.com/glue/latest/dg/dev-endpoint-tutorial-sage.html BCF is correct "C" is definitely wrong SageMaker cannot be created by using the AWS Glue development endpoint. Explanation: This option is not correct. SageMaker notebooks are created using the SageMaker console or APIs, not through AWS Glue development endpoints. https://docs.aws.amazon.com/glue/latest/dg/dev-endpoint-tutorial-sage.html it says ... "Select the check box next to the name of a development endpoint that you want to use, and on the Action menu, choose Create SageMaker notebook" BDF is correct when creating notebook why do we need Glue development endpoint? it should be D A. YES - By requirement, the notebook must be in a VPC B. NO - Data is already in S3, we do not need to know it was made with AWS Glue C. NO - Data is already in S3 thanks to AWS Glue, no runtime relationship with SageMaker D. YES - need to create the Notebooks at some point E. NO - no need to decrypt, it is about ACL F. YES - notebooks need to be able to read S3 Sorry, B is true. Scripts must be accessible they say. Then A is wrong ? Creating a SageMaker development endpoint in the data science team's VPC will allow the data science team to access the ETL scripts and the AWS Glue job from within their VPC. Creating an IAM policy and an IAM role for the SageMaker notebooks will allow the data science team to access the ETL scripts and the AWS Glue job with the appropriate permissions. Creating SageMaker notebooks by using the SageMaker console will allow the data science team to easily create and manage the SageMaker notebooks. In the AWS Glue console, choose Dev endpoints to navigate to the development endpoints list. Select the check box next to the name of a development endpoint that you want to use, and on the Action menu, choose Create SageMaker notebook. Fill out the Create and configure a notebook page as follows: Enter a notebook name. Under Attach to development endpoint, verify the development endpoint. Create or choose an AWS Identity and Access Management (IAM) role. - https://www.examtopics.com/discussions/amazon/view/74999-exam-aws-certified-machine-learning-specialty-topic-1/
187
187 - A data engineer needs to provide a team of data scientists with the appropriate dataset to run machine learning training jobs. The data will be stored in Amazon S3. The data engineer is obtaining the data from an Amazon Redshift database and is using join queries to extract a single tabular dataset. A portion of the schema is as follows: TransactionTimestamp (Timestamp) CardName (Varchar) CardNo (Varchar) The data engineer must provide the data so that any row with a CardNo value of NULL is removed. Also, the TransactionTimestamp column must be separated into a TransactionDate column and a TransactionTime column. Finally, the CardName column must be renamed to NameOnCard. The data will be extracted on a monthly basis and will be loaded into an S3 bucket. The solution must minimize the effort that is needed to set up infrastructure for the ingestion and transformation. The solution also must be automated and must minimize the load on the Amazon Redshift cluster. Which solution meets these requirements? - A.. Set up an Amazon EMR cluster. Create an Apache Spark job to read the data from the Amazon Redshift cluster and transform the data. Load the data into the S3 bucket. Schedule the job to run monthly. B.. Set up an Amazon EC2 instance with a SQL client tool, such as SQL Workbench/J, to query the data from the Amazon Redshift cluster directly Export the resulting dataset into a file. Upload the file into the S3 bucket. Perform these tasks monthly. C.. Set up an AWS Glue job that has the Amazon Redshift cluster as the source and the S3 bucket as the destination. Use the built-in transforms Filter, Map, and RenameField to perform the required transformations. Schedule the job to run monthly. D.. Use Amazon Redshift Spectrum to run a query that writes the data directly to the S3 bucket. Create an AWS Lambda function to run the query monthly.
C - agreed with C Its always Glue The answer was between C and D, but we are suppose to minimize use of Redshift cluster, answer is C. And B are too much effort, so not to be done as per constraints of question. Simply the requirements are a full ETL process where data will be extracted from Redshift (E), then transformed by renaming, removing null values, or even separating the first column So (T), and finally load data to S3(L) all that with the least overhead, which make the AWS Glue ideal for these requirements A. NO - AWS Glue (serverless) is a simpler option than EMR to run Spark jobs B. NO - Spark is a better option for datapipelines, it avoids the need for intermediary files C. YES - Spark and AWS Glue best combination D. NO - Amazon Redshift Spectrum is a "Lake House" architecture, meant to run SQL against against both DW & S3; here, we want to query only from the DW The reason is that this solution can leverage the existing capabilities of AWS Glue, which is a fully managed service that can help users create, run, and manage ETL (extract, transform, and load) workflows. According to the web search results, AWS Glue can connect to various data sources and destinations, such as Amazon Redshift and Amazon S3, and use Apache Spark as the underlying processing engine. AWS Glue can also provide various built-in transforms that can perform common data manipulation operations, such as filtering, mapping, renaming, or joining. Moreover, AWS Glue can support scheduling and automation of ETL jobs using triggers or workflows. agree with C C Reason: we want to minimize infrastructure effort, so we should prioritize serverless solutions, we want something automated and minimize the load on the Redshift cluster. That said, Letter A is wrong as it uses a managed service (EMR) just like Letter B (EC2). Letter D brings Redshift Spectrum, however the base is not in S3, but Redshift! So, it's discarded this option, since we use this service to move data from S3 → Redshift using SQL. Letter C is correct. https://docs.aws.amazon.com/glue/latest/dg/aws-glue-programming-python-transforms.html - https://www.examtopics.com/discussions/amazon/view/89019-exam-aws-certified-machine-learning-specialty-topic-1/
188
188 - A machine learning (ML) specialist wants to bring a custom training algorithm to Amazon SageMaker. The ML specialist implements the algorithm in a Docker container that is supported by SageMaker. How should the ML specialist package the Docker container so that SageMaker can launch the training correctly? - A.. Specify the server argument in the ENTRYPOINT instruction in the Dockerfile. B.. Specify the training program in the ENTRYPOINT instruction in the Dockerfile. C.. Include the path to the training data in the docker build command when packaging the container. D.. Use a COPY instruction in the Dockerfile to copy the training program to the /opt/ml/train directory.
B - https://docs.aws.amazon.com/sagemaker/latest/dg/your-algorithms-training-algo-dockerfile.html The /opt/ml directory is the default directory where SageMaker expects the training script and other related files to be located. The script at location above is triggered by setting environment variable SAGEMAKER_PROGRAM and *not* through an ENTRYPOINT in docker file A. NO - There is no server here, we do training not inference B. YES C. NO - path to training data is externally provided, not hardcoded in the image D. NO - /opt/ml/train is the working directory of the ENTRYPOINT Amazon SageMaker supports bringing custom training algorithms by using Docker containers, which are software packages that can contain all the dependencies and configurations needed to run an application. Dockerfile is a text file that contains the instructions for building a Docker image, which is a snapshot of a Docker container. ENTRYPOINT is an instruction in the Dockerfile that specifies the default executable or command that will run when the container is started. By specifying the training program in the ENTRYPOINT instruction, the ML specialist can ensure that Amazon SageMaker can run the training program automatically when it creates and runs a Docker container for the training job. https://docs.aws.amazon.com/sagemaker/latest/dg/adapt-training-container.html In Step 2, it is mentioned to use this instruction on dockerfile: # Defines train.py as script entrypoint ENV SAGEMAKER_PROGRAM train.py - https://www.examtopics.com/discussions/amazon/view/88925-exam-aws-certified-machine-learning-specialty-topic-1/
189
189 - An ecommerce company wants to use machine learning (ML) to monitor fraudulent transactions on its website. The company is using Amazon SageMaker to research, train, deploy, and monitor the ML models. The historical transactions data is in a .csv file that is stored in Amazon S3. The data contains features such as the user's IP address, navigation time, average time on each page, and the number of clicks for each session. There is no label in the data to indicate if a transaction is anomalous. Which models should the company use in combination to detect anomalous transactions? (Choose two.) - A.. IP Insights B.. K-nearest neighbors (k-NN) C.. Linear learner with a logistic function D.. Random Cut Forest (RCF) E.. XGBoost
AD - AD are the right answer Random Cut Forest (RCF) (Option D): RCF is an unsupervised algorithm designed for anomaly detection. It can identify unusual patterns in the data without requiring labeled examples of fraudulent transactions1. IP Insights (Option A): IP Insights is another unsupervised algorithm that can detect anomalies based on IP address usage patterns. It is particularly useful for identifying suspicious activities related to IP addresses A. YES - IP Insights works unsupervised on IP addresses; builtin algorithm B. NO - k-NN is unsupervised clustering, does not help with anomalities C. NO - Linear learner is supervised D. YES - Random Cut Forest (RCF) is unsupervised anomalities E. NO - XGBoost is supervised IP Insights is an unsupervised learning algorithm that learns the usage patterns of IP addresses. It can capture associations between IP addresses and various entities, such as user IDs or account numbers. It can also identify anomalous events, such as a user attempting to log in from an unusual IP address, or an account that is creating resources from a suspicious IP address1. Random Cut Forest (RCF) is another unsupervised algorithm for detecting anomalous data points within a dataset. It can handle arbitrary-dimensional input and scale well with respect to number of features, data set size, and number of instances. It can detect anomalies such as unexpected spikes in time series data, breaks in periodicity, or unclassifiable data points2. Can't be A, as we don't have data in the format expected for IP Insights algorithm(https://docs.aws.amazon.com/sagemaker/latest/dg/ip-insights-training-data-formats.html). It expects CSV format and the question mentions data is in CSV format so IP Insights is correct A and D are correct. A. IP Insights for Pattern recognition. D. Random Cut Forest (RCF) for Anomaly detection B,C,E are normally Supervised learning algorithm which are against the wordings "There is no label ..." C is not part of the answer. IP insight because the data contain IP address. RCF because the data is unlabeled and anomaly is being detected for fraud. AD are correct apprently AD AC is the correct answer to detect anomalies - https://www.examtopics.com/discussions/amazon/view/88731-exam-aws-certified-machine-learning-specialty-topic-1/
190
190 - A healthcare company is using an Amazon SageMaker notebook instance to develop machine learning (ML) models. The company's data scientists will need to be able to access datasets stored in Amazon S3 to train the models. Due to regulatory requirements, access to the data from instances and services used for training must not be transmitted over the internet. Which combination of steps should an ML specialist take to provide this access? (Choose two.) - A.. Configure the SageMaker notebook instance to be launched with a VPC attached and internet access disabled. B.. Create and configure a VPN tunnel between SageMaker and Amazon S3. C.. Create and configure an S3 VPC endpoint Attach it to the VPC. D.. Create an S3 bucket policy that allows traffic from the VPC and denies traffic from the internet. E.. Deploy AWS Transit Gateway Attach the S3 bucket and the SageMaker instance to the gateway.
AC - I think the answer is CD. A is wrong. SageMaker notebook does not need to have internet access disabled. Agree. The setting should be relevant to S3 and VPC, not the notebook. if notebook is not within vpc, then having s3 bucket policy to allow traffic only from vpc will block notebook to get data from s3. A C A and C seems fine CD is right AD is the answer While creating Sagemaker notebook instances we have to decide on the access (via VPC and/or direct internet). Here we will select access only from VPC. The same VPC should become a requirement to access S3 bucket via S3 bucket policy. C would have been fine but fails to mention creation of S3 access points and those access points can be restricted to VPC. A&C as explained in this blog as well - https://aws.amazon.com/blogs/machine-learning/secure-amazon-s3-access-for-isolated-amazon-sagemaker-notebook-instances/ A. NO - Notebook must run in a VPC (SageMaker will provision an instance), but with a private subnet there is no need to disable internet traffic B. NO - VPN tunnel is to encrypt traffic with the Internet C. YES - Endpoint will prevent S3 traffic to flow over the internet D. YES - Create an S3 bucket policy that allows traffic from the VPC and denies traffic from the internet. E. NO - AWS Transit Gateway is for multiple VPCs By configuring the SageMaker notebook instance to be launched with a VPC attached and internet access disabled, the data scientists can access the resources within the VPC, such as Amazon EFS or Amazon EC2 instances, without exposing them to the internet1. This also prevents the notebook instance from accessing any resources outside the VPC, such as Amazon S3, unless a VPC endpoint is configured2. By creating and configuring an S3 VPC endpoint and attaching it to the VPC, the data scientists can access the datasets stored in Amazon S3 from the notebook instance using private IP addresses. The S3 VPC endpoint is a gateway endpoint that routes the traffic between the VPC and Amazon S3 within the AWS network, without requiring an internet gateway or a NAT device3. This also enhances the security and performance of the data access1. CD : Question is about make S3 Data not accessible from Internet & VPC Endpoint Only. S3 by default is not public, you don't have to deny traffic from internet. Just not make it public. the requirements are about providing secure access from notebooks to S3, nothing else. the answer is C,D. Firstly, the VPC need to connect with S3 through gateway endpoint, check "https://docs.aws.amazon.com/vpc/latest/privatelink/vpc-endpoints-s3.html" secondly, after connection is created, we need to define the policy from s3 side. restrict access to s3 only from specified VPC or VPC endpoint. "https://docs.aws.amazon.com/vpc/latest/privatelink/vpc-endpoints-s3.html#bucket-policies-s3" the confusion about A is tricky. Ideally, you need to create sagemaker in private subnet with no internet access. But I assume the question "access to the data from instances and services" only requires the process from obtaining data from s3, you don't need to specify the requirement about data egress from training service(even though disable internet connection from sagemaker is crucial) A and C are correct. To disable direct internet access, you can specify a VPC for your notebook instance. By doing so, you prevent SageMaker from providing internet access to your notebook instance. As a result, the notebook instance can't train or host models unless your VPC has an interface endpoint (AWS PrivateLink) or a NAT gateway and your security groups allow outbound connections. https://docs.aws.amazon.com/sagemaker/latest/dg/appendix-notebook-and-internet-access.html https://docs.aws.amazon.com/sagemaker/latest/dg/notebook-interface-endpoint.html D is wrong. Bucket policy cant be used to deny internet access. It can only enforce access from VPC or VPC endpoint your statement "Bucket policy cant be used to deny internet access" is completely wrong, you can either specify "Allow" or "Deny" in bucket policy, check "https://docs.aws.amazon.com/vpc/latest/privatelink/vpc-endpoints-s3.html#bucket-policies-s3" You can create a bucket policy that restricts access to a specific endpoint by using the aws:sourceVpce condition key. You can use Amazon S3 bucket policies to control access to buckets from specific virtual private cloud (VPC) endpoints, or specific VPCs. This section contains example bucket policies that can be used to control Amazon S3 bucket access from VPC endpoints. Notebook doesn’t need to be created within Vpc The question requires to access to the data from instances and services used for training must not be transmitted over the internet. So the traffic has to go through the VPC endpoints, thus the notebook has to live in the VPC. - https://www.examtopics.com/discussions/amazon/view/89092-exam-aws-certified-machine-learning-specialty-topic-1/
191
191 - A machine learning (ML) specialist at a retail company is forecasting sales for one of the company's stores. The ML specialist is using data from the past 10 years. The company has provided a dataset that includes the total amount of money in sales each day for the store. Approximately 5% of the days are missing sales data. The ML specialist builds a simple forecasting model with the dataset and discovers that the model performs poorly. The performance is poor around the time of seasonal events, when the model consistently predicts sales figures that are too low or too high. Which actions should the ML specialist take to try to improve the model's performance? (Choose two.) - A.. Add information about the store's sales periods to the dataset. B.. Aggregate sales figures from stores in the same proximity. C.. Apply smoothing to correct for seasonal variation. D.. Change the forecast frequency from daily to weekly. E.. Replace missing values in the dataset by using linear interpolation.
AC - I think the answer is BC BC - https://docs.aws.amazon.com/forecast/latest/dg/aws-forecast-recipe-ets.html While smoothing can help, it doesn’t directly address the missing data issue or provide the model with additional context about specific sales periods, which are crucial for improving the model’s accuracy around seasonal events A is for sure. C, watch the keywords mentioned in the question "too low or too high"! Ans: AC A. Add Information About the Store's Sales Periods: This directly targets the issue of seasonality affecting the sales forecast. C. Apply Smoothing to Correct for Seasonal Variation: Smoothing techniques will help in handling the seasonal trends more effectively, which seems to be a major factor in the model's current performance issues. both of A and C solves the seasonality issues A&E for sure. For option E, refer to this blog - https://aws.amazon.com/blogs/machine-learning/prepare-time-series-data-with-amazon-sagemaker-data-wrangler/ A. YES - valuable contextual information B. NO - irrelevant to seasonal events C. YES - Removes noise and can help make patterns easier to identify D. NO - not point to loose precious information such as weekend days E. NO - 5% data loss is not a big deal, might as well drop them Adding information about the store’s sales periods to the dataset can help the model learn about patterns in sales that are specific to certain times of year. This can help the model make more accurate predictions around seasonal events. ------------------- Smoothing can help correct for seasonal variation by removing some of the noise from the data. This can help the model make more accurate predictions -------------------- None of the other options address the seasonal variation in my opinion B&C were my top choices without looking at the key. E for sure then either C or A. would go for A Answer: AC Smoothing has different uses. Please find the definition Data smoothing can be defined as a statistical approach to eliminating outliers from datasets to make the patterns more noticeable. ChatGPT say C + E CE C. Apply smoothing to correct for seasonal variation: Seasonal variation can have a significant impact on sales data. By applying smoothing techniques such as moving averages or exponential smoothing, the ML specialist can reduce the noise and fluctuations caused by seasonal effects, allowing the model to capture the underlying patterns more effectively. E. Replace missing values in the dataset by using linear interpolation: Missing data can introduce biases and affect the accuracy of the model. Linear interpolation is a common technique for filling in missing values by estimating the missing data points based on the available data. By replacing the missing values, the ML specialist ensures that the model has a complete and representative dataset to learn from. A to improve model in seasonal periods E to fill missing data I would go for C and E C is quite obvious I think E Linear interpolation is a technique to fill the missing data https://towardsdatascience.com/4-techniques-to-handle-missing-values-in-time-series-data-c3568589b5a8 why not C and D? Maybe the sales event can last longer than a week? - https://www.examtopics.com/discussions/amazon/view/89093-exam-aws-certified-machine-learning-specialty-topic-1/
192
192 - A newspaper publisher has a table of customer data that consists of several numerical and categorical features, such as age and education history, as well as subscription status. The company wants to build a targeted marketing model for predicting the subscription status based on the table data. Which Amazon SageMaker built-in algorithm should be used to model the targeted marketing? - A.. Random Cut Forest (RCF) B.. XGBoost C.. Neural Topic Model (NTM) D.. DeepAR forecasting
B - B is correct. IMO A - No, Random cut forest is for anomaly detection B - Yes, exactly was XGBoost is good for. Binary classification based on a variety of input features C - No, NTM is unsupervised. The problem states the table already has subscription status, therefore we need a supervised algorithm D - No, DeepAR is used for time-series data Whether subscription status is binary or multi-class XGBoost can handle the problem in this case problem. A. NO - Random Cut Forest (RCF) used for anomalities B. YES - XGBoost is good for classification C. NO - Neural Topic Model (NTM) is to find topics, not classify D. NO - that is for timeseries XGboost Letter B is correct as we have a supervised classification problem here. XGboost for Binary classification XGBoost is a popular and powerful algorithm for binary classification problems such as this one, where the goal is to predict a binary outcome (e.g. whether a customer subscribes or not). It is particularly effective when the dataset has a mix of numerical and categorical features. The answer is B: A. Random Cut Forest (RCF): anomaly detection B. XGBoost: allows classification tasks like the use case in the question C. Neural Topic Model (NTM): topic modelling D. DeepAR forecasting: time series I think the answer is B. It looks like no time serials condition, so it may be not suitable to A and D. D. Refer to https://docs.aws.amazon.com/forecast/latest/dg/aws-forecast-recipe-deeparplus.html can you please let us know, where in question it states that date/time column is available. It is all about numerical or categorical columns and we need to predict subscription status which can be done by the use XG-BOOST you're right. there is no time series. - https://www.examtopics.com/discussions/amazon/view/88927-exam-aws-certified-machine-learning-specialty-topic-1/
193
193 - A company will use Amazon SageMaker to train and host a machine learning model for a marketing campaign. The data must be encrypted at rest. Most of the data is sensitive customer data. The company wants AWS to maintain the root of trust for the encryption keys and wants key usage to be logged. Which solution will meet these requirements with the LEAST operational overhead? - A.. Use AWS Security Token Service (AWS STS) to create temporary tokens to encrypt the storage volumes for all SageMaker instances and to encrypt the model artifacts and data in Amazon S3. B.. Use customer managed keys in AWS Key Management Service (AWS KMS) to encrypt the storage volumes for all SageMaker instances and to encrypt the model artifacts and data in Amazon S3. C.. Use encryption keys stored in AWS CloudHSM to encrypt the storage volumes for all SageMaker instances and to encrypt the model artifacts and data in Amazon S3. D.. Use SageMaker built-in transient keys to encrypt the storage volumes for all SageMaker instances. Enable default encryption ffnew Amazon Elastic Block Store (Amazon EBS) volumes.
B - It seems to be B A. NO - you don't want temporary access B. YES - best practice C. NO - CloudHSM is overkill vs. KMS D. NO - transient keys are transient AWS Key Management Service (AWS KMS) is a service that allows customers to create and manage encryption keys that can be used to encrypt data at rest in AWS services. AWS KMS provides a high level of security and control over the encryption keys, as well as integration with AWS CloudTrail to log key usage. By using customer managed keys in AWS KMS, the company can encrypt the storage volumes for all SageMaker instances, such as notebook instances, training instances, and endpoint instances. This can be done by specifying the KMS key ID when creating or updating the instances. The company can also encrypt the model artifacts and data in Amazon S3 by using the same or different KMS keys. This can be done by enabling server-side encryption with KMS keys when creating or updating the S3 buckets or objects. Letra B. Normalmente, falou de criptografia na AWS, falou de KMS. Why not "A"? I think using AWS STS to create temporary tokens is easier than create custom AWS KMS key. STS is for other purpose. why no B? It should be B. This question is the same as Q143. - https://www.examtopics.com/discussions/amazon/view/88928-exam-aws-certified-machine-learning-specialty-topic-1/
194
194 - A data scientist is working on a model to predict a company's required inventory stock levels. All historical data is stored in .csv files in the company's data lake on Amazon S3. The dataset consists of approximately 500 GB of data The data scientist wants to use SQL to explore the data before training the model. The company wants to minimize costs. Which option meets these requirements with the LEAST operational overhead? - A.. Create an Amazon EMR cluster. Create external tables in the Apache Hive metastore, referencing the data that is stored in the S3 bucket. Explore the data from the Hive console. B.. Use AWS Glue to crawl the S3 bucket and create tables in the AWS Glue Data Catalog. Use Amazon Athena to explore the data. C.. Create an Amazon Redshift cluster. Use the COPY command to ingest the data from Amazon S3. Explore the data from the Amazon Redshift query editor GUI. D.. Create an Amazon Redshift cluster. Create external tables in an external schema, referencing the S3 bucket that contains the data. Explore the data from the Amazon Redshift query editor GUI.
B - I think the answer is B. The others are quite expensive and complicated. A. NO - B is easier B. YES - works natively against S3 C. NO - no need to import S3 data to Redshift when Presto/Athena allows you to query directly D. NO - Redshift overkill AWS Glue We want to use SQL to explore the 500GB database saved in S3. We also want to minimize costs and have the least headache with the operation. Letters A - C - D mean managed services, hence: headache. Correct alternative is letter B. B as "LEAST operational overhead" The advantage of D is that Redshift, as a data warehouse, can handle large dataset(>1TB) and complex frequent query. In this example, 500GB dataset and infrequent query(I consider this just one-time ad-hoc query, just verify the data before training.) Athena would be a much better option. The option that meets these requirements with the LEAST operational overhead is option B: Use AWS Glue to crawl the S3 bucket and create tables in the AWS Glue Data Catalog. Use Amazon Athena to explore the data. AWS Glue is a fully managed ETL service that can automatically discover and catalog metadata about data stored in various data stores, including Amazon S3. By using AWS Glue to crawl the S3 bucket, the data scientist can easily create tables in the AWS Glue Data Catalog, without needing to create or manage any infrastructure. Amazon Athena is an interactive query service that allows querying data stored in Amazon S3 using SQL. By using Amazon Athena, the data scientist can easily explore the data using SQL, without needing to set up any infrastructure. Both Glue and Athena are serverless hence cost effective. B is highly managed unlike other options. It seems to be B - https://www.examtopics.com/discussions/amazon/view/89135-exam-aws-certified-machine-learning-specialty-topic-1/
195
196 - A company is using Amazon SageMaker to build a machine learning (ML) model to predict customer churn based on customer call transcripts. Audio files from customer calls are located in an on-premises VoIP system that has petabytes of recorded calls. The on-premises infrastructure has high-velocity networking and connects to the company's AWS infrastructure through a VPN connection over a 100 Mbps connection. The company has an algorithm for transcribing customer calls that requires GPUs for inference. The company wants to store these transcriptions in an Amazon S3 bucket in the AWS Cloud for model development. Which solution should an ML specialist use to deliver the transcriptions to the S3 bucket as quickly as possible? - A.. Order and use an AWS Snowball Edge Compute Optimized device with an NVIDIA Tesla module to run the transcription algorithm. Use AWS DataSync to send the resulting transcriptions to the transcription S3 bucket. B.. Order and use an AWS Snowcone device with Amazon EC2 Inf1 instances to run the transcription algorithm. Use AWS DataSync to send the resulting transcriptions to the transcription S3 bucket. C.. Order and use AWS Outposts to run the transcription algorithm on GPU-based Amazon EC2 instances. Store the resulting transcriptions in the transcription S3 bucket. D.. Use AWS DataSync to ingest the audio files to Amazon S3. Create an AWS Lambda function to run the transcription algorithm on the audio files when they are uploaded to Amazon S3. Configure the function to write the resulting transcriptions to the transcription S3 bucket.
A - I think the answer is A. B: Snowcone has limit with 8 TB. C: is AWS on-premises solution, but the company wants to store these transcriptions in an Amazon S3 bucket in the AWS Cloud for model development. D: 100 Mbps cannot handle petabytes datasync. A: is the Right Folks, please... OUTPOST is very complicated to implement, and the question is not talking about continue to do it after. We know that Snowball has limite of storage, but the ideia is not to send Petabytes of data to S3, just only the likelihood processed. so, the ideia is to use the Snowball as an Optimized machine to be able to process the data and send it to S3. The Snowball Edge Compute Optimized device provides 52 vCPUs, 208 GiB of memory, and an optional NVIDIA Tesla V100 GPU. For storage, the device provides 42 TB usable HDD capacity for Amazon S3 or Amazon EBS, as well as 7.68 TB of usable NVMe SSD capacity for EBS block volumes. Snowball Edge Compute Optimized devices run Amazon EC2 sbe-c and sbe-g instances, which are equivalent to C5, M5a, G3, and P3 instances. A. NO - Snowball needs to be shipped back to AWS, that does not use Datasynch B. YES - that is an edge computing device C. NO - Too big admin overhead to have your local AWS Cloud D. NO - Too slow over the 100Mps connection faster A, as this is faster option. A ==> Bring the compute closer to the data The key is faster Transfer speeds of up to 100gb per second It either A or C. But C is too complicated to setup. You order rack and AWS installs that plus you need enterprise support and the biggest reason it is not possible in this case is that it requires at least 1GB connection. The question clearly asks as soon as possible so A is the best choice in my opinion The key is to deliver the transcripts to S3 as early as possible. Outpost order and provisioning takes months. I would go for A as its logical to do local inference and send transcripts to S3. The model needs to run on-premises to don't be necessarily upload all the audio data to after run the model and this solution can use datasync to upload the results after too, then it's a good choice! Outpost for use case when customer don’t want to transfer their data out It is A. C can't be the answer, as to transfer 1PB data, it may take 1,000 days under a 100 mbps network. My vote goes to C, A. Snowball device only stores around 80 TB of data and uploading the newly transcribed data through datasync still goes through the slow connection between on-site and AWS. Outpost seem like the only feasible solution here that can satisfy both requriements Correct Answer A. D is incorrect as it takes 1024 Days Approx to transfer Petabyte of data. The best solution would be option A and D. Option A: Order and use an AWS Snowball Edge Compute Optimized device with NVIDIA Tesla modules to run the transcription algorithm. Use AWS DataSync to send the generated transcriptions to a transcription S3 bucket. This option allows to use a device that has the necessary GPU for running the transcription algorithm and then use the AWS DataSync to send the generated transcriptions to the S3 bucket. Option D: Use AWS DataSync to ingest the audio files to Amazon S3. Create an AWS Lambda function to run the transcription algorithm on the audio file as it is uploaded to Amazon S3. Configure the function to write the generated transcriptions to the transcriptions S3 bucket. This option allows to automatically transcribing the audio files as they are uploaded to S3. This means that the transcriptions are ready as soon as the audio files are uploaded and eliminates the need to transcribe the audio files separately. ws.amazon.com/about-aws/whats-new/2020/07/aws-snowball-edge-compute-optimized-now-available-additional-aws-regions/ I guess C, given that "The company wants to store these transcriptions". Petabytes audio data can keep in on-prem. - https://www.examtopics.com/discussions/amazon/view/88943-exam-aws-certified-machine-learning-specialty-topic-1/
196
197 - A company has a podcast platform that has thousands of users. The company has implemented an anomaly detection algorithm to detect low podcast engagement based on a 10-minute running window of user events such as listening, pausing, and exiting the podcast. A machine learning (ML) specialist is designing the data ingestion of these events with the knowledge that the event payload needs some small transformations before inference. How should the ML specialist design the data ingestion to meet these requirements with the LEAST operational overhead? - A.. Ingest event data by using a GraphQLAPI in AWS AppSync. Store the data in an Amazon DynamoDB table. Use DynamoDB Streams to call an AWS Lambda function to transform the most recent 10 minutes of data before inference. B.. Ingest event data by using Amazon Kinesis Data Streams. Store the data in Amazon S3 by using Amazon Kinesis Data Firehose. Use AWS Glue to transform the most recent 10 minutes of data before inference. C.. Ingest event data by using Amazon Kinesis Data Streams. Use an Amazon Kinesis Data Analytics for Apache Flink application to transform the most recent 10 minutes of data before inference. D.. Ingest event data by using Amazon Managed Streaming for Apache Kafka (Amazon MSK). Use an AWS Lambda function to transform the most recent 10 minutes of data before inference.
C - B, but it was possible to use Kinesis Data Firehose directly, insted Kinesis Data Stream I think the answer is B. This has the least latency Moving window, and less components C: Doesn't talk how to Store the data in Amazon S3 With Amazon Kinesis Data Analytics for Apache Flink, the ML specialist needs to manage the scaling and resource allocation for the Flink application, including determining the appropriate number of processing units (KPUs) and handling scaling based on the incoming data volume. This requires monitoring and adjusting resources as needed, adding to the operational overhead. C is the correct answer. B is workable but is not good for small transformation required in question. C Amazon Managed Service for Apache Flink was previously known as Amazon Kinesis Data Analytics for Apache Flink. it allows you to process and analyze streaming data providing the capability to perform transformations on the streaming data. B - no need of using an extra service aws glue I would choose C. As we need to implement our detection on running window, and B only allows us to perform operations on the latest 10 minutes of data. If we choose B, we also need to decide how frequently to run the Glue job and it involves some orchestrator tools. C in other way works in real-time mode, and we don't need an orchestration tool to move the window. Based on this, I would go with C as it has less overhead Would choose see as the transformation required is minimal which could be easily achieved with KDA (flink job) Not sure between B & C A. NO - too many moving parts B. YES - clean & elegant C. YES - works as well in batch mode D. NO - MSK is outdated https://aws.amazon.com/blogs/architecture/realtime-in-stream-inference-kinesis-sagemaker-flink/ Amazon Kinesis Data Streams is a fully managed real-time streaming service that can be used to ingest large amounts of data from multiple sources. This makes it a good choice for ingesting the event data from the podcast platform. Amazon Kinesis Data Analytics for Apache Flink is a fully managed service that can be used to process streaming data using Apache Flink. Apache Flink is a popular streaming processing framework that is known for its scalability and fault tolerance. This makes it a good choice for transforming the event data before inference. It's B, ""LEAST operational overhead"", C is more operations overhead. No. it's not true. for running B we need some orchestrator to run the glue job frequently. but for C it is running constantly. So C doesn't have step with orchestration Flink distributes the data across one or more stream partitions, and user-defined operators can transform the data stream. B is the answer. least management overhead And for C you have to author and build your Apache Flink application. extra work And for Glue you need to write a SQL or spark script. Extra work. Ah, yes. and for B you need to create an orchestrator to run the ETL jobs frequently Answer should be C. https://aws.amazon.com/blogs/architecture/realtime-in-stream-inference-kinesis-sagemaker-flink/ Flink distributes the data across one or more stream partitions, and user-defined operators can transform the data stream. - https://www.examtopics.com/discussions/amazon/view/89148-exam-aws-certified-machine-learning-specialty-topic-1/
197
198 - A company wants to predict the classification of documents that are created from an application. New documents are saved to an Amazon S3 bucket every 3 seconds. The company has developed three versions of a machine learning (ML) model within Amazon SageMaker to classify document text. The company wants to deploy these three versions to predict the classification of each document. Which approach will meet these requirements with the LEAST operational overhead? - A.. Configure an S3 event notification that invokes an AWS Lambda function when new documents are created. Configure the Lambda function to create three SageMaker batch transform jobs, one batch transform job for each model for each document. B.. Deploy all the models to a single SageMaker endpoint. Treat each model as a production variant. Configure an S3 event notification that invokes an AWS Lambda function when new documents are created. Configure the Lambda function to call each production variant and return the results of each model. C.. Deploy each model to its own SageMaker endpoint Configure an S3 event notification that invokes an AWS Lambda function when new documents are created. Configure the Lambda function to call each endpoint and return the results of each model. D.. Deploy each model to its own SageMaker endpoint. Create three AWS Lambda functions. Configure each Lambda function to call a different endpoint and return the results. Configure three S3 event notifications to invoke the Lambda functions when new documents are created.
B - yes, It seems to be 'B' agreed, shadow testing is supported on SageMaker. https://aws.amazon.com/cn/blogs/aws/new-for-amazon-sagemaker-perform-shadow-tests-to-compare-inference-performance-between-ml-model-variants/ Not sure if the shadow testing fits into the purpose here. "The company has developed three versions of a machine learning (ML) model within Amazon SageMaker to classify document text." B is correct, from within single endpoint, we can create multiple production variant. When lambda called, it should have been each target variant instead of production variant in the verbiage C is fine B is not possible as it is single sagemaker endpoint (so we won't get prediction from all models for each document) D is wrong as we do not need three lambda functions A is also wrong as time gap is 3 seconds for which we should be running batch transform jobs Will go with B A. NO - you don't want to create a new job for each Lambda invokation B. YES - best practice C. NO - could work but does not leverage production variants which in-turn disable some built-in model performance evaluation features D. NO - more operationnal overhead to have 3 endpoints Answer is B Although C sounds like a better option but B is less operational overhead at least for short term. It's B, you can use Invoke a Multi-Model Endpoint, when you call invoke_endpoint you need to provide which model filw to use. response1 = runtime_sagemaker_client.invoke_endpoint( EndpointName = "MAIN_ENDPOINT", TargetModel = "model1.tar.gz", Body = body) response2 = runtime_sagemaker_client.invoke_endpoint( EndpointName = "MAIN_ENDPOINT", TargetModel = "model2.tar.gz", Body = body) response3 = runtime_sagemaker_client.invoke_endpoint( EndpointName = "MAIN_ENDPOINT", TargetModel = "model3.tar.gz", Body = body) Ref: https://docs.aws.amazon.com/sagemaker/latest/dg/invoke-multi-model-endpoint.html B - the reason is not shadow testing since it is not named and does not require client logic. The reason is that it is possible to target a model https://docs.aws.amazon.com/sagemaker/latest/dg/invoke-multi-model-endpoint.html B it is which involves using single endpoint for multiple model versions I think the answer should be C. As there is no production version of the model identified, all the 3 models need to be invoked. C, prod variant is used for traffic routing. All model needs to be invoked. C is correct I think the answer is B. - https://www.examtopics.com/discussions/amazon/view/89150-exam-aws-certified-machine-learning-specialty-topic-1/
198
199 - A manufacturing company needs to identify returned smartphones that have been damaged by moisture. The company has an automated process that produces 2,000 diagnostic values for each phone. The database contains more than five million phone evaluations. The evaluation process is consistent, and there are no missing values in the data. A machine learning (ML) specialist has trained an Amazon SageMaker linear learner ML model to classify phones as moisture damaged or not moisture damaged by using all available features. The model's F1 score is 0.6. Which changes in model training would MOST likely improve the model's F1 score? (Choose two.) - A.. Continue to use the SageMaker linear learner algorithm. Reduce the number of features with the SageMaker principal component analysis (PCA) algorithm. B.. Continue to use the SageMaker linear learner algorithm. Reduce the number of features with the scikit-learn multi-dimensional scaling (MDS) algorithm. C.. Continue to use the SageMaker linear learner algorithm. Set the predictor type to regressor. D.. Use the SageMaker k-means algorithm with k of less than 1,000 to train the model. E.. Use the SageMaker k-nearest neighbors (k-NN) algorithm. Set a dimension reduction target of less than 1,000 to train the model.
AE - KNN can be used for dimensionality reduction through NCA (https://scikit-learn.org/stable/auto_examples/neighbors/plot_nca_dim_reduction.html#) A. Correct B. Incorrect. MDS is Non-linear dimensionality reduction method. https://towardsdatascience.com/11-dimensionality-reduction-techniques-you-should-know-in-2021-dcb9500d388b C. Incorrect. This is a classification problem instead of Regression. D. Incorrect. K-means is for Clustering(Unsupervised learning). E. Correct. Why "Non-linear dimensionality reduction method" is a problem? We can add non linear features as x^2 to a model to improve performance. E : https://docs.aws.amazon.com/sagemaker/latest/dg/k-nearest-neighbors.html B is wrong as Multidimensional Scaling sits under the Unsupervised branch of Machine Learning algorithms A. YES - F1 score is low. Reducing feature count could improve F1 score. B. NO - MDS is for visualization C. NO - regressor is to predict a numerical value, we want classification D. NO - K-means is clustering, we want classification E. YES - k-Means could work if a linear model is not best A & E are correct A and B are correct. But I understand E being correct as well. https://docs.aws.amazon.com/sagemaker/latest/dg/k-nearest-neighbors.html I think the answer is AB. k-means and k-nearest neighbors are not for reduce dimension. k-nn does. https://docs.aws.amazon.com/sagemaker/latest/dg/k-nearest-neighbors.html - https://www.examtopics.com/discussions/amazon/view/89151-exam-aws-certified-machine-learning-specialty-topic-1/
199
200 - A company is building a machine learning (ML) model to classify images of plants. An ML specialist has trained the model using the Amazon SageMaker built-in Image Classification algorithm. The model is hosted using a SageMaker endpoint on an ml.m5.xlarge instance for real-time inference. When used by researchers in the field, the inference has greater latency than is acceptable. The latency gets worse when multiple researchers perform inference at the same time on their devices. Using Amazon CloudWatch metrics, the ML specialist notices that the ModelLatency metric shows a high value and is responsible for most of the response latency. The ML specialist needs to fix the performance issue so that researchers can experience less latency when performing inference from their devices. Which action should the ML specialist take to meet this requirement? - A.. Change the endpoint instance to an ml.t3 burstable instance with the same vCPU number as the ml.m5.xlarge instance has. B.. Attach an Amazon Elastic Inference ml.eia2.medium accelerator to the endpoint instance. C.. Enable Amazon SageMaker Autopilot to automatically tune performance of the model. D.. Change the endpoint instance to use a memory optimized ML instance.
B - It's B https://aws.amazon.com/premiumsupport/knowledge-center/sagemaker-endpoint-latency/ false. its C. in the link you shared under High ModelLatency, it states "If an endpoint is overused, it might cause higher model latency. You can add Auto scaling to an endpoint to dynamically increase and decrease the number of instances available for an instance." C is the most wrong solution. Autopilot is is not autoscaling in AWS. Autopilot is for model training, Autoscaling is during inference. A. NO - that is image processing so more CPU would only provide incremental improvement B. YES - that is image processing so GPU would provide a step change; supported by the built-in algorithm C. NO - Autopilot is for training, not inference D. NO - usually inference uses little memory Attach an Amazon Elastic Inference ml.eia2.medium accelerator to the endpoint instance. Amazon Elastic Inference allows users to attach low-cost GPU-powered acceleration to Amazon EC2 and SageMaker instances or Amazon ECS tasks, to reduce the cost of running deep learning inference by up to 75% B is not correct anymore. After April 15, 2023, new customers will not be able to launch instances with Amazon EI accelerators in Amazon SageMaker, Amazon ECS, or Amazon EC2. (https://docs.aws.amazon.com/sagemaker/latest/dg/ei.html) Changes in exams apply 6 months after the change as been applied (oct 2023) Freak! It's B. Burstable instances only solves when lot of users are making inferences at the same time bro, isn't this exactly the question is asking for? The ModelLatency metric shows that the model inference time is causing the latency issue. Amazon Elastic Inference is designed to speed up the inference process of a machine learning model without needing to deploy the model on a more powerful instance. By attaching an Elastic Inference accelerator to the endpoint instance, the ML specialist can offload the compute-intensive parts of the inference process to the accelerator, resulting in faster inference times and lower latency. B - https://aws.amazon.com/premiumsupport/knowledge-center/sagemaker-endpoint-latency/ Elastic Inference accelerator ( and AutoScaling but there is no autoscaling in the option). Be aware that Autopilot is is not autoscaling. - https://www.examtopics.com/discussions/amazon/view/88924-exam-aws-certified-machine-learning-specialty-topic-1/
200
201 - An automotive company is using computer vision in its autonomous cars. The company has trained its models successfully by using transfer learning from a convolutional neural network (CNN). The models are trained with PyTorch through the use of the Amazon SageMaker SDK. The company wants to reduce the time that is required for performing inferences, given the low latency that is required for self-driving. Which solution should the company use to evaluate and improve the performance of the models? - A.. Use Amazon CloudWatch algorithm metrics for visibility into the SageMaker training weights, gradients, biases, and activation outputs. Compute the filter ranks based on this information. Apply pruning to remove the low-ranking filters. Set the new weights. Run a new training job with the pruned model. B.. Use SageMaker Debugger for visibility into the training weights, gradients, biases, and activation outputs. Adjust the model hyperparameters, and look for lower inference times. Run a new training job. C.. Use SageMaker Debugger for visibility into the training weights, gradients, biases, and activation outputs. Compute the filter ranks based on this information. Apply pruning to remove the low-ranking filters. Set the new weights. Run a new training job with the pruned model. D.. Use SageMaker Model Monitor for visibility into the ModelLatency metric and OverheadLatency metric of the model after the model is deployed. Adjust the model hyperparameters, and look for lower inference times. Run a new training job.
C - A. NO - Must use SageMaker Debugger for visibility into model insights B. NO - Hyperparameters will most likely influence model accuracy but not response time C. YES - SageMaker Debugger is the right tool for model insights; filter (or "kernels") slides in CNN to identify specific features D. NO - SageMaker Model Monitor is for model performance Pruning is a technique that reduces the complexity of convolutional neural networks (CNNs) by removing unimportant filters or neurons. This can lead to faster inference times and lower memory consumption, which are desirable for self-driving applications. Pruning can be done by ranking the filters based on some criteria, such as the norm of the weights, the activation outputs, or the Taylor expansion of the loss function123. ChatGPT is an awesome tool, but please ML colleagues: study! you are very right, about how awesome ChatGPT, but since we find it's answers over here, so some colleagues are trying to help us in proving why these could be the right answers without wasting our time to prove it. All the names over here are without any way of connection and most of the names are fictitious, so when the leave their answers, we don't know them but still we know the right answers with the right proof. The company should use solution C. Use SageMaker Debugger for visibility into the training weights, gradients, biases, and activation outputs. Compute the filter ranks based on this information. Apply pruning to remove the low-ranking filters. Set the new weights. Run a new training job with the pruned model. Same example here: https://aws.amazon.com/blogs/machine-learning/pruning-machine-learning-models-with-amazon-sagemaker-debugger-and-amazon-sagemaker-experiments/ Selected Answer: B To reduce the time required for performing inferences in autonomous cars, the automotive company should use SageMaker Debugger for visibility into the training weights, gradients, biases, and activation outputs. They can adjust the model hyperparameters and look for lower inference times. They can also use SageMaker Model Monitor for visibility into the ModelLatency metric and OverheadLatency metric of the model after the model is deployed. However, option C, which suggests computing the filter ranks based on the training outputs and applying pruning to remove the low-ranking filters, is not applicable for transfer learning models since the layers in the pre-trained model are already trained and cannot be changed. Therefore, the correct solution is B. better not use chatgpt without knowing something of AWS, it will trick you Even if a better machine could help, the problem is about the model, not about the general or the machine in specific. Using SageMaker Debugger, the company can monitor the training process and evaluate the performance of the model by computing filter ranks based on information like weights, gradients, biases, and activation outputs. After identifying the low-ranking filters, the company can apply pruning to remove them and set new weights. By doing so, the company can reduce the model size and improve the inference time. Finally, a new training job with the pruned model can be run to verify the performance improvements Not D because Model Monitor is a tool for monitoring the performance of deployed models, and it does not provide any direct feedback or insights into the model training process or ways to improve model inference time. Therefore, while Model Monitor can be useful for monitoring the performance of deployed models, it is not the best choice for evaluating and improving the performance of the models during the training phase, which is what the question is asking for. It's between C and D. But I think it's C. C. https://aws.amazon.com/blogs/machine-learning/pruning-machine-learning-models-with-amazon-sagemaker-debugger-and-amazon-sagemaker-experiments/ Everything is there. D: https://aws.amazon.com/premiumsupport/knowledge-center/sagemaker-endpoint-latency/ Here it says use Cloudwatch to view ModelLatency and OverheadLatency, not Model Monitor. I think Model Monitor is just for model performance i.e. drift, bias, accuracy etc. The answer I guess D per below , they should have said Sagemaker model monitor using cloud watch https://docs.aws.amazon.com/sagemaker/latest/dg/model-monitor.html C, https://aws.amazon.com/blogs/machine-learning/pruning-machine-learning-models-with-amazon-sagemaker-debugger-and-amazon-sagemaker-experiments/ I would say 'D as a more generic approach than C. The problem can be caused not just filters. The answer is "c" as the question is asking for evaluate and improve the performance of the models? "https://docs.aws.amazon.com/sagemaker/latest/dg/debugger-visualization.html" I think the answer is D. - https://www.examtopics.com/discussions/amazon/view/89262-exam-aws-certified-machine-learning-specialty-topic-1/
201
202 - A company's machine learning (ML) specialist is designing a scalable data storage solution for Amazon SageMaker. The company has an existing TensorFlow-based model that uses a train.py script. The model relies on static training data that is currently stored in TFRecord format. What should the ML specialist do to provide the training data to SageMaker with the LEAST development overhead? - A.. Put the TFRecord data into an Amazon S3 bucket. Use AWS Glue or AWS Lambda to reformat the data to protobuf format and store the data in a second S3 bucket. Point the SageMaker training invocation to the second S3 bucket. B.. Rewrite the train.py script to add a section that converts TFRecord data to protobuf format. Point the SageMaker training invocation to the local path of the data. Ingest the protobuf data instead of the TFRecord data. C.. Use SageMaker script mode, and use train.py unchanged. Point the SageMaker training invocation to the local path of the data without reformatting the training data. D.. Use SageMaker script mode, and use train.py unchanged. Put the TFRecord data into an Amazon S3 bucket. Point the SageMaker training invocation to the S3 bucket without reformatting the training data.
D - Should be D. TFRecord could be uploaded to S3 directly and be used as SageMaker's data source. https://sagemaker-examples.readthedocs.io/en/latest/sagemaker_batch_transform/working_with_tfrecords/working-with-tfrecords.html#Upload-dataset-to-S3 Then why not C then how is local path a "scalable data storage solution" ? answer is D It has to option D. A. NO - SageMaker can use TFRecods as-is in S3 B. NO - SageMaker can use TFRecods as-is in S3 C. NO - SageMaker must use S3 as input, it cannot read your local data D. YES - SageMaker can use TFRecods as-is in S3 SageMaker script mode allows you to use your existing TensorFlow training scripts without any modifications. You can use the same TFRecord data format that your model expects, and point the SageMaker training invocation to the S3 bucket where the data is stored. SageMaker will automatically download the data to the local path of the training instance and pass it as an argument to your train.py script. You don’t need to reformat the data to protobuf format or rewrite your script to convert the data12. This option allows the ML specialist to use the existing train.py script and TFRecord data without any changes, minimizing development overhead. By using SageMaker script mode, the specialist can run the existing TensorFlow script as-is, and by pointing the SageMaker training invocation to the S3 bucket containing the TFRecord data, the specialist can provide the training data to SageMaker without reformatting it. This option leverages SageMaker's built-in support for the TensorFlow framework and script mode. The existing train.py script can be used without any modifications. SageMaker will automatically download the training data from the specified S3 location to the instance running the training job. This option saves development time by avoiding the need to rewrite the train.py script or reformat the training data. - https://www.examtopics.com/discussions/amazon/view/89051-exam-aws-certified-machine-learning-specialty-topic-1/
202
203 - An ecommerce company wants to train a large image classification model with 10,000 classes. The company runs multiple model training iterations and needs to minimize operational overhead and cost. The company also needs to avoid loss of work and model retraining. Which solution will meet these requirements? - A.. Create the training jobs as AWS Batch jobs that use Amazon EC2 Spot Instances in a managed compute environment. B.. Use Amazon EC2 Spot Instances to run the training jobs. Use a Spot Instance interruption notice to save a snapshot of the model to Amazon S3 before an instance is terminated. C.. Use AWS Lambda to run the training jobs. Save model weights to Amazon S3. D.. Use managed spot training in Amazon SageMaker. Launch the training jobs with checkpointing enabled.
D - https://docs.aws.amazon.com/sagemaker/latest/dg/model-managed-spot-training.html Managed spot training can optimize the cost of training models up to 90% over on-demand instances. SageMaker manages the Spot interruptions on your behalf. "Spot instances can be interrupted, causing jobs to take longer to start or finish. You can configure your managed spot training job to use checkpoints. SageMaker copies checkpoint data from a local path to Amazon S3. When the job is restarted, SageMaker copies the data from Amazon S3 back into the local path. The training job can then resume from the last checkpoint instead of restarting." It has to be D. With Spot training we can reduce the cost and save the model weights with checkpoint enabled. agree. managed spot training is also cost effective A. NO - D is simpler B. NO - D is simpler C. NO - D is simpler D. YES - works out-of-the-box Managed spot training in Amazon SageMaker uses Amazon EC2 Spot instances to run training jobs, which can optimize the cost of training models by up to 90% over on-demand instances 1. SageMaker manages the Spot interruptions on the company’s behalf 1. By enabling checkpointing, the company can ensure that if a Spot instance is interrupted, the training job can resume from the last checkpoint instead of restarting, avoiding loss of work and model retraining 1 Selected Answer: D Use managed spot training in Amazon SageMaker. Launch the training jobs with checkpointing enabled. Managed spot training in Amazon SageMaker can help minimize operational overhead and cost by using spot instances to perform the training. This can significantly reduce the cost of training, while still achieving the same accuracy. SageMaker provides built-in checkpointing capability, which allows saving model weights and progress to Amazon S3 periodically. This ensures that even if the spot instances are terminated, the training can resume from the last saved checkpoint. Additionally, SageMaker provides a managed service, so the ecommerce company does not need to worry about managing the infrastructure, and can focus on building and tuning their model. Selected Answer: D The ML specialist should choose option D, which provides the training data to SageMaker with the least development overhead. This option involves putting the TFRecord data into an Amazon S3 bucket and pointing the SageMaker training invocation to the S3 bucket without reformatting the training data. Using SageMaker script mode is a convenient way to execute training scripts without any modification. Since the training script train.py already works with TFRecord data, it can be used as is without any changes. By storing the data in S3 and accessing it from there, the specialist can take advantage of SageMaker's built-in data distribution and parallelization capabilities, which can significantly speed up training. Rewriting the train.py script or using additional services like AWS Glue or Lambda would add unnecessary complexity and increase development overhead. Managed spot training in Amazon SageMaker provides a cost-effective way to run large machine learning workloads. With managed spot training, the training jobs are executed using Amazon EC2 Spot instances, which can significantly reduce the cost of training. Additionally, by launching training jobs with checkpointing enabled, the work done up to the last checkpoint is saved to Amazon S3. This ensures that the training job can be resumed from the last checkpoint in case of instance failure or termination. This minimizes the risk of data loss and avoids the need for retraining the model from scratch. Using Amazon SageMaker also reduces the operational overhead required to set up and manage the training environment. - https://www.examtopics.com/discussions/amazon/view/89213-exam-aws-certified-machine-learning-specialty-topic-1/
203
204 - A retail company uses a machine learning (ML) model for daily sales forecasting. The model has provided inaccurate results for the past 3 weeks. At the end of each day, an AWS Glue job consolidates the input data that is used for the forecasting with the actual daily sales data and the predictions of the model. The AWS Glue job stores the data in Amazon S3. The company's ML team determines that the inaccuracies are occurring because of a change in the value distributions of the model features. The ML team must implement a solution that will detect when this type of change occurs in the future. Which solution will meet these requirements with the LEAST amount of operational overhead? - A.. Use Amazon SageMaker Model Monitor to create a data quality baseline. Confirm that the emit_metrics option is set to Enabled in the baseline constraints file. Set up an Amazon CloudWatch alarm for the metric. B.. Use Amazon SageMaker Model Monitor to create a model quality baseline. Confirm that the emit_metrics option is set to Enabled in the baseline constraints file. Set up an Amazon CloudWatch alarm for the metric. C.. Use Amazon SageMaker Debugger to create rules to capture feature values Set up an Amazon CloudWatch alarm for the rules. D.. Use Amazon CloudWatch to monitor Amazon SageMaker endpoints. Analyze logs in Amazon CloudWatch Logs to check for data drift.
A - A is correct. "If the statistical nature of the data that your model receives while in production drifts away from the nature of the baseline data it was trained on, the model begins to lose accuracy in its predictions. Amazon SageMaker Model Monitor uses rules to detect data drift and alerts you when it happens." https://docs.aws.amazon.com/sagemaker/latest/dg/model-monitor-data-quality.html A. YES - Model Monitor can validate distribution of input data B. NO - a model quality baseline is for model performance eg. precision, F1 score, etc. C. NO - Model Monitor is the right tool D. NO - Model Monitor is the right tool it's a problem of monitoring data distributions. The reason for this choice is that Amazon SageMaker Model Monitor is a feature of Amazon SageMaker that allows you to monitor and analyze your machine learning models in production. Model Monitor can automatically detect data drift and other data quality issues by comparing your live data with a baseline dataset that you provide1. Model Monitor can also emit metrics and alerts when it detects violations of the constraints that you define or that it suggests based on the baseline2. A is correct. The best solution to meet the requirements is to use Amazon SageMaker Model Monitor to create a data quality baseline. The ML team can set up a data quality baseline to detect when the input data to the model has drifted significantly from the historical distribution of the data. When data drift occurs, the Model Monitor emits a metric that can trigger an alarm in Amazon CloudWatch. The ML team can use this alarm to investigate and take corrective action. Option B is incorrect because model quality baseline monitors model performance, not the input data quality. Option C is incorrect because Amazon SageMaker Debugger is used to debug machine learning models and to identify problems with model training, not data quality. Option D is incorrect because Amazon CloudWatch does not provide any features to monitor data drift in the input data used for the machine learning model. Data monitor https://docs.aws.amazon.com/sagemaker/latest/dg/model-monitor-data-quality.html Properties of independent variables changes due to seasonality, customer preferences Model monitor https://docs.aws.amazon.com/sagemaker/latest/dg/model-monitor-model-quality.html Concept of what is spam email, changes over time "The company's ML team determines that the inaccuracies are occurring because of a change in the value distributions of the model features.” They know model features that is data for model input is changing so we monitor data https://pair-code.github.io/what-if-tool/learn/tutorials/features-overview/ a Data monitor https://docs.aws.amazon.com/sagemaker/latest/dg/model-monitor-data-quality.html Properties of independent variables changes due to seasonality, customer preferences Model monitor https://docs.aws.amazon.com/sagemaker/latest/dg/model-monitor-model-quality.html Concept of what is spam email, changes over time "The company's ML team determines that the inaccuracies are occurring because of a change in the value distributions of the model features.” They know model features that is data for model input is changing so we monitor data https://pair-code.github.io/what-if-tool/learn/tutorials/features-overview/ a I think the answer is B. Data quality can be monitored via model monitor model quality baseline. B. Since it's "a change in the value distributions of the model features". model features = data What is the difference of ans A and B? model quality baseline vs data quality baseline - https://www.examtopics.com/discussions/amazon/view/88838-exam-aws-certified-machine-learning-specialty-topic-1/
204
205 - A machine learning (ML) specialist has prepared and used a custom container image with Amazon SageMaker to train an image classification model. The ML specialist is performing hyperparameter optimization (HPO) with this custom container image to produce a higher quality image classifier. The ML specialist needs to determine whether HPO with the SageMaker built-in image classification algorithm will produce a better model than the model produced by HPO with the custom container image. All ML experiments and HPO jobs must be invoked from scripts inside SageMaker Studio notebooks. How can the ML specialist meet these requirements in the LEAST amount of time? - A.. Prepare a custom HPO script that runs multiple training jobs in SageMaker Studio in local mode to tune the model of the custom container image. Use the automatic model tuning capability of SageMaker with early stopping enabled to tune the model of the built-in image classification algorithm. Select the model with the best objective metric value. B.. Use SageMaker Autopilot to tune the model of the custom container image. Use the automatic model tuning capability of SageMaker with early stopping enabled to tune the model of the built-in image classification algorithm. Compare the objective metric values of the resulting models of the SageMaker AutopilotAutoML job and the automatic model tuning job. Select the model with the best objective metric value. C.. Use SageMaker Experiments to run and manage multiple training jobs and tune the model of the custom container image. Use the automatic model tuning capability of SageMaker to tune the model of the built-in image classification algorithm. Select the model with the best objective metric value. D.. Use the automatic model tuning capability of SageMaker to tune the models of the custom container image and the built-in image classification algorithm at the same time. Select the model with the best objective metric value.
C - https://docs.aws.amazon.com/sagemaker/latest/dg/experiments.html Autopilot is the faster. "Amazon SageMaker Autopilot experiments are now up to 2x faster in Hyperparameter Optimization training mode" . Refer to https://aws.amazon.com/about-aws/whats-new/2022/11/amazon-sagemaker-autopilot-experiments-2x-faster-hyperparameter-optimization-training-mode/?nc1=h_ls SageMaker Autopilot is designed to automatically build, train, and tune the best machine learning model based on a dataset, without the user needing to choose an algorithm. It's not designed to be used with custom container images. It seems that Autopilot doesn't support image data (image classification), so B will be incorrect in this case https://docs.aws.amazon.com/sagemaker/latest/dg/autopilot-datasets-problem-types.html#autopilot-datasets This is the MOST efficient approach. SageMaker's automatic model tuning (HPO) is designed to efficiently search the hyperparameter space and find the best model. By using it for both the custom container image and the built-in algorithm, the ML specialist can directly compare the performance of the two approaches in the least amount of time. ince AMT works seamlessly with both the built-in image classification algorithm and custom container images, answer D ("tune both models at the same time") might seem feasible at first glance. However, each HPO job is independent—you cannot run a single AMT job for multiple algorithms or containers simultaneously. This limitation makes C the more appropriate choice for managing and comparing experiments systematically C Question asks "determine whether HPO with the SageMaker built-in image classification algorithm will produce a better model than the model produced by HPO with the custom container image" meaning experiment both option and then determine which is better. "All ML experiments and HPO jobs must be invoked from scripts inside SageMaker Studio notebooks" Sagemake experiments provides more capability https://docs.aws.amazon.com/sagemaker/latest/dg/experiments.html SageMaker's automatic model tuning (also known as hyperparameter optimization, or HPO) is designed to find the best hyperparameters for your model by running multiple training jobs with different hyperparameter configurations. It supports both built-in algorithms and custom container images, making it a versatile tool for this task. Option D Not C bcos using SageMaker Experiments to manage multiple training jobs adds an extra layer of management complexity. While it helps in tracking experiments, it does not inherently speed up the HPO process compared to running them concurrently. In summary, Option D provides the most efficient and straightforward approach to determine the best model by leveraging SageMaker’s automatic model tuning capabilities to run HPO on both models simultaneously Option D stands out as the most effective approach because it leverages SageMaker's automatic model tuning capabilities for both the custom container image and the built-in image classification algorithm. This ensures: D The question is talking about how to do HPO using AWS Sagemaker for a model in custom image. Experiment is not to do HPO because you need to input parameter manually. So D Amazon sagemaker experiment is ideal for this. D: By using SageMaker's automatic model tuning capability to tune both the custom container image model and the built-in image classification algorithm model simultaneously, it leverages the parallel processing capabilities of SageMaker. This approach allows for efficient utilization of compute resources and can potentially complete the tuning process for both models in a shorter amount of time compared to running separate tuning jobs sequentially. Additionally, option D aligns with the requirement of invoking all ML experiments and HPO jobs from scripts inside SageMaker Studio notebooks, as SageMaker's automatic model tuning can be initiated and managed through notebook scripts. While options B and C could potentially work, option D provides the most direct and efficient path to meeting the requirements in the least amount of time by leveraging SageMaker's parallel processing capabilities and avoiding potential development overhead or limitations associated with other approaches. The best option to meet the requirements in the least amount of time is D. Use the automatic model tuning capability of SageMaker to tune the models of the custom container image and the built-in image classification algorithm at the same time. This approach directly utilizes SageMaker's built-in capabilities for HPO, applies to both custom containers and built-in algorithms, and avoids the inefficiencies associated with local mode or manual management of experiments. It's important to note that while the tuning jobs would not literally run "at the same time" in a single operation, this option represents the most efficient use of SageMaker's capabilities for both scenarios. Should be C. We are looking at comparing 2 models here, where Sagemaker Experiments fits the bill. D is out because "Amazon SageMaker automatic model tuning (AMT), also known as hyperparameter tuning, finds the best version of a model by running many training jobs on your dataset." https://docs.aws.amazon.com/sagemaker/latest/dg/automatic-model-tuning.html D can be done easily https://docs.aws.amazon.com/sagemaker/latest/dg/automatic-model-tuning.html "You can use SageMaker AMT with built-in algorithms, custom algorithms, or SageMaker pre-built containers for machine learning frameworks." I will go with D https://docs.aws.amazon.com/sagemaker/latest/dg/automatic-model-tuning-how-it-works.html Will go with B and Autopilot supports Image classification as per this link - https://docs.aws.amazon.com/sagemaker/latest/dg/autopilot-automate-model-development.html Autopilot currently supports the following problem types: Regression, binary, and multiclass classification with tabular data formatted as CSV or Parquet files in which each column contains a feature with a specific data type and each row contains an observation. The column data types accepted include numerical, categorical, text, and time series that consists of strings of comma-separated numbers. Text classification with data formatted as CSV or Parquet files in which a column provides the sentences to be classified, while another column should provide the corresponding class label. Image classification with images formats such as PNG, JPEG or a combination of both. Time-series forecasting with time-series data formatted as CSV or as Parquet files. SageMaker Autopilot is designed to automatically build, train, and tune the best machine learning model based on a dataset, without the user needing to choose an algorithm. It's not designed to be used with custom container images. A. NO - try AMT (=Automatic Model Tuning) before using custom HPO scripts; further, no reason to use the local mode B. NO - Autopilot is not for HPO only, it will also select a model etc. C. NO - requires manual parameter setting for each experiments D. YES - AMT (=Automatic Model Tuning) work with custom containers - https://www.examtopics.com/discussions/amazon/view/89266-exam-aws-certified-machine-learning-specialty-topic-1/
205
206 - A company wants to deliver digital car management services to its customers. The company plans to analyze data to predict the likelihood of users changing cars. The company has 10 TB of data that is stored in an Amazon Redshift cluster. The company's data engineering team is using Amazon SageMaker Studio for data analysis and model development. Only a subset of the data is relevant for developing the machine learning models. The data engineering team needs a secure and cost-effective way to export the data to a data repository in Amazon S3 for model development. Which solutions will meet these requirements? (Choose two.) - A.. Launch multiple medium-sized instances in a distributed SageMaker Processing job. Use the prebuilt Docker images for Apache Spark to query and plot the relevant data and to export the relevant data from Amazon Redshift to Amazon S3. B.. Launch multiple medium-sized notebook instances with a PySpark kernel in distributed mode. Download the data from Amazon Redshift to the notebook cluster. Query and plot the relevant data. Export the relevant data from the notebook cluster to Amazon S3. C.. Use AWS Secrets Manager to store the Amazon Redshift credentials. From a SageMaker Studio notebook, use the stored credentials to connect to Amazon Redshift with a Python adapter. Use the Python client to query the relevant data and to export the relevant data from Amazon Redshift to Amazon S3. D.. Use AWS Secrets Manager to store the Amazon Redshift credentials. Launch a SageMaker extra-large notebook instance with block storage that is slightly larger than 10 TB. Use the stored credentials to connect to Amazon Redshift with a Python adapter. Download, query, and plot the relevant data. Export the relevant data from the local notebook drive to Amazon S3. E.. Use SageMaker Data Wrangler to query and plot the relevant data and to export the relevant data from Amazon Redshift to Amazon S3.
CE - CE Option A: Launching multiple medium-sized instances in a distributed SageMaker Processing job and using the prebuilt Docker images for Apache Spark to query and plot the relevant data is a possible solution, but it may not be the most cost-effective solution as it requires spinning up multiple instances. Option B: Launching multiple medium-sized notebook instances with a PySpark kernel in distributed mode is another solution, but it may not be the most secure solution as the data would be stored on the instances and not in a centralized data repository. Option D: Using AWS Secrets Manager to store the Amazon Redshift credentials and launching a SageMaker extra-large notebook instance is a solution, but the block storage requirement that is slightly larger than 10 TB could be costly and may not be necessary. C and E. No secure control is in option A. Option A can do it as well but could be expensive and not as easy as option E Changed my mind to AC because Data Wrangler may struggle with 10TB. By using distributed SageMaker Processing jobs with Apache Spark and securely managing credentials with AWS Secrets Manager, the data engineering team can efficiently and securely export the relevant data from Amazon Redshift to Amazon S3 As soon as I see, Download andpython client, I am worried about speed and efficiency. So I would say A and E A. NO - SageMaker Processing job is a self-contained feature using the sagemaker.processing API; it does not rely on invoking Spark directly B. NO - you want to identify the relevant slice of data without having to download everything first C. YES - minimize data movement D. NO - you want to identify the relevant slice of data without having to download everything first E. YES - built-in tool specifically designed for that use case E for sure but was a bit confused on A or C but based on the link would go for C https://aws.amazon.com/blogs/big-data/using-the-amazon-redshift-data-api-to-interact-from-an-amazon-sagemaker-jupyter-notebook/ e is obvious choice: c https://aws.amazon.com/blogs/big-data/using-the-amazon-redshift-data-api-to-interact-from-an-amazon-sagemaker-jupyter-notebook/ C & E seems right - https://docs.aws.amazon.com/sagemaker/latest/dg/data-wrangler.html - https://www.examtopics.com/discussions/amazon/view/89069-exam-aws-certified-machine-learning-specialty-topic-1/
206
207 - A company is building an application that can predict spam email messages based on email text. The company can generate a few thousand human-labeled datasets that contain a list of email messages and a label of "spam" or "not spam" for each email message. A machine learning (ML) specialist wants to use transfer learning with a Bidirectional Encoder Representations from Transformers (BERT) model that is trained on English Wikipedia text data. What should the ML specialist do to initialize the model to fine-tune the model with the custom data? - A.. Initialize the model with pretrained weights in all layers except the last fully connected layer. B.. Initialize the model with pretrained weights in all layers. Stack a classifier on top of the first output position. Train the classifier with the labeled data. C.. Initialize the model with random weights in all layers. Replace the last fully connected layer with a classifier. Train the classifier with the labeled data. D.. Initialize the model with pretrained weights in all layers. Replace the last fully connected layer with a classifier. Train the classifier with the labeled data.
D - Its B. The [CLS] token (first position) represents the embedding of the entire sentence. Doing classification over this token on top makes the most sense I don't see why everyone is voting for D. To fine tune BERT you should add a classifier on top of the [CLS] token representing the hidden state. So it's not clear to me what does the question mean with "last fully connected layer" D is the right option since initializing the model with pretrained weights, the model can leverage the knowledge learned from a large corpus of text data, such as English Wikipedia text data, to improve its performance on a specific task, such as spam email classification . And Replacing the last fully connected layer with a classifier is necessary because the last layer of BERT is designed for predicting masked words in a sentence, which is different from the task of spam email classification A. NO - the last fully connected layer will not do SoftMax classification B. YES - output of BERT (word embeddings) can be used as input of classification C. NO - random weights will discard previous transfer learning D. NO - we don't want to loose the word embeddings; "cut the head off" (replacing the last layer) is if we want to learn different classes than what the model was trained for, but here we want to augment You should consider that Stacking a classifier on top of the first output position and training it with labeled data is not recommended because it does not take advantage of the knowledge learned from pretraining on a large corpus of text data D although was leaning towards B on a second thought going for B Cut the Head Off D seems correct D is a best practice Is B correct? - https://www.analyticsvidhya.com/blog/2020/07/transfer-learning-for-nlp-fine-tuning-bert-for-text-classification/ Freeze the entire architecture – We can even freeze all the layers of the model and attach a few neural network layers of our own and train this new model. Note that the weights of only the attached layers will be updated during model training. You would have two classifiers stacked, so your predictions would be based in the other classifier. I think the answer is D. - https://www.examtopics.com/discussions/amazon/view/89267-exam-aws-certified-machine-learning-specialty-topic-1/
207
208 - A company is using a legacy telephony platform and has several years remaining on its contract. The company wants to move to AWS and wants to implement the following machine learning features: • Call transcription in multiple languages • Categorization of calls based on the transcript • Detection of the main customer issues in the calls • Customer sentiment analysis for each line of the transcript, with positive or negative indication and scoring of that sentiment Which AWS solution will meet these requirements with the LEAST amount of custom model training? - A.. Use Amazon Transcribe to process audio calls to produce transcripts, categorize calls, and detect issues. Use Amazon Comprehend to analyze sentiment. B.. Use Amazon Transcribe to process audio calls to produce transcripts. Use Amazon Comprehend to categorize calls, detect issues, and analyze sentiment C.. Use Contact Lens for Amazon Connect to process audio calls to produce transcripts, categorize calls, detect issues, and analyze sentiment. D.. Use Contact Lens for Amazon Connect to process audio calls to produce transcripts. Use Amazon Comprehend to categorize calls, detect issues, and analyze sentiment.
C - The correct answer is C. Use Contact Lens for Amazon Connect to process audio calls to produce transcripts, categorize calls, detect issues, and analyze sentiment. Contact Lens is a fully managed service that provides advanced analytics for customer service interactions in Amazon Connect. It includes call transcription, sentiment analysis, and issue detection, which meets all the requirements in the question. Using a single service like Contact Lens will reduce the complexity of integrating multiple AWS services and also minimize the need for custom model training. While Amazon Transcribe and Amazon Comprehend are also valuable AWS services, they are not designed specifically for customer service interactions and may require additional configuration and custom model training to meet the specific requirements listed in the question. You can see detail in https://www.udemy.com/course/aws-certified-machine-learning-specialty-2023/ I think C. Contact Lens can do all. https://aws.amazon.com/connect/contact-lens/ Contact Lens for Amazon Connect is not within scope of the exam. this is AWS Q answer Contact Lens for Amazon Connect is primarily designed to analyze conversations that occur within an Amazon Connect contact center. It provides real-time capabilities like sentiment analysis, issue detection and compliance monitoring for calls and chats handled by Amazon Connect agents. While it cannot directly analyze past recorded conversations outside of Amazon Connect, you could potentially use other AWS services like Amazon Transcribe to transcribe pre-recorded calls. Amazon Comprehend could then be used to analyze the transcripts for things like sentiment, topics, entities and key phrases. After further review and research. Contact Lens can indeed process past audio and chat records. Please ignore previous explanation. Chat GPT or AWS Q are not fully reliable as yet. Answer is indeed C Based on the information provided and the goal of minimizing custom model training, Option C (Use Contact Lens for Amazon Connect to process audio calls to produce transcripts, categorize calls, detect issues, and analyze sentiment) is the best solution. Contact Lens for Amazon Connect is specifically designed to handle tasks related to call center operations, including all the features listed in the requirements, without the need for extensive custom model training. "several years remaining on its contract" means no Amazon Connect. Amazon Connect is a pay-as-you-go cloud contact center. There are no required minimum monthly fees, long-term commitments, or upfront license charges, and pricing is not based on peak capacity, agent seats, or maintenance; you only pay for what you use. correct. But the contract with a legacy system will not disappear. So the company will need to pay for both Amazon Connect and Legacy system. First I choose B but question says "The company wants to move to AWS", so the company wants to replace legacy system with AWS I think C is better. Amazon connect contact lens can fulfill all these requirements and on an enterprise level it provides contact center analytics and quality management capabilities that enable you to monitor, measure, and continuously improve contact quality and agent performance for a better overall customer experience Ref: https://docs.aws.amazon.com/connect/latest/adminguide/contact-lens.html https://aws.amazon.com/connect/contact-lens/ Answer is B.. no doubt correct answer is C It's B Customer is not replacing telephony software with Amazon Connect hence c & d are ruled out. Answer B B as they cannot use Amazon Connect due to several years on contract. "several years remaining on its contract" means no Amazon Connect. c https://docs.aws.amazon.com/connect/latest/adminguide/analyze-conversations.html Company has several years remaining on its contract. Does this change the solution approach? That means they cant use Amazon Connect. So Contact Lens may not be applicable here. If that assumption is true, B should be the answer. I don't think it changes the solution approach. The question was about LEAST development effort, not lowest cost. Yes it will cost more to use Connect but dev cost is lower. Having the same though. If the company needs to remain in the legacy telephony platform, the answer would be B. Otherwise, C. Contact Lens is an out of the box feature for Amazon Connect that leverages Amazon Transcribe to generate call transcripts and Amazon Comprehend to apply natural language processing (NLP) on these transcripts, with no coding required. https://aws.amazon.com/connect/contact-lens/#:~:text=Contact%20Lens%20is%20an%20out,transcripts%2C%20with%20no%20coding%20required. c seems to be correct - https://www.examtopics.com/discussions/amazon/view/88837-exam-aws-certified-machine-learning-specialty-topic-1/
208
209 - A finance company needs to forecast the price of a commodity. The company has compiled a dataset of historical daily prices. A data scientist must train various forecasting models on 80% of the dataset and must validate the efficacy of those models on the remaining 20% of the dataset. How should the data scientist split the dataset into a training dataset and a validation dataset to compare model performance? - A.. Pick a date so that 80% of the data points precede the date. Assign that group of data points as the training dataset. Assign all the remaining data points to the validation dataset. B.. Pick a date so that 80% of the data points occur after the date. Assign that group of data points as the training dataset. Assign all the remaining data points to the validation dataset. C.. Starting from the earliest date in the dataset, pick eight data points for the training dataset and two data points for the validation dataset. Repeat this stratified sampling until no data points remain. D.. Sample data points randomly without replacement so that 80% of the data points are in the training dataset. Assign all the remaining data points to the validation dataset.
A - For time series data, it is important to split the dataset chronologically, with the training dataset containing the earlier dates and the validation dataset containing the later dates A. YES - it is forecasting, so you want to predict the future and 20% of the data points after a date will do so B. NO - it is forecasting, we want to simulate an actual use case and not predict the past C. NO - there is data leakage as future datapoints are used in the predictions C. NO - there is data leakage as future datapoints are used in the predictions Time series keep the order A, As it's a time series problem. It's a timeseries problem, then the splitting needs to be made by date. Option A is the recommended approach where the training dataset contains historical prices that precede a certain date, and the validation dataset contains prices that occur after that date. This ensures that the model is trained on past data and evaluated on future data, which is more representative of real-world performance. Option D is NOT the recommended approach for time series data because it ignores the time aspect of the data. Randomly sampling data points without considering the time sequence can result in data leakage and poor model performance. a since this is time series problem a https://towardsdatascience.com/time-series-from-scratch-train-test-splits-and-evaluation-metrics-4fd654de1b37 Because it randomly selects data points for both the training and validation datasets, ensuring that the samples are representative of the entire dataset and reducing the chances of overfitting. By randomly sampling without replacement, the data scientist can avoid any biases in the selection of data points and ensure that the training and validation datasets are independent. For time series you should keep the order. I think it should be A! - https://www.examtopics.com/discussions/amazon/view/98541-exam-aws-certified-machine-learning-specialty-topic-1/
209
210 - A retail company wants to build a recommendation system for the company's website. The system needs to provide recommendations for existing users and needs to base those recommendations on each user's past browsing history. The system also must filter out any items that the user previously purchased. Which solution will meet these requirements with the LEAST development effort? - A.. Train a model by using a user-based collaborative filtering algorithm on Amazon SageMaker. Host the model on a SageMaker real-time endpoint. Configure an Amazon API Gateway API and an AWS Lambda function to handle real-time inference requests that the web application sends. Exclude the items that the user previously purchased from the results before sending the results back to the web application. B.. Use an Amazon Personalize PERSONALIZED_RANKING recipe to train a model. Create a real-time filter to exclude items that the user previously purchased. Create and deploy a campaign on Amazon Personalize. Use the GetPersonalizedRanking API operation to get the real-time recommendations. C.. Use an Amazon Personalize USER_PERSONALIZATION recipe to train a model. Create a real-time filter to exclude items that the user previously purchased. Create and deploy a campaign on Amazon Personalize. Use the GetRecommendations API operation to get the real-time recommendations. D.. Train a neural collaborative filtering model on Amazon SageMaker by using GPU instances. Host the model on a SageMaker real-time endpoint. Configure an Amazon API Gateway API and an AWS Lambda function to handle real-time inference requests that the web application sends. Exclude the items that the user previously purchased from the results before sending the results back to the web application.
C - A. NO - we want to leverage a prebuilt model for efficiency B. NO - PERSONALIZED_RANKING uses a predefined list of items as input C. YES - USER_PERSONALIZATION uses past user history as input D. NO - we want to leverage a prebuilt model for efficiency Answer C User personalization: Recommendations tailored to a user’s profile, behavior, preferences, and history. This is most commonly used to boost customer engagement and satisfaction. It can also drive higher conversion rates. Personalized ranking: Items re-ranked in a category or search response based on user preference or history. This use case is used to surface relevant items or content to a specific user ensuring a better customer experience. Amazon Personalize supports re-ranking while optimizing for business priorities such as revenue, promotions, or trending items. https://aws.amazon.com/personalize/faqs/ B it,s the correct answer C looks the right choice Its C, User Personalizationis recommended for user interaction scenarios https://docs.aws.amazon.com/personalize/latest/dg/native-recipe-new-item-USER_PERSONALIZATION.html It's C, User Personalization is recommended for self-user user case. The User-Personalization (aws-user-personalization) recipe is optimized for all personalized recommendation scenarios. It predicts the items that a user will interact with based on Interactions, Items, and Users datasets. When recommending items, it uses automatic item exploration. Option B: https://docs.aws.amazon.com/personalize/latest/dg/native-recipe-search.html "With Personalized-Ranking, you must manually create a new solution version (retrain the model) to reflect updates to your catalog and update the model with your user’s most recent behavior." Option B has a disadvantage to update the catalog in retail company. So, Option C has the less effort to oprate than Option B Option B is a better fit for the given requirements since it specifically mentions the need to filter out items that the user has previously purchased. The PERSONALIZED_RANKING recipe in Amazon Personalize is designed to provide personalized recommendations while allowing for exclusion of previously purchased items using a filter. In contrast, the USER_PERSONALIZATION recipe in option C is designed to provide personalized recommendations without the ability to filter out previously purchased items. Therefore, option B is the best choice for meeting the given requirements with the least development effort. Answer is C https://docs.aws.amazon.com/personalize/latest/dg/native-recipe-new-item-USER_PERSONALIZATION.html Answer is C https://docs.aws.amazon.com/personalize/latest - https://www.examtopics.com/discussions/amazon/view/98527-exam-aws-certified-machine-learning-specialty-topic-1/
210
211 - A bank wants to use a machine learning (ML) model to predict if users will default on credit card payments. The training data consists of 30,000 labeled records and is evenly balanced between two categories. For the model, an ML specialist selects the Amazon SageMaker built-in XGBoost algorithm and configures a SageMaker automatic hyperparameter optimization job with the Bayesian method. The ML specialist uses the validation accuracy as the objective metric. When the bank implements the solution with this model, the prediction accuracy is 75%. The bank has given the ML specialist 1 day to improve the model in production. Which approach is the FASTEST way to improve the model's accuracy? - A.. Run a SageMaker incremental training based on the best candidate from the current model's tuning job. Monitor the same metric that was used as the objective metric in the previous tuning, and look for improvements. B.. Set the Area Under the ROC Curve (AUC) as the objective metric for a new SageMaker automatic hyperparameter tuning job. Use the same maximum training jobs parameter that was used in the previous tuning job. C.. Run a SageMaker warm start hyperparameter tuning job based on the current model’s tuning job. Use the same objective metric that was used in the previous tuning. D.. Set the F1 score as the objective metric for a new SageMaker automatic hyperparameter tuning job. Double the maximum training jobs parameter that was used in the previous tuning job.
C - A. NO - Incremental training not supported by XGBoost (https://docs.aws.amazon.com/sagemaker/latest/dg/incremental-training.html) B. NO - we don't want to change the objective and restart from scratch C. YES - warm start can leverage new data from production for further tuning D. NO - we don't want to start from the training from scratch or use F1 score as objective Answer C Given time constraint, I believe that C is the crrect one. C is the correct answer because it uses the results from past HPO jobs and builds upon them to improve accuracy. I go with C - warm start, A is not supported on XGBoost, and other options will start tuning from scratch and might be just as bad as the inital tuning job. We only have 1 day, so more tuning with existing job to inform the new trainging job is the only option here https://docs.aws.amazon.com/sagemaker/latest/dg/automatic-model-tuning-warm-start.html C is the correct answer. You can't use Incremental training on Xgboost algorithm https://docs.aws.amazon.com/sagemaker/latest/dg/incremental-training.html It appears in 2023-April-3 Since ROC-AUC is presumed to be one of the best for a binary classification. Hence option B. Option A -- Incremental training is suited wherein the training dataset gets updated frequently. https://docs.aws.amazon.com/sagemaker/latest/dg/automatic-model-tuning-warm-start.html https://francesca-donadoni.medium.com/training-an-xgboost-model-for-pricing-analysis-using-aws-sagemaker-55d777708e52 https://docs.aws.amazon.com/sagemaker/latest/dg/incremental-training.html C also it cannot be a: "Only three built-in algorithms currently support incremental training: Object Detection - MXNet, Image Classification - MXNet, and Semantic Segmentation Algorithm." from https://docs.aws.amazon.com/sagemaker/latest/dg/incremental-training.html - https://www.examtopics.com/discussions/amazon/view/99688-exam-aws-certified-machine-learning-specialty-topic-1/
211
212 - A data scientist has 20 TB of data in CSV format in an Amazon S3 bucket. The data scientist needs to convert the data to Apache Parquet format. How can the data scientist convert the file format with the LEAST amount of effort? - A.. Use an AWS Glue crawler to convert the file format. B.. Write a script to convert the file format. Run the script as an AWS Glue job. C.. Write a script to convert the file format. Run the script on an Amazon EMR cluster. D.. Write a script to convert the file format. Run the script in an Amazon SageMaker notebook.
B - B is right. Is very simple to create a conversion file JOB in AWS Glue, using just 3 workflow steps. WITH NO CODE.. CREATED AUTOMATICALLY BY GLUE (Scala or Python) (s3 - source data file) --> (Data Mapping) --> (target transformed data file) A. NO - Crawler is to populate the data catalog B. YES - leverage serverless for distributed processing C. NO - Altough EMR can run Spark like Glue, it is not serverless D. NO - using the PySpark kernel will be single instance (running in the notebook) Option B is better than option A because option A uses an AWS Glue crawler to convert the file format. A crawler is a component of AWS Glue that scans your data sources and infers the schema, format, partitioning, and other properties of your data. A crawler can create or update a table in the AWS Glue Data Catalog that points to your data source. However, a crawler cannot change the format of your data source itself. You still need to write a script or use a tool to convert your CSV files to Parquet files. Option B. A - Glue crawler creates Glue Data Catalog from S3 buckets. It can be used to query by athena. C, D - not serverless and not generally used for etl. AWS Glue is a fully-managed ETL service that makes it easy to move data between data stores. AWS Glue can be used to automate the conversion of CSV files to Parquet format with minimal effort. AWS Glue supports reading data from CSV files, transforming the data, and writing the transformed data to Parquet files. Option A is incorrect because AWS Glue crawler is used to infer the schema of data stored in S3 and create AWS Glue Data Catalog tables. Option C is incorrect because while Amazon EMR can be used to process large amounts of data and perform data conversions, it requires more operational effort than AWS Glue. Option D is incorrect because Amazon SageMaker is a machine learning service, and while it can be used for data processing, it is not the best option for simple data format conversion tasks. in sagemaker notebook, you'd have to write python code but question is asking for something easy so i choose option b https://blog.searce.com/convert-csv-json-files-to-apache-parquet-using-aws-glue-a760d177b45f From you link, A(Glue crawler) Should be correct. crawler just creates the data catalog (schema), it does not actually converts the data to another format. As per details in that article, you are creating a job where source is schema created by crawler and destination is output s3 where we store formatted data. - https://www.examtopics.com/discussions/amazon/view/98756-exam-aws-certified-machine-learning-specialty-topic-1/
212
213 - A company is building a pipeline that periodically retrains its machine learning (ML) models by using new streaming data from devices. The company's data engineering team wants to build a data ingestion system that has high throughput, durable storage, and scalability. The company can tolerate up to 5 minutes of latency for data ingestion. The company needs a solution that can apply basic data transformation during the ingestion process. Which solution will meet these requirements with the MOST operational efficiency? - A.. Configure the devices to send streaming data to an Amazon Kinesis data stream. Configure an Amazon Kinesis Data Firehose delivery stream to automatically consume the Kinesis data stream, transform the data with an AWS Lambda function, and save the output into an Amazon S3 bucket. B.. Configure the devices to send streaming data to an Amazon S3 bucket. Configure an AWS Lambda function that is invoked by S3 event notifications to transform the data and load the data into an Amazon Kinesis data stream. Configure an Amazon Kinesis Data Firehose delivery stream to automatically consume the Kinesis data stream and load the output back into the S3 bucket. C.. Configure the devices to send streaming data to an Amazon S3 bucket. Configure an AWS Glue job that is invoked by S3 event notifications to read the data, transform the data, and load the output into a new S3 bucket. D.. Configure the devices to send streaming data to an Amazon Kinesis Data Firehose delivery stream. Configure an AWS Glue job that connects to the delivery stream to transform the data and load the output into an Amazon S3 bucket.
A - A. YES - Kinesis/Kafka acts as buffer for ingestion, Firehose provides good integration with Lambda (tranformation) & S3 (storage) B. NO - no point to save the data twice in S3 (raw and transformed) C. NO - since we do single-record transformation Glue/Spark is overkill D. NO - since we do single-record transformation Glue/Spark is overkill; further, we can reasonably expect devices to produce Kafka events but deploying a Firehose client API seem complicated answer A AWS Glue cannot get data from Kinesis Firehose, only from Kinesis Data Stream. It's not D. https://docs.aws.amazon.com/glue/latest/dg/add-job-streaming.html It is D Glue can't read from Firehose. It's A. It is C Option C uses AWS Glue, which can perform data transformation and load data into S3 buckets. However, Glue may not be the most efficient option for this use case, as it requires setting up a Glue job, which can introduce additional latency. Option A uses Amazon Kinesis data stream, which is optimized for high throughput, durable storage, and scalability. Why not C? Firehose can take just at a maximum of 5 minutes, then it's the best solution for transformations. A general architecture for (near)real - time ingesting & processing data: Kinesis Data Streams - Kinesis Data Firehose - (If needs etl, lambda) - S3(Redshift, ...) This solution provides a highly scalable and efficient way to ingest streaming data from devices with high throughput and durable storage by using Amazon Kinesis Data Streams and Amazon Kinesis Data Firehose. By configuring an AWS Lambda function to transform the data during the ingestion process, the solution also applies basic data transformation with low latency. Additionally, Amazon S3 provides highly durable and scalable storage for the transformed data, which can be easily accessed by downstream processes such as machine learning model training. A: https://docs.aws.amazon.com/firehose/latest/dev/data-transformation.html - https://www.examtopics.com/discussions/amazon/view/99814-exam-aws-certified-machine-learning-specialty-topic-1/
213
214 - A retail company is ingesting purchasing records from its network of 20,000 stores to Amazon S3 by using Amazon Kinesis Data Firehose. The company uses a small, server-based application in each store to send the data to AWS over the internet. The company uses this data to train a machine learning model that is retrained each day. The company's data science team has identified existing attributes on these records that could be combined to create an improved model. Which change will create the required transformed records with the LEAST operational overhead? - A.. Create an AWS Lambda function that can transform the incoming records. Enable data transformation on the ingestion Kinesis Data Firehose delivery stream. Use the Lambda function as the invocation target. B.. Deploy an Amazon EMR cluster that runs Apache Spark and includes the transformation logic. Use Amazon EventBridge (Amazon CloudWatch Events) to schedule an AWS Lambda function to launch the cluster each day and transform the records that accumulate in Amazon S3. Deliver the transformed records to Amazon S3. C.. Deploy an Amazon S3 File Gateway in the stores. Update the in-store software to deliver data to the S3 File Gateway. Use a scheduled daily AWS Glue job to transform the data that the S3 File Gateway delivers to Amazon S3. D.. Launch a fleet of Amazon EC2 instances that include the transformation logic. Configure the EC2 instances with a daily cron job to transform the records that accumulate in Amazon S3. Deliver the transformed records to Amazon S3.
A - A is correct, why not C : C require updating software in each of the 20,000 stores, which is operationally intensive. Moreover, the S3 File Gateway is designed for on-premises integration with S3. Answer A Firehose can use lambda functions to do data transformations! A is the best option for this use case. By creating an AWS Lambda function that can transform the incoming records and enabling data transformation on the ingestion Kinesis Data Firehose delivery stream, the company can transform the data with minimal operational overhead. The Lambda function can be the invocation target for Kinesis Data Firehose, so that data is transformed as it is ingested. This approach is serverless and scalable, and it does not require the company to manage any additional infrastructure. A: Lambda as invocation target just means that the function will invoke in response to the Firehose stream. See following: https://docs.aws.amazon.com/lambda/latest/dg/lambda-services.html https://docs.aws.amazon.com/lambda/latest/dg/services-kinesisfirehose.html note: invocationid in Firehose message event. a - seems to be an easy to manage solution however the phrase "Use the Lambda function as the invocation target." confuses me a bit. well that is used by Kinesis Data Firehouse.. - https://www.examtopics.com/discussions/amazon/view/99689-exam-aws-certified-machine-learning-specialty-topic-1/
214
215 - A sports broadcasting company is planning to introduce subtitles in multiple languages for a live broadcast. The commentary is in English. The company needs the transcriptions to appear on screen in French or Spanish, depending on the broadcasting country. The transcriptions must be able to capture domain-specific terminology, names, and locations based on the commentary context. The company needs a solution that can support options to provide tuning data. Which combination of AWS services and features will meet these requirements with the LEAST operational overhead? (Choose two.) - A.. Amazon Transcribe with custom vocabularies B.. Amazon Transcribe with custom language models C.. Amazon SageMaker Seq2Seq D.. Amazon SageMaker with Hugging Face Speech2Text E.. Amazon Translate
BE - https://docs.aws.amazon.com/transcribe/latest/dg/improving-accuracy.html A: Only specific words can be corrected AE is most straight forward. In contrast, Amazon Transcribe with custom vocabularies (option A) and Amazon Translate (option E) provide a simpler, more efficient solution with lower operational overhead, making them better suited for the company’s needs. from AWS docs. No doubt. B is a solution that can fix more problem than this requirement but no need. It is clear from AWS docs. No doubt. custom language model will be needed as custom vocab will just help with pronunciation, and requirement clearly states handling domain specific terminologies, which cannot be handled by custom vocab. Option A - Amazon Transcribe with custom vocabularies, allows you to enhance the transcription accuracy by providing domain-specific terminology, names, and locations. Custom vocabularies enable you to train the transcription model to recognize specific words or phrases commonly used in the context of sports commentary. This would help ensure that the transcriptions accurately capture the specialized terminology and context of the commentary, meeting the requirements of the sports broadcasting company. Additionally, the option mentions supporting options to provide tuning data, which further enhances the flexibility and customization of the solution. Could someone explain why A is not correct? As par AWS documentation: Use custom vocabularies to improve transcription accuracy for one or more specific words. These are generally domain-specific terms, such as brand names and acronyms, proper nouns, and words that Amazon Transcribe isn't rendering correctly. Custom vocabularies can be used with all supported languages. A - Yes - Transcribe custom vocabs allow domain specific transforms B - No - Custom language models are typically used for fine-tuning for specific accents, dialects, or unique speech patterns. This question is about domain specific terminology C - No - building and tarining requires more overhead D - No - building and tarining requires more overhead E - Once transcribed in english, translate can perform laguage transformation A. NO - Amazon Transcribe with custom vocabularies does not allow to take into account the broader context (https://docs.aws.amazon.com/transcribe/latest/dg/custom-vocabulary-create-list.html) B. YES - Custom language models are designed to improve transcription accuracy for domain-specific speech (https://docs.aws.amazon.com/transcribe/latest/dg/custom-language-models.html) C. NO - better to use built-in Translate service than base Seq2Seq D. NO - Hugging Face Speech2Text is custom model, use standard Transcribe E. YES - we need to translate English custom language + Translate Answer BE BE : For Specific Language: Custom Language Model Comentary context is custom language It's BE, custom languages is for domain-specific speech like terminologies, custom vocabulary is for words liken nouns. https://docs.aws.amazon.com/transcribe/latest/dg/custom-language-models.html Two sub-processes are needed: Speech to Text and Text to Text. We can consider Amazon Transcribe for Speech2Text. If we use custom language models or SageMaker, we would need to gather our own data to train or retrain models. for less effort, (a) option is better than (b) option. Then, Amazon Translate can be used to translate transcription to other language: (e) option option A cannot capture "commentary context". B and E should be correct. - https://www.examtopics.com/discussions/amazon/view/99696-exam-aws-certified-machine-learning-specialty-topic-1/
215
216 - A data scientist at a retail company is forecasting sales for a product over the next 3 months. After preliminary analysis, the data scientist identifies that sales are seasonal and that holidays affect sales. The data scientist also determines that sales of the product are correlated with sales of other products in the same category. The data scientist needs to train a sales forecasting model that incorporates this information. Which solution will meet this requirement with the LEAST development effort? - A.. Use Amazon Forecast with Holidays featurization and the built-in autoregressive integrated moving average (ARIMA) algorithm to train the model. B.. Use Amazon Forecast with Holidays featurization and the built-in DeepAR+ algorithm to train the model. C.. Use Amazon SageMaker Processing to enrich the data with holiday information. Train the model by using the SageMaker DeepAR built-in algorithm. D.. Use Amazon SageMaker Processing to enrich the data with holiday information. Train the model by using the Gluon Time Series (GluonTS) toolkit.
B - Amazon Forecast is an AWS service that uses machine learning to build accurate time-series forecasts. It provides several built-in algorithms that support holiday featurization, and the DeepAR+ algorithm can handle the seasonality and correlation with other products with minimal development effort. With Amazon Forecast, the data scientist can easily configure the forecast horizon, select the appropriate forecast frequency, and configure the model training to incorporate the available historical data. Using Amazon SageMaker Processing to enrich the data with holiday information may require more development effort and does not offer the same level of automation and integration as Amazon Forecast. While ARIMA is a classic time series forecasting method, it might not capture complex patterns (like seasonality and related product sales) as effectively as DeepAR+ B - DeepAR _ Algo with Holidya featurization in Amazon Forecast --> works better than Option A as Option A may be suboptimal but totally doable. DeepAR shines in multiple correlated time series and meant for seasonal data patterns A. YES - fully managed solution B. NO - DeepAR+ is more for multiple time series ("In many applications, however, you have many similar time series across a set of cross-sectional units"- https://docs.aws.amazon.com/forecast/latest/dg/aws-forecast-recipe-deeparplus.html) C. NO - SageMaker DeepAR is too low-level D. NO - Gluon is too low-level Option B is a great choice but requires more development effort over A which is also a great choice. Since the question asked for Least Development I am going with A Deep AR can understand seasonal effect B is correct DeepAR accepts exogenorous regressors different from ARIMA and can understand seasonal effects, ARIMA can't do it too. It is deepAR Option B is a good choice, as the DeepAR+ algorithm is specifically designed for forecasting in time series data with seasonality and long-term dependencies. However, it may require more development effort compared to the ARIMA algorithm. b https://docs.aws.amazon.com/forecast/latest/dg/holidays.html https://docs.aws.amazon.com/whitepapers/latest/time-series-forecasting-principles-with-amazon-forecast/appendix-a-faqs.html - https://www.examtopics.com/discussions/amazon/view/98990-exam-aws-certified-machine-learning-specialty-topic-1/
216
217 - A company is building a predictive maintenance model for its warehouse equipment. The model must predict the probability of failure of all machines in the warehouse. The company has collected 10,000 event samples within 3 months. The event samples include 100 failure cases that are evenly distributed across 50 different machine types. How should the company prepare the data for the model to improve the model's accuracy? - A.. Adjust the class weight to account for each machine type. B.. Oversample the failure cases by using the Synthetic Minority Oversampling Technique (SMOTE). C.. Undersample the non-failure events. Stratify the non-failure events by machine type. D.. Undersample the non-failure events by using the Synthetic Minority Oversampling Technique (SMOTE).
B - oversample the minority class Undersampling not an option for already limited observations. SMOTE clearly MOST promising first action before trying to balance classes Do we need to stratify the non-failure events by machine type? B. Oversample the failure cases by using the Synthetic Minority Oversampling Technique (SMOTE). Since the number of failure cases is relatively small, oversampling the failure cases using techniques like SMOTE can help balance the class distribution and prevent the model from being biased towards the majority class. SMOTE creates synthetic samples for the minority class by interpolating new samples between existing ones. This will help improve the model's accuracy in predicting failure cases. Adjusting class weights (A) or undersampling (C, D) may not be as effective in this scenario. The data provided is imbalanced, with only 100 failure cases out of 10,000 event samples. Therefore, it is important to address this imbalance to improve the accuracy of the predictive maintenance model. - https://www.examtopics.com/discussions/amazon/view/97866-exam-aws-certified-machine-learning-specialty-topic-1/
217
218 - A company stores its documents in Amazon S3 with no predefined product categories. A data scientist needs to build a machine learning model to categorize the documents for all the company's products. Which solution will meet these requirements with the MOST operational efficiency? - A.. Build a custom clustering model. Create a Dockerfile and build a Docker image. Register the Docker image in Amazon Elastic Container Registry (Amazon ECR). Use the custom image in Amazon SageMaker to generate a trained model. B.. Tokenize the data and transform the data into tabular data. Train an Amazon SageMaker k-means model to generate the product categories. C.. Train an Amazon SageMaker Neural Topic Model (NTM) model to generate the product categories. D.. Train an Amazon SageMaker Blazing Text model to generate the product categories.
C - C. Train an Amazon SageMaker Neural Topic Model (NTM) model to generate the product categories. The task is to build a machine learning model to categorize documents for all the company's products. Among the given options, training an Amazon SageMaker Neural Topic Model (NTM) model would be the most efficient and effective solution. An NTM model can identify topics in text data and group similar documents into specific categories, making it a suitable model for document categorization. With an NTM model, the data scientist would not need to define product categories beforehand, as the model would automatically group similar documents into topics. This saves time and resources compared to the other options. thank you chatgpt Thanks to ChatGPT and also thanks Ajose O for saving our time looking for some evidence or a proof to the right answer Ajose you made some good work bringing this clarification for us, so Thank you so much, Gracias amigo :) A. NO - no need to build a custom model B. NO - k-means is supervised model C. YES - unsupervised clustering algorithm D. NO - Blazing Text will do word embedding, not classification No, k-means is an unsupervised learning algorithm. I C. Train an Amazon SageMaker Neural Topic Model (NTM) model to generate the product categories. -- option doesn't talk about any classification activity Neural Topic Model (NTM) is one of the built-in algorithms of Amazon SageMaker that can perform topic modeling on text data. Topic modeling is a technique that can discover latent topics or themes from a collection of documents. Topic modeling can be used for document categorization by assigning each document to one or more topics based on its content. Blazing Text is only for supervised problems. Assign pre-defined categories to documents in a corpus: categorize books in a library into academic disciplines - BlazingText algorithm https://docs.aws.amazon.com/sagemaker/latest/dg/algos.html Reading it again - C Option C is wrong because it suggests using a Neural Topic Model (NTM) to categorize documents. While NTM can be used to discover the underlying topics in a corpus of documents, it may not be the most efficient solution for categorizing documents for specific products. NTM is more suited for unsupervised learning problems where the goal is to discover the underlying themes or topics of the document corpus. In this scenario, the data scientist needs to categorize documents based on predefined product categories. Therefore, a supervised learning algorithm like a text classification model would be more suitable. Amazon SageMaker Blazing Text algorithm provides an efficient and scalable solution for text classification problems. "no predefined product categories" -> unsupervised learning, C. what is K-means? Good catch. The problem with B is the fact that is an incomplete question: "Tokenize the data and transform the data into tabular data" how are you going to do this conrad? No predefined product category: topic modeling with NTM or LDA (Organize a set of documents into topics (not known in advance): tag a document as belonging to a medical category based on the terms used in the document.) Predefined product category: topic modeling with blazing text (categorize books in a library into academic disciplines) c - https://www.examtopics.com/discussions/amazon/view/98762-exam-aws-certified-machine-learning-specialty-topic-1/
218
219 - A sports analytics company is providing services at a marathon. Each runner in the marathon will have their race ID printed as text on the front of their shirt. The company needs to extract race IDs from images of the runners. Which solution will meet these requirements with the LEAST operational overhead? - A.. Use Amazon Rekognition. B.. Use a custom convolutional neural network (CNN). C.. Use the Amazon SageMaker Object Detection algorithm. D.. Use Amazon Lookout for Vision.
A - Amazon Rekognition can handle large-scale and high-quality images with low latency and high accuracy. You can use Amazon Rekognition to process images from various sources, such as cameras, webcams, or media files. You can also use Amazon Rekognition to process images in real time or in batch mode. It's A, rekgonition has methods to do text detection. https://aws.amazon.com/rekognition/ OCR : Amazon Rekognition Rekognition's Text Detection feature allows you to easily extract text from images and videos without the need to create a custom model or perform complex training. It's a fully managed service that provides accurate and scalable text detection, recognition, and analysis capabilities. Additionally, Rekognition provides a simple API and SDKs for integrating text detection functionality into your applications. https://docs.aws.amazon.com/rekognition/latest/dg/text-detection.html?pg=ln&sec=ft a - LEAST operational overhead - https://www.examtopics.com/discussions/amazon/view/98763-exam-aws-certified-machine-learning-specialty-topic-1/
219
220 - A manufacturing company wants to monitor its devices for anomalous behavior. A data scientist has trained an Amazon SageMaker scikit-learn model that classifies a device as normal or anomalous based on its 4-day telemetry. The 4-day telemetry of each device is collected in a separate file and is placed in an Amazon S3 bucket once every hour. The total time to run the model across the telemetry for all devices is 5 minutes. What is the MOST cost-effective solution for the company to use to run the model across the telemetry for all the devices? - A.. SageMaker Batch Transform B.. SageMaker Asynchronous Inference C.. SageMaker Processing D.. A SageMaker multi-container endpoint
A - Batch Transform can efficiently handle this workload by splitting the files into mini-batches and distributing them across multiple instances. Batch Transform can also scale down the instances when there are no files to process, so you only pay for the duration that the instances are actively processing files. Batch Transform is more cost-effective than Asynchronous Inference because Asynchronous Inference is designed for workloads with large payload sizes (up to 1GB) and long processing times (up to 15 minutes) that need near real-time responses. Asynchronous Inference queues incoming requests and processes them asynchronously, returning an output location as a response. Asynchronous Inference also autoscales the instance count to zero when there are no requests to process. However, Asynchronous Inference charges you for both request processing and request queuing time, which may be higher than Batch Transform for your use case. A. YES - Batch Transform can pick up new files from S3 B. NO - no need for asynch, high-throughpt queues C. NO - Processing is not for model inference D. NO - no need for scaling through multiple endpoints Key point is data is collected every hour. seems like a batch solution is most cost effective I will go with A. The Async inference seems promising but the size of telemetry file is not known. As per https://docs.aws.amazon.com/sagemaker/latest/dg/inference-cost-optimization.html "Use batch inference for workloads for which you need inference for a large set of data for processes that happen offline (that is, you don’t need a persistent endpoint). You pay for the instance for the duration of the batch inference job". As you pay for the batch job duration, cost should not be an issue with Batch transform. "Use asynchronous inference for asynchronous workloads that process up to 1 GB of data (such as text corpus, image, video, and audio) that are latency insensitive and cost sensitive. With asynchronous inference, you can control costs by specifying a fixed number of instances for the optimal processing rate instead of provisioning for the peak. You can also scale down to zero to save additional costs." Based on the requirements and constraints given in the scenario, the MOST cost-effective solution for the company to use to run the model across the telemetry for all the devices is SageMaker Batch Transform. SageMaker Batch Transform is a cost-effective solution for performing offline inference, as it allows for large amounts of data to be processed at a lower cost compared to real-time inference. In this case, the telemetry data for each device is collected hourly and can be processed in batches using SageMaker Batch Transform. This can help to reduce the cost of inference, as the data is not being processed in real-time and can be processed offline. B -- based on what Drcock87 said, as well as this: "Amazon SageMaker Asynchronous Inference is a new capability in SageMaker that queues incoming requests and processes them asynchronously. Compared to Batch Transform Asynchronous Inference provides immediate access to the results of the inference job rather than waiting for the job to complete" I still think its A because: - "The 4-day telemetry of each device is collected in a separate file and is placed in an Amazon S3 bucket once every hour." Which means this is use-case where data is available upfront for inferencing. - Also, unlike async the batch transform does not keep an active endpoint all the time. async is similar to real-time inference, used when you need inference right-away; the question is not asking for real-time inference. Real-Time Inference is suitable for workloads where payload sizes are up to 6MB and need to be processed with low latency requirements in the order of milliseconds or seconds. Serverless Inference: Serverless inference is ideal when you have intermittent or unpredictable traffic patterns. Batch transform is ideal for offline predictions on large batches of data that is available upfront. We are introducing Amazon SageMaker Asynchronous Inference, a new inference option in Amazon SageMaker that queues incoming requests and processes them asynchronously. This option is ideal for inferences with large payload sizes (up to 1GB) and/or long processing times (up to 15 minutes) that need to be processed as requests arrive. Asynchronous inference enables you to save on costs by autoscaling the instance count to zero when there are no requests to process, so you only pay when your endpoint is processing requests. a Real-time inference is suitable for workloads where payload sizes are up to 6MB and need to be processed with low latency requirements in the order of milliseconds or seconds. Batch transform is ideal for offline predictions on large batches of data that is available upfront. The new asynchronous inference option is ideal for workloads where the request sizes are large (up to 1GB) and inference processing times are in the order of minutes (up to 15 minutes). Example workloads for asynchronous inference include running predictions on high resolution images generated from a mobile device at different intervals during the day and providing responses within minutes of receiving the request. - https://www.examtopics.com/discussions/amazon/view/98764-exam-aws-certified-machine-learning-specialty-topic-1/
220
221 - A company wants to segment a large group of customers into subgroups based on shared characteristics. The company’s data scientist is planning to use the Amazon SageMaker built-in k-means clustering algorithm for this task. The data scientist needs to determine the optimal number of subgroups (k) to use. Which data visualization approach will MOST accurately determine the optimal value of k? - A.. Calculate the principal component analysis (PCA) components. Run the k-means clustering algorithm for a range of k by using only the first two PCA components. For each value of k, create a scatter plot with a different color for each cluster. The optimal value of k is the value where the clusters start to look reasonably separated. B.. Calculate the principal component analysis (PCA) components. Create a line plot of the number of components against the explained variance. The optimal value of k is the number of PCA components after which the curve starts decreasing in a linear fashion. C.. Create a t-distributed stochastic neighbor embedding (t-SNE) plot for a range of perplexity values. The optimal value of k is the value of perplexity, where the clusters start to look reasonably separated. D.. Run the k-means clustering algorithm for a range of k. For each value of k, calculate the sum of squared errors (SSE). Plot a line chart of the SSE for each value of k. The optimal value of k is the point after which the curve starts decreasing in a linear fashion.
D - The Answer is D Option D uses the elbow method, which is a popular and well-known method for determining the optimal value of k for k-means clustering1. It plots the sum of squared errors (SSE) for different values of k, and looks for the point where the SSE starts to decrease in a linear fashion. This point is called the elbow, and it indicates that adding more clusters does not improve the model significantly2. The Sum of square shows variation within each cluster D. Run the k-means clustering algorithm for a range of k. For each value of k, calculate the sum of squared errors (SSE). Plot a line chart of the SSE for each value of k. The optimal value of k is the point after which the curve starts decreasing in a linear fashion. The sum of squared errors (SSE) measures the total variation within each cluster, and the optimal value of k is typically the point where the SSE begins to level off or decrease sharply. Plotting the SSE against the number of clusters (k) allows the data scientist to identify the optimal number of clusters based on where the SSE curve starts decreasing linearly. d https://towardsdatascience.com/explain-ml-in-a-simple-way-k-means-clustering-e925d019743b - https://www.examtopics.com/discussions/amazon/view/98201-exam-aws-certified-machine-learning-specialty-topic-1/
221
222 - A data scientist at a financial services company used Amazon SageMaker to train and deploy a model that predicts loan defaults. The model analyzes new loan applications and predicts the risk of loan default. To train the model, the data scientist manually extracted loan data from a database. The data scientist performed the model training and deployment steps in a Jupyter notebook that is hosted on SageMaker Studio notebooks. The model's prediction accuracy is decreasing over time. Which combination of steps is the MOST operationally efficient way for the data scientist to maintain the model's accuracy? (Choose two.) - A.. Use SageMaker Pipelines to create an automated workflow that extracts fresh data, trains the model, and deploys a new version of the model. B.. Configure SageMaker Model Monitor with an accuracy threshold to check for model drift. Initiate an Amazon CloudWatch alarm when the threshold is exceeded. Connect the workflow in SageMaker Pipelines with the CloudWatch alarm to automatically initiate retraining. C.. Store the model predictions in Amazon S3. Create a daily SageMaker Processing job that reads the predictions from Amazon S3, checks for changes in model prediction accuracy, and sends an email notification if a significant change is detected. D.. Rerun the steps in the Jupyter notebook that is hosted on SageMaker Studio notebooks to retrain the model and redeploy a new version of the model. E.. Export the training and deployment code from the SageMaker Studio notebooks into a Python script. Package the script into an Amazon Elastic Container Service (Amazon ECS) task that an AWS Lambda function can initiate.
AB - A. YES - fully automated pipeline B. YES - triggers the pipeline A as needed C. NO - email notification does not allow automation D. NO - manual steps required, not operationaly efficient E. NO - we need another step to trigger the Lambda Option A uses SageMaker Pipelines to create an automated workflow that extracts fresh data, trains the model, and deploys a new version of the model. This option is operationally efficient because it eliminates the need for manual intervention and ensures that your model is always up to date with the latest data. You can also use SageMaker Pipelines to orchestrate your workflow using a graphical interface or a Python SDK1. Option B configures SageMaker Model Monitor with an accuracy threshold to check for model drift. Model drift occurs when the statistical properties of the target variable change over time, which can affect the performance of your model2. https://aws.amazon.com/blogs/machine-learning/automate-model-retraining-with-amazon-sagemaker-pipelines-when-drift-is-detected/ Retrain the model when the accuracy is decreasing is the most recommended way to take of your models. The MOST operationally efficient way for the data scientist to maintain the model's accuracy would be to choose options A and B: A. Use SageMaker Pipelines to create an automated workflow that extracts fresh data, trains the model, and deploys a new version of the model. Using SageMaker Pipelines allows the data scientist to automate the entire workflow from data extraction to model deployment. This ensures that the model is trained and deployed on the latest data automatically without the need for manual intervention. The data scientist can set up the pipeline to run on a schedule or trigger it based on certain events. B. Configure SageMaker Model Monitor with an accuracy threshold to check for model drift. Initiate an Amazon CloudWatch alarm when the threshold is exceeded. Connect the workflow in SageMaker Pipelines with the CloudWatch alarm to automatically initiate retraining. Using SageMaker Pipelines to create an automated workflow that extracts fresh data, trains the model, and deploys a new version of the model is an efficient way to automate the process of model retraining and deployment. Configuring SageMaker Model Monitor with an accuracy threshold to check for model drift and initiating an Amazon CloudWatch alarm when the threshold is exceeded is an efficient way to monitor the accuracy of the deployed model and initiate retraining when necessary. This approach helps to maintain the accuracy of the model over time. https://aws.amazon.com/blogs/machine-learning/automate-model-retraining-with-amazon-sagemaker-pipelines-when-drift-is-detected/ B first and then A - https://www.examtopics.com/discussions/amazon/view/98765-exam-aws-certified-machine-learning-specialty-topic-1/
222
223 - A retail company wants to create a system that can predict sales based on the price of an item. A machine learning (ML) engineer built an initial linear model that resulted in the following residual plot: Which actions should the ML engineer take to improve the accuracy of the predictions in the next phase of model building? (Choose three.) [https://img.examtopics.com/aws-certified-machine-learning-specialty/image1.png] - A.. Downsample the data uniformly to reduce the amount of data. B.. Create two different models for different sections of the data. C.. Downsample the data in sections where Price < 50. D.. Offset the input data by a constant value where Price > 50. E.. Examine the input data, and apply non-linear data transformations where appropriate. F.. Use a non-linear model instead of a linear model.
BEF - A. NO - reducing data will not help in a better model; the more the merrier :-) B. YES - It can address non-linearity in the full spectrum C. NO - reducing data will not help in a better model; the more the merrier :-) D. NO - residual is not constant when price > 50 E. YES - that can help make non-linear data linear F. YES - it can capture more complex relationships Option E suggests that you examine the input data, and apply non-linear data transformations where appropriate. This option is helpful because it can reduce the non-linearity in your data and make it more suitable for a linear model. For example, you can apply a logarithmic, square root, or inverse transformation to your price variable and see if it improves the fit of your model1. You can also use the Box-Cox transformation, which is a method that automatically finds the best transformation for your data2. Option F suggests that you use a non-linear model instead of a linear model. This option is also helpful because it can capture the non-linear relationship between price and sales that is evident in your residual plot. Option B suggests that you create two different models for different sections of the data. This option is also helpful because it can account for the different behavior of your data at different price ranges. The linear model y = ax + b works well for x < 50, but for x > 50 the residual increases linearly, meaning that the slope linear model increases, i.e., y = a'x + b' with a' != a. Offset will not help. Downsampling will not help either. The linear model doesn't capture the data complexity Then, BEF It appears on 2023-April-03 As per wolfsong said Two models , add a constant or in-put data transformation Bde should be the answer The residual plot shows that the linear model is not fitting the data well, with a clear pattern indicating that the model is underfitting. To improve the accuracy of the predictions, the ML engineer should take the following actions: C. Downsample the data in sections where Price < 50: This could be an option since there seems to be a higher variance in the residuals in the region where Price < 50. D. Offset the input data by a constant value where Price > 50: This could be an option since there seems to be a systematic bias in the residuals in the region where Price > 50. E. Examine the input data, and apply non-linear data transformations where appropriate: This is necessary since the residual plot shows that the linear model is not capturing the non-linear relationships in the data. If D is the answer of this question, Isn't B the another answer too? Suppose that the initial linear model means y = aX + b, then D means y = a(X - C) + b --> y = aX + b' (when Price > 50) I think that D means we would use two different linear models for different sections (Price = 50) of the data. A good residual plot is a flat line at y = 0. So... - Not sure if D is right. If you offset by a constant value, you're just moving the plot up or down. You'd have to add a term like K*Price, where price > 50 and K > 0, for you to flatten that curve beyond Price > 50. - Also unsure about C. The variance looks fairly good for Price < 50 as it's mostly around zero which is what you want. The problem is the residual value at Price > 50 which goes way off. I'd go with B, E & F: E: obvious F: use non-linear model instead as it will remove the kink in the plot B: Not an answer I like, but if you can't use a nonlinear model, you need to use a piecewise-linear model that separates the data in two. Something like this: https://towardsdatascience.com/piecewise-linear-regression-model-what-is-it-and-when-can-we-use-it-93286cfee452 - https://www.examtopics.com/discussions/amazon/view/100146-exam-aws-certified-machine-learning-specialty-topic-1/
223
224 - A data scientist at a food production company wants to use an Amazon SageMaker built-in model to classify different vegetables. The current dataset has many features. The company wants to save on memory costs when the data scientist trains and deploys the model. The company also wants to be able to find similar data points for each test data point. Which algorithm will meet these requirements? - A.. K-nearest neighbors (k-NN) with dimension reduction B.. Linear learner with early stopping C.. K-means D.. Principal component analysis (PCA) with the algorithm mode set to random
A - to be able to find similar data points for each test data point. K-means unsupervised learning It is A Memory Efficiency: K-nearest neighbors (k-NN) doesn't require storing a model with learned parameters, as it's an instance-based learning algorithm. It simply memorizes the training dataset. Therefore, it saves on memory costs compared to models with learned parameters like linear learners. Dimension Reduction: By employing dimension reduction techniques like Principal Component Analysis (PCA) in conjunction with k-NN, you can reduce the dimensionality of the dataset, which helps in saving memory costs. This makes k-NN with dimension reduction a suitable choice when memory efficiency is a concern. Similar Data Points: K-nearest neighbors naturally provides a measure of similarity between data points. Given a test data point, it finds the k nearest neighbors in the training data. This fulfills the requirement of being able to find similar data points for each test data point. A. K-nearest neighbors (k-NN) with dimension reduction ( KNN - Useful for Classification task for different types of veggies based on features + Dimensionality reductions like PCA can be applied prior to KNN to reduce no.of features in dataset , thereby saving memory costs during training and model deployment - also remove noise and data redundancy ) https://www.linkedin.com/advice/3/what-difference-between-knn-k-means-skills-computer-science-cx1hc This is un unsupervised clustering problem not a classification one (A). k-means is a better choice from memory efficiency perspective. While A is most voted comment, but knn is really high on memory usage as it stores the data points information to make predictions. Just voting for it because it mentions dimensionality reduction is obtuse. C is the next most probable candidate that fits the bill on every account. Should be A, as only A is can be used for classification, finding similar data points and dimensionality reduction A. YES - K-NN will find the similar datapoints, and dimension reduction will save memory B. NO - Linear learner is for regression or classification, not finding similar data points C. NO - K-means is for unsupervised clustering, not find closest data ponits D. NO - Principal component analysis (PCA) with the algorithm mode set to random C doesn't solve the "too many features" problem + It's well defined the vegetable classes. A is the way KNN to reduce dimensionality which may help reduce memory utilisation. It's A, needs to reduct the dimensionality of the dataset. They wanna less feature I will go with A. C is not valid, as K-means is a clustering algorithm that can group similar data points together. However, it does not perform classification, and it is not clear how it addresses the memory cost and similarity search requirements mentioned in the question. It should be C, because it is unsupervised classification problem. Not sure that classification is unsupervised problem option A suggests using the k-nearest neighbors (k-NN) algorithm with dimension reduction. The k-NN algorithm can be used for classification tasks and dimension reduction can help reduce memory costs. Additionally, k-NN can be used for finding similar data points. K-NN is a simple algorithm that works well with high-dimensional data and can find similar data points. agree. By reducing the number of dimensions, you may achieve comparable analysis results using less memory and in a shorter amount of time. https://ealizadeh.com/blog/knn-and-kmeans/#:~:text=unsupervised%20learning%20algorithm.-,K%20in%20K%2DMeans%20refers%20to%20the%20number%20of%20clusters,using%20different%20values%20for%20K. KNN does use lot of memory because in lazy learning it stores (memorizes) the training dataset. AWS sagemaker however has an improved version of this algorithm. Because the questions does not mention we have labels, we cannot use supervised learning K-means : unsupervised and helps us to classify different vegetables based on their many features. This also find "similar data points for each test data point" c - https://www.examtopics.com/discussions/amazon/view/98308-exam-aws-certified-machine-learning-specialty-topic-1/
224
225 - A data scientist is training a large PyTorch model by using Amazon SageMaker. It takes 10 hours on average to train the model on GPU instances. The data scientist suspects that training is not converging and that resource utilization is not optimal. What should the data scientist do to identify and address training issues with the LEAST development effort? - A.. Use CPU utilization metrics that are captured in Amazon CloudWatch. Configure a CloudWatch alarm to stop the training job early if low CPU utilization occurs. B.. Use high-resolution custom metrics that are captured in Amazon CloudWatch. Configure an AWS Lambda function to analyze the metrics and to stop the training job early if issues are detected. C.. Use the SageMaker Debugger vanishing_gradient and LowGPUUtilization built-in rules to detect issues and to launch the StopTrainingJob action if issues are detected. D.. Use the SageMaker Debugger confusion and feature_importance_overweight built-in rules to detect issues and to launch the StopTrainingJob action if issues are detected.
C - It has to be C. Option C uses the SageMaker Debugger vanishing_gradient and LowGPUUtilization built-in rules to detect issues and to launch the StopTrainingJob action if issues are detected. This option is the most efficient because it leverages the existing features of SageMaker Debugger to monitor and troubleshoot your training job without requiring any additional development effort. You can use the following steps to implement this option. Answer is C The best option for the data scientist to identify and address training issues with the least development effort is option C: Use the SageMaker Debugger vanishing_gradient and LowGPUUtilization built-in rules to detect issues and to launch the StopTrainingJob action if issues are detected. SageMaker Debugger is a tool that helps to debug machine learning training processes. It provides several built-in rules that can detect and diagnose common issues that can occur during training. In this case, the data scientist suspects that the training is not converging and that resource utilization is not optimal. The vanishing_gradient and LowGPUUtilization rules can help to identify these issues. C. Use the SageMaker Debugger vanishing_gradient and LowGPUUtilization built-in rules to detect issues and to launch the StopTrainingJob action if issues are detected. The SageMaker Debugger is a built-in tool that helps with debugging and profiling machine learning models trained in SageMaker. In this scenario, the data scientist suspects that there are issues with the training process, so using the SageMaker Debugger is the most appropriate solution. The vanishing_gradient and LowGPUUtilization built-in rules can detect common training issues such as a vanishing gradient problem or low GPU utilization, which could affect the training convergence and resource utilization. By launching the StopTrainingJob action if issues are detected, the training job can be stopped early, which can help to save resources and time. This approach requires the least development effort, as it is built-in to SageMaker and does not require the data scientist to create custom metrics or configure CloudWatch alarms. https://docs.aws.amazon.com/sagemaker/latest/dg/debugger-built-in-rules.html - https://www.examtopics.com/discussions/amazon/view/98200-exam-aws-certified-machine-learning-specialty-topic-1/
225
226 - A bank wants to launch a low-rate credit promotion campaign. The bank must identify which customers to target with the promotion and wants to make sure that each customer's full credit history is considered when an approval or denial decision is made. The bank's data science team used the XGBoost algorithm to train a classification model based on account transaction features. The data science team deployed the model by using the Amazon SageMaker model hosting service. The accuracy of the model is sufficient, but the data science team wants to be able to explain why the model denies the promotion to some customers. What should the data science team do to meet this requirement in the MOST operationally efficient manner? - A.. Create a SageMaker notebook instance. Upload the model artifact to the notebook. Use the plot_importance() method in the Python XGBoost interface to create a feature importance chart for the individual predictions. B.. Retrain the model by using SageMaker Debugger. Configure Debugger to calculate and collect Shapley values. Create a chart that shows features and SHapley. Additive explanations (SHAP) values to explain how the features affect the model outcomes. C.. Set up and run an explainability job powered by SageMaker Clarify to analyze the individual customer data, using the training data as a baseline. Create a chart that shows features and SHapley Additive explanations (SHAP) values to explain how the features affect the model outcomes. D.. Use SageMaker Model Monitor to create Shapley values that help explain model behavior. Store the Shapley values in Amazon S3. Create a chart that shows features and SHapley Additive explanations (SHAP) values to explain how the features affect the model outcomes.
C - As the model has been already trained and deployed, I will go with C. because B (SageMaker Debugger) is used at training time C is right B and C are possible solutions, but the question requested the MOST operationally efficient manner SageMaker Clarify provides tools to help ML modelers and developers understand model characteristics as a whole prior to deployment and to debug predictions provided by the model after it’s deployed Option B is not recommended because retraining the model with SageMaker Debugger and configuring Debugger to calculate and collect Shapley values is time-consuming and may not be operationally efficient It seems that both B and C are possible answers. SHAP baselines can be provided by both. The scenario says nothing about concern of bias, so perhaps Clarify is overkill? This post from AWS seem to be specifically addressing this case, and uses SageMaker Debugger. https://aws.amazon.com/blogs/machine-learning/ml-explainability-with-amazon-sagemaker-debugger/ Selected Answer: B Retrain the model by using SageMaker Debugger. Configure Debugger to calculate and collect Shapley values. Create a chart that shows features and SHapley Additive explanations (SHAP) values to explain how the features affect the model outcomes. While A, C, and D are all options for explaining the model's behavior, the most efficient way to meet the bank's requirements is to use SageMaker Debugger to calculate and collect S Shapley values for each prediction. This allows the data science team to easily explain why the model denied the promotion to certain customers. SageMaker Debugger also provides built-in integration with SageMaker Studio, which enables data scientists to visualize the Shapley values and other debugging information through a user-friendly interface. C for explainability question says "wants wo be explain why" for current solution not retrain and asks for explainability. I think it's D. Model monitor automatically integrated with Clarify B. Retrain the model by using SageMaker Debugger. Configure Debugger to calculate and collect Shapley values. Create a chart that shows features and Shapley Additive explanations (SHAP) values to explain how the features affect the model outcomes would be the most operationally efficient way to meet the requirement of explaining why the model denies the promotion to some customers. Question needs explainability of the features for the predictions to get answers of "wants wo be explain why" not retraining the model. Question ask for "explain how the features affect the model outcome" Clarify is the best solution. The key is training data https://www.amazonaws.cn/en/sagemaker/clarify/ Amazon SageMaker Clarify provides machine learning developers with greater visibility into their training data and models so they can identify and limit bias and explain predictions. I think B is correct. Its between B and C SageMaker Clarify is used to promote transparency and accountability in machine learning models. Thats what we are looking for why model denies promotion to some customers "Explain individual model predictions Customers and internal stakeholders both want transparency into how models make their predictions. SageMaker Clarify integrates with SageMaker Experiments to show you the importance of each model input for a specific prediction. Results can be made available to customer-facing employees so that they have an understanding of the model’s behavior when making decisions based on model predictions." https://www.amazonaws.cn/en/sagemaker/clarify/ - https://www.examtopics.com/discussions/amazon/view/103014-exam-aws-certified-machine-learning-specialty-topic-1/
226
227 - A company has hired a data scientist to create a loan risk model. The dataset contains loan amounts and variables such as loan type, region, and other demographic variables. The data scientist wants to use Amazon SageMaker to test bias regarding the loan amount distribution with respect to some of these categorical variables. Which pretraining bias metrics should the data scientist use to check the bias distribution? (Choose three.) - A.. Class imbalance B.. Conditional demographic disparity C.. Difference in proportions of labels D.. Jensen-Shannon divergence E.. Kullback-Leibler divergence F.. Total variation distance
DEF - All are valid answers, so definitely un scored question https://docs.aws.amazon.com/sagemaker/latest/dg/clarify-measure-data-bias.html Jensen-Shannon divergence, Kullback-Leibler divergence, and Total variation distance—are all measures of statistical distance between probability distributions. They are useful for understanding how different two distributions are, but they are not specifically designed to measure bias in categorical variables. https://docs.aws.amazon.com/sagemaker/latest/dg/clarify-measure-data-bias.html It is leaning towards bias in data, rather than probability distribution. BIAS in _data_ before training uses following matrices B - helps assess whether there is bias in how loan amounts are distributed among different categories C - ompares the proportions of positive (e.g., approved loans) and negative (e.g., rejected loans) outcomes across different facets (demographic groups) F - High total variation distance between between the predicted and observed labels suggests potential bias Question asks for distributions. DEF are distributions. ABC are imbalances or disparities. Since it is indicated in the official web site that D, E and F are used to determine how different the distributions for loan application outcomes are for different demographic groups https://docs.aws.amazon.com/sagemaker/latest/dg/clarify-measure-data-bias.html All valid answers ? They are listed as "pre-training bias" here https://docs.aws.amazon.com/sagemaker/latest/dg/clarify-measure-data-bias.html Jensen-Shannon divergence, Kullback-Leibler divergence, and Total variation distance, are used to measure differences between probability distributions, but they are not specifically pretraining bias metrics for checking bias distribution concerning categorical variables in this context. confusing but lean towards B D and F Confusing with DEF answers "How different are the distributions for loan application outcomes for different demographic groups?" https://docs.aws.amazon.com/sagemaker/latest/dg/clarify-measure-data-bias.html D. Jensen-Shannon divergence and E. Kullback-Leibler divergence are post-training bias metrics that measure the distance between two probability distributions. They are not pretraining bias metrics and cannot be used to check the bias distribution of the dataset. F. Total variation distance is a post-training bias metric that measures the difference between two probability distributions. It is not a pretraining bias metric and cannot be used to check the bias distribution of the dataset. Send a message... The are all pretrainimg metrics https://docs.aws.amazon.com/sagemaker/latest/dg/clarify-measure-data-bias.html This is correct.... they are all valid answers. Seems this is one of the un-scored questions... those 15 that are used to calibrate or test possible future questions. Agree with austinoy, answer should be DEF. https://docs.aws.amazon.com/sagemaker/latest/dg/clarify-measure-data-bias.html based on the link shouldn't DEF be the answers? - https://www.examtopics.com/discussions/amazon/view/103018-exam-aws-certified-machine-learning-specialty-topic-1/
227
228 - A retail company wants to use Amazon Forecast to predict daily stock levels of inventory. The cost of running out of items in stock is much higher for the company than the cost of having excess inventory. The company has millions of data samples for multiple years for thousands of items. The company’s purchasing department needs to predict demand for 30-day cycles for each item to ensure that restocking occurs. A machine learning (ML) specialist wants to use item-related features such as "category," "brand," and "safety stock count." The ML specialist also wants to use a binary time series feature that has "promotion applied?" as its name. Future promotion information is available only for the next 5 days. The ML specialist must choose an algorithm and an evaluation metric for a solution to produce prediction results that will maximize company profit. Which solution will meet these requirements? - A.. Train a model by using the Autoregressive Integrated Moving Average (ARIMA) algorithm. Evaluate the model by using the Weighted Quantile Loss (wQL) metric at 0.75 (P75). B.. Train a model by using the Autoregressive Integrated Moving Average (ARIMA) algorithm. Evaluate the model by using the Weighted Absolute Percentage Error (WAPE) metric. C.. Train a model by using the Convolutional Neural Network - Quantile Regression (CNN-QR) algorithm. Evaluate the model by using the Weighted Quantile Loss (wQL) metric at 0.75 (P75). D.. Train a model by using the Convolutional Neural Network - Quantile Regression (CNN-QR) algorithm. Evaluate the model by using the Weighted Absolute Percentage Error (WAPE) metric.
C - WQL is particularly useful when there are different costs for underpredicting and overpredicting. By setting the weight (τ) of the wQL function, you can automatically incorporate differing penalties for underpredicting and overpredicting. A. NO - lot of data is available, better to use CNN-QR B. NO - lot of data is available, better to use CNN-QR C. YES - wQL is particularly useful when there are different costs for underpredicting and overpredicting (https://docs.aws.amazon.com/forecast/latest/dg/metrics.html#metrics-wQL) D. NO - WAPE will measure deviation, but no over vs. under forecasting It should be A I think. CNN is for image analysis Option C also suggests evaluating the model by using the Weighted Quantile Loss (wQL) metric at 0.75 (P75). This metric measures the accuracy of a model at a specified quantile, which is a point in the distribution of possible outcomes2. For example, P75 means that 75% of the outcomes are below that point, and 25% are above it. This metric is suitable for your use case because it can incorporate different costs for underpredicting and overpredicting2. Since the cost of running out of items in stock is much higher for your company than the cost of having excess inventory, you can set a high weight (τ) for the wQL function to penalize underpredictions more than overpredictions2. This way, you can optimize your model to produce prediction results that will maximize your company profit. A retail company wants to use Amazon Forecast to predict daily stock levels of inventory. The cost of running out of items in stock is much higher for the company than the cost of having excess inventory. The company has millions of data samples for multiple years for thousands of items. The company’s purchasing department needs to predict demand for 30-day cycles for each item to ensure that restocking occurs. So, what is an argument for A here? I'll go with, C? https://docs.aws.amazon.com/forecast/latest/dg/aws-forecast-algo-cnnqr.html https://docs.aws.amazon.com/forecast/latest/dg/metrics.html#metrics-wQL https://docs.aws.amazon.com/forecast/latest/dg/metrics.html https://docs.aws.amazon.com/forecast/latest/dg/aws-forecast-recipe-arima.html https://docs.aws.amazon.com/forecast/latest/dg/aws-forecast-algo-cnnqr.html - https://www.examtopics.com/discussions/amazon/view/103021-exam-aws-certified-machine-learning-specialty-topic-1/
228
229 - An online retail company wants to develop a natural language processing (NLP) model to improve customer service. A machine learning (ML) specialist is setting up distributed training of a Bidirectional Encoder Representations from Transformers (BERT) model on Amazon SageMaker. SageMaker will use eight compute instances for the distributed training. The ML specialist wants to ensure the security of the data during the distributed training. The data is stored in an Amazon S3 bucket. Which combination of steps should the ML specialist take to protect the data during the distributed training? (Choose three.) - A.. Run distributed training jobs in a private VPC. Enable inter-container traffic encryption. B.. Run distributed training jobs across multiple VPCs. Enable VPC peering. C.. Create an S3 VPC endpoint. Then configure network routes, endpoint policies, and S3 bucket policies. D.. Grant read-only access to SageMaker resources by using an IAM role. E.. Create a NAT gateway. Assign an Elastic IP address for the NAT gateway. F.. Configure an inbound rule to allow traffic from a security group that is associated with the training instances.
ACD - A - Running the training jobs in a private VPC will ensure that the data is transmitted over an encrypted channel. Enabling inter-container traffic encryption will encrypt data that is transmitted between containers. This will help protect the data during the distributed training. C - Creating an S3 VPC endpoint will provide a secure and private connection between the VPC and the S3 bucket. Network routes, endpoint policies, and S3 bucket policies can be configured to further secure the data during the distributed training. D - Granting read-only access to SageMaker resources by using an IAM role will ensure that the data is only accessed by the necessary resources during the distributed training. This will help prevent unauthorized access to the data. I'm not agree with E because assigns read-only access to Sagemaker SageMaker doesn't support resource-based policies. https://docs.aws.amazon.com/sagemaker/latest/dg/security_iam_service-with-iam.html#security_iam_service-with-iam-resource-based-policies A,C,D I am not sure but F says allow inbound rule for training instance's security group? Which security group's inbound rule, s3? It is distraction, I think F says s3 inbound rule, nonsense. A and C are final, I think E is the third option where inbound traffic cannot access VPC resources. Changing my options to ACD Will go for A,C & F. Please check the keyword "Distributed" in question. In case of a distributed training, instances within a same security group are required to communicate with each other which is configured by allowing inbound traffic through security group. Check this section (Configure the VPC Security Group) in this document - https://docs.aws.amazon.com/sagemaker/latest/dg/train-vpc.html A -->This ensures that the data is not exposed to the public internet and that all traffic between containers is encrypted C--> This ensures that all traffic between the Amazon SageMaker instances and the S3 bucket is kept within the VPC and is not exposed to the public internet. The endpoint policies and S3 bucket policies can be used to control access to the data D--> This ensures that only authorized users can access the SageMaker resources (option B) is not necessary as running jobs in a private VPC provides sufficient security creating a NAT gateway and assigning an Elastic IP address for the NAT gateway (option E) is not necessary as it does not provide any additional security benefits configuring an inbound rule to allow traffic from a security group that is associated with the training instances (option F) is not necessary as it does not provide any additional security benefits especially in the Prescence of the private endpoint Reference is https://docs.aws.amazon.com/sagemaker/latest/dg/train-vpc.html A. YES - need a private VPC, inter-container traffic encryption optionnal but ok B. NO - no need for multple VPC C. YES - S3 VPC endpoint will prevent the traffic to flow through the internet D. YES - SageMaker resources (instances here) need to read the S3 files E. NO - NAT gateway is allow outbound traffic from a private subnet to Internet; not needed F. NO - The training instances does not need to receive inbound connections ACD Is what I am going with ---------------- It was a tough choice between D and F but when I look at Protecting the Data as the main point of the question I went with D (read only to S3) think the best combination of steps for you are A, C, and D Letra B está errada, pois torna o processo muito complexo. Letra A - C - D estão corretas. Letra F está errada, pois Inbound Rules não são relevantes para S3. Finalmente, Letra E é desnecessária. Based on the context the inbound should be added to the data, which is stored in S3. Inbound rules are not relevant to S3. D should be correct instead of F. A C and F look correct Configure the VPC Security Group In distributed training, you must allow communication between the different containers in the same training job. To do that, configure a rule for your security group that allows inbound connections between members of the same security group. For EFA-enabled instances, ensure that both inbound and outbound connections allow all traffic from the same security group. For information, see Security Group Rules in the Amazon Virtual Private Cloud User Guide. A,B pretty sure; D best guess. https://docs.aws.amazon.com/sagemaker/latest/dg/train-encrypt.html https://docs.aws.amazon.com/sagemaker/latest/dg/train-vpc.html My best "guess" is ACF - https://www.examtopics.com/discussions/amazon/view/103024-exam-aws-certified-machine-learning-specialty-topic-1/
229
230 - An analytics company has an Amazon SageMaker hosted endpoint for an image classification model. The model is a custom-built convolutional neural network (CNN) and uses the PyTorch deep learning framework. The company wants to increase throughput and decrease latency for customers that use the model. Which solution will meet these requirements MOST cost-effectively? - A.. Use Amazon Elastic Inference on the SageMaker hosted endpoint. B.. Retrain the CNN with more layers and a larger dataset. C.. Retrain the CNN with more layers and a smaller dataset. D.. Choose a SageMaker instance type that has multiple GPUs.
A - Just additional information, Elastic Inference is being deprecated and recommendation is use AWS Inferentia Option A can help you meet your requirements most cost-effectively because it enables you to choose the instance type that is best suited to the overall compute and memory needs of your application, and then separately specify the amount of inference acceleration that you need. This reduces inference costs by up to 75% because you no longer need to over-provision GPU compute for inference1. We want to improve the inference of the model. That said, Letter B - C does not solve this problem. Letter D solves it, but at a very high cost. Letter A is correct, as we solve the problem at the lowest possible cost. Use Amazon Elastic Inference on the SageMaker hosted endpoint would be the most cost-effective solution for increasing throughput and decreasing latency. Amazon Elastic Inference is a service that allows you to attach GPU-powered inference acceleration to Amazon SageMaker hosted endpoints and EC2 instances. By attaching an Elastic Inference accelerator to the SageMaker endpoint, you can achieve better performance with lower costs than using a larger, more expensive instance type. "cost efficient" therefore A based on slide 20: https://pages.awscloud.com/rs/112-TZM-766/images/AL-ML%20for%20Startups%20-%20Select%20the%20Right%20ML%20Instance.pdf - https://www.examtopics.com/discussions/amazon/view/103027-exam-aws-certified-machine-learning-specialty-topic-1/
230
231 - An ecommerce company is collecting structured data and unstructured data from its website, mobile apps, and IoT devices. The data is stored in several databases and Amazon S3 buckets. The company is implementing a scalable repository to store structured data and unstructured data. The company must implement a solution that provides a central data catalog, self-service access to the data, and granular data access policies and encryption to protect the data. Which combination of actions will meet these requirements with the LEAST amount of setup? (Choose three.) - A.. Identify the existing data in the databases and S3 buckets. Link the data to AWS Lake Formation. B.. Identify the existing data in the databases and S3 buckets. Link the data to AWS Glue. C.. Run AWS Glue crawlers on the linked data sources to create a central data catalog. D.. Apply granular access policies by using AWS Identity and Access Management (1AM). Configure server-side encryption on each data source. E.. Apply granular access policies and encryption by using AWS Lake Formation. F.. Apply granular access policies and encryption by using AWS Glue.
ACE - Does Lake Formation provides access policies (IAM)? Encryptions (lake formation supports SSE with KMS) ACE is correct. A. YES - AWS Lake Formation is fully managed & integrated B. NO - we want to use Lake Formation instead of raw Glue (Formation built on top of Glue) C. YES - Glue used in conjunction with Lake Formation (https://docs.aws.amazon.com/lake-formation/latest/dg/glue-features-lf.html) D. Apply granular access policies by using AWS Identity and Access Management (1AM). Configure server-side encryption on each data source. E. YES - AWS Lake Formation is fully managed & integrated F. NO - we want to use Lake Formation instead of raw Glue Agree A, C, E https://docs.aws.amazon.com/lake-formation/latest/dg/what-is-lake-formation.html Lake Formation provides a single place to manage access controls for data in your data lake. You can define security policies that restrict access to data at the database, table, column, row, and cell levels. These policies apply to IAM users and roles, and to users and groups when federating through an external identity provider. You can use fine-grained controls to access data secured by Lake Formation within Amazon Redshift Spectrum, Athena, AWS Glue ETL, and Amazon EMR for Apache Spark. Whenever you create IAM identities, make sure to follow IAM best practices. ACE looks good I will choose ACE https://aws.amazon.com/blogs/big-data/build-secure-encrypted-data-lakes-with-aws-lake-formation/ ACE looks legit. ACE looks correct I'll go with, ACE? A: https://docs.aws.amazon.com/lake-formation/latest/dg/what-is-lake-formation.html C: https://docs.aws.amazon.com/lake-formation/latest/dg/upgrade-glue-lake-formation.html D: https://docs.aws.amazon.com/lake-formation/latest/dg/what-is-lake-formation.html Option D is incorrect and in fact, Server-side encryption would be discarded in favor of Lakeformation (Glue) encryption as per this AWS document - https://docs.aws.amazon.com/glue/latest/dg/encryption-security-configuration.html ACE are the correct options. - https://www.examtopics.com/discussions/amazon/view/103028-exam-aws-certified-machine-learning-specialty-topic-1/
231
232 - A machine learning (ML) specialist is developing a deep learning sentiment analysis model that is based on data from movie reviews. After the ML specialist trains the model and reviews the model results on the validation set, the ML specialist discovers that the model is overfitting. Which solutions will MOST improve the model generalization and reduce overfitting? (Choose three.) - A.. Shuffle the dataset with a different seed. B.. Decrease the learning rate. C.. Increase the number of layers in the network. D.. Add L1 regularization and L2 regularization. E.. Add dropout. F.. Decrease the number of layers in the network.
DEF - A: possible but unlikely for movie reviews B: wrong https://www.google.com/url?sa=t&rct=j&q=&esrc=s&source=web&cd=&cad=rja&uact=8&ved=2ahUKEwi31N_10eX9AhWYQ0EAHXDFCAwQFnoECA8QAw&url=https%3A%2F%2Fdeepchecks.com%2Fquestion%2Fdoes-learning-rate-affect-overfitting%2F&usg=AOvVaw19RT-u_XyEe8FG_10R6aFC C: wrong because would increase complexity and potentially overfitting D: correct E: correct F: correct Overfitting solutions must be regularization, dropout and adjusting learning rate. F is wrong, decreasing number of layers is not the top recommendations to solve overfitting, it may even cause underfitting. A. NO B. NO - decreasing the learning rate make may increase accuracy thus increase overfitting C. NO - more complexity tend to increase overfitting D. YES - best practice E. YES - best practice, will reduce model complexity and thus increase generalization F. YES - best practice, will reduce model complexity and thus increase generalization d and E for sure. i am a bit confused F and B To improve the generalization of the deep learning sentiment analysis model and reduce overfitting, the following three solutions can be implemented: Add Dropout: Dropout is a regularization technique that randomly drops out (sets to zero) a certain percentage of nodes in the neural network during each training epoch. This helps to prevent overfitting and improve generalization. Add L1 and L2 Regularization: L1 and L2 regularization are techniques used to add a penalty to the loss function of the neural network, which helps to prevent overfitting. L1 regularization adds a penalty based on the absolute value of the weights, while L2 regularization adds a penalty based on the squared value of the weights. Decrease the number of layers in the network: Deep neural networks with too many layers can be prone to overfitting. Reducing the number of layers in the network can help to prevent overfitting and improve generalization. We don't have to touch learning rate because the model is overfitting DEF are correct - https://www.examtopics.com/discussions/amazon/view/103029-exam-aws-certified-machine-learning-specialty-topic-1/
232
233 - An online advertising company is developing a linear model to predict the bid price of advertisements in real time with low-latency predictions. A data scientist has trained the linear model by using many features, but the model is overfitting the training dataset. The data scientist needs to prevent overfitting and must reduce the number of features. Which solution will meet these requirements? - A.. Retrain the model with L1 regularization applied. B.. Retrain the model with L2 regularization applied. C.. Retrain the model with dropout regularization applied. D.. Retrain the model by using more data.
A - A. YES - best to reduce feature count B. NO - L2 will reduce large weights and smooth features, to get rid of them C. NO - dropout is for NN D. NO - we are already converging, no need for more data L1 regulization A is correct L1 shrinks the less important feature’s coefficient to zero thus, removing some feature altogether. So, this works well for feature selection in case we have a huge number of features. Yes L1 for feature reduction https://www.google.com/url?sa=t&rct=j&q=&esrc=s&source=web&cd=&cad=rja&uact=8&ved=2ahUKEwim5YGd0-X9AhXPYcAKHSAlAkoQFnoECA0QAQ&url=https%3A%2F%2Fdocs.aws.amazon.com%2Fmachine-learning%2Flatest%2Fdg%2Fmodel-fit-underfitting-vs-overfitting.html&usg=AOvVaw2jwLt-J0jRSWeiDyjEzI_S - https://www.examtopics.com/discussions/amazon/view/103030-exam-aws-certified-machine-learning-specialty-topic-1/
233
234 - A credit card company wants to identify fraudulent transactions in real time. A data scientist builds a machine learning model for this purpose. The transactional data is captured and stored in Amazon S3. The historic data is already labeled with two classes: fraud (positive) and fair transactions (negative). The data scientist removes all the missing data and builds a classifier by using the XGBoost algorithm in Amazon SageMaker. The model produces the following results: • True positive rate (TPR): 0.700 • False negative rate (FNR): 0.300 • True negative rate (TNR): 0.977 • False positive rate (FPR): 0.023 • Overall accuracy: 0.949 Which solution should the data scientist use to improve the performance of the model? - A.. Apply the Synthetic Minority Oversampling Technique (SMOTE) on the minority class in the training dataset. Retrain the model with the updated training data. B.. Apply the Synthetic Minority Oversampling Technique (SMOTE) on the majority class in the training dataset. Retrain the model with the updated training data. C.. Undersample the minority class. D.. Oversample the majority class.
A - SMOTE for minority class for unbalanced data A. YES - we want to oversample the minority class = Fraud B. NO - we want more fraudulent cases C. NO - we want more fraudulent cases D. NO - we want more fraudulent cases By applying SMOTE, you can balance the class distribution and increase the diversity of your data, which can help your model learn better and reduce overfitting1. You can use the imbalanced-learn library in Python to implement SMOTE on your data2. https://www.google.com/url?sa=t&rct=j&q=&esrc=s&source=web&cd=&cad=rja&uact=8&ved=2ahUKEwjLkqfb1OX9AhXkQ0EAHYlVDq0QFnoECBMQAw&url=https%3A%2F%2Ftowardsdatascience.com%2F5-smote-techniques-for-oversampling-your-imbalance-data-b8155bdbe2b5&usg=AOvVaw1FdrxDEbLDjNhacXn3d-Tu - https://www.examtopics.com/discussions/amazon/view/103032-exam-aws-certified-machine-learning-specialty-topic-1/
234
235 - A company is training machine learning (ML) models on Amazon SageMaker by using 200 TB of data that is stored in Amazon S3 buckets. The training data consists of individual files that are each larger than 200 MB in size. The company needs a data access solution that offers the shortest processing time and the least amount of setup. Which solution will meet these requirements? - A.. Use File mode in SageMaker to copy the dataset from the S3 buckets to the ML instance storage. B.. Create an Amazon FSx for Lustre file system. Link the file system to the S3 buckets. C.. Create an Amazon Elastic File System (Amazon EFS) file system. Mount the file system to the training instances. D.. Use FastFile mode in SageMaker to stream the files on demand from the S3 buckets.
D - When to use fast file mode: For larger datasets with larger files (more than 50 MB per file), the first option is to try fast file mode, which is more straightforward to use than FSx for Lustre because it doesn't require creating a file system, or connecting to a VPC. Fast file mode is ideal for large file containers (more than 150 MB), and might also do well with files more than 50 MB. https://docs.aws.amazon.com/sagemaker/latest/dg/model-access-training-data.html#model-access-training-data-best-practices https://aws.amazon.com/about-aws/whats-new/2021/10/amazon-sagemaker-fast-file-mode/ Least setup is D, B could work but requires more setup! D Amazon SageMaker now supports Fast File Mode for accessing data in training jobs. This enables high performance data access by streaming directly from Amazon S3 with no code changes from the existing File Mode. For example, training a K-Means clustering model on a 100GB dataset took 28 minutes with File Mode but only 5 minutes with Fast File Mode (82% decrease). https://aws.amazon.com/about-aws/whats-new/2021/10/amazon-sagemaker-fast-file-mode/ B. Yes Please, look this link (https://aws.amazon.com/blogs/aws/enhanced-amazon-s3-integration-for-amazon-fsx-for-lustre/) A. NO - the files are too big and will fill the instance storage for no reason B. NO - Lustre create stripes for each file on different hard drives, maximizing throughput; our challenge is more about the volume of data to be made available on the training instance, not throughput C. NO - EFS support File semantic, but does not change any system property D. YES - FastFile allows training to start before the full file has been downloaded (like Pipe Mode) but does not require code change changing to D although D is very tempting but leaning towards B https://aws.amazon.com/blogs/machine-learning/ensure-efficient-compute-resources-on-amazon-sagemaker/ When to use fast file mode For larger datasets with larger files (more than 50 MB per file), the first option is to try fast file mode, which is more straightforward to use than FSx for Lustre because it doesn't require creating a file system, or connecting to a VPC. Fast file mode is ideal for large file containers (more than 150 MB), and might also do well with files more than 50 MB. Because fast file mode provides a POSIX interface, it supports random reads (reading non-sequential byte-ranges). However, this is not the ideal use case, and your throughput might be lower than with the sequential reads. However, if you have a relatively large and computationally intensive ML model, fast file mode might still be able to saturate the effective bandwidth of the training pipeline and not result in an IO bottleneck. Option D, FastFile mode, streams files on demand from S3 buckets to the training instance, which can be efficient for small datasets but may not be optimal for large datasets. Moreover, this solution does not provide a file system that is optimized for high performance, and it may require additional development effort to set up B because we have 200TB https://saturncloud.io/blog/using-aws-sagemaker-input-modes-amazon-s3-efs-or-fsx/ Fast File Mode combines the ease of use of the existing File Mode with the performance of Pipe Mode. This provides convenient access to data as if it was downloaded locally, while offering the performance benefit of streaming the data directly from Amazon S3. No code change required or no lengthy setup The solution that meets the requirements of the company is B, which involves creating an Amazon FSx for Lustre file system and linking it to the S3 buckets. Amazon FSx for Lustre is a fully managed, high-performance file system optimized for compute-intensive workloads, such as machine learning training. It is designed to provide low latencies and high throughput for processing large data sets, and it can directly access data from S3 buckets without any data movement or copying. This solution requires minimal setup and provides the shortest processing time since the data can be accessed in parallel by multiple instances. I will go with D https://sagemaker.readthedocs.io/en/stable/api/utility/inputs.html https://aws.amazon.com/blogs/machine-learning/speed-up-training-on-amazon-sagemaker-using-amazon-efs-or-amazon-fsx-for-lustre-file-systems/ - https://www.examtopics.com/discussions/amazon/view/103033-exam-aws-certified-machine-learning-specialty-topic-1/
235
236 - An online store is predicting future book sales by using a linear regression model that is based on past sales data. The data includes duration, a numerical feature that represents the number of days that a book has been listed in the online store. A data scientist performs an exploratory data analysis and discovers that the relationship between book sales and duration is skewed and non-linear. Which data transformation step should the data scientist take to improve the predictions of the model? - A.. One-hot encoding B.. Cartesian product transformation C.. Quantile binning D.. Normalization
C - C. Quantile binning: Quantile binning (or discretization) involves dividing a continuous variable into bins based on quantiles. This can be useful for handling skewed data by distributing the data more evenly across the bins. However, this method transforms the numerical feature into a categorical one, which might not be ideal for preserving the ordinal nature and the detailed variance of the 'duration' feature in a regression model. If the choice must be made from the given options, Option C (Quantile binning) might be the most suitable, albeit not ideal, as it can at least help in dealing with skewed distributions by distributing the data across bins more evenly. However, the data scientist should consider logarithmic or polynomial transformations for a more direct approach to addressing non-linearity. https://docs.aws.amazon.com/machine-learning/latest/dg/data-transformations-reference.html A. NO - One-hot encoding is for featurization of categories B. NO - C. YES - Quantile binning can make data linear (https://docs.aws.amazon.com/machine-learning/latest/dg/data-transformations-reference.html#quantile-binning-transformation) D. NO - Normalization will recenter the data, not change the relationship quantile binning C is correct C is the best answer I guess the correct answer is C, Quantile binning. This transformation divides the data into quantiles (equal-sized intervals) based on the values of the feature (in this case, duration) and replaces the values with the bin number. This transformation can help capture non-linear relationships between features by creating more representative categories for skewed data. The transformed data can then be used to train a non-linear regression model, such as a polynomial regression, to better predict future book sales. - https://www.examtopics.com/discussions/amazon/view/103035-exam-aws-certified-machine-learning-specialty-topic-1/
236
237 - A company's data engineer wants to use Amazon S3 to share datasets with data scientists. The data scientists work in three departments: Finance. Marketing, and Human Resources. Each department has its own IAM user group. Some datasets contain sensitive information and should be accessed only by the data scientists from the Finance department. How can the data engineer set up access to meet these requirements? - A.. Create an S3 bucket for each dataset. Create an ACL for each S3 bucket. For each S3 bucket that contains a sensitive dataset, set the ACL to allow access only from the Finance department user group. Allow all three department user groups to access each S3 bucket that contains a non-sensitive dataset. B.. Create an S3 bucket for each dataset. For each S3 bucket that contains a sensitive dataset, set the bucket policy to allow access only from the Finance department user group. Allow all three department user groups to access each S3 bucket that contains a non-sensitive dataset. C.. Create a single S3 bucket that includes two folders to separate the sensitive datasets from the non-sensitive datasets. For the Finance department user group, attach an IAM policy that provides access to both folders. For the Marketing and Human Resources department user groups, attach an IAM policy that provides access to only the folder that contains the non-sensitive datasets. D.. Create a single S3 bucket that includes two folders to separate the sensitive datasets from the non-sensitive datasets. Set the policy for the S3 bucket to allow only the Finance department user group to access the folder that contains the sensitive datasets. Allow all three department user groups to access the folder that contains the non-sensitive datasets.
C - The goal is to provide secure and efficient access to datasets stored in Amazon S3. Sensitive datasets should be accessible only to the Finance department, while non-sensitive datasets should be accessible to all user groups. S3 bucket policies are the most effective and scalable solution for implementing access control in this scenario. For the Marketing and Human Resources department user groups, attach an IAM policy that provides access to only the folder that contains the non-sensitive datasets. Finance department user also need access to non-sensitive datasets. I think attaching the policy is more flexible, in case this pattern needs to be repeated for another s3 bucket? You cannot identify a user group as a principal in a policy (such as a resource-based policy) because groups relate to permissions, not authentication, and principals are authenticated IAM entities. an Amazon S3 bucket policy cannot have a user group as the principal directly. https://docs.aws.amazon.com/IAM/latest/UserGuide/reference_policies_elements_principal.html I stand corrected. I retract my previous answer. D Use a bucket policy. User group cannot be a principal in IAM policy. adding each individual user to the policy is not practical According to the AWS documentation, you cannot specify an IAM group as a principal in an S3 bucket policy. This is because groups relate to permissions, not authentication, and principals are authenticated IAM entities. You can only specify the following principals in a policy: AWS account and root user IAM user Federated user IAM role . If you want to grant permission to an IAM group, you can add the ARNs of all the IAM users in that group to the S3 bucket policy instead. so it is C to create 2 IAM roles and attach them to different groups you have REF: https://docs.aws.amazon.com/AmazonS3/latest/userguide/s3-bucket-user-policy-specifying-principal-intro.html https://docs.aws.amazon.com/AmazonS3/latest/userguide/walkthrough1.html it does not show any option to use iam group based s3 bucket policy. (so D cannot be the right answer) changing to C Option D suggests creating a single S3 bucket that includes two folders to separate the sensitive datasets from the non-sensitive datasets. This option is helpful because it can simplify the data management and reduce the cost of using multiple S3 buckets. You can use a single S3 bucket to store all your datasets and use folders to organize them by their sensitivity level1. You can also use the Amazon S3 console or the AWS CLI to create and manage your folders2. First it is more efficient to use one single bucket, S3 has limit of 100 buckets by default, answer C creates two policies while for answer D , it is done with one, and use Deny on the sensitive folder to the two groups not finance, and have an allow to the non sensitive, knowing that deny takes precendence In S3 bucket Policy you CANNOT specify IAM Group as Principal, but you can specify IAM Users. So it's C. Option C https://stackoverflow.com/questions/35944349/iam-aws-s3-to-restrict-to-a-specific-sub-folder https://aws.amazon.com/blogs/security/how-to-restrict-amazon-s3-bucket-access-to-a-specific-iam-role/ I will choose C I will choose C Both B and D look apparently correct but they are not because in s3 bucket policy , IAM Group cant be the principal. In other words you cant give access to a User group to s3 buckets using s3 bucket policy. It can only be an IAM user or role.https://stackoverflow.com/questions/30667678/s3-bucket-policy-how-to-allow-a-iam-group-from-another-account I would go for C single bucket looks a better option. Ease of management and still secure Actually this is not possible. I will go for C https://stackoverflow.com/questions/30667678/s3-bucket-policy-how-to-allow-a-iam-group-from-another-account Creating a single S3 bucket that includes two folders to separate the sensitive datasets from the non-sensitive datasets would be the best approach. The policy of the S3 bucket can be set to allow only the Finance department user group to access the folder that contains the sensitive datasets. The folder that contains non-sensitive datasets can be made available to all three department user groups. This approach will ensure that sensitive datasets are only accessible to users who need access to them. - https://www.examtopics.com/discussions/amazon/view/103036-exam-aws-certified-machine-learning-specialty-topic-1/
237
238 - A company operates an amusement park. The company wants to collect, monitor, and store real-time traffic data at several park entrances by using strategically placed cameras. The company’s security team must be able to immediately access the data for viewing. Stored data must be indexed and must be accessible to the company’s data science team. Which solution will meet these requirements MOST cost-effectively? - A.. Use Amazon Kinesis Video Streams to ingest, index, and store the data. Use the built-in integration with Amazon Rekognition for viewing by the security team. B.. Use Amazon Kinesis Video Streams to ingest, index, and store the data. Use the built-in HTTP live streaming (HLS) capability for viewing by the security team. C.. Use Amazon Rekognition Video and the GStreamer plugin to ingest the data for viewing by the security team. Use Amazon Kinesis Data Streams to index and store the data. D.. Use Amazon Kinesis Data Firehose to ingest, index, and store the data. Use the built-in HTTP live streaming (HLS) capability for viewing by the security team.
B - A and B the only options to consider, A talks about Rekognition which is not well suited for video so B Amazon Kinesis Video Streams is a fully managed service that makes it easy to ingest, store, and analyze streaming video data. The built-in HTTP live streaming (HLS) capability allows the security team to view the data in real time. Amazon Kinesis Video Streams is a pay-per-use service, so the company will only be charged for the amount of data that it ingests, stores, and analyzes. Amazon Kinesis Video Streams is a fully managed service that makes it easy to ingest, store, and analyze streaming video data. The built-in HTTP live streaming (HLS) capability allows the security team to view the data in real time. Amazon Kinesis Video Streams is a pay-per-use service, so the company will only be charged for the amount of data that it ingests, stores, and analyzes. It's B real time video ingestion = KVS (C and D are wrong) watch the footage = HLS (rekognition would be for ML, which is not required so A is wrong) No any ML involved here, so it's B. Selected B as per https://aws.amazon.com/about-aws/whats-new/2018/07/kinesis-video-adds-hls-support/ - https://www.examtopics.com/discussions/amazon/view/112449-exam-aws-certified-machine-learning-specialty-topic-1/
238
239 - An engraving company wants to automate its quality control process for plaques. The company performs the process before mailing each customized plaque to a customer. The company has created an Amazon S3 bucket that contains images of defects that should cause a plaque to be rejected. Low-confidence predictions must be sent to an internal team of reviewers who are using Amazon Augmented AI (Amazon A2I). Which solution will meet these requirements? - A.. Use Amazon Textract for automatic processing. Use Amazon A2I with Amazon Mechanical Turk for manual review. B.. Use Amazon Rekognition for automatic processing. Use Amazon A2I with a private workforce option for manual review. C.. Use Amazon Transcribe for automatic processing. Use Amazon A2I with a private workforce option for manual review. D.. Use AWS Panorama for automatic processing. Use Amazon A2I with Amazon Mechanical Turk for manual review.
B - Amazon Rekognition for image analysis and A21 private workforce for manual review Amazon Rekognition is an image and video analysis service that can detect objects, scenes, and faces in images. The company can use Amazon Rekognition to automatically process images of plaques and identify defects that should cause a plaque to be rejected. Low-confidence predictions can be sent to an internal team of reviewers who are using Amazon A2I with a private workforce option for manual review. This will ensure that the plaques are thoroughly checked before being mailed to customers A is not suitable because Amazon Textract is a service that extracts text and data from scanned documents. It is not designed for image analysis. C is also not suitable because Amazon Transcribe is a service that converts speech to text. It is not designed for image analysis. D is not suitable because AWS Panorama is a computer vision service that runs on cameras and other edge devices. It is not designed for analyzing images stored in an S3 bucket Amazon Rekognition is a service that can be used to detect objects, faces, and text in images. Amazon A2I is a service that can be used to automate manual tasks, such as reviewing images for defects. The private workforce option for Amazon A2I allows the engraving company to create a custom workforce of reviewers who are familiar with the types of defects that should cause a plaque to be rejected. Keywords are image processing and internal users for review. - https://www.examtopics.com/discussions/amazon/view/112450-exam-aws-certified-machine-learning-specialty-topic-1/
239
240 - A machine learning (ML) engineer at a bank is building a data ingestion solution to provide transaction features to financial ML models. Raw transactional data is available in an Amazon Kinesis data stream. The solution must compute rolling averages of the ingested data from the data stream and must store the results in Amazon SageMaker Feature Store. The solution also must serve the results to the models in near real time. Which solution will meet these requirements? - A.. Load the data into an Amazon S3 bucket by using Amazon Kinesis Data Firehose. Use a SageMaker Processing job to aggregate the data and to load the results into SageMaker Feature Store as an online feature group. B.. Write the data directly from the data stream into SageMaker Feature Store as an online feature group. Calculate the rolling averages in place within SageMaker Feature Store by using the SageMaker GetRecord API operation. C.. Consume the data stream by using an Amazon Kinesis Data Analytics SQL application that calculates the rolling averages. Generate a result stream. Consume the result stream by using a custom AWS Lambda function that publishes the results to SageMaker Feature Store as an online feature group. D.. Load the data into an Amazon S3 bucket by using Amazon Kinesis Data Firehose. Use a SageMaker Processing job to load the data into SageMaker Feature Store as an offline feature group. Compute the rolling averages at query time.
C - From what I see, C is the only option that will meet the time constratins A. NO - no need for intermediary S3 storage B. NO - Feature store does not have built-in transformations C. YES - https://aws.amazon.com/blogs/machine-learning/using-streaming-ingestion-with-amazon-sagemaker-feature-store-to-make-ml-backed-decisions-in-near-real-time/ D. NO - Computing a query time is expensive, you want it done once and cached Amazon Kinesis Data Analytics is a fully managed service that makes it easy to process streaming data. Amazon Kinesis Data Analytics SQL is a feature of Amazon Kinesis Data Analytics that allows you to process streaming data using SQL. AWS Lambda is a serverless compute service that allows you to run code without provisioning or managing servers. SageMaker Feature Store is a managed service that makes it easy to store and manage features for machine learning models. agree with all but the only question is how can we consume data directly using Kinesis Data Analytics? Don't we need Kinesis Data Stream or Firehose to consume the stream data? The letter B is wrong as KDS does not have the ability to load (another service is needed for this, such as KDF). The letter D is wrong as it saves a variable that needs to be accessed quickly in an offline group in the Feature Store. Since the solution starts with KDS and we need the moving average results to be displayed in near real time, the letter C guarantees this: KDS → KDA → Lambda (triggered quickly) → SM FS. Letter A is wrong, as it does not guarantee near real-time feedback. KDA provides facility for rolling averages and meet with realtime requirement - https://www.examtopics.com/discussions/amazon/view/112451-exam-aws-certified-machine-learning-specialty-topic-1/
240
241 - Each morning, a data scientist at a rental car company creates insights about the previous day’s rental car reservation demands. The company needs to automate this process by streaming the data to Amazon S3 in near real time. The solution must detect high-demand rental cars at each of the company’s locations. The solution also must create a visualization dashboard that automatically refreshes with the most recent data. Which solution will meet these requirements with the LEAST development time? - A.. Use Amazon Kinesis Data Firehose to stream the reservation data directly to Amazon S3. Detect high-demand outliers by using Amazon QuickSight ML Insights. Visualize the data in QuickSight. B.. Use Amazon Kinesis Data Streams to stream the reservation data directly to Amazon S3. Detect high-demand outliers by using the Random Cut Forest (RCF) trained model in Amazon SageMaker. Visualize the data in Amazon QuickSight. C.. Use Amazon Kinesis Data Firehose to stream the reservation data directly to Amazon S3. Detect high-demand outliers by using the Random Cut Forest (RCF) trained model in Amazon SageMaker. Visualize the data in Amazon QuickSight. D.. Use Amazon Kinesis Data Streams to stream the reservation data directly to Amazon S3. Detect high-demand outliers by using Amazon QuickSight ML Insights. Visualize the data in QuickSight.
A - A. YES - Quicksight has Z-score algorithm to detect outliers B. NO - Kinesis Data Streams cannot stream directly to Amazon S3. C. NO - RCF is overkill when QuickSight supports D. NO - Kinesis Data Streams cannot stream directly to Amazon S3. both a and c are right.. quicker is a .. https://aws.amazon.com/quicksight/features-ml/ Amazon Kinesis Data Firehose is a fully managed service that makes it easy to stream data to Amazon S3. Amazon QuickSight ML Insights is a feature of Amazon QuickSight that allows you to detect outliers in your data using machine learning algorithms. Amazon QuickSight is a fully managed business intelligence service that allows you to visualize your data in dashboards. Letters B - D are wrong, because KDS has no load power, that is, directly saving the files in any other service (you would need, for example, a KDF coupled to KDS). Letter A is correct, as QuickSight has tools to identify outliers. Letter C would be correct, but it requires more development to use something we already have in QuickSight. It's should be Firehose, and then it's should be least development effort, SageMaker is complicated and require a lot of effort. so it's A. Keywords Near real-time, visualization with minimal dev efforts - https://www.examtopics.com/discussions/amazon/view/112452-exam-aws-certified-machine-learning-specialty-topic-1/
241
242 - A company is planning a marketing campaign to promote a new product to existing customers. The company has data for past promotions that are similar. The company decides to try an experiment to send a more expensive marketing package to a smaller number of customers. The company wants to target the marketing campaign to customers who are most likely to buy the new product. The experiment requires that at least 90% of the customers who are likely to purchase the new product receive the marketing materials. The company trains a model by using the linear learner algorithm in Amazon SageMaker. The model has a recall score of 80% and a precision of 75%. How should the company retrain the model to meet these requirements? - A.. Set the target_recall hyperparameter to 90%. Set the binary_classifier_model_selection_criteria hyperparameter to recall_at_target_precision. B.. Set the target_precision hyperparameter to 90%. Set the binary_classifier_model_selection_criteria hyperparameter to precision_at_target_recall. C.. Use 90% of the historical data for training. Set the number of epochs to 20. D.. Set the normalize_label hyperparameter to true. Set the number of classes to 2.
A - This is simple, assume we identify 100 customers who have spent atleast $1M dollars before. We want our marketing materil to reach atleast 90 of them - hence as many true positives as possible meaning FPs are also ok. This means we need to optimise for recall For me it is A. We need to send the promo to as many potential buyers as possible. so we need to reduce FN FPs should be fine as it is not mentioned in the question that it is expensive to send the expensive marketing packages to some extra customers. BUT FNs more than 10% are not fine as they want at least 90% of potential customers to get the material. Hence we tune hyperparameters to increase recall to target 90%. And then find best precision with this 90% recall. Which makes precision_at_target_recall an appropriate hyperparameter to set. But option A has recall_at_target_precision. Still leaning towards A as target recall needs to be 90%. As per docs, "If binary_classifier_model_selection_criteria is recall_at_target_precision, then precision is held at this value while recall is maximized." we want to reduce FPs. Hence precision target should be 90%. Thus B. B We want to reduce the false positive(We do not want to give the expensive marketing package to the costomer who won't buy our product) But I see that the goal is to send the promo to as much potential purchasers as possible (reduce FN) (A) We want to reduce FN as much as possible (customers who will buy, but who were predicted by the model as “non-buyers”), so the correct alternative is Letter A. c - d are decoy. For classification problems, such as the one here, you can specify the binary_classifier_model_selection_criteria hyperparameter to control how the best model is selected from the validation set. You can choose from several criteria, such as accuracy, precision, recall, F1 measure, or cross-entropy loss1. I think it's A - Recall. - https://www.examtopics.com/discussions/amazon/view/114485-exam-aws-certified-machine-learning-specialty-topic-1/
242
243 - A wildlife research company has a set of images of lions and cheetahs. The company created a dataset of the images. The company labeled each image with a binary label that indicates whether an image contains a lion or cheetah. The company wants to train a model to identify whether new images contain a lion or cheetah. Which Amazon SageMaker algorithm will meet this requirement? - A.. XGBoost B.. Image Classification - TensorFlow C.. Object Detection - TensorFlow D.. Semantic segmentation - MXNet
B - Image classification - Tensorflow We just want to identify the item in the image, so object detection and semantic segmentation out. Left with A and B , but Xgboost is not ideal for images so B Image Classification - TensorFlow is a built-in algorithm in Amazon SageMaker that can be used to train a model to identify images of lions and cheetahs. The algorithm uses a convolutional neural network (CNN) to extract features from the images and then classifies the images based on the features B, definitely. It's B. B is correct. - https://www.examtopics.com/discussions/amazon/view/114297-exam-aws-certified-machine-learning-specialty-topic-1/
243
244 - A data scientist for a medical diagnostic testing company has developed a machine learning (ML) model to identify patients who have a specific disease. The dataset that the scientist used to train the model is imbalanced. The dataset contains a large number of healthy patients and only a small number of patients who have the disease. The model should consider that patients who are incorrectly identified as positive for the disease will increase costs for the company. Which metric will MOST accurately evaluate the performance of this model? - A.. Recall B.. F1 score C.. Accuracy D.. Precision
D - reduce the false positive so Precision A. NO - Recall=TP / (TP+FN) measures how well we capture the postives () B. NO - F1 score balances recall & precision C. NO - Accuracy=(TP+TN) / Total only measures TP, which can be misleading if there class imbalance D. YES - Precision=TP / (TP+FP) gets a penalty for FP We want to give more weight to the “FP” error, so we need to keep an eye on Letter D: Precision. Letter D is correct. F1-Score gives the same weight for Recall and Precision, so it's wrong (given the question statement) Changing to "D" As F1 score takes precision and recall into account. Changed to D D! "The model should consider that patients who are incorrectly identified as positive for the disease will increase costs for the company.", it means that FP (False Positives) are worse than FN (False Negatives), which means that we must increase Precision for this case. It's D - Precision. - https://www.examtopics.com/discussions/amazon/view/114484-exam-aws-certified-machine-learning-specialty-topic-1/
244
245 - A machine learning (ML) specialist is training a linear regression model. The specialist notices that the model is overfitting. The specialist applies an L1 regularization parameter and runs the model again. This change results in all features having zero weights. What should the ML specialist do to improve the model results? - A.. Increase the L1 regularization parameter. Do not change any other training parameters. B.. Decrease the L1 regularization parameter. Do not change any other training parameters. C.. Introduce a large L2 regularization parameter. Do not change the current L1 regularization value. D.. Introduce a small L2 regularization parameter. Do not change the current L1 regularization value.
B - Correct Answer B why not D ? While introducing a small L2 regularization might help in some cases, it doesn't address the main issue, which is that the L1 regularization is too strong. The primary problem needs to be addressed first. Decrease L1 will ensure that all features are not zero From Claude 3: Based on the AWS documentation and industry best practices, introducing a small L2 regularization parameter in addition to the existing L1 regularization (Option D) is a recommended approach to address overfitting while retaining the feature selection capabilities of L1 regularization [3]. The combination of L1 and L2 regularization, known as Elastic Net regularization, can effectively handle collinearity and provide a balance between sparsity and weight shrinkage, potentially improving the model results and addressing the overfitting issue [4]. While decreasing the L1 regularization parameter (Option B) may seem like a logical step, it does not directly address the overfitting problem and may not be effective in the specific scenario where all features have zero weights. Introducing a small L2 regularization parameter (Option D) is a more appropriate solution based on AWS documentation and industry best practices Seems like a problem of too much regularization, so I would start by decreasing L1 Ans: B Decreasing the L1 regularization parameter would reduce the penalization on the coefficients, allowing some features to contribute to the model without being driven to zero. This adjustment can help in achieving a balance between overfitting and underfitting. leaning towards C .. confusing though Letter B is correct, since L1 is very strong and is eliminating all variables. Letters A - C would harm even more and Letter D would not change the current result at all. May I know how adding L2 regularization parameter would harm even more? Both regularizations can prevent overfitting. I'm leaning towards C thinking that even if all L1 (absolute) terms go to 0, there would still be L2 (squared) terms non-0, thereby improving the model. But I'm not sure if keeping L1 value same would still result in all feature going to 0. you applied an L1 regularization parameter and got all features having zero weights, it means that your parameter value was too high and caused too much shrinkage. This resulted in a model that underfits the data and has poor performance. Also don’t think any options given improve the model L1 is too strong, you can improve a little be by decreasing lambda Can’t apply L2 on zero weights - https://www.examtopics.com/discussions/amazon/view/116362-exam-aws-certified-machine-learning-specialty-topic-1/
245
246 - A machine learning (ML) engineer is integrating a production model with a customer metadata repository for real-time inference. The repository is hosted in Amazon SageMaker Feature Store. The engineer wants to retrieve only the latest version of the customer metadata record for a single customer at a time. Which solution will meet these requirements? - A.. Use the SageMaker Feature Store BatchGetRecord API with the record identifier. Filter to find the latest record. B.. Create an Amazon Athena query to retrieve the data from the feature table. C.. Create an Amazon Athena query to retrieve the data from the feature table. Use the write_time value to find the latest record. D.. Use the SageMaker Feature Store GetRecord API with the record identifier.
D - GetRecord API of Feature Store to retrieve the record https://docs.aws.amazon.com/ko_kr/sagemaker/latest/APIReference/API_feature_store_BatchGetRecord.html Why not A? D. Using the SageMaker Feature Store GetRecord API with the record identifier1. This API allows customers to retrieve features from a single feature group and access one record per API call2. The record identifier is a unique value that identifies a record within a feature group1. The GetRecord API returns the latest version of the record by default1. This solution avoids the need to use additional queries or filters to find the latest record. GetRecord API retrieves latest record whereas BatchGetRecrdbatch of records https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_feature_store_GetRecord.html - https://www.examtopics.com/discussions/amazon/view/112456-exam-aws-certified-machine-learning-specialty-topic-1/
246
247 - A company’s data scientist has trained a new machine learning model that performs better on test data than the company’s existing model performs in the production environment. The data scientist wants to replace the existing model that runs on an Amazon SageMaker endpoint in the production environment. However, the company is concerned that the new model might not work well on the production environment data. The data scientist needs to perform A/B testing in the production environment to evaluate whether the new model performs well on production environment data. Which combination of steps must the data scientist take to perform the A/B testing? (Choose two.) - A.. Create a new endpoint configuration that includes a production variant for each of the two models. B.. Create a new endpoint configuration that includes two target variants that point to different endpoints. C.. Deploy the new model to the existing endpoint. D.. Update the existing endpoint to activate the new model. E.. Update the existing endpoint to use the new endpoint configuration.
AE - Create a New Endpoint Configuration: Option A: Create a new endpoint configuration that includes a production variant for each of the two models (the existing model and the new model). This allows traffic to be split between the two models, enabling comparative performance analysis. Update the Existing Endpoint: Option E: Update the existing endpoint to use the newly created endpoint configuration. This ensures that both models receive traffic according to the specified distribution, facilitating A/B testing on production data. A. YES - we create a new configuration with 2 production variants (1) B. Create a new endpoint configuration that includes two target variants that point to different endpoints. C. NO - that would defeat A D. NO - that would defeat A E. YES - we modify the existing endpoint to now the new configuration (2) A & E - supported by docs listed below. A & E .... Letter A - E are correct for doing the A/B test: We create an endpoint with production variant → We update the existing endpoint. C is wrong, because invalidate Letter A. AE per https://docs.aws.amazon.com/sagemaker/latest/dg/serverless-endpoints-update.html This blog https://aws.amazon.com/blogs/machine-learning/a-b-testing-ml-models-in-production-using-amazon-sagemaker/ Say Deploy just because they deploy two models from the beginning. https://docs.aws.amazon.com/sagemaker/latest/dg/serverless-endpoints-update.html E is a better choice than C as it's more specific... the endpoint must be updated to point to the new endpoint configuration. https://aws.amazon.com/blogs/machine-learning/a-b-testing-ml-models-in-production-using-amazon-sagemaker/ - https://www.examtopics.com/discussions/amazon/view/112457-exam-aws-certified-machine-learning-specialty-topic-1/
247
248 - A data scientist is working on a forecast problem by using a dataset that consists of .csv files that are stored in Amazon S3. The files contain a timestamp variable in the following format: March 1st, 2020, 08:14pm - There is a hypothesis about seasonal differences in the dependent variable. This number could be higher or lower for weekdays because some days and hours present varying values, so the day of the week, month, or hour could be an important factor. As a result, the data scientist needs to transform the timestamp into weekdays, month, and day as three separate variables to conduct an analysis. Which solution requires the LEAST operational overhead to create a new dataset with the added features? - A.. Create an Amazon EMR cluster. Develop PySpark code that can read the timestamp variable as a string, transform and create the new variables, and save the dataset as a new file in Amazon S3. B.. Create a processing job in Amazon SageMaker. Develop Python code that can read the timestamp variable as a string, transform and create the new variables, and save the dataset as a new file in Amazon S3. C.. Create a new flow in Amazon SageMaker Data Wrangler. Import the S3 file, use the Featurize date/time transform to generate the new variables, and save the dataset as a new file in Amazon S3. D.. Create an AWS Glue job. Develop code that can read the timestamp variable as a string, transform and create the new variables, and save the dataset as a new file in Amazon S3.
C - Data Wrangler can transform the date time to desire features Amazon SageMaker Data Wrangler is a visual data preparation tool that makes it easy to clean, transform, and featurize data for machine learning. It provides a variety of built-in transformations, including the Featurize date/time transform, which can be used to generate the new variables from the timestamp variable. The other options require the data scientist to develop code, which can be more time-consuming and error-prone. Amazon EMR and AWS Glue are both batch processing services that can be used to run Python code. However, they require the data scientist to create and manage a cluster, which can be a significant operational overhead. Amazon SageMaker Processing is a serverless processing service that can also be used to run Python code. However, it is more expensive than Data Wrangler and does not provide the same level of visual tooling. Letra C é a correta, pois o Data Wrangler permite low code para realizar esta tarefa e como queremos o menor operational overhead esta é a solução. Letra D também é possível, mas envolve desenvolvimento de código ficando mais complexa que a Letra C. Letra A requer subir um novo serviço e Letra B cai no mesmo cenário da Letra D (desenvolver código). https://aws.amazon.com/blogs/machine-learning/prepare-time-series-data-with-amazon-sagemaker-data-wrangler/ "Featurize datetime time series transformation to add the month, day of the month, day of the year, week of the year, and quarter features to our dataset." - https://www.examtopics.com/discussions/amazon/view/112600-exam-aws-certified-machine-learning-specialty-topic-1/
248
249 - A manufacturing company has a production line with sensors that collect hundreds of quality metrics. The company has stored sensor data and manual inspection results in a data lake for several months. To automate quality control, the machine learning team must build an automated mechanism that determines whether the produced goods are good quality, replacement market quality, or scrap quality based on the manual inspection results. Which modeling approach will deliver the MOST accurate prediction of product quality? - A.. Amazon SageMaker DeepAR forecasting algorithm B.. Amazon SageMaker XGBoost algorithm C.. Amazon SageMaker Latent Dirichlet Allocation (LDA) algorithm D.. A convolutional neural network (CNN) and ResNet
B - We have a supervised classification problem + numerical features, clearly XGBoost is the best candidate A. NO - it is not a foreacasting problem B. YES - it is a supervised classification problem C. NO - LDA is for topic modeling, sentiment analysis D. NO - ResNet (a type of CNN) is for image recognition A bit confusing. Depends on sensor data. if it is images then D for sure but if it is tabular data then B We have a tabular multiclass classification problem. Letter B is the correct solution for this. Definitely B. It's B - https://www.examtopics.com/discussions/amazon/view/113070-exam-aws-certified-machine-learning-specialty-topic-1/
249
250 - A healthcare company wants to create a machine learning (ML) model to predict patient outcomes. A data science team developed an ML model by using a custom ML library. The company wants to use Amazon SageMaker to train this model. The data science team creates a custom SageMaker image to train the model. When the team tries to launch the custom image in SageMaker Studio, the data scientists encounter an error within the application. Which service can the data scientists use to access the logs for this error? - A.. Amazon S3 B.. Amazon Elastic Block Store (Amazon EBS) C.. AWS CloudTrail D.. Amazon CloudWatch
D - d - Cloudwatch - no brainer for diagnostics logs CloudWatch A. NO - Amazon S3 is for SageMaker input B. NO - EBS is for data store C. NO - CloudTrail is for access log/security D. YES - Amazon CloudWatch will grab errors Agree D for logging cloud watch Letter D is correct. Service for Logging (Cloudwatch), Service for Auditing and Access (CloudTrail), and Service for Storage (S3 and EBS). D is correct. All logs are captured in CloudWatch by default those can be exported to S3 if needed. https://repost.aws/knowledge-center/sagemaker-studio-custom-container Amazon CloudWatch is a monitoring and logging service provided by AWS. It collects and stores log files and metrics from various AWS services, including Amazon SageMaker. CloudWatch allows you to gain visibility into your applications and infrastructure by providing a unified view of logs, metrics, and events. - https://www.examtopics.com/discussions/amazon/view/111690-exam-aws-certified-machine-learning-specialty-topic-1/
250
251 - A data scientist wants to build a financial trading bot to automate investment decisions. The financial bot should recommend the quantity and price of an asset to buy or sell to maximize long-term profit. The data scientist will continuously stream financial transactions to the bot for training purposes. The data scientist must select the appropriate machine learning (ML) algorithm to develop the financial trading bot. Which type of ML algorithm will meet these requirements? - A.. Supervised learning B.. Unsupervised learning C.. Semi-supervised learning D.. Reinforcement learning
D - Reinforcement learning allows the bot to continuously learn from its own experiences, adapt to changing market conditions, and optimize its decision-making process over time. It is well-suited for dynamic and uncertain environments like financial markets, where the optimal trading strategies may vary depending on various factors and trends. I like your answer conrad liked* At first glance this is a problem where we have some sort of goal and an agent(.i.e. bot) A. NO - there are no labelled data as input B. NO - it is not a clustering problem C. NO - there are no labelled data as input D. YES - it can learn over time what are the good and bad decisions - https://www.examtopics.com/discussions/amazon/view/111704-exam-aws-certified-machine-learning-specialty-topic-1/
251
252 - A manufacturing company wants to create a machine learning (ML) model to predict when equipment is likely to fail. A data science team already constructed a deep learning model by using TensorFlow and a custom Python script in a local environment. The company wants to use Amazon SageMaker to train the model. Which TensorFlow estimator configuration will train the model MOST cost-effectively? - A.. Turn on SageMaker Training Compiler by adding compiler_config=TrainingCompilerConfig() as a parameter. Pass the script to the estimator in the call to the TensorFlow fit() method. B.. Turn on SageMaker Training Compiler by adding compiler_config=TrainingCompilerConfig() as a parameter. Turn on managed spot training by setting the use_spot_instances parameter to True. Pass the script to the estimator in the call to the TensorFlow fit() method. C.. Adjust the training script to use distributed data parallelism. Specify appropriate values for the distribution parameter. Pass the script to the estimator in the call to the TensorFlow fit() method. D.. Turn on SageMaker Training Compiler by adding compiler_config=TrainingCompilerConfig() as a parameter. Set the MaxWaitTimeInSeconds parameter to be equal to the MaxRuntimeInSeconds parameter. Pass the script to the estimator in the call to the TensorFlow fit() method.
B - A. NO - should use Spot instances B. YES - makes use of compiler and spot instances C. NO - that is to minimize training time, not reduced cost D. NO - MaxWaitTimeInSeconds is time to find a free spot instance, it is unrelated to MaxRuntimeInSeconds which is processing time once an istance has been acquired managed spot B is the only option which mentions spot instances... the question is to be most cost effective, so seems B is only viable option. I go with you B. Turn on SageMaker Training Compiler by adding compiler_config=TrainingCompilerConfig() as a parameter. Turn on managed spot training by setting the use_spot_instances parameter to True. Pass the script to the estimator in the call to the TensorFlow fit() method. - https://www.examtopics.com/discussions/amazon/view/111706-exam-aws-certified-machine-learning-specialty-topic-1/
252
253 - An automotive company uses computer vision in its autonomous cars. The company trained its object detection models successfully by using transfer learning from a convolutional neural network (CNN). The company trained the models by using PyTorch through the Amazon SageMaker SDK. The vehicles have limited hardware and compute power. The company wants to optimize the model to reduce memory, battery, and hardware consumption without a significant sacrifice in accuracy. Which solution will improve the computational efficiency of the models? - A.. Use Amazon CloudWatch metrics to gain visibility into the SageMaker training weights, gradients, biases, and activation outputs. Compute the filter ranks based on the training information. Apply pruning to remove the low-ranking filters. Set new weights based on the pruned set of filters. Run a new training job with the pruned model. B.. Use Amazon SageMaker Ground Truth to build and run data labeling workflows. Collect a larger labeled dataset with the labelling workflows. Run a new training job that uses the new labeled data with previous training data. C.. Use Amazon SageMaker Debugger to gain visibility into the training weights, gradients, biases, and activation outputs. Compute the filter ranks based on the training information. Apply pruning to remove the low-ranking filters. Set the new weights based on the pruned set of filters. Run a new training job with the pruned model. D.. Use Amazon SageMaker Model Monitor to gain visibility into the ModelLatency metric and OverheadLatency metric of the model after the company deploys the model. Increase the model learning rate. Run a new training job.
C - A. NO - CloudWatch cannot get filter ranks of filters. Run a new training job with the pruned model. B. NO - Ground Truth can help improve model performance, not reduce inference cost C. YES - Filter pruning based on ranks is specific to CNN and supported by SageMaker Debugger D. NO - learning rate impacts training, not inference https://aws.amazon.com/ko/blogs/machine-learning/pruning-machine-learning-models-with-amazon-sagemaker-debugger-and-amazon-sagemaker-experiments/ Gain insight into the training metrics with debugger and prune Letter B uses AWS GT incorrectly, so it is wrong. Letter D is outside the scope of what the question asks, so it is wrong. Letter A falls into the same problem as Letter B. Letter C is correct. SageMaker Model Monitor is for model drift, not for utilization metrics. SageMaker Debugger is the best choose. So it's C. C. Use Amazon SageMaker Debugger to gain visibility into the training weights, gradients, biases, and activation outputs. Compute the filter ranks based on the training information. Apply pruning to remove the low-ranking filters. Set the new weights based on the pruned set of filters. Run a new training job with the pruned model. - https://www.examtopics.com/discussions/amazon/view/111710-exam-aws-certified-machine-learning-specialty-topic-1/
253
254 - A data scientist wants to improve the fit of a machine learning (ML) model that predicts house prices. The data scientist makes a first attempt to fit the model, but the fitted model has poor accuracy on both the training dataset and the test dataset. Which steps must the data scientist take to improve model accuracy? (Choose three.) - A.. Increase the amount of regularization that the model uses. B.. Decrease the amount of regularization that the model uses. C.. Increase the number of training examples that that model uses. D.. Increase the number of test examples that the model uses. E.. Increase the number of model features that the model uses. F.. Decrease the number of model features that the model uses.
BCE - BCE B. Decrease the amount of regularization that the model uses: Regularization is used to prevent overfitting, but if the fitted model has poor accuracy on both the training and test datasets, reducing the amount of regularization can help the model better capture the underlying patterns and improve its accuracy. C. Increase the number of training examples that the model uses: Increasing the number of training examples allows the model to learn from a larger and more diverse dataset, which can help improve its ability to generalize and make accurate predictions. E. Increase the number of model features that the model uses: Adding more relevant features to the model can enhance its ability to capture important patterns and relationships in the data, leading to improved accuracy. since model is underfitting, reduce the regularization will allow to use the features more, large no. of training example meaning more learning. and increase features will help model understand the establish pattens better Please someone corrects me if I am wrong but I don't see that the question mentions overfitting or underfitting. It tells that both training and test datasets have poor accuracy. For this reason, I wouldn't apply B and I would go with the general steps that would help me to improve model accuracy (CDE) B and C obvious. Between E and F , I would start by increasing features before considering reduction I will go with BCE, other options are for solving overfitting A. NO - regularization will reduce overfitting, not accuracy B. YES - to much regularization will reduce complexity and thus decrease accuracy C. YES - the more data the merrier D. NO - test examples will no influence model performance E. YES - the more features the more there is to learn F. NO - as per E B C and E We have an underfitting problem here. To remedy this, we must follow the alternatives that increase the complexity of the model: Letter B - C - E. @RRST summarized well. We can decrease the overfitting by reducing the number of features, so how is F not an answer The problem is stating the Underfitting scenario. So correct answers are ACE A would be to solve an overfitting problem. A. Increase the amount of regularization that the model uses. C. Increase the number of training examples that the model uses. E. Increase the number of model features that the model uses. corrected to BCE . decrease regularization for under fitting and increase for overfitting - https://www.examtopics.com/discussions/amazon/view/111711-exam-aws-certified-machine-learning-specialty-topic-1/
254
255 - A car company is developing a machine learning solution to detect whether a car is present in an image. The image dataset consists of one million images. Each image in the dataset is 200 pixels in height by 200 pixels in width. Each image is labeled as either having a car or not having a car. Which architecture is MOST likely to produce a model that detects whether a car is present in an image with the highest accuracy? - A.. Use a deep convolutional neural network (CNN) classifier with the images as input. Include a linear output layer that outputs the probability that an image contains a car. B.. Use a deep convolutional neural network (CNN) classifier with the images as input. Include a softmax output layer that outputs the probability that an image contains a car. C.. Use a deep multilayer perceptron (MLP) classifier with the images as input. Include a linear output layer that outputs the probability that an image contains a car. D.. Use a deep multilayer perceptron (MLP) classifier with the images as input. Include a softmax output layer that outputs the probability that an image contains a car.
B - Softmax function used Multi-class Classification Answer B - B. Use a deep convolutional neural network (CNN) classifier with the images as input. Include a softmax output layer that outputs the probability that an image contains a car. This approach leverages the powerful feature extraction capabilities of CNNs and uses a softmax output layer, which is most suitable for binary classification tasks like detecting a car's presence. According to chat GPT it is A. when we say "linear output layer" in the context of binary classification, it might lead to confusion. The output itself for binary classification problems is not linear; instead, it's the result of applying a sigmoid function to a linear combination of features extracted by the neural network. The term "linear" might be more accurately replaced with "sigmoid activation" for the output neuron to reflect its role in producing a probability. Dont quite agree and Chat GPT not always perfect - https://ml4.me/a-deep-dive-into-convolutional-neural-network-architectures-with-tensorflow-in-sagemaker/ A. NO - linear output is not B. YES - CNN is good, softmax output is good C. NO - MLP is not good D. NO - MLP is not good Softmax is particularly useful when you want the network to make a clear choice among multiple classes. The question doesn't even ask for probabilities, it just asks for a binary classification. Ideally sigmoid activation function is used for binary classification, but when there is only 1 class, softmax will work the same way as sigmoid. The answer is still B, but just because linear layer is used for regression and not for binary classification. And of course, CNNs are better than MLP for image classifications. It is a wording puzzle. A should be right due to car or no car but in the answer it mentions probability. Hence, softmax is more appropriate despite it is not multi classification For image classification problems we must use CNN, so we discard Letters C - D. Letter A is wrong, because linear output layer does not generate probability, but softmax. Letter B is correct. As it's a binary classification problem (car vs. no car) I would argue a linear output layer makes more sense than softmax... A linear layer will not generate a probability, so it's wrong my conrad. B is right Both MLP and CNN can process images, but CNN is more accurate and can be used for more complex images - https://www.examtopics.com/discussions/amazon/view/112460-exam-aws-certified-machine-learning-specialty-topic-1/
255
256 - A company is creating an application to identify, count, and classify animal images that are uploaded to the company’s website. The company is using the Amazon SageMaker image classification algorithm with an ImageNetV2 convolutional neural network (CNN). The solution works well for most animal images but does not recognize many animal species that are less common. The company obtains 10,000 labeled images of less common animal species and stores the images in Amazon S3. A machine learning (ML) engineer needs to incorporate the images into the model by using Pipe mode in SageMaker. Which combination of steps should the ML engineer take to train the model? (Choose two.) - A.. Use a ResNet model. Initiate full training mode by initializing the network with random weights. B.. Use an Inception model that is available with the SageMaker image classification algorithm. C.. Create a .lst file that contains a list of image files and corresponding class labels. Upload the .lst file to Amazon S3. D.. Initiate transfer learning. Train the model by using the images of less common species. E.. Use an augmented manifest file in JSON Lines format.
CD - A. NO - we can't change the model for transfer learning B. NO - we can't change the model for transfer learning C. YES - lst file is how we give input to SageMaker (https://medium.com/@texasdave2/itty-bitty-lst-file-format-converter-for-machine-learning-image-classification-on-aws-sagemaker-b3828c7ba9cc) D. YES - obvious E. NO - there is no extra metadata we want to provide (https://docs.aws.amazon.com/sagemaker/latest/dg/augmented-manifest.html) .lst file is the correct way to pass metadata of training files to sagemaker training job. C is correct not E. no doubt for D. Create a .lst File (Option C): Explanation: The .lst file is a standard format used with Amazon SageMaker's image classification algorithm, which lists image files and their corresponding labels. This file is crucial for SageMaker to read and map the images correctly for training purposes. The .lst file needs to be uploaded to Amazon S3. Initiate Transfer Learning (Option D): Explanation: Transfer learning allows you to leverage pre-trained weights from existing models (like ImageNetV2) and fine-tune them using your own data. In this case, training with the 10,000 new labeled images helps the model recognize less common animal species. Transfer learning is more efficient since the model has already been trained on similar data. Link for more information on the .lst requirement for training and validation channels: https://docs.aws.amazon.com/sagemaker/latest/dg/image-classification.html#IC-inputoutput A - no need for full training only transfer learning B - no built in inception model in sagemaker and no transfer learning C - no data augmentation could introduce inaccuracies D - yes! Transfer learning E - yes, augmentation improves accuracy We need to augment the existing images so E makes sense A False - no transfer learning B False - no transfer learning C True - "If you use the Image format for training, specify train, validation, train_lst, and validation_lst channels as values for the InputDataConfig parameter of the CreateTrainingJob request. Specify the individual image data (.jpg or .png files) for the train and validation channels. Specify one .lst file in each of the train_lst and validation_lst channels. Set the content type for all four channels to application/x-image." D True - it uses transfer learning E False - "To include metadata with your dataset in a training job, use an augmented manifest file. " Here we don't have any metadata I would go with E. as we have a hint in the question, that we need to use a Pipe mode, and E is used for pipe mode D is obvious. Reason for option E is because they want to train model in Pipe mode and using an augmented manifest file in JSON Lines format enables trraining in Pipe mode as per this link - https://docs.aws.amazon.com/sagemaker/latest/dg/image-classification.html Refer to the section - Train with Augmented Manifest Image Format in this link for more details. Using an augmented manifest file is an alternative to preprocessing when you have labeled data. For training jobs using labeled data, you typically need to preprocess the dataset to combine input data with metadata before training. If your training dataset is large, preprocessing can be time consuming and expensive. D is obvious. Keyword here is pipe mode. "The augmented manifest format enables you to do training in Pipe mode using image files without needing to create RecordIO files." Hence, E. https://docs.aws.amazon.com/sagemaker/latest/dg/augmented-manifest.html Letter A - B deviate from what is asked in the scope of the question. Correct alternatives are E - D. To understand that C is wrong, look here: https://docs.aws.amazon.com/sagemaker/latest/dg/image-classification.html TL;DR - .lst file is only for classification task Augmented manifest format enables you to do training in Pipe mode using files without needing to create RecordIO files (.rec) D+E https://docs.aws.amazon.com/sagemaker/latest/dg/augmented-manifest.html C. Create a .lst file that contains a list of image files and corresponding class labels. Upload the .lst file to Amazon S3. D. Initiate transfer learning. Train the model by using the images of less common species. Details provided in this blog post: https://aws.amazon.com/blogs/machine-learning/classify-your-own-images-using-amazon-sagemaker/ Correct Ans DE - https://www.examtopics.com/discussions/amazon/view/112462-exam-aws-certified-machine-learning-specialty-topic-1/
256
257 - A music streaming company is building a pipeline to extract features. The company wants to store the features for offline model training and online inference. The company wants to track feature history and to give the company’s data science teams access to the features. Which solution will meet these requirements with the MOST operational efficiency? - A.. Use Amazon SageMaker Feature Store to store features for model training and inference. Create an online store for online inference. Create an offline store for model training. Create an IAM role for data scientists to access and search through feature groups. B.. Use Amazon SageMaker Feature Store to store features for model training and inference. Create an online store for both online inference and model training. Create an IAM role for data scientists to access and search through feature groups. C.. Create one Amazon S3 bucket to store online inference features. Create a second S3 bucket to store offline model training features. Turn on versioning for the S3 buckets and use tags to specify which tags are for online inference features and which are for offline model training features. Use Amazon Athena to query the S3 bucket for online inference. Connect the S3 bucket for offline model training to a SageMaker training job. Create an IAM policy that allows data scientists to access both buckets. D.. Create two separate Amazon DynamoDB tables to store online inference features and offline model training features. Use time-based versioning on both tables. Query the DynamoDB table for online inference. Move the data from DynamoDB to Amazon S3 when a new SageMaker training job is launched. Create an IAM policy that allows data scientists to access both tables.
A - Answer is A - "SageMaker Feature Store consists of an online and an offline mode for managing features. The online store is used for low-latency real-time inference use cases. The offline store is primarily used for batch predictions and model training." https://aws.amazon.com/blogs/machine-learning/speed-ml-development-using-sagemaker-feature-store-and-apache-iceberg-offline-store-compaction/ Best Choice: B. Amazon SageMaker Feature Store This option offers a centralized and efficient solution that meets all the requirements: Feature Storage: SageMaker Feature Store acts as a single repository for features used in both offline training and online inference. Online Store: Creating an online store within Feature Store eliminates the need for a separate S3 bucket for inference features, simplifying management. Feature History Tracking: Feature Store automatically tracks feature lineage and versions, allowing you to see changes and roll back if needed. Data Science Team Access: IAM roles can be created to grant data scientists access to search and explore features within Feature Groups in the store. A meets all requirements, and looks like the easiest to setup A. YES - online store will be faster for inference, offline store cheaper for batch B. NO - online store for offline will be too expensive C. NO - want to use Feature store D. NO - want to use Feature store Amazon SageMaker Feature Store is a managed service that makes it easy to store and manage features for machine learning models. It provides a scalable and reliable way to store features, and it supports both online inference and offline model training. Creating separate online and offline stores in SageMaker Feature Store will allow the music streaming company to optimize the storage and performance of their features for each use case. The online store can be configured to be highly available and performant, while the offline store can be configured to be cost-effective and scalable. - https://www.examtopics.com/discussions/amazon/view/112605-exam-aws-certified-machine-learning-specialty-topic-1/
257
258 - A beauty supply store wants to understand some characteristics of visitors to the store. The store has security video recordings from the past several years. The store wants to generate a report of hourly visitors from the recordings. The report should group visitors by hair style and hair color. Which solution will meet these requirements with the LEAST amount of effort? - A.. Use an object detection algorithm to identify a visitor’s hair in video frames. Pass the identified hair to an ResNet-50 algorithm to determine hair style and hair color. B.. Use an object detection algorithm to identify a visitor’s hair in video frames. Pass the identified hair to an XGBoost algorithm to determine hair style and hair color. C.. Use a semantic segmentation algorithm to identify a visitor’s hair in video frames. Pass the identified hair to an ResNet-50 algorithm to determine hair style and hair color. D.. Use a semantic segmentation algorithm to identify a visitor’s hair in video frames. Pass the identified hair to an XGBoost algorithm to determine hair style and hair.
A - Going for object detection to identify hair is easy but customer needs hair style. So in this contest I believe Semantic Segmentation does a good job. Definitely 'C' Semantic Segmentation is the algorithm for colour segmentation like hair C. Semantic Segmentation allows for pixel-wise classification of the video frames, meaning it can precisely identify and isolate a visitor’s hair by labeling each pixel in the image as belonging to hair (or other categories).Object detection uses bounding boxes, which would not effectively isolate hair, especially in cases where hair might not have clear boundaries. Copied from ChatGPT: "Semantic segmentation provides a pixel-level classification of the image, meaning it labels each pixel in the image with the class of the object it belongs to. However, it does not inherently detect whether the object is present in the image. Instead, it assumes that the objects of interest are already present and segments the image accordingly." Since the input is video stream, Not all the frames(images) contain hair! Therefore I would go for A. semantic segmentation identifies hair, and Resnet for type and color. I was sure that it was option C but but when we want to select the option requiring the least amount of effort, it must be A. Hair color is detected by ResNet-50, not by semantic algorithm. So, object detection algorithms are generally easier to implement and fine-tune compared to semantic segmentation algorithms. They can accurately locate and extract specific objects, such as hair, from the video frames, simplifying the subsequent analysis. Additionally, ResNet-50 is a widely used pre-trained model for image classification tasks, making it relatively straightforward to apply for determining hair style and hair color We need Semantic Segmentation to identify the hair style and color by pixel level mapping I mean to select "C" CHAT GPT4= C Claude 3 Sonnet = C I doubt that object detection will detect hair better than a semantic segmentation Will go with A (Object detection) as semantic segmentation requires labelling every pixel in a picture which is more effort compared to object detection C is the right. Semantic Segmentation = Pixel Level category assingment Restnet50 = used for image recognition and computer vision tasks C The backbone is a network that produces reliable activation maps of image features. The decoder is a network that constructs the segmentation mask from the encoded activation maps. Amazon SageMaker semantic segmentation provides a choice of pre-trained or randomly initialized ResNet50 or ResNet101 as options for backbones. The backbones come with pre-trained artifacts that were originally trained on the ImageNet classification task. These are reliable pre-trained artifacts that users can use to fine-tune their FCN or PSP backbones for segmentation. Alternatively, users can initialize these networks from scratch. Decoders are never pre-trained. Semantic Segmentation algorithm is now available in Amazon SageMaker | AWS Machine Learning Blog https://aws.amazon.com/cn/blogs/machine-learning/semantic-segmentation-algorithm-is-now-available-in-amazon-sagemaker/ Letters B - D are wrong as they ultimately use a tabular classification model for an image problem, so we discard it. As we want a solution with the least effort, it is known that object detection requires less training effort than semantic segmentation, in addition to being able to keep the visitor's hair in the frame. Therefore, the correct alternative is Letter A. Segmentation is too heavy Using ResNet-50, you can determine hair style and hair color by passing the identified hair region from the semantic segmentation algorithm as input. ResNet-50 can classify the hair region into one of the 1000 categories from the ImageNet database, such as curly, straight, blonde, brunette, etc. Option C is the best option for your problem because it allows you to efficiently and accurately identify and classify a visitor’s hair in video frames using two powerful deep learning algorithms: semantic segmentation and ResNet-50. Xgboost not for Image detection ChatGPT say it's A. - https://www.examtopics.com/discussions/amazon/view/112719-exam-aws-certified-machine-learning-specialty-topic-1/
258
259 - A financial services company wants to automate its loan approval process by building a machine learning (ML) model. Each loan data point contains credit history from a third-party data source and demographic information about the customer. Each loan approval prediction must come with a report that contains an explanation for why the customer was approved for a loan or was denied for a loan. The company will use Amazon SageMaker to build the model. Which solution will meet these requirements with the LEAST development effort? - A.. Use SageMaker Model Debugger to automatically debug the predictions, generate the explanation, and attach the explanation report. B.. Use AWS Lambda to provide feature importance and partial dependence plots. Use the plots to generate and attach the explanation report. C.. Use SageMaker Clarify to generate the explanation report. Attach the report to the predicted results. D.. Use custom Amazon CloudWatch metrics to generate the explanation report. Attach the report to the predicted results.
C - SageMaker Clarify can do the work The model is trained already, so C. I imagine using a debugger in production is nuts The solution that will meet these requirements with the least development effort is C. Using SageMaker Clarify to generate the explanation report, and attaching the report to the predicted results . This solution allows the financial services company to use SageMaker Clarify, a feature that provides machine learning (ML) model transparency and explainability, to generate the explanation report for each loan approval prediction. SageMaker Clarify can provide feature importance scores, which indicate how much each feature contributes to the prediction, and SHAP values, which measure how each feature affects the prediction compared to the average prediction . The company can use these metrics to generate and attach the explanation report that contains the reason for why the customer was approved or denied for a loan. C, SageMaker Clarify can give explanation Why. Sagemaker Clarity is better option - https://www.examtopics.com/discussions/amazon/view/112463-exam-aws-certified-machine-learning-specialty-topic-1/
259
260 - A financial company sends special offers to customers through weekly email campaigns. A bulk email marketing system takes the list of email addresses as an input and sends the marketing campaign messages in batches. Few customers use the offers from the campaign messages. The company does not want to send irrelevant offers to customers. A machine learning (ML) team at the company is using Amazon SageMaker to build a model to recommend specific offers to each customer based on the customer's profile and the offers that the customer has accepted in the past. Which solution will meet these requirements with the MOST operational efficiency? - A.. Use the Factorization Machines algorithm to build a model that can generate personalized offer recommendations for customers. Deploy a SageMaker endpoint to generate offer recommendations. Feed the offer recommendations into the bulk email marketing system. B.. Use the Neural Collaborative Filtering algorithm to build a model that can generate personalized offer recommendations for customers. Deploy a SageMaker endpoint to generate offer recommendations. Feed the offer recommendations into the bulk email marketing system. C.. Use the Neural Collaborative Filtering algorithm to build a model that can generate personalized offer recommendations for customers. Deploy a SageMaker batch inference job to generate offer recommendations. Feed the offer recommendations into the bulk email marketing system. D.. Use the Factorization Machines algorithm to build a model that can generate personalized offer recommendations for customers. Deploy a SageMaker batch inference job to generate offer recommendations. Feed the offer recommendations into the bulk email marketing system.
D - D makes more sence to me. Collaborative filtering takes into account other users preferences which is what we want to avoid because we do not want irrelevant promotions Answer 'C' is better for efficiency As per the documentation: https://docs.aws.amazon.com/sagemaker/latest/dg/fact-machines.html Factorization machines are a good choice for tasks dealing with high dimensional sparse datasets, such as click prediction and item recommendation. And factorization machines is better with sparse data. C. Use the Neural Collaborative Filtering algorithm with a SageMaker batch inference job This solution uses the Neural Collaborative Filtering algorithm to leverage the latest techniques in recommendation systems, while SageMaker's batch inference jobs provide efficient and cost-effective processing of recommendations in bulk. This aligns well with the company's weekly email campaigns and minimizes operational overhead. I will go with D, as it is more operationally eficient A. NO - Factorization Machines is classification B. NO - an endpoint needed be invoked C. YES - Collaborative Filtering is good for recommendations based on past activities, and a batch job will generate the fiel we want D. NO - Factorization Machines is classification My choice D factorization machine algorithm is used for regression or classification. Generating recommendation is neither. Use neural collaborative filtering and do batch inference to identify email addresses. From Chat GPT The solution that will meet the requirements with the MOST operational efficiency is option C: Use the Neural Collaborative Filtering algorithm to build a model that can generate personalized offer recommendations for customers. Deploy a SageMaker batch inference job to generate offer recommendations. Feed the offer recommendations into the bulk email marketing system. By using the Neural Collaborative Filtering algorithm, the ML team can build a model that can provide personalized offer recommendations based on customer profiles and past accepted offers. Deploying a SageMaker batch inference job allows for efficient processing of a large batch of customer data to generate offer recommendations. These recommendations can then be fed directly into the bulk email marketing system, streamlining the process and improving operational efficiency. : Use the Factorization Machines algorithm to build a model that can generate personalized offer recommendations for customers. Deploy a SageMaker batch inference job to generate offer recommendations. Feed the offer recommendations into the bulk email marketing system , option D is more operationally efficient. C batch predictions and collaborative C is a better option for efficiency. - https://www.examtopics.com/discussions/amazon/view/114475-exam-aws-certified-machine-learning-specialty-topic-1/
260
261 - A social media company wants to develop a machine learning (ML) model to detect inappropriate or offensive content in images. The company has collected a large dataset of labeled images and plans to use the built-in Amazon SageMaker image classification algorithm to train the model. The company also intends to use SageMaker pipe mode to speed up the training. The company splits the dataset into training, validation, and testing datasets. The company stores the training and validation images in folders that are named Training and Validation, respectively. The folders contain subfolders that correspond to the names of the dataset classes. The company resizes the images to the same size and generates two input manifest files named training.lst and validation.lst, for the training dataset and the validation dataset, respectively. Finally, the company creates two separate Amazon S3 buckets for uploads of the training dataset and the validation dataset. Which additional data preparation steps should the company take before uploading the files to Amazon S3? - A.. Generate two Apache Parquet files, training.parquet and validation.parquet, by reading the images into a Pandas data frame and storing the data frame as a Parquet file. Upload the Parquet files to the training S3 bucket. B.. Compress the training and validation directories by using the Snappy compression library. Upload the manifest and compressed files to the training S3 bucket. C.. Compress the training and validation directories by using the gzip compression library. Upload the manifest and compressed files to the training S3 bucket. D.. Generate two RecordIO files, training.rec and validation.rec, from the manifest files by using the im2rec Apache MXNet utility tool. Upload the RecordIO files to the training S3 bucket.
D - Amazon SageMaker's built-in image classification algorithm supports input data in RecordIO format for training. RecordIO is a binary file format that efficiently stores images and labels in a compact format, making it suitable for training deep learning models with large datasets. The im2rec utility tool provided by Apache MXNet can be used to generate RecordIO files from the manifest files (training.lst and validation.lst) containing image paths and labels. Using RecordIO files allows for efficient streaming of data during training, especially when combined with SageMaker's pipe mode, which can speed up the training process by reducing disk I/O. A. NO - SageMaker requires RecordIO input B. NO - SageMaker requires RecordIO input C. NO - SageMaker requires RecordIO input D. YES - SageMaker requires RecordIO input If they want to use the RecordIO content type for training in pipe mode, they should generate two RecordIO files, training.rec and validation.rec, from the manifest files by using the im2rec Apache MXNet utility tool1. They should upload the RecordIO files to the training S3 bucket. This corresponds to option D in the question. https://aws.amazon.com/blogs/machine-learning/classify-your-own-images-using-amazon-sagemaker/ It's D. https://sagemaker-examples.readthedocs.io/en/latest/introduction_to_amazon_algorithms/imageclassification_caltech/Image-classification-fulltraining.html should be D.he company wants to use the Amazon SageMaker image classification algorithm to train the model. SageMaker's image classification algorithm requires the data to be in the RecordIO format. Therefore, the company should use the im2rec utility tool, which is part of the Apache MXNet framework, to generate RecordIO files from the manifest files. The company should generate two RecordIO files, one for the training dataset (training.rec) and one for the validation dataset (validation.rec). These RecordIO files will contain the image data along with their corresponding labels. - https://www.examtopics.com/discussions/amazon/view/113104-exam-aws-certified-machine-learning-specialty-topic-1/
261
262 - A media company wants to create a solution that identifies celebrities in pictures that users upload. The company also wants to identify the IP address and the timestamp details from the users so the company can prevent users from uploading pictures from unauthorized locations. Which solution will meet these requirements with LEAST development effort? - A.. Use AWS Panorama to identify celebrities in the pictures. Use AWS CloudTrail to capture IP address and timestamp details. B.. Use AWS Panorama to identify celebrities in the pictures. Make calls to the AWS Panorama Device SDK to capture IP address and timestamp details. C.. Use Amazon Rekognition to identify celebrities in the pictures. Use AWS CloudTrail to capture IP address and timestamp details. D.. Use Amazon Rekognition to identify celebrities in the pictures. Use the text detection feature to capture IP address and timestamp details.
C - Its B. CloudTrail records API calls for AWS USERS in AWS accounts. It has nothing to do with some random users who submit pictures via an app. CloudTrail is NOT the answer. Rekognition and cloud trail A. NO - AWS Panorama is for edge computing B. NO - AWS Panorama is for edge computing C. YES - best practice D. NO - IP address cannot be captured from the image, but from internet traffic Agree with C C. Rekognition text detection is for text inside image and not for source ip. C or D but CloudTrail detect IP only of API calls and not record API call for Invoke_Model. - https://www.examtopics.com/discussions/amazon/view/114477-exam-aws-certified-machine-learning-specialty-topic-1/
262
263 - A pharmaceutical company performs periodic audits of clinical trial sites to quickly resolve critical findings. The company stores audit documents in text format. Auditors have requested help from a data science team to quickly analyze the documents. The auditors need to discover the 10 main topics within the documents to prioritize and distribute the review work among the auditing team members. Documents that describe adverse events must receive the highest priority. A data scientist will use statistical modeling to discover abstract topics and to provide a list of the top words for each category to help the auditors assess the relevance of the topic. Which algorithms are best suited to this scenario? (Choose two.) - A.. Latent Dirichlet allocation (LDA) B.. Random forest classifier C.. Neural topic modeling (NTM) D.. Linear support vector machine E.. Linear regression
AC - A. LDA is designed to discover abstract topics in a collection of documents. It is commonly used for topic modeling and is one of the most popular techniques for extracting topics from text data. C. NTM is also used in topic modeling. It uses deep learning to discover topics in a collection of documents, and it can produce similar results to LDA but potentially with better accuracy due to its neural network foundation. Incorrect choices: B. Random forest classifier is a classification algorithm. It is better suited for classification tasks based on labeled data. D. SVM is also a classification algorithm. It works well for binary classification problems. E. Linear regression is a regression algorithm used to predict continuous values. It's not suitable for topic modeling. LDA and NTM are the only applicable options for topic modelling here A. YES B. NO - for classification C. YES D. NO - for classification E. NO - for classification both unsupervised learning algorithms that can discover abstract topics in a collection of text documents . These algorithms can help the data scientist to analyze the audit documents and provide a list of the top words for each category to help the auditors assess the relevance of the topic. LDA and NTM are different from other algorithms that are not suitable for this scenario, such as: AC for topics Although you can use both the Amazon SageMaker NTM and LDA algorithms for topic modeling, they are distinct algorithms and can be expected to produce different results on the same input data. A and C https://docs.aws.amazon.com/sagemaker/latest/dg/ntm.html https://docs.aws.amazon.com/sagemaker/latest/dg/lda.html - https://www.examtopics.com/discussions/amazon/view/114030-exam-aws-certified-machine-learning-specialty-topic-1/
263
264 - A company needs to deploy a chatbot to answer common questions from customers. The chatbot must base its answers on company documentation. Which solution will meet these requirements with the LEAST development effort? - A.. Index company documents by using Amazon Kendra. Integrate the chatbot with Amazon Kendra by using the Amazon Kendra Query API operation to answer customer questions. B.. Train a Bidirectional Attention Flow (BiDAF) network based on past customer questions and company documents. Deploy the model as a real-time Amazon SageMaker endpoint. Integrate the model with the chatbot by using the SageMaker Runtime InvokeEndpoint API operation to answer customer questions. C.. Train an Amazon SageMaker Blazing Text model based on past customer questions and company documents. Deploy the model as a real-time SageMaker endpoint. Integrate the model with the chatbot by using the SageMaker Runtime InvokeEndpoint API operation to answer customer questions. D.. Index company documents by using Amazon OpenSearch Service. Integrate the chatbot with OpenSearch Service by using the OpenSearch Service k-nearest neighbors (k-NN) Query API operation to answer customer questions.
A - A. Amazon Kendra is designed to search through various types of documents and provide relevant answers. B. Training a BiDAF network requires expertise in deep learning and natural language processing. It would require substantial effort in data preparation, model training, and integration. C. Amazon SageMaker Blazing Text is primarily used for text classification and word embeddings, not for extracting answers from company documents based on user queries. D. Amazon OpenSearch Service is a search and analytics engine, but it's not tailored for extracting precise answers from documents. The k-NN Query API is used for similarity searches and isn't inherently designed to answer questions based on document content. Amazon Kendra is a managed search service that helps you find answers to your questions from your content. It uses natural language processing and machine learning to understand the meaning of your questions and match them to the most relevant content. A is the correct. All the others are really hard. For sure A. A is correct Amazon Kendra is an intelligent search service powered by machine learning. It can be used to index and search through company documents, making it a suitable solution for the chatbot to base its answers on. Option A suggests indexing company documents using Amazon Kendra, which simplifies the process of searching and retrieving relevant information from the documentation. Integrating the chatbot with Amazon Kendra using the Kendra Query API operation allows the chatbot to send customer questions to Kendra and receive relevant answers based on the indexed documents. This solution requires minimal development effort as it leverages the built-in capabilities of Amazon Kendra and its integration with the chatbot. Option B, training a Bidirectional Attention Flow (BiDAF) network, and option C, training a SageMaker Blazing Text model, both involve training custom models, which would require significant development effort, including data preparation, model training, and deployment. - https://www.examtopics.com/discussions/amazon/view/114427-exam-aws-certified-machine-learning-specialty-topic-1/
264
265 - A company wants to conduct targeted marketing to sell solar panels to homeowners. The company wants to use machine learning (ML) technologies to identify which houses already have solar panels. The company has collected 8,000 satellite images as training data and will use Amazon SageMaker Ground Truth to label the data. The company has a small internal team that is working on the project. The internal team has no ML expertise and no ML experience. Which solution will meet these requirements with the LEAST amount of effort from the internal team? - A.. Set up a private workforce that consists of the internal team. Use the private workforce and the SageMaker Ground Truth active learning feature to label the data. Use Amazon Rekognition Custom Labels for model training and hosting. B.. Set up a private workforce that consists of the internal team. Use the private workforce to label the data. Use Amazon Rekognition Custom Labels for model training and hosting. C.. Set up a private workforce that consists of the internal team. Use the private workforce and the SageMaker Ground Truth active learning feature to label the data. Use the SageMaker Object Detection algorithm to train a model. Use SageMaker batch transform for inference. D.. Set up a public workforce. Use the public workforce to label the data. Use the SageMaker Object Detection algorithm to train a model. Use SageMaker batch transform for inference.
A - My PoV is that D requires LEAST amount of effort from the internal team. The blog post (https://aws.amazon.com/blogs/machine-learning/identify-rooftop-solar-panels-from-satellite-imagery-using-amazon-rekognition-custom-labels/) does not mention that the effort would be huge, to label 8k images even with Ground Truth active learning. Far more than the effort of using SageMaker Object Detection algorithm and SageMaker batch transform. Considering the fact that the internal team has no ML expertise and no ML experience, they need to do some configuration, learn key concepts (e.g., preparing data in the correct format, specifying hyperparameters), and follow AWS documentation and industrial best practices for training and inference... Now, everyone is learning on the job, no ML expertise and no ML experience only means they have a poor start point. Study Rekognition Custom Labels sounds a little bit easier than leaning SageMaker Object Detection algorithm and SageMaker batch transform. But because SageMaker and its features are AWS managed and out-of-box, the operational overhead is acceptable. And because Object Detection algorithm is a SageMaker built-in algorithm, the related effort is also not big. So to sum up, comparing the effort to label 8k images even with Ground Truth active learning, the effort of using SageMaker Object Detection algorithm and SageMaker batch transform is expected to be less, IMHO. https://aws.amazon.com/blogs/machine-learning/identify-rooftop-solar-panels-from-satellite-imagery-using-amazon-rekognition-custom-labels/ Least effort + no ML experience, so A A. YES - Amazon Rekognition Custom Labels is better than other option like Face/Celebrity/etc.; Ground Truth Active Learning will require human labelling only when needed, works well with small internal team B. NO - missing Active Learning C. NO - SageMaker Object Detection is more complicated than labelling D. NO Vote for A SageMaker Ground Truth can use active learning to automate the labeling of the input data for certain built-in task types, such as object detection. Active learning is a machine learning technique that identifies data that should be labeled by human workers. This helps to reduce the cost and time that it takes to label the dataset compared to using only humans1. By setting up a private workforce, the internal team can use their own domain knowledge to label the data and ensure quality and consistency. As we have an internal team working on this project, it is understood that they will do the labeling. Letter A is correct, as SageMaker Active Learning Feature allows you to streamline the team's efforts. Letters C - D are wrong as they use the wrong algorithm (object detection) and Letter B takes longer than Letter A. Option D uses a public workforce to label the data. This means that the company can leverage a large pool of workers from Amazon Mechanical Turk, who are experienced and qualified in labeling tasks. The public workforce can provide more diverse and accurate labels than the internal team, who may have limited or biased perspectives. The public workforce can also complete the labeling task faster and more efficiently than the internal team, who may have other priorities or responsibilities. changed to A. public workforce may need effort. https://aws.amazon.com/blogs/machine-learning/identify-rooftop-solar-panels-from-satellite-imagery-using-amazon-rekognition-custom-labels/ https://docs.aws.amazon.com/sagemaker/latest/dg/sms-automated-labeling.html It's A due to small team on the project and minimal effort from the team required. SageMaker Ground Truth active learning feature can speed up the labeling process for 8000 images. B is correct B. Set up a private workforce that consists of the internal team. Use the private workforce to label the data. Use Amazon Rekognition Custom Labels for model training and hosting. By setting up a private workforce consisting of the internal team and using Amazon Rekognition Custom Labels, the company can leverage the labeling capabilities of the internal team to label the data. Amazon Rekognition Custom Labels can then be used for model training and hosting. This option eliminates the need for additional complex steps such as active learning or object detection algorithm training, which may require more ML expertise and effort from the internal team. Instead, it relies on the simplicity and convenience of using Amazon Rekognition Custom Labels for model training and hosting, making it the least effort-intensive option for the team with no ML expertise or experience. C is correct. - https://www.examtopics.com/discussions/amazon/view/111820-exam-aws-certified-machine-learning-specialty-topic-1/
265
266 - A company hosts a machine learning (ML) dataset repository on Amazon S3. A data scientist is preparing the repository to train a model. The data scientist needs to redact personally identifiable information (PH) from the dataset. Which solution will meet these requirements with the LEAST development effort? - A.. Use Amazon SageMaker Data Wrangler with a custom transformation to identify and redact the PII. B.. Create a custom AWS Lambda function to read the files, identify the PII. and redact the PII C.. Use AWS Glue DataBrew to identity and redact the PII D.. Use an AWS Glue development endpoint to implement the PII redaction from within a notebook
C - https://docs.aws.amazon.com/databrew/latest/dg/personal-information-protection.html option C is better than the other options because it can meet the company’s requirements with the least development effort. Option C can leverage DataBrew’s native capabilities to identify and handle PII data in a visual and intuitive way. Answer C https://aws.amazon.com/blogs/big-data/introducing-pii-data-identification-and-handling-using-aws-glue-databrew/ C cause A require customization. - https://www.examtopics.com/discussions/amazon/view/114478-exam-aws-certified-machine-learning-specialty-topic-1/
266
267 - A company is deploying a new machine learning (ML) model in a production environment. The company is concerned that the ML model will drift over time, so the company creates a script to aggregate all inputs and predictions into a single file at the end of each day. The company stores the file as an object in an Amazon S3 bucket. The total size of the daily file is 100 GB. The daily file size will increase over time. Four times a year, the company samples the data from the previous 90 days to check the ML model for drift. After the 90-day period, the company must keep the files for compliance reasons. The company needs to use S3 storage classes to minimize costs. The company wants to maintain the same storage durability of the data. Which solution will meet these requirements? - A.. Store the daily objects in the S3 Standard-InfrequentAccess (S3 Standard-IA) storage class. Configure an S3 Lifecycle rule to move the objects to S3 Glacier Flexible Retrieval after 90 days. B.. Store the daily objects in the S3 One Zone-Infrequent Access (S3 One Zone-IA) storage class. Configure an S3 Lifecycle rule to move the objects to S3 Glacier Flexible Retrieval after 90 days. C.. Store the daily objects in the S3 Standard-InfrequentAccess (S3 Standard-IA) storage class. Configure an S3 Lifecycle rule to move the objects to S3 Glacier Deep Archive after 90 days. D.. Store the daily objects in the S3 One Zone-Infrequent Access (S3 One Zone-IA) storage class. Configure an S3 Lifecycle rule to move the objects to S3 Glacier Deep Archive after 90 days.
C - Availability is not mentioned in the question. So D The lowest cost is Glacier Deep Archive (so A and B discarded). Despite the fact that D is slightly cheaper than C, we are talking about productive data. Losing access to them is never an option. One zone only are for data that can be easily replicated as image thumbnails. Durability and access patterns Four times a year, the company samples the data from the previous 90 days to check the ML model for drift. Durability is the key, retrieval time is not a factor. Answer is C. D 11 9s for all storage classes, durability is the key not availability obvious. no need retrieval time to be short. You cannot achieved durability in standard IA Lowest cost - past 90 days in deep archive. Access only 4 times an year - standard IA - we need to be able to recover data in case of a zone failure the question only mentions "the same durability" and "minimize cost". It didn't mention "availability" So D If the company is required to "must" keep the files after 90 days then what happens if One Zone is hit by a disaster?! Therefore option "C" is the best solution here. If the company can tolerate retrieval times of several hours and is looking for the lowest cost solution, Option C is the best choice over D, as S3 Glacier Deep Archive offers the lowest storage cost. S3 Standard-IA : Designed for 99.9% availability over a given year S3 One Zone-IA : Designed for 99.5% availability over a given year https://aws.amazon.com/s3/storage-classes/?nc1=h_ls D 11 9s for all storage classes. Availability not mentioned. Durability is the key point Answer C Not A - S3 Glacier Flexible Retrieval is not needed. it cost more than Glacier Deep Archive. And the question doesnt mention to retrieve the file immediately when neede. Not B - Although One Zone-Infrequent Access is possible, the question doesnt mention thre is another copy of files somewhere else or there is no problem if some files are lost. Although the loss of data is very rare we use One Zone-Infrequent Access for data that can tolerate loss. "Amazon S3 Standard, S3 Standard-IA, S3 One Zone-IA, and Amazon Glacier, are all designed for 99.999999999% durability." One Zone is as durable as Standard but cheaper than Standard IA. Glacier Deep Archive is cheaper than Flexible Retrieval. D is the cheapest (without compromising on durability). D is the right option because S3 One Zone-Infrequent Access has the same durability (11 nines) as S3 Standard-IA with an added benefit of 20% less cost than S3 Standard-IA. Only downside is that One-Zone is less resilient than Standard-IA as data is not replicated to multiple AZs. This question is asking about durability not resiliency. Also, option D has the cheapest archival option (Glacier Deep Archive) which comes with 11 nines durability so, daily data files and archived files will have the same durability. A. NO - can store in cheapest Glacier Deep Archive after 90 days B. NO - can store in cheapest Glacier Deep Archive after 90 days C. YES D. NO - need multiple zones for higher resiliency due to legal requirements "Amazon S3 Standard, S3 Standard-IA, S3 One Zone-IA, and Amazon Glacier, are all designed for 99.999999999% durability." One Zone is as durable as Standard but cheaper than Standard IA. Glacier Deep Archive is cheaper than Flexible Retrieval. D is the cheapest (without compromising on durability). - https://www.examtopics.com/discussions/amazon/view/114479-exam-aws-certified-machine-learning-specialty-topic-1/
267
268 - A company wants to enhance audits for its machine learning (ML) systems. The auditing system must be able to perform metadata analysis on the features that the ML models use. The audit solution must generate a report that analyzes the metadata. The solution also must be able to set the data sensitivity and authorship of features. Which solution will meet these requirements with the LEAST development effort? - A.. Use Amazon SageMaker Feature Store to select the features. Create a data flow to perform feature-level metadata analysis. Create an Amazon DynamoDB table to store feature-level metadata. Use Amazon QuickSight to analyze the metadata. B.. Use Amazon SageMaker Feature Store to set feature groups for the current features that the ML models use. Assign the required metadata for each feature. Use SageMaker Studio to analyze the metadata. C.. Use Amazon SageMaker Features Store to apply custom algorithms to analyze the feature-level metadata that the company requires. Create an Amazon DynamoDB table to store feature-level metadata. Use Amazon QuickSight to analyze the metadata. D.. Use Amazon SageMaker Feature Store to set feature groups for the current features that the ML models use. Assign the required metadata for each feature. Use Amazon QuickSight to analyze the metadata.
D - The creation of DynamoDB is logical, so never Not suitable for "LEAST development effort". Copilot initially said option D, but when you challenge the machine saying that QuickSight is purely a data representation tool and does not analyse anything (as clearly requested in the question) changed its mind, and says B. "Considering the need for both analyzing metadata and generating a report, Option B might be a better fit because SageMaker Studio has the comprehensive tools needed for both tasks: in-depth analysis and reporting." Option B involves using Amazon SageMaker Feature Store to set feature groups and assign metadata, then using SageMaker Studio to analyze the metadata. While this approach is valid, it may not be the best choice. Studio is for development not for visualisation. you can input metadata directly in sagemaker feature store, no need for dydb. https://docs.amazonaws.cn/en_us/sagemaker/latest/dg/feature-store-add-metadata.html Option D is similar to option B but suggests using Amazon QuickSight for metadata analysis instead of SageMaker Studio. While QuickSight is a viable option for visualization, it may require additional configuration and setup compared to using SageMaker Studio, which is already integrated with SageMaker Feature Store. On second thoughts, D (QuickSight) is also a possible option because feature store parquet files could be queried by Athena and Athena could be used with Quicksight without any development efforts. Would go with option D B is the correct option because in option D, Quicksight is used which doesn't support parquet files. Sagemaker feature groups created in offline feature store use parquet to store feature values on S3. This question is talking about auditing which makes offline feature store an obvious choice. In order to use Quicksight, there is an additional step to convert feature store parquet file to a supported format (like CSV, JSON, etc.) and hence, it has more efforts compared to creating a dataframe in Data Wrangler and using it for visualizations The answer should be D, both the sagemaker studio and the quicksight can analyze metadata but Amazon SageMaker Studio is a web-based, integrated development environment (IDE) for machine learning that provides all the tools you need to take your models from data preparation to experimentation to production. and the question is asking about a solution with the least development so it should be Quicksight A. NO - Amazon SageMaker Feature Store cannot transform B. NO - SageMaker Studio require Python to analyze the metadata C. NO - custom algorithms are dev-intensive D. YES - use built-in functionnalities Agree D Por menor esforço de desenvolvimento descartamos Letra A (levantar um serviço gerenciado como DynamoDB normalmente não é a melhor solução), Letra B (não é performática), Letra C (mesmo motivo da A). Logo por eliminação, Letra D está correta. This solution meets the requirements with the least development effort because it uses Amazon SageMaker Feature Store, which is a fully managed service that makes it easy to store and manage feature metadata. Amazon SageMaker Feature Store also provides built-in functionality for analyzing feature metadata, so there is no need to create custom algorithms or data flows. https://aws.amazon.com/blogs/machine-learning/controlling-and-auditing-data-exploration-activities-with-amazon-sagemaker-studio-and-aws-lake-formation/ Studio supports Audit I think it's B Maybe It's D as QuickSight less development effort. Agreed, should be D - https://www.examtopics.com/discussions/amazon/view/114480-exam-aws-certified-machine-learning-specialty-topic-1/
268
269 - A machine learning (ML) specialist uploads a dataset to an Amazon S3 bucket that is protected by server-side encryption with AWS KMS keys (SSE-KMS). The ML specialist needs to ensure that an Amazon SageMaker notebook instance can read the dataset that is in Amazon S3. Which solution will meet these requirements? - A.. Define security groups to allow all HTTP inbound and outbound traffic. Assign the security groups to the SageMaker notebook instance. B.. Configure the SageMaker notebook instance to have access to the VPC. Grant permission in the AWS Key Management Service (AWS KMS) key policy to the notebook’s VPC. C.. Assign an IAM role that provides S3 read access for the dataset to the SageMaker notebook. Grant permission in the KMS key policy to the IAM role. D.. Assign the same KMS key that encrypts the data in Amazon S3 to the SageMaker notebook instance.
C - https://stackoverflow.com/questions/66692579/aws-sagemaker-permissionerror-access-denied-reading-data-from-s3-bucket 'C' is the correct answer. Note that the question does not mention VPC, so we discard Letter B. Letter C appears to be correct, as they use IAM to grant permission (congruent with their role). SG is beyond the scope of KMS, so we discard Letter A. Letter D is incongruous. Option C allows the ML specialist to assign an IAM role that provides S3 read access for the dataset to the SageMaker notebook. IAM is a service that helps users manage access to AWS resources. An IAM role is an entity that defines a set of permissions for making AWS service requests. The ML specialist can create an IAM role that has a policy that allows the notebook to read the dataset from the S3 bucket. The ML specialist can then attach the IAM role to the notebook when creating or updating it. C it is C is correct 100% it's D. It's not stop misleading people - https://www.examtopics.com/discussions/amazon/view/114481-exam-aws-certified-machine-learning-specialty-topic-1/
269
270 - A company has a podcast platform that has thousands of users. The company implemented an algorithm to detect low podcast engagement based on a 10-minute running window of user events such as listening to, pausing, and closing the podcast. A machine learning (ML) specialist is designing the ingestion process for these events. The ML specialist needs to transform the data to prepare the data for inference. How should the ML specialist design the transformation step to meet these requirements with the LEAST operational effort? - A.. Use an Amazon Managed Streaming for Apache Kafka (Amazon MSK) cluster to ingest event data. Use Amazon Kinesis Data Analytics to transform the most recent 10 minutes of data before inference. B.. Use Amazon Kinesis Data Streams to ingest event data. Store the data in Amazon S3 by using Amazon Kinesis Data Firehose. Use AWS Lambda to transform the most recent 10 minutes of data before inference. C.. Use Amazon Kinesis Data Streams to ingest event data. Use Amazon Kinesis Data Analytics to transform the most recent 10 minutes of data before inference. D.. Use an Amazon Managed Streaming for Apache Kafka (Amazon MSK) cluster to ingest event data. Use AWS Lambda to transform the most recent 10 minutes of data before inference.
C - Easiest option would be C A. NO - Kinesis Data Analytics can use only Firehose or Amazon Kinesis Data Streams as input (https://docs.aws.amazon.com/kinesisanalytics/latest/dev/how-it-works.html) B. NO - no need to save in S3, can do on-the-fly C. YES D. NO - AWS Lambda would be invoked on a per-record basis why not B? Option C also allows the ML specialist to use Amazon Kinesis Data Analytics to transform the most recent 10 minutes of data before inference. Kinesis Data Analytics is a fully managed service that enables users to analyze streaming data using SQL or Apache Flink. Kinesis Data Analytics can process streaming data in real time and generate insights, metrics, and alerts. Kinesis Data Analytics can also integrate with other AWS services, such as Lambda, S3, or SageMaker. The ML specialist can use Kinesis Data Analytics to apply SQL queries or Flink applications to transform the event data based on the 10-minute running window and prepare it for inference. It’s real-time and less operational overhead I think it's C as we are using only 2 services and it's less operational effort. - https://www.examtopics.com/discussions/amazon/view/114482-exam-aws-certified-machine-learning-specialty-topic-1/
270
271 - A machine learning (ML) specialist is training a multilayer perceptron (MLP) on a dataset with multiple classes. The target class of interest is unique compared to the other classes in the dataset, but it does not achieve an acceptable recall metric. The ML specialist varies the number and size of the MLP's hidden layers, but the results do not improve significantly. Which solution will improve recall in the LEAST amount of time? - A.. Add class weights to the MLP's loss function, and then retrain. B.. Gather more data by using Amazon Mechanical Turk, and then retrain. C.. Train a k-means algorithm instead of an MLP. D.. Train an anomaly detection model instead of an MLP.
A - Option A allows the ML specialist to add class weights to the MLP’s loss function, and then retrain. Class weights are a way of assigning different importance or penalties to different classes in a classification problem. Class weights can help balance the data distribution and reduce the bias towards the majority classes. Class weights can also help improve the recall metric, which is the ratio of true positives to the sum of true positives and false negatives. Recall measures how well the model can identify the relevant instances of a class, especially when the class is rare or unique. The ML specialist can use class weights to increase the importance or penalty of the target class of interest, and then retrain the MLP to improve its recall. Option A - Add class weights to MLP's loss function - improve recall with the least amount of time and effort by making the model more sensitive to the underrepresented target class during training. Apologies for the confusion but on second thoughts, A is the right answer as unique doesn't mean unknown and this is still a supervised learning problem. Adding weights to classes would even out the bias caused by unique class and improve recall as mentioned by other experts in this forum. Please ignore my previous comment. A is the correct option indeed. Leaning towards C as the target class of interest is unique as compared to dataset (as given in this question). If the target class is unique / non-existing in data set then we are talking about unsupervised learning and k-means is a right fit so, option C seems to be more appropriate than option A. Adding weights may still not be able to solve the purpose as target class is not present in data set. It is almost an unlabeled data set if target class is unknown / unique as compared to existing classes in data set. Unlabeled data sets are better solved using unsupervised learning. Agreed A A as this is Faster solution. - https://www.examtopics.com/discussions/amazon/view/114483-exam-aws-certified-machine-learning-specialty-topic-1/
271
272 - A machine learning (ML) specialist uploads 5 TB of data to an Amazon SageMaker Studio environment. The ML specialist performs initial data cleansing. Before the ML specialist begins to train a model, the ML specialist needs to create and view an analysis report that details potential bias in the uploaded data. Which combination of actions will meet these requirements with the LEAST operational overhead? (Choose two.) - A.. Use SageMaker Clarify to automatically detect data bias B.. Turn on the bias detection option in SageMaker Ground Truth to automatically analyze data features. C.. Use SageMaker Model Monitor to generate a bias drift report. D.. Configure SageMaker Data Wrangler to generate a bias report. E.. Use SageMaker Experiments to perform a data check
AD - The key here is before training the model. If it is before training then we can do that using SageMaker Data Wrangler. After training and model is deployed for inference, we can use model monitor A and C are correct. https://docs.aws.amazon.com/sagemaker/latest/dg/clarify-data-bias-reports-ui.html A. YES - SageMaker Clarify can be used to detect bias during prep (https://aws.amazon.com/sagemaker/clarify/) B. NO - SageMaker Ground Truth is not used in the solution C. NO - drift report to during inference D. YES - Configure SageMaker Data Wrangler to generate a bias report. E. NO - SageMaker Experiments is to compare model outputs This combination meets all the requirements with the least operational overhead. You can use SageMaker Data Wrangler to ingest and clean your data in Amazon SageMaker Studio without writing any code. You can also use SageMaker Clarify to automatically detect potential bias in your data using predefined or custom metrics. You can then configure SageMaker Data Wrangler to generate a bias report that shows the results of the bias analysis in a visual and interactive way2. I think the combination of actions that will meet the requirements with the least operational overhead are A and D. Use SageMaker Clarify to automatically detect data bias and configure SageMaker Data Wrangler to generate a bias report. A: AWS Clarify used to generate Bias report. D: AWS Data Wrangler to generate Bias report AD: https://docs.aws.amazon.com/sagemaker/latest/dg/clarify-data-bias-reports-ui.html SageMaker Clarify is integrated with Amazon SageMaker Data Wrangler, which can help you identify bias during data preparation without having to write your own code. Data Wrangler provides an end-to-end solution to import, prepare, transform, featurize, and analyze data with Amazon SageMaker Studio. I think it's A+E AC is correct. A: AWS Clarify used to generate Bias report. C: AWS Data Wrangler to generate Bias report and is operationally efficient. https://catalog.us-east-1.prod.workshops.aws/workshops/1e224d5a-4273-444a-acec-28d44a5bfb28/en-US/data-preparation/amazon-sagemaker/data-wrangler - https://www.examtopics.com/discussions/amazon/view/114426-exam-aws-certified-machine-learning-specialty-topic-1/
272
273 - A network security vendor needs to ingest telemetry data from thousands of endpoints that run all over the world. The data is transmitted every 30 seconds in the form of records that contain 50 fields. Each record is up to 1 KB in size. The security vendor uses Amazon Kinesis Data Streams to ingest the data. The vendor requires hourly summaries of the records that Kinesis Data Streams ingests. The vendor will use Amazon Athena to query the records and to generate the summaries. The Athena queries will target 7 to 12 of the available data fields. Which solution will meet these requirements with the LEAST amount of customization to transform and store the ingested data? - A.. Use AWS Lambda to read and aggregate the data hourly. Transform the data and store it in Amazon S3 by using Amazon Kinesis Data Firehose. B.. Use Amazon Kinesis Data Firehose to read and aggregate the data hourly. Transform the data and store it in Amazon S3 by using a short-lived Amazon EMR cluster. C.. Use Amazon Kinesis Data Analytics to read and aggregate the data hourly. Transform the data and store it in Amazon S3 by using Amazon Kinesis Data Firehose. D.. Use Amazon Kinesis Data Firehose to read and aggregate the data hourly. Transform the data and store it in Amazon S3 by using AWS Lambda.
C - Not D because Firehose is not best to aggregate, and lambda is not necessary. A. NO - AWS Lambda is not per to aggregate, it works on a per-row basis B. NO - Kinesis Data Firehose is not best to aggregate C. YES - Amazon Kinesis Data Analytics can aggregate and stream to Firehose D. NO - Kinesis Data Firehose is not best to aggregate Kinesis Data Firehose can invoke your Lambda function to transform incoming source data and deliver the transformed data to destinations. So the answer cannot be C, has to be D C it is changing back to C. very confusing Changing to option D Option D is the best option because it allows the network security vendor to use Amazon Kinesis Data Firehose to read and aggregate the data hourly from Amazon Kinesis Data Streams, and use AWS Lambda to transform the data and store it in Amazon S3. This way, the network security vendor can leverage the benefits of both services: Amazon Kinesis Data Firehose can provide a simple and scalable way to ingest, buffer, compress, and batch the streaming data; AWS Lambda can provide a flexible and cost-effective way to perform custom logic on the data, such as selecting only 7 to 12 fields for Athena queries. This option meets the requirements with the least amount of customization to transform and store the ingested data Letter B is wrong, as it brings the addition of yet another new service (EMR). Letter D is wrong, as we cannot directly use KDF to perform transformations (it's a load service only). Letter C is the most correct and fastest, as it uses the Kinesis family. Letter A is functional, as we can call Lambda via KDS, but it would involve more customization given the Lambda code to be built. Changing my answer to C This option meets all the requirements with the least amount of customization. You can use Amazon Kinesis Data Firehose to ingest streaming data from thousands of endpoints and configure it to buffer the data by size or time interval (such as 1 hour). You can use AWS Lambda to transform the data and select only the relevant fields before delivering it to Amazon S3. You can also configure Amazon Kinesis Data Firehose to convert the data to a columnar format such as Apache Parquet or Apache ORC, which are optimized for querying with Amazon Athena3. You can not transform data using KDF. C it is C is correct! Vote for C. C is correct. - https://www.examtopics.com/discussions/amazon/view/114425-exam-aws-certified-machine-learning-specialty-topic-1/
273
274 - A medical device company is building a machine learning (ML) model to predict the likelihood of device recall based on customer data that the company collects from a plain text survey. One of the survey questions asks which medications the customer is taking. The data for this field contains the names of medications that customers enter manually. Customers misspell some of the medication names. The column that contains the medication name data gives a categorical feature with high cardinality but redundancy. What is the MOST effective way to encode this categorical feature into a numeric feature? - A.. Spell check the column. Use Amazon SageMaker one-hot encoding on the column to transform a categorical feature to a numerical feature. B.. Fix the spelling in the column by using char-RNN. Use Amazon SageMaker Data Wrangler one-hot encoding to transform a categorical feature to a numerical feature. C.. Use Amazon SageMaker Data Wrangler similarity encoding on the column to create embeddings of vectors of real numbers. D.. Use Amazon SageMaker Data Wrangler ordinal encoding on the column to encode categories into an integer between 0 and the total number of categories in the column.
C - Ansewer is A: Given the scenario, One-Hot Encoding would be the most effective way to encode the categorical feature into a numerical feature Similarity encoding is meant to encode high cardinality features having misspelled values by group them closer. In high cardinality, it's more efficient o create vectors of real numbers than creating insane number of one hot encoded columns A - most effective. dataset seems small and manually fixing of spelling is possible since this is categorical C- Data Wrangler similarity encoding on the column to create embeddings of vectors of real numbers. The similarity encoder creates embeddings for columns with categorical data. An embedding is a mapping of discrete objects, such as words, to vectors of real numbers. It encodes similar strings to vectors containing similar values. For example, it creates very similar encodings for "California" and "Calfornia". https://docs.aws.amazon.com/sagemaker/latest/dg/data-wrangler-transform.html a character-level Recurrent Neural Network (char-RNN) can be used to fix spelling mistakes in a column containing medication names. Use similarity encoding when you have the following: 1. A large number of categorical variables 2. Noisy data - https://www.examtopics.com/discussions/amazon/view/128599-exam-aws-certified-machine-learning-specialty-topic-1/
274
275 - A machine learning (ML) engineer has created a feature repository in Amazon SageMaker Feature Store for the company. The company has AWS accounts for development, integration, and production. The company hosts a feature store in the development account. The company uses Amazon S3 buckets to store feature values offline. The company wants to share features and to allow the integration account and the production account to reuse the features that are in the feature repository. Which combination of steps will meet these requirements? (Choose two.) - A.. Create an IAM role in the development account that the integration account and production account can assume. Attach IAM policies to the role that allow access to the feature repository and the S3 buckets. B.. Share the feature repository that is associated the S3 buckets from the development account to the integration account and the production account by using AWS Resource Access Manager (AWS RAM). C.. Use AWS Security Token Service (AWS STS) from the integration account and the production account to retrieve credentials for the development account. D.. Set up S3 replication between the development S3 buckets and the integration and production S3 buckets. E.. Create an AWS PrivateLink endpoint in the development account for SageMaker.
AB - A: By creating an IAM role in the development account that the integration and production accounts can assume, you establish a trust relationship between the accounts. You can attach IAM policies to the role that grant the necessary permissions to access the feature repository and S3 buckets. B: AWS Resource Access Manager (AWS RAM) enables resource sharing across AWS accounts. By sharing the feature repository associated with the S3 buckets using AWS RAM, you allow the integration and production accounts to access and reuse the features. AB B- its not saying to share the S3 bucket using RAM, it saying to share the Feature repository (that is associated to the bucket) - https://docs.aws.amazon.com/ram/latest/userguide/shareable.html B- is definitely incorrect, you can't share an S3 bucket using RAM A - cross account access C - use STS to get credentials and assume IAM role from different account (https://docs.aws.amazon.com/STS/latest/APIReference/API_AssumeRole.html) B is incorrect - RAM can be used only in AWS Organizations (which is not mentioned in question), also only Amazon S3 on Outposts can be shared using RAM A & B Create an IAM role in the dev account and share the features repository using AWS RAM - https://www.examtopics.com/discussions/amazon/view/128589-exam-aws-certified-machine-learning-specialty-topic-1/
275
276 - A company is building a new supervised classification model in an AWS environment. The company's data science team notices that the dataset has a large quantity of variables. All the variables are numeric. The model accuracy for training and validation is low. The model's processing time is affected by high latency. The data science team needs to increase the accuracy of the model and decrease the processing time. What should the data science team do to meet these requirements? - A.. Create new features and interaction variables. B.. Use a principal component analysis (PCA) model. C.. Apply normalization on the feature set. D.. Use a multiple correspondence analysis (MCA) model.
B - It's PCA. It's not MCA because all the values are numeric and not categorical. need to reduce the dimension of features in order to enhance accuracy on train and test data since # of features are huge. B. Use a principal component analysis (PCA) model. This is because PCA can help to reduce the number of variables while preserving the most important information, which can help to improve the accuracy of the model and reduce the processing time. - https://www.examtopics.com/discussions/amazon/view/128591-exam-aws-certified-machine-learning-specialty-topic-1/
276
277 - An exercise analytics company wants to predict running speeds for its customers by using a dataset that contains multiple health-related features for each customer. Some of the features originate from sensors that provide extremely noisy values. The company is training a regression model by using the built-in Amazon SageMaker linear learner algorithm to predict the running speeds. While the company is training the model, a data scientist observes that the training loss decreases to almost zero, but validation loss increases. Which technique should the data scientist use to optimally fit the model? - A.. Add L1 regularization to the linear learner regression model. B.. Perform a principal component analysis (PCA) on the dataset. Use the linear learner regression model. C.. Perform feature engineering by including quadratic and cubic terms. Train the linear learner regression model. D.. Add L2 regularization to the linear learner regression model.
A - L1 so the weight for the noisy features can go to zero Although this document (https://docs.aws.amazon.com/machine-learning/latest/dg/training-parameters1.html#regularization) relate noise with L1 in its content, if you were to use L1 regularization, the algorithm might push too many weights to zero. This could inadvertently remove useful information, especially if the noisy features still contain some signal. So in this case it should be L2 instead of L1. L2 regularization is generally more effective for reducing overfitting in the presence of noisy data because it reduces the impact of all features proportionally, rather than eliminating some features entirely as L1 regularization does. Both L1 & L2 help with overfitting. However, L1 regularization does feature selection by reducing weights of rerelenat featurtes to zero - reducing dimensionality and removing noisy features as in this case. L2 on the other hand keeps all features including noisy ones. If you suspect that some features are irrelevant, Option A (L1 Regularization) could be more effective as it can shrink some coefficients to zero, effectively performing feature selection be removing the noise. If you believe that most features are relevant but the model is too complex, Option D (L2 Regularization) is typically the better choice as it evenly shrinks all coefficients, thus reducing model complexity without eliminating features. In this case, option A would be ideal to get rid of the irrelevant noise. It should be D as this is a overfitting problem. A might make the model oversimple in that case train acc will be bad. L2 is better than L1 A. L1 Regularization reduces the amount of noise in the model, https://docs.aws.amazon.com/machine-learning/latest/dg/training-parameters1.html D L2 regularisation for overfitting and noise - https://www.examtopics.com/discussions/amazon/view/128608-exam-aws-certified-machine-learning-specialty-topic-1/
277
278 - A company's machine learning (ML) specialist is building a computer vision model to classify 10 different traffic signs. The company has stored 100 images of each class in Amazon S3, and the company has another 10,000 unlabeled images. All the images come from dash cameras and are a size of 224 pixels × 224 pixels. After several training runs, the model is overfitting on the training data. Which actions should the ML specialist take to address this problem? (Choose two.) - A.. Use Amazon SageMaker Ground Truth to label the unlabeled images. B.. Use image preprocessing to transform the images into grayscale images. C.. Use data augmentation to rotate and translate the labeled images. D.. Replace the activation of the last layer with a sigmoid. E.. Use the Amazon SageMaker k-nearest neighbors (k-NN) algorithm to label the unlabeled images.
AC - C. Use data augmentation to rotate and translate the labeled images. Data augmentation involves creating new training data by applying transformations such as rotation, translation, scaling, etc. This helps to increase the diversity of the training data and makes the model more robust without requiring additional labeled data. A. Use Amazon SageMaker Ground Truth to label the unlabeled images. Leveraging Amazon SageMaker Ground Truth can help in labeling the unlabeled images to expand the training dataset and reduce overfitting. Adding more labeled data can improve the generalization of the model and reduce overfitting. A. Use Amazon SageMaker Ground Truth to label the unlabeled images C. Helps address the over-fitting problem Ground Truth to label the unlabeled images and data augmentation to create multiple variations of the labeled images Ground Truth to label the unlabelled images - https://www.examtopics.com/discussions/amazon/view/128794-exam-aws-certified-machine-learning-specialty-topic-1/
278
279 - A data science team is working with a tabular dataset that the team stores in Amazon S3. The team wants to experiment with different feature transformations such as categorical feature encoding. Then the team wants to visualize the resulting distribution of the dataset. After the team finds an appropriate set of feature transformations, the team wants to automate the workflow for feature transformations. Which solution will meet these requirements with the MOST operational efficiency? - A.. Use Amazon SageMaker Data Wrangler preconfigured transformations to explore feature transformations. Use SageMaker Data Wrangler templates for visualization. Export the feature processing workflow to a SageMaker pipeline for automation. B.. Use an Amazon SageMaker notebook instance to experiment with different feature transformations. Save the transformations to Amazon S3. Use Amazon QuickSight for visualization. Package the feature processing steps into an AWS Lambda function for automation. C.. Use AWS Glue Studio with custom code to experiment with different feature transformations. Save the transformations to Amazon S3. Use Amazon QuickSight for visualization. Package the feature processing steps into an AWS Lambda function for automation. D.. Use Amazon SageMaker Data Wrangler preconfigured transformations to experiment with different feature transformations. Save the transformations to Amazon S3. Use Amazon QuickSight for visualization. Package each feature transformation step into a separate AWS Lambda function. Use AWS Step Functions for workflow automation.
A - This solution offers the following advantages: Amazon SageMaker Data Wrangler provides a user-friendly interface to explore and experiment with feature transformations, making it efficient for the data science team to try out different options. SageMaker Data Wrangler templates for visualization can quickly generate visualizations for the resulting distribution of the dataset, streamlining the visualization process. Exporting the feature processing workflow to a SageMaker pipeline for automation automates the feature transformations efficiently within the SageMaker environment. Amazon SageMaker Data Wrangler: provides preconfigured transformations that allow for easy exploration of feature transformations. This simplifies the experimentation process. SageMaker Data Wrangler templates for visualization: allows for visualizing the resulting distribution of the dataset, aiding in understanding the effects of feature transformations. Export the feature processing workflow to a SageMaker pipeline for automation: Once an appropriate set of feature transformations is identified, the workflow can be exported to a SageMaker pipeline for automation. This ensures reproducibility and scalability of the feature processing steps. Data Wrangler is an amazing tool that take EDA to next level - https://www.examtopics.com/discussions/amazon/view/128943-exam-aws-certified-machine-learning-specialty-topic-1/
279
280 - A company plans to build a custom natural language processing (NLP) model to classify and prioritize user feedback. The company hosts the data and all machine learning (ML) infrastructure in the AWS Cloud. The ML team works from the company's office, which has an IPsec VPN connection to one VPC in the AWS Cloud. The company has set both the enableDnsHostnames attribute and the enableDnsSupport attribute of the VPC to true. The company's DNS resolvers point to the VPC DNS. The company does not allow the ML team to access Amazon SageMaker notebooks through connections that use the public internet. The connection must stay within a private network and within the AWS internal network. Which solution will meet these requirements with the LEAST development effort? - A.. Create a VPC interface endpoint for the SageMaker notebook in the VPC. Access the notebook through a VPN connection and the VPC endpoint. B.. Create a bastion host by using Amazon EC2 in a public subnet within the VPC. Log in to the bastion host through a VPN connection. Access the SageMaker notebook from the bastion host. C.. Create a bastion host by using Amazon EC2 in a private subnet within the VPC with a NAT gateway. Log in to the bastion host through a VPN connection. Access the SageMaker notebook from the bastion host. D.. Create a NAT gateway in the VPC. Access the SageMaker notebook HTTPS endpoint through a VPN connection and the NAT gateway.
A - bastion host / is an outdated method Since the connection is over IPSec VPN and internet access is prohibited, NAT gateway and Bastian hosts are unnecessary eliminating B, C, D. Also, traffic should not leave AWS network between services so sagemaker notebook VPC endpoint is needed A most effective solution A - Never choose bastion host. Other answers don't make sense. A has the least development cost comparing with B - https://www.examtopics.com/discussions/amazon/view/128947-exam-aws-certified-machine-learning-specialty-topic-1/
280
281 - A data scientist is using Amazon Comprehend to perform sentiment analysis on a dataset of one million social media posts. Which approach will process the dataset in the LEAST time? - A.. Use a combination of AWS Step Functions and an AWS Lambda function to call the DetectSentiment API operation for each post synchronously. B.. Use a combination of AWS Step Functions and an AWS Lambda function to call the BatchDetectSentiment API operation with batches of up to 25 posts at a time. C.. Upload the posts to Amazon S3. Pass the S3 storage path to an AWS Lambda function that calls the StartSentimentDetectionJob API operation. D.. Use an AWS Lambda function to call the BatchDetectSentiment API operation with the whole dataset.
C - This approach uses Amazon Comprehend's asynchronous batch processing. By uploading the data to S3 and using the StartSentimentDetectionJob API, Comprehend can process the entire dataset in parallel. This is the most efficient method for large datasets. This approach uses Amazon Comprehend's asynchronous batch processing. By uploading the data to S3 and using the StartSentimentDetectionJob API, Comprehend can process the entire dataset in parallel. This is the most efficient method for large datasets. I meant answer C. C is a async method, A, B is not. https://docs.aws.amazon.com/comprehend/latest/APIReference/API_StartSentimentDetectionJob.html C is the most efficient approach This method is the most efficient and scalable for processing a dataset of this size, significantly outperforming the other options in terms of processing time. since there are million post, 15 minutes may not be enough so step function is needed and batchDetectSentiment is good way to go https://docs.aws.amazon.com/comprehend/latest/APIReference/API_BatchDetectSentiment.html#API_BatchDetectSentiment_RequestParameters It's B. Limit on BatchDetectSentiment is 25 documents. Other endpoints are for individual strings. B. Use a combination of AWS Step Functions and an AWS Lambda function to call the BatchDetectSentiment API operation with batches of up to 25 posts at a time. Batch processing is generally more efficient for large datasets. The BatchDetectSentiment API operation allows you to process multiple items (up to 25) in a single call, which helps in reducing the overall processing time. Additionally, using AWS Step Functions to manage the workflow and AWS Lambda to handle the batch processing can make the implementation scalable and easier to manage. - https://www.examtopics.com/discussions/amazon/view/128795-exam-aws-certified-machine-learning-specialty-topic-1/
281
282 - A machine learning (ML) specialist at a retail company must build a system to forecast the daily sales for one of the company's stores. The company provided the ML specialist with sales data for this store from the past 10 years. The historical dataset includes the total amount of sales on each day for the store. Approximately 10% of the days in the historical dataset are missing sales data. The ML specialist builds a forecasting model based on the historical dataset. The specialist discovers that the model does not meet the performance standards that the company requires. Which action will MOST likely improve the performance for the forecasting model? - A.. Aggregate sales from stores in the same geographic area. B.. Apply smoothing to correct for seasonal variation. C.. Change the forecast frequency from daily to weekly. D.. Replace missing values in the dataset by using linear interpolation.
D - Could be B or D. The question calls out that 10% of the data is missing, which a lot. Smoothing would help as well. I'll go with D. Linear interpolation is a widely used imputation method for time series data, where missing values are replaced by values calculated based on adjacent data points. While both B & D will have effect on performace but MOST effect will be from B - smoothening of seasonal variations for forecasting. Linera interpolation may even have adverse effect on performance if relationship between variables in not linear. smoothing is more important than missing data in this scenario After must consideration, I will change my answer to "D" First and foremost, solving missing data is more important. The question clearly states that 10% data are missing. I'm going with B. 10% missing data on 10 years of data shouldn't matter too much, so D falls off. Seasonality introduces issues and should be fixed. A and C are wrong for obvious reasons. 10% data in 10 years is an entire year of missing data. Its D, we dont even know if this company has seasonality problems Linear interpolation would help to handle 10% missing values B. 10% of the days data missing out of 365*10 days D. Based on the problem, we need to address missing data and not seasonal variance. Answer: D B. Apply smoothing to correct for seasonal variation. Smoothing techniques, such as using moving averages or other time series smoothing methods, can help in reducing noise and capturing the underlying patterns in the sales data. Seasonal variation is a common issue in time series data, especially in retail where sales may exhibit regular patterns based on seasons, holidays, or other recurring events. - https://www.examtopics.com/discussions/amazon/view/128796-exam-aws-certified-machine-learning-specialty-topic-1/
282
283 - A mining company wants to use machine learning (ML) models to identify mineral images in real time. A data science team built an image recognition model that is based on convolutional neural network (CNN). The team trained the model on Amazon SageMaker by using GPU instances. The team will deploy the model to a SageMaker endpoint. The data science team already knows the workload traffic patterns. The team must determine instance type and configuration for the workloads. Which solution will meet these requirements with the LEAST development effort? - A.. Register the model artifact and container to the SageMaker Model Registry. Use the SageMaker Inference Recommender Default job type. Provide the known traffic pattern for load testing to select the best instance type and configuration based on the workloads. B.. Register the model artifact and container to the SageMaker Model Registry. Use the SageMaker Inference Recommender Advanced job type. Provide the known traffic pattern for load testing to select the best instance type and configuration based on the workloads. C.. Deploy the model to an endpoint by using GPU instances. Use AWS Lambda and Amazon API Gateway to handle invocations from the web. Use open-source tools to perform load testing against the endpoint and to select the best instance type and configuration. D.. Deploy the model to an endpoint by using CPU instances. Use AWS Lambda and Amazon API Gateway to handle invocations from the web. Use open-source tools to perform load testing against the endpoint and to select the best instance type and configuration.
B - Offers the least development effort Changed my mind to option 'B' because since the traffic is already known, Advanced job type should be better, https://docs.aws.amazon.com/sagemaker/latest/dg/inference-recommender-recommendation-jobs.html GPT+Claude 3: The Default job type (option A) involves SageMaker running a set of load tests on the recommended instance types, which can provide a quicker result as it takes less time to complete (within 45 minutes). On the other hand, the Advanced job type (option B) involves a custom load test where you have more control over the traffic pattern and requirements for latency and throughput. However, this option may take longer to complete (an average of 2 hours). Given the requirement for the least development effort, option A seems more suitable. It utilizes the Default job type, which is more automated and requires less manual configuration compared to the Advanced job type. Additionally, the shorter completion time aligns better with the goal of minimizing development effort. Inference recommendations (Default job type) run a set of load tests on the recommended instance types. You can also load test for a serverless endpoint.. You only need to provide a model package Amazon Resource Name (ARN) to launch this type of recommendation job. Inference recommendation jobs complete within 45 minutes. Endpoint recommendations (Advanced job type) are based on a custom load test where you select your desired ML instances or a serverless endpoint, provide a custom traffic pattern, and provide requirements for latency and throughput based on your production requirements. This job takes an average of 2 hours to complete depending on the job duration set and the total number of inference configurations tested. B. Traffic patterns are known. It's either A or B. Advanced Job Type recommendations re based on a custom load test where you select your desired ML instances or a serverless endpoint, provide a custom traffic pattern, and provide requirements for latency and throughput based on your production requirements. since traffic patterns are already known, it should be B. with Default job type, you only need to provide a model package Amazon Resource Name(ARN) to lunch this type of recommendation job, it does not support providing custom traffic patterns. A. Register the model artifact and container to the SageMaker Model Registry. Use the SageMaker Inference Recommender Default job type. Provide the known traffic pattern for load testing to select the best instance type and configuration based on the workloads. Explanation: SageMaker Model Registry allows you to register and organize your trained models. The SageMaker Inference Recommender Default job type simplifies the process of selecting the best instance type and configuration based on the known traffic pattern. It automatically selects the best instance type for the model. Load testing with the known traffic pattern helps in understanding the actual workloads and selecting the most appropriate instance type and configuration. This approach leverages the capabilities provided by SageMaker without the need for additional infrastructure or open-source tools - https://www.examtopics.com/discussions/amazon/view/128797-exam-aws-certified-machine-learning-specialty-topic-1/
283
284 - A company is building custom deep learning models in Amazon SageMaker by using training and inference containers that run on Amazon EC2 instances. The company wants to reduce training costs but does not want to change the current architecture. The SageMaker training job can finish after interruptions. The company can wait days for the results. Which combination of resources should the company use to meet these requirements MOST cost-effectively? (Choose two.) - A.. On-Demand Instances B.. Checkpoints C.. Reserved Instances D.. Incremental training E.. Spot instances
BE - https://docs.aws.amazon.com/sagemaker/latest/dg/model-managed-spot-training.html to pick up where you left off, checkpointing is important and 'sport instances' to save cost Checkpoints & Spot instances Spot instances and checkpoints. Spot Instance are cheapest and can be used with Checkpoints. To meet the requirements of reducing training costs and being cost-effective in an Amazon SageMaker environment, the company should consider the following combination of resources: E. Spot Instances: Spot Instances are spare EC2 instances that are available at a lower cost compared to On-Demand Instances. By using Spot Instances for training, the company can significantly reduce the cost of running SageMaker training jobs. B. Checkpoints: Checkpoints allow the model training process to save the model's current state during training. If the training job is interrupted (e.g., due to a Spot Instance termination), the model can resume from the last saved checkpoint rather than starting from scratch. - https://www.examtopics.com/discussions/amazon/view/128798-exam-aws-certified-machine-learning-specialty-topic-1/
284
285 - A company hosts a public web application on AWS. The application provides a user feedback feature that consists of free-text fields where users can submit text to provide feedback. The company receives a large amount of free-text user feedback from the online web application. The product managers at the company classify the feedback into a set of fixed categories including user interface issues, performance issues, new feature request, and chat issues for further actions by the company's engineering teams. A machine learning (ML) engineer at the company must automate the classification of new user feedback into these fixed categories by using Amazon SageMaker. A large set of accurate data is available from the historical user feedback that the product managers previously classified. Which solution should the ML engineer apply to perform multi-class text classification of the user feedback? - A.. Use the SageMaker Latent Dirichlet Allocation (LDA) algorithm. B.. Use the SageMaker BlazingText algorithm. C.. Use the SageMaker Neural Topic Model (NTM) algorithm. D.. Use the SageMaker CatBoost algorithm.
B - B. Blazing Text for text classification. BlazingText's implements a supervised multi-class, multi-label text classification algorithm. B. Use the SageMaker BlazingText algorithm. Explanation: BlazingText for Text Classification: SageMaker BlazingText is designed for efficient and scalable text classification tasks. It supports multi-class classification, making it suitable for the scenario where user feedback needs to be classified into fixed categories. BlazingText uses a fast implementation of the Word2Vec algorithm, making it highly performant. - https://www.examtopics.com/discussions/amazon/view/128799-exam-aws-certified-machine-learning-specialty-topic-1/
285
286 - A digital media company wants to build a customer churn prediction model by using tabular data. The model should clearly indicate whether a customer will stop using the company's services. The company wants to clean the data because the data contains some empty fields, duplicate values, and rare values. Which solution will meet these requirements with the LEAST development effort? - A.. Use SageMaker Canvas to automatically clean the data and to prepare a categorical model. B.. Use SageMaker Data Wrangler to clean the data. Use the built-in SageMaker XGBoost algorithm to train a classification model. C.. Use SageMaker Canvas automatic data cleaning and preparation tools. Use the built-in SageMaker XGBoost algorithm to train a regression model. D.. Use SageMaker Data Wrangler to clean the data. Use the SageMaker Autopilot to train a regression model
A - B 1.Data Cleaning: SageMaker Data Wrangler is designed for data preparation tasks, including handling missing values, duplicates, and rare values. It provides a visual interface to clean and transform tabular data efficiently. This addresses the data cleaning requirements mentioned in the question. 2.Model Training: Using the built-in SageMaker XGBoost algorithm is a common and effective choice for classification tasks like customer churn prediction. XGBoost is a powerful and widely used algorithm for binary classification problems. B. Use SageMaker Data Wrangler to clean the data. Use the built-in SageMaker XGBoost algorithm to train a classification model. Explanation: SageMaker Data Wrangler: SageMaker Data Wrangler is designed for efficient data cleaning and preparation. It provides a visual interface that simplifies the process of cleaning tabular data, handling missing values, and addressing duplicate or rare values. Data Wrangler can generate the necessary preprocessing code automatically, reducing the development effort. SageMaker XGBoost (for Classification): XGBoost is a popular and powerful algorithm for classification tasks, including customer churn prediction. SageMaker provides a built-in XGBoost algorithm, making it easy to train a classification model without the need for extensive coding. https://aws.amazon.com/it/blogs/machine-learning/predicting-customer-churn-with-no-code-machine-learning-using-amazon-sagemaker-canvas/ Option B involves using SageMaker Data Wrangler to clean the data and the built-in SageMaker XGBoost algorithm to train a classification model. While this is a valid approach, it requires more manual intervention and development effort compared to using SageMaker Canvas. A is correct SageMaker Canvas is an excellent tool for those without ML expertise to build models, but it may not provide the detailed control needed for data cleaning and may not be as robust as Data Wrangler for complex cleaning tasks. https://aws.amazon.com/tw/about-aws/whats-new/2022/05/amazon-sagemaker-canvas-adds-new-data-capabilities-usability-updates/ Answer A- Sagemaker Canvas + categorical model Reason : SageMaker Canvas: SageMaker Canvas is a no-code machine learning tool that allows users to perform data preparation, feature engineering, and model training with minimal technical expertise. It automatically handles tasks like data cleaning, including the removal of duplicates, filling missing values, and managing rare categories. Categorical Model: A categorical (classification) model is the correct type for churn prediction, as it aims to classify whether a customer will stop using the service (churn) or not. SageMaker Canvas provides user-friendly tools to build and evaluate this type of model. While Amazon SageMaker Canvas can perform automatic data cleaning and preparation, it has certain limitations when it comes to handling complex data cleaning tasks. SageMaker Canvas is designed for building machine learning models with minimal code and effort, primarily targeting business analysts and non-technical users. It provides a guided user interface and automates many steps in the machine learning pipeline, including data cleaning and preparation. However, SageMaker Canvas has a set of built-in data cleaning and preparation operations, which may not be sufficient for handling all types of data quality issues or complex data transformations. If the data requires more advanced cleaning techniques or custom transformations, SageMaker Data Wrangler (option B) would be a better choice. A is correct Canvas can do without writing single line of code This can be done without code using SageMaker Canvas: https://aws.amazon.com/blogs/machine-learning/predicting-customer-churn-with-no-code-machine-learning-using-amazon-sagemaker-canvas/ Hence, A is right. The best solution, meeting the requirements with the least development effort and correctly addressing the problem nature, is: A. Use SageMaker Canvas to automatically clean the data and to prepare a categorical model. This option leverages the simplicity and automatic features of SageMaker Canvas, ensuring minimal development effort while accurately targeting the need for a classification model in customer churn prediction. See: https://aws.amazon.com/blogs/machine-learning/predicting-customer-churn-with-no-code-machine-learning-using-amazon-sagemaker-canvas/ Canvas also does no-code data cleaning and preparation. So, least development effort is Canvas. - https://www.examtopics.com/discussions/amazon/view/128698-exam-aws-certified-machine-learning-specialty-topic-1/
286
287 - A data engineer is evaluating customer data in Amazon SageMaker Data Wrangler. The data engineer will use the customer data to create a new model to predict customer behavior. The engineer needs to increase the model performance by checking for multicollinearity in the dataset. Which steps can the data engineer take to accomplish this with the LEAST operational effort? (Choose two.) - A.. Use SageMaker Data Wrangler to refit and transform the dataset by applying one-hot encoding to category-based variables. B.. Use SageMaker Data Wrangler diagnostic visualization. Use principal components analysis (PCA) and singular value decomposition (SVD) to calculate singular values. C.. Use the SageMaker Data Wrangler Quick Model visualization to quickly evaluate the dataset and to produce importance scores for each feature. D.. Use the SageMaker Data Wrangler Min Max Scaler transform to normalize the data. E.. Use SageMaker Data Wrangler diagnostic visualization. Use least absolute shrinkage and selection operator (LASSO) to plot coefficient values from a LASSO model that is trained on the dataset.
BE - B,E https://aws.amazon.com/about-aws/whats-new/2021/08/detect-multicollinearity-amazon-sagemaker-data-wrangler/ Use SageMaker Data Wrangler diagnostic visualization. Use principal components analysis (PCA) and singular value decomposition (SVD) to calculate singular values (Option B). PCA and SVD are effective techniques for identifying multicollinearity by reducing the dimensionality of the data and highlighting the relationships between variables1. Use SageMaker Data Wrangler diagnostic visualization. Use least absolute shrinkage and selection operator (LASSO) to plot coefficient values from a LASSO model that is trained on the dataset (Option E). LASSO helps in identifying and mitigating multicollinearity by shrinking some coefficients to zero, effectively selecting a subset of predictors B. Use SageMaker Data Wrangler diagnostic visualization. Use principal components analysis (PCA) and singular value decomposition (SVD) to calculate singular values. PCA and SVD: These methods help identify multicollinearity by reducing the dataset's dimensionality, revealing relationships among variables. Multicollinear features often become evident through high correlations in principal components or singular values. C. Use the SageMaker Data Wrangler Quick Model visualization to quickly evaluate the dataset and to produce importance scores for each feature. Quick Model Visualization: This feature enables rapid evaluation of feature importance scores, which can help detect multicollinearity by identifying features that may be overly correlated and thus less impactful independently. B and E make sense https://aws.amazon.com/about-aws/whats-new/2021/08/detect-multicollinearity-amazon-sagemaker-data-wrangler/ PCA and SVD calculate singular values, which indicate the contribution of each feature to the overall variance. Features with high singular values have less multicollinearity. LASSO regularization shrinks coefficient values of highly correlated features towards zero, highlighting potential multicollinearity through their relative sizes. B. Use SageMaker Data Wrangler diagnostic visualization. Use principal components analysis (PCA) and singular value decomposition (SVD) to calculate singular values. PCA and SVD can help in identifying multicollinearity by analyzing the correlation structure of the variables. High condition numbers or small singular values may indicate multicollinearity issues. D. Use the SageMaker Data Wrangler Min Max Scaler transform to normalize the data. Normalizing the data using techniques like Min-Max scaling can mitigate the impact of multicollinearity. Normalization helps in bringing the features to a similar scale, reducing the sensitivity to differences in magnitudes. B and E Explanation: Option B: Principal components analysis (PCA) and singular value decomposition (SVD) are techniques used to identify multicollinearity in a dataset. By visualizing the singular values, the data engineer can assess the level of multicollinearity present in the features. This approach is effective for detecting relationships among variables. Option E: LASSO (Least Absolute Shrinkage and Selection Operator) is a regularization technique that can be used to penalize certain coefficients and, in turn, highlight the most important features. By plotting the coefficient values from a LASSO model, the data engineer can identify variables that contribute the most to the model. This can be useful for identifying and mitigating multicollinearity. - https://www.examtopics.com/discussions/amazon/view/128699-exam-aws-certified-machine-learning-specialty-topic-1/
287
288 - A company processes millions of orders every day. The company uses Amazon DynamoDB tables to store order information. When customers submit new orders, the new orders are immediately added to the DynamoDB tables. New orders arrive in the DynamoDB tables continuously. A data scientist must build a peak-time prediction solution. The data scientist must also create an Amazon QuickSight dashboard to display near real-time order insights. The data scientist needs to build a solution that will give QuickSight access to the data as soon as new order information arrives. Which solution will meet these requirements with the LEAST delay between when a new order is processed and when QuickSight can access the new order information? - A.. Use AWS Glue to export the data from Amazon DynamoDB to Amazon S3. Configure QuickSight to access the data in Amazon S3. B.. Use Amazon Kinesis Data Streams to export the data from Amazon DynamoDB to Amazon S3. Configure QuickSight to access the data in Amazon S3. C.. Use an API call from QuickSight to access the data that is in Amazon DynamoDB directly. D.. Use Amazon Kinesis Data Firehose to export the data from Amazon DynamoDB to Amazon S3. Configure QuickSight to access the data in Amazon S3.
D - Direct Access Kinesis Data Stream do not support to save data to S3 directly. -- B is out Kinesis Data Firehose do not access data from S3 directly. -- D is out Direct Access: Using an API call allows QuickSight to access the data in DynamoDB directly, ensuring near real-time insights without the need for intermediate steps like exporting data to S3. Minimal Latency: This approach minimizes latency since it eliminates the delay associated with data transfer and storage in S3. Kinesis Data Stream didnt support any feature to save data to s3 directly, so the answer is D. if it mention that a lambda after Data Stream to consume data to s3, the B is the LEAST delay option. ChatGPT B Amazon QuickSight dashboard to display near real-time! order insights B provides the most efficient solution for near real-time access to new order information in QuickSight Option C involves using an API call from QuickSight to access the data directly in Amazon DynamoDB. While this option can provide real-time access to the data, it requires direct integration between QuickSight and DynamoDB, which may involve additional development effort. Additionally, QuickSight's native integration with DynamoDB for real-time data access might be limited compared to its integration with data stored in Amazon S3. Therefore, while option C might offer real-time access, option D with Kinesis Data Firehose to S3 could be a more robust and scalable solution, especially considering the potential limitations of direct DynamoDB integration with QuickSight. D is the best solution given options. if not directly, QuickSight can connect to DynamoDB via Athena using a connector https://aws.amazon.com/blogs/big-data/visualize-amazon-dynamodb-insights-in-amazon-quicksight-using-the-amazon-athena-dynamodb-connector-and-aws-glue/ Quicksight doesn't integrate with DynamoDB directly. It could use S3, Redshift, Aurora/RDS, Athena, IOT analytics and EC2 hosted databases as data sources. Glouw would work as well but Firehose (D) is the least delay option. QuickSight provides the ability to connect to various data sources, including DynamoDB, to create visualizations and dashboards. QuickSight supports a direct connection to DynamoDB tables, allowing you to query and visualize data stored in DynamoDB in real-time. No need to consume a DynamoDB stream with firehose. C is right. Quicksight doesn't integrate with DynamoDB directly he best solution, considering the requirement for the least delay and the ability to handle continuous data flow efficiently, would be: D. Use Amazon Kinesis Data Firehose to export the data from Amazon DynamoDB to Amazon S3, and configure QuickSight to access the data in Amazon S3. This solution leverages the automatic, scalable streaming capture of Kinesis Data Firehose to move data into S3, where it can be readily accessed by QuickSight for analytics and visualization purposes. This approach balances the need for near real-time insights with the capabilities of AWS services to handle streaming data effectively. - https://www.examtopics.com/discussions/amazon/view/133244-exam-aws-certified-machine-learning-specialty-topic-1/
288
289 - A data engineer is preparing a dataset that a retail company will use to predict the number of visitors to stores. The data engineer created an Amazon S3 bucket. The engineer subscribed the S3 bucket to an AWS Data Exchange data product for general economic indicators. The data engineer wants to join the economic indicator data to an existing table in Amazon Athena to merge with the business data. All these transformations must finish running in 30-60 minutes. Which solution will meet these requirements MOST cost-effectively? - A.. Configure the AWS Data Exchange product as a producer for an Amazon Kinesis data stream. Use an Amazon Kinesis Data Firehose delivery stream to transfer the data to Amazon S3. Run an AWS Glue job that will merge the existing business data with the Athena table. Write the result set back to Amazon S3. B.. Use an S3 event on the AWS Data Exchange S3 bucket to invoke an AWS Lambda function. Program the Lambda function to use Amazon SageMaker Data Wrangler to merge the existing business data with the Athena table. Write the result set back to Amazon S3. C.. Use an S3 event on the AWS Data Exchange S3 bucket to invoke an AWS Lambda function. Program the Lambda function to run an AWS Glue job that will merge the existing business data with the Athena table. Write the results back to Amazon S3. D.. Provision an Amazon Redshift cluster. Subscribe to the AWS Data Exchange product and use the product to create an Amazon Redshift table. Merge the data in Amazon Redshift. Write the results back to Amazon S3.
C - A - kinessis added unnecessary additional cost and complexity and will add to latency B - wrangler is better suited for daat prep & feature engineering no merging C - Serverless so cost effective, trigger happens immediately so within lambda 15 min window and glue is made for these use cases D - Costly setup and maintenace hint is 30 to 60 minutes and lambda has 15 minutes. Plus Glue provides many built-in functionality to perform the merge process much easier A is not needed as we don't need to add Kinesis, it has no purpose here. B is possible but DataWrangler is more expensive then C. C is serverless and cost optimized, so C is correct. D is obviously too expensive. C. Use an S3 event on the AWS Data Exchange S3 bucket to invoke an AWS Lambda function. Program the Lambda function to run an AWS Glue job that will merge the existing business data with the Athena table. Write the results back to Amazon S3. This solution avoids the need for continuous data streams or provisioning a persistent database cluster, which can incur higher costs. AWS Lambda can trigger cost-effective, short-duration tasks, and AWS Glue is a managed ETL service that can handle the data transformation and merging efficiently. The integration with Amazon S3 and Athena also aligns with the existing data flow and tools. fix c - > b The most cost-effective solution is to use an S3 event to trigger a Lambda function that uses SageMaker Data Wrangler to merge the data. This solution avoids the need to provision and manage any additional resources, such as Kinesis streams, Firehose delivery streams, Glue jobs, or Redshift clusters. SageMaker Data Wrangler provides a visual interface to import, prepare, transform, and analyze data from various sources, including AWS Data Exchange products. It can also export the data preparation workflow to a Python script that can be executed by a Lambda function. This solution can meet the time requirement of 30-60 minutes, depending on the size and complexity of the data. References: Using Amazon S3 Event Notifications Prepare ML Data with Amazon SageMaker Data Wrangler AWS Lambda Function he most cost-effective and straightforward solution is C. Use an S3 event on the AWS Data Exchange S3 bucket to invoke an AWS Lambda function. Program the Lambda function to run an AWS Glue job that will merge the existing business data with the Athena table and write the results back to Amazon S3. This approach leverages the serverless architecture of AWS, minimizing operational overhead and cost while ensuring the transformations can be completed within the desired timeframe. - https://www.examtopics.com/discussions/amazon/view/133242-exam-aws-certified-machine-learning-specialty-topic-1/
289
290 - A company operates large cranes at a busy port The company plans to use machine learning (ML) for predictive maintenance of the cranes to avoid unexpected breakdowns and to improve productivity. The company already uses sensor data from each crane to monitor the health of the cranes in real time. The sensor data includes rotation speed, tension, energy consumption, vibration, pressure, and temperature for each crane. The company contracts AWS ML experts to implement an ML solution. Which potential findings would indicate that an ML-based solution is suitable for this scenario? (Choose two.) - A.. The historical sensor data does not include a significant number of data points and attributes for certain time periods. B.. The historical sensor data shows that simple rule-based thresholds can predict crane failures. C.. The historical sensor data contains failure data for only one type of crane model that is in operation and lacks failure data of most other types of crane that are in operation. D.. The historical sensor data from the cranes are available with high granularity for the last 3 years. E.. The historical sensor data contains most common types of crane failures that the company wants to predict.
DE - D and E simple agree 100% Conclusion: The findings that indicate an ML-based solution is suitable for predictive maintenance in this scenario are: D. The historical sensor data from the cranes are available with high granularity for the last 3 years. E. The historical sensor data contains most common types of crane failures that the company wants to predict. These points suggest the availability of comprehensive and relevant data necessary for developing an effective ML model for predictive maintenance. - https://www.examtopics.com/discussions/amazon/view/133245-exam-aws-certified-machine-learning-specialty-topic-1/
290
291 - A company wants to create an artificial intelligence (AШ) yoga instructor that can lead large classes of students. The company needs to create a feature that can accurately count the number of students who are in a class. The company also needs a feature that can differentiate students who are performing a yoga stretch correctly from students who are performing a stretch incorrectly. Determine whether students are performing a stretch correctly, the solution needs to measure the location and angle of each student’s arms and legs. A data scientist must use Amazon SageMaker to access video footage of a yoga class by extracting image frames and applying computer vision models. Which combination of models will meet these requirements with the LEAST effort? (Choose two.) - A.. Image Classification B.. Optical Character Recognition (OCR) C.. Object Detection D.. Pose estimation E.. Image Generative Adversarial Networks (GANs)
CD - Object detection for the count and Pose detection for posture Object detection + pose detection will do. agrees C. Object Detection: This model can identify and locate multiple objects within an image frame. For the task of counting the number of students in a class, object detection models can recognize and count the number of people present. This is essential for understanding class size and ensuring that each student is accounted for in the analysis. D. Pose Estimation: Pose estimation models are designed to determine the positions and orientations of human bodies in images or videos. They can identify the location and angle of a person's arms, legs, and other body parts. This capability is crucial for analyzing whether a student is performing a yoga stretch correctly by comparing their pose to the desired alignment and form for each yoga pose. - https://www.examtopics.com/discussions/amazon/view/133249-exam-aws-certified-machine-learning-specialty-topic-1/
291
292 - An ecommerce company has used Amazon SageMaker to deploy a factorization machines (FM) model to suggest products for customers. The company’s data science team has developed two new models by using the TensorFlow and PyTorch deep learning frameworks. The company needs to use A/B testing to evaluate the new models against the deployed model. The required A/B testing setup is as follows: • Send 70% of traffic to the FM model, 15% of traffic to the TensorFlow model, and 15% of traffic to the PyTorch model. • For customers who are from Europe, send all traffic to the TensorFlow model. Which architecture can the company use to implement the required A/B testing setup? - A.. Create two new SageMaker endpoints for the TensorFlow and PyTorch models in addition to the existing SageMaker endpoint. Create an Application Load Balancer. Create a target group for each endpoint. Configure listener rules and add weight to the target groups. To send traffic to the TensorFlow model for customers who are from Europe, create an additional listener rule to forward traffic to the TensorFlow target group. B.. Create two production variants for the TensorFlow and PyTorch models. Create an auto scaling policy and configure the desired A/B weights to direct traffic to each production variant. Update the existing SageMaker endpoint with the auto scaling policy. To send traffic to the TensorFlow model for customers who are from Europe, set the TargetVariant header in the request to point to the variant name of the TensorFlow model. C.. Create two new SageMaker endpoints for the TensorFlow and PyTorch models in addition to the existing SageMaker endpoint. Create a Network Load Balancer. Create a target group for each endpoint. Configure listener rules and add weight to the target groups. To send traffic to the TensorFlow model for customers who are from Europe, create an additional listener rule to forward traffic to the TensorFlow target group. D.. Create two production variants for the TensorFlow and PyTorch models. Specify the weight for each production variant in the SageMaker endpoint configuration. Update the existing SageMaker endpoint with the new configuration. To send traffic to the TensorFlow model for customers who are from Europe, set the TargetVariant header in the request to point to the variant name of the TensorFlow model.
D - A is the only option capable of region based routing to direct traffic from Europe to TF model. A is correct Typical multivariant deployment workflow. - No additional endpoints required which eliminates A & C. - B: question isn't about autoscaling but traffic routing - D: textbook production variant deployment method needs to use both product variant and target variant in this requirement https://docs.aws.amazon.com/sagemaker/latest/dg/model-ab-testing.html - https://www.examtopics.com/discussions/amazon/view/133519-exam-aws-certified-machine-learning-specialty-topic-1/
292
293 - A data scientist stores financial datasets in Amazon S3. The data scientist uses Amazon Athena to query the datasets by using SQL. The data scientist uses Amazon SageMaker to deploy a machine learning (ML) model. The data scientist wants to obtain inferences from the model at the SageMaker endpoint. However, when the data scientist attempts to invoke the SageMaker endpoint, the data scientist receives SQL statement failures. The data scientist’s IAM user is currently unable to invoke the SageMaker endpoint. Which combination of actions will give the data scientist’s IAM user the ability to invoke the SageMaker endpoint? (Choose three.) - A.. Attach the AmazonAthenaFullAccess AWS managed policy to the user identity. B.. Include a policy statement for the data scientist's IAM user that allows the IAM user to perform the sagemaker:InvokeEndpoint action. C.. Include an inline policy for the data scientist’s IAM user that allows SageMaker to read S3 objects. D.. Include a policy statement for the data scientist’s IAM user that allows the IAM user to perform the sagemaker:GetRecord action. E.. Include the SQL statement "USING EXTERNAL FUNCTION ml_function_name'' in the Athena SQL query. F.. Perform a user remapping in SageMaker to map the IAM user to another IAM user that is on the hosted endpoint.
BCE - Why on earth would you "update data scientist’s IAM USER that allows SageMaker to read S3 objects" - you cant update an IAM user to give a service access to another service? If this said "Sagemaker IAM role to access S3" then i would undertsand, but it doesnt say that. At the same time, giving AthenaFullAccess is not best practise and not least privalage. ill go with ABE (Option B). This is essential for invoking the endpoint1. (Option C). This ensures that the model can access the necessary data stored in S32. Option A). This grants the necessary permissions to query datasets using Athena data scientist already was querying the athena plus invoking sagemaker endpoint issue would not solve. therefore, A is not a good choice A: NO - not needed as user already has Athena access (he is already querying Athena by SQL) B: Yes - sagemaker:InvokeEndpoint permission is needed to invoke endpoint C: Yes - needed for IAM user context to read S3 bucket D: No - sagemaker:GetRecord has no relevance in this question E: Yes - used to call an external function, in this case, the ML function deployed on the SageMaker endpoint, within the Athena SQL query F: No - irrelevant ABE C is wrong. Why sagemaker need access S3? Sagemaker receive data and request via the endpoint. https://docs.aws.amazon.com/athena/latest/ug/querying-mlmodel.html https://docs.aws.amazon.com/athena/latest/ug/machine-learning-iam-access.html The correct combination of actions to enable the data scientist's IAM user to invoke the SageMaker endpoint is B, C, and E, because they ensure that the IAM user has the necessary permissions, access, and syntax to query the ML model from Athena. These actions have the following benefits: B: Including a policy statement for the IAM user that allows the sagemaker:InvokeEndpoint action grants the IAM user the permission to call the SageMaker Runtime InvokeEndpoint API, which is used to get inferences from the model hosted at the endpoint1. - https://www.examtopics.com/discussions/amazon/view/133250-exam-aws-certified-machine-learning-specialty-topic-1/
293
294 - A data scientist is building a linear regression model. The scientist inspects the dataset and notices that the mode of the distribution is lower than the median, and the median is lower than the mean. Which data transformation will give the data scientist the ability to apply a linear regression model? - A.. Exponential transformation B.. Logarithmic transformation C.. Polynomial transformation D.. Sinusoidal transformation
B - The distribution described (mode < median < mean) indicates a positively skewed distribution. To normalize such data and make it more suitable for linear regression, a logarithmic transformation is often used. This transformation can help stabilize variance and make the data more normally distributed. The fact that the mode is lower than the median, and the median is lower than the mean, suggests that the data is positively skewed (i.e., has a long right tail). In such cases, a logarithmic transformation is often used to reduce skewness and make the data more symmetric. Therefore, the correct answer is B. Logarithmic transformation. Explanation: A logarithmic transformation is a suitable data transformation for a linear regression model when the data has a skewed distribution, such as when the mode is lower than the median and the median is lower than the mean. A logarithmic transformation can reduce the skewness and make the data more symmetric and normally distributed, which are desirable properties for linear regression. A logarithmic transformation can also reduce the effect of outliers and heteroscedasticity (unequal variance) in the data. An exponential transformation would have the opposite effect of increasing the skewness and making the data more asymmetric. A polynomial transformation may not be able to capture the nonlinearity in the data and may introduce multicollinearity among the transformed variables. A sinusoidal transformation is not appropriate for data that does not have a periodic - https://www.examtopics.com/discussions/amazon/view/133253-exam-aws-certified-machine-learning-specialty-topic-1/
294
295 - A data scientist receives a collection of insurance claim records. Each record includes a claim ID. the final outcome of the insurance claim, and the date of the final outcome. The final outcome of each claim is a selection from among 200 outcome categories. Some claim records include only partial information. However, incomplete claim records include only 3 or 4 outcome categories from among the 200 available outcome categories. The collection includes hundreds of records for each outcome category. The records are from the previous 3 years. The data scientist must create a solution to predict the number of claims that will be in each outcome category every month, several months in advance. Which solution will meet these requirements? - A.. Perform classification every month by using supervised learning of the 200 outcome categories based on claim contents. B.. Perform reinforcement learning by using claim IDs and dates. Instruct the insurance agents who submit the claim records to estimate the expected number of claims in each outcome category every month. C.. Perform forecasting by using claim IDs and dates to identify the expected number of claims in each outcome category every month. D.. Perform classification by using supervised learning of the outcome categories for which partial information on claim contents is provided. Perform forecasting by using claim IDs and dates for all other outcome categories.
C - D is the better answer, except for the part that says "for all other outcome categories". It should say, "for all outcome categories" including the ones we categorized with partial information. Because of this flaw, i go C. Should be D For the outcome categories with partial information (3 or 4 out of 200 categories), supervised learning can be used to classify claims into those categories based on the available claim contents. For the remaining outcome categories without partial information, forecasting techniques using claim IDs and dates can be employed to predict the expected number of claims in each category every month. While the argument for option C is valid in terms of using claim IDs and dates for forecasting, it does not address the scenario where partial information on claim contents is available for some outcome categories. By ignoring this information, option C may miss an opportunity to improve the accuracy of predictions for those categories through classification techniques. Furthermore, various machine learning resources and best practices recommend combining different techniques, such as classification and forecasting, when dealing with complex datasets that contain both structured and unstructured data. This hybrid approach can often lead to more accurate and robust solutions. A: No - not a classification problem B: No - Reinforcement learning does not apply to the situation - adding positive reinforcement/negative penalty to train the system does not apply C: Yes - leverages historical data (claim IDs and dates from the previous 3 years) to forecast future claim counts D: Not a classification problem C: this is forecasting problem C directly addresses the need to forecast the number of claims in each outcome category on a monthly basis, leveraging historical data patterns without the need for classifying individual claim records based on their content. - https://www.examtopics.com/discussions/amazon/view/133248-exam-aws-certified-machine-learning-specialty-topic-1/
295
296 - A retail company stores 100 GB of daily transactional data in Amazon S3 at periodic intervals. The company wants to identify the schema of the transactional data. The company also wants to perform transformations on the transactional data that is in Amazon S3. The company wants to use a machine learning (ML) approach to detect fraud in the transformed data. Which combination of solutions will meet these requirements with the LEAST operational overhead? (Choose three.) - A.. Use Amazon Athena to scan the data and identify the schema. B.. Use AWS Glue crawlers to scan the data and identify the schema. C.. Use Amazon Redshift to store procedures to perform data transformations. D.. Use AWS Glue workflows and AWS Glue jobs to perform data transformations. E.. Use Amazon Redshift ML to train a model to detect fraud. F.. Use Amazon Fraud Detector to train a model to detect fraud.
BDF - A: No - Athena queries require much more operational overhead B: Yes - Glue crawlers are made to discover schema C: No - redshift high setup and operational overhead D: Yes - Glue workflows and jobs are made for this E: No - redshift high setup and operational overhead F: Yes - Amazon fraud detector is designed for this A: No - Athena queries require much more operational overhead B: Yes - Glue crawlers are made to discover schema C: No - redshift high setup and operational overhead D: Yes - Glue workflows and jobs are made for this E: No - redshift high setup and operational overhead F: Amazon fraud detector is designed for this B, D and F combined bring a serverless solution with the least operational overhead. I go with B, D, and F No need to use Redshift. Serverless solution is with least operational effort. - https://www.examtopics.com/discussions/amazon/view/134301-exam-aws-certified-machine-learning-specialty-topic-1/
296
297 - A data scientist uses Amazon SageMaker Data Wrangler to define and perform transformations and feature engineering on historical data. The data scientist saves the transformations to SageMaker Feature Store. The historical data is periodically uploaded to an Amazon S3 bucket. The data scientist needs to transform the new historic data and add it to the online feature store. The data scientist needs to prepare the new historic data for training and inference by using native integrations. Which solution will meet these requirements with the LEAST development effort? - A.. Use AWS Lambda to run a predefined SageMaker pipeline to perform the transformations on each new dataset that arrives in the S3 bucket. B.. Run an AWS Step Functions step and a predefined SageMaker pipeline to perform the transformations on each new dataset that arrives in the S3 bucket. C.. Use Apache Airflow to orchestrate a set of predefined transformations on each new dataset that arrives in the S3 bucket. D.. Configure Amazon EventBridge to run a predefined SageMaker pipeline to perform the transformations when a new data is detected in the S3 bucket.
D - EventBridge provide least development effort because we can just configure and trigger based on dataset drops in S3. Lambda will require some development effort D requires minimal effort as it involves configuring EventBridge to monitor the S3 bucket for new data uploads and automatically triggering the SageMaker pipeline to perform the transformations on the new data while leveraging native Eventbridge -> Pipeline integration. https://docs.aws.amazon.com/sagemaker/latest/dg/automating-sagemaker-with-eventbridge.html Answer: D Explanation: The best solution is to configure Amazon EventBridge to run a predefined SageMaker pipeline to perform the transformations when a new data is detected in the S3 bucket. This solution requires the least development effort because it leverages the native integration between EventBridge and SageMaker Pipelines, which allows you to trigger a pipeline execution based on an event rule. EventBridge can monitor the S3 bucket for new data uploads and invoke the pipeline that contains the same transformations and feature engineering steps that were defined in SageMaker Data Wrangler. The pipeline can then ingest the transformed data into the online feature store for training and inference. - https://www.examtopics.com/discussions/amazon/view/133265-exam-aws-certified-machine-learning-specialty-topic-1/
297
298 - An insurance company developed a new experimental machine learning (ML) model to replace an existing model that is in production. The company must validate the quality of predictions from the new experimental model in a production environment before the company uses the new experimental model to serve general user requests. New one model can serve user requests at a time. The company must measure the performance of the new experimental model without affecting the current live traffic. Which solution will meet these requirements? - A.. A/B testing B.. Canary release C.. Shadow deployment D.. Blue/green deployment
C - https://docs.aws.amazon.com/sagemaker/latest/dg/shadow-tests.html Shadow deployment is new technique provided by sagemaker Answer: C Explanation: The best solution for this scenario is to use shadow deployment, which is a technique that allows the company to run the new experimental model in parallel with the existing model, without exposing it to the end users. In shadow deployment, the company can route the same user requests to both models, but only return the responses from the existing model to the users. The responses from the new experimental model are logged and analyzed for quality and performance metrics, such as accuracy, latency, and resource consumption12. This way, the company can validate the new experimental model in a production environment, without affecting the current live traffic or user experience Shadow deployment consists of releasing version B alongside version A, fork version A’s incoming requests, and send them to version B without impacting production traffic. This is particularly useful to test production load on a new feature and measure model performance on a new version without impacting current live traffic. source: https://aws.amazon.com/blogs/machine-learning/deploy-shadow-ml-models-in-amazon-sagemaker/ - https://www.examtopics.com/discussions/amazon/view/133096-exam-aws-certified-machine-learning-specialty-topic-1/
298
299 - A company deployed a machine learning (ML) model on the company website to predict real estate prices. Several months after deployment, an ML engineer notices that the accuracy of the model has gradually decreased. The ML engineer needs to improve the accuracy of the model. The engineer also needs to receive notifications for any future performance issues. Which solution will meet these requirements? - A.. Perform incremental training to update the model. Activate Amazon SageMaker Model Monitor to detect model performance issues and to send notifications. B.. Use Amazon SageMaker Model Governance. Configure Model Governance to automatically adjust model hyperparameters. Create a performance threshold alarm in Amazon CloudWatch to send notifications. C.. Use Amazon SageMaker Debugger with appropriate thresholds. Configure Debugger to send Amazon CloudWatch alarms to alert the team. Retrain the model by using only data from the previous several months. D.. Use only data from the previous several months to perform incremental training to update the model. Use Amazon SageMaker Model Monitor to detect model performance issues and to send notifications.
A - Avoids the model being influenced by outdated patterns that no longer apply. For example, if we enter a recession in a few months, we should not be using pricing data from last year. Why not D? ChatGPT: While option D includes the use of Amazon SageMaker Model Monitor, it suggests using only the most recent data for incremental training. This could result in the loss of valuable information from older data, which might still be relevant. Incremental training should ideally update the model with new data while retaining useful insights from the entire dataset, not just the recent months. Incremental training involves updating the model with new data over time. Amazon SageMaker Model Monitor is a suitable choice for monitoring model performance. It can detect drift and anomalies in real-time predictions and send notifications. Option A makes more sense in this case A cover all the requirements - https://www.examtopics.com/discussions/amazon/view/133097-exam-aws-certified-machine-learning-specialty-topic-1/
299
300 - A university wants to develop a targeted recruitment strategy to increase new student enrollment. A data scientist gathers information about the academic performance history of students. The data scientist wants to use the data to build student profiles. The university will use the profiles to direct resources to recruit students who are likely to enroll in the university. Which combination of steps should the data scientist take to predict whether a particular student applicant is likely to enroll in the university? (Choose two.) - A.. Use Amazon SageMaker Ground Truth to sort the data into two groups named "enrolled" or "not enrolled." B.. Use a forecasting algorithm to run predictions. C.. Use a regression algorithm to run predictions. D.. Use a classification algorithm to run predictions. E.. Use the built-in Amazon SageMaker k-means algorithm to cluster the data into two groups named "enrolled" or "not enrolled."
AD - Correct AD first, classify the student profiles. then use classification algorithm to run predictions K-means is unsupervised, so not useful for clustering. For grouping, use GroundTruth. It’s a classification problem. So, A and D are right. This question is focusing on either yes/no type of response (binary). So I think Classification algorithm would work the best as compared to K-means which is solely responsible for clustering the data. D. Use a classification algorithm to run predictions: This approach is suitable for binary outcomes, such as predicting whether a student will enroll ("enrolled") or not ("not enrolled"). A. Use Amazon SageMaker Ground Truth to sort the data into two groups named "enrolled" or "not enrolled.": This service can help in labeling the dataset accurately, providing a strong foundation for training the classification model. The data scientist should use Amazon SageMaker Ground Truth to sort the data into two groups named "enrolled" or "not enrolled." This will create a labeled dataset that can be used for supervised learning. The data scientist should then use a classification algorithm to run predictions on the test data. A classification algorithm is a suitable choice for predicting a binary outcome, such as enrollment status, based on the input features, such as academic performance. A classification IT Certification Guaranteed, The Easy Way! 163 algorithm will output a probability for each class label and assign the most likely label to each observation. References: Use Amazon SageMaker Ground Truth to Label Data Classification Algorithm in Machine Learning It mentions combination of options. It is a classification problem and labels will be needed. - https://www.examtopics.com/discussions/amazon/view/133098-exam-aws-certified-machine-learning-specialty-topic-1/
300
301 - A machine learning (ML) specialist is using the Amazon SageMaker DeepAR forecasting algorithm to train a model on CPU-based Amazon EC2 On-Demand instances. The model currently takes multiple hours to train. The ML specialist wants to decrease the training time of the model. Which approaches will meet this requirement? (Choose two.) - A.. Replace On-Demand Instances with Spot Instances. B.. Configure model auto scaling dynamically to adjust the number of instances automatically. C.. Replace CPU-based EC2 instances with GPU-based EC2 instances. D.. Use multiple training instances. E.. Use a pre-trained version of the model. Run incremental training.
CD - https://docs.aws.amazon.com/sagemaker/latest/dg/distributed-training.html CD Given the specific context of training a DeepAR forecasting model and the potential cost implications, the options B and D are generally more applicable and cost-effective approaches to decreasing training time. However, if cost is not a concern and the DeepAR algorithm can benefit significantly from GPU acceleration, then option C could be a valid approach as well. C and D CD is correct answer The best approaches to decrease the training time of the model are C and D, because they can improve the computational efficiency and parallelization of the training process. These approaches have the following benefits: C: Replacing CPU-based EC2 instances with GPU-based EC2 instances can speed up the training of the DeepAR algorithm, as it can leverage the parallel processing power of GPUs to perform matrix operations and gradient computations faster than CPUs12. The DeepAR algorithm supports GPUbased EC2 instances such as ml.p2 and ml.p33. D: Using multiple training instances can also reduce the training time of the DeepAR algorithm, as it can distribute the workload across multiple nodes and perform data parallelism4. The DeepAR algorithm supports distributed training with multiple CPU-based or GPU-based EC2 instances3. The other options are not effective or relevant, because they have the following drawbacks: - https://www.examtopics.com/discussions/amazon/view/133266-exam-aws-certified-machine-learning-specialty-topic-1/
301
302 - A chemical company has developed several machine learning (ML) solutions to identify chemical process abnormalities. The time series values of independent variables and the labels are available for the past 2 years and are sufficient to accurately model the problem. The regular operation label is marked as 0 The abnormal operation label is marked as 1. Process abnormalities have a significant negative effect on the company’s profits. The company must avoid these abnormalities. Which metrics will indicate an ML solution that will provide the GREATEST probability of detecting an abnormality? - A.. Precision = 0.91 - Recall = 0.6 B.. Precision = 0.61 - Recall = 0.98 C.. Precision = 0.7 - Recall = 0.9 D.. Precision = 0.98 - Recall = 0.8
B - It is B To maximize the probability of detecting an abnormality, the focus should be on high recall (the ability of the model to find all actual positives), especially in scenarios where missing an abnormality could have significant negative effects. Between the given options: B. Precision = 0.61 - Recall = 0.98 This option has the highest recall, meaning it is best at identifying actual abnormalities (label 1), which is crucial for minimizing the risk of undetected process abnormalities. Although precision is lower (indicating more false positives), in this context, ensuring abnormalities are detected (even at the cost of investigating more false alarms) is more critical. if abnormality is not detected then higher cost. Therefore FN (false negatives) should be minimized. Which means recall should have the highest value - https://www.examtopics.com/discussions/amazon/view/133089-exam-aws-certified-machine-learning-specialty-topic-1/
302
303 - An online delivery company wants to choose the fastest courier for each delivery at the moment an order is placed. The company wants to implement this feature for existing users and new users of its application. Data scientists have trained separate models with XGBoost for this purpose, and the models are stored in Amazon S3. There is one model for each city where the company operates. Operation engineers are hosting these models in Amazon EC2 for responding to the web client requests, with one instance for each model, but the instances have only a 5% utilization in CPU and memory. The operation engineers want to avoid managing unnecessary resources. Which solution will enable the company to achieve its goal with the LEAST operational overhead? - A.. Create an Amazon SageMaker notebook instance for pulling all the models from Amazon S3 using the boto3 library. Remove the existing instances and use the notebook to perform a SageMaker batch transform for performing inferences offline for all the possible users in all the cities. Store the results in different files in Amazon S3. Point the web client to the files. B.. Prepare an Amazon SageMaker Docker container based on the open-source multi-model server. Remove the existing instances and create a multi-model endpoint in SageMaker instead, pointing to the S3 bucket containing all the models. Invoke the endpoint from the web client at runtime, specifying the TargetModel parameter according to the city of each request. C.. Keep only a single EC2 instance for hosting all the models. Install a model server in the instance and load each model by pulling it from Amazon S3. Integrate the instance with the web client using Amazon API Gateway for responding to the requests in real time, specifying the target resource according to the city of each request. D.. Prepare a Docker container based on the prebuilt images in Amazon SageMaker. Replace the existing instances with separate SageMaker endpoints, one for each city where the company operates. Invoke the endpoints from the web client, specifying the URL and EndpointName parameter according to the city of each request.
B - Multi-model endpoints provide a scalable and cost-effective solution to deploying large numbers of models. https://docs.aws.amazon.com/sagemaker/latest/dg/multi-model-endpoints.html By preparing a SageMaker Docker container based on the open-source multi-model server, the company can host all models in a single endpoint and dynamically select the appropriate model based on the city of each request. This approach optimizes resource utilization and avoids managing unnecessary resources, as opposed to having separate instances for each city A multi-model endpoint in Amazon SageMaker is an endpoint that can host multiple machine learning models simultaneously. This allows you to deploy and manage multiple models on a single endpoint, reducing operational costs and simplifying deployment and management tasks. Each model is associated with a specific container image and can be invoked using a unique model name or endpoint name. This feature is useful when you have multiple models that need to be deployed together or when you want to reduce the number of endpoints that need to be managed. - https://www.examtopics.com/discussions/amazon/view/134338-exam-aws-certified-machine-learning-specialty-topic-1/
303
304 - A company builds computer-vision models that use deep learning for the autonomous vehicle industry. A machine learning (ML) specialist uses an Amazon EC2 instance that has a CPU:GPU ratio of 12:1 to train the models. The ML specialist examines the instance metric logs and notices that the GPU is idle half of the time. The ML specialist must reduce training costs without increasing the duration of the training jobs. Which solution will meet these requirements? - A.. Switch to an instance type that has only CPUs. B.. Use a heterogeneous cluster that has two different instances groups. C.. Use memory-optimized EC2 Spot Instances for the training jobs. D.. Switch to an instance type that has a CPU:GPU ratio of 6:1.
C - https://aws.amazon.com/it/blogs/machine-learning/improve-price-performance-of-your-model-training-using-amazon-sagemaker-heterogeneous-clusters/ The topic is wierd. GPU is idle, which means CPU is not able to feed data to GPU in time, which means more CPU is needed. The current ratio is 12:1, and you need to increase the ratio. D is wrong because 6:1 is smaller than 12:1 A: No - removing GPU could significantly increase training time B: No - doesn't solve the issue of GPU under utilization C: No - doesn't solve the issue of GPU under utilization and may take longer D: Yes - Reducing CPU's should solve the issue of GPU underutilization withour causing training delays ChatGPT D. Switch to an instance type that has a CPU:GPU ratio of 6:1. This solution aligns with reducing training costs without extending the duration of training jobs. By selecting an instance with a lower CPU:GPU ratio, the specialist can ensure more consistent utilization of the GPU, thereby reducing idle time and optimizing resource use without compromising training efficiency. Is it from GPT 4? Because I got option B from it. yes, GPT4o says the option D is the right answer Switching to an instance type that has a CPU: GPU ratio of 6:1 will reduce the training costs by using fewer CPUs and GPUs, while maintaining the same level of performance. The GPU idle time indicates that the CPU is not able to feed the GPU with enough data, so reducing the CPU: GPU ratio will IT Certification Guaranteed, The Easy Way! 167 balance the workload and improve the GPU utilization. A lower CPU: GPU ratio also means less overhead for inter-process communication and synchronization between the CPU and GPU processes. References: Optimizing GPU utilization for AI/ML workloads on Amazon EC2 Analyze CPU vs. GPU Performance for AWS Machine Learning - https://www.examtopics.com/discussions/amazon/view/133268-exam-aws-certified-machine-learning-specialty-topic-1/
304
305 - A company wants to forecast the daily price of newly launched products based on 3 years of data for older product prices, sales, and rebates. The time-series data has irregular timestamps and is missing some values. Data scientist must build a dataset to replace the missing values. The data scientist needs a solution that resamples the data daily and exports the data for further modeling. Which solution will meet these requirements with the LEAST implementation effort? - A.. Use Amazon EMR Serverless with PySpark. B.. Use AWS Glue DataBrew. C.. Use Amazon SageMaker Studio Data Wrangler. D.. Use Amazon SageMaker Studio Notebook with Pandas.
C - There is a need for scheduling daily resampling. This can be automated in Databrew more easily than in Data Wrangler. While SageMaker Data Wrangler (option C) is also a strong contender, DataBrew is slightly easier to use and requires even less implementation effort, especially for users who may not be as familiar with the SageMaker ecosystem. C is the right one I think. For B, you should to feed this data to sagemaker which brings more operational effort than Data Wrangler Data Wrangler is better for ML work. Brew can be used as well Data wrangler supports tight integration with Sagemaker and is better suited for this scenario since resampled data is used in further modelling. AWS Glue DataBrew is a data preparation service more for general purpose use. Best for Data Wrangler This is exactly what Data Wrangler is for Answer: C Explanation: Amazon SageMaker Studio Data Wrangler is a visual data preparation tool that enables users to clean and normalize data without writing any code. Using Data Wrangler, the data scientist can easily import the time-series data from various sources, such as Amazon S3, Amazon Athena, or Amazon Redshift. Data Wrangler can automatically generate data insights and quality reports, which can help identify and fix missing values, outliers, and anomalies in the data. Data Wrangler also provides over 250 built-in transformations, such as resampling, interpolation, aggregation, and filtering, which can be applied to the data with a point-and-click interface. Data Wrangler can also export the prepared data to different destinations, such as Amazon S3, Amazon SageMaker Feature Store, or Amazon SageMaker Pipelines, for further modeling and analysis. D - https://www.examtopics.com/discussions/amazon/view/133269-exam-aws-certified-machine-learning-specialty-topic-1/
305
306 - A data scientist is building a forecasting model for a retail company by using the most recent 5 years of sales records that are stored in a data warehouse. The dataset contains sales records for each of the company’s stores across five commercial regions. The data scientist creates a working dataset with StoreID. Region. Date, and Sales Amount as columns. The data scientist wants to analyze yearly average sales for each region. The scientist also wants to compare how each region performed compared to average sales across all commercial regions. Which visualization will help the data scientist better understand the data trend? - A.. Create an aggregated dataset by using the Pandas GroupBy function to get average sales for each year for each store. Create a bar plot, faceted by year, of average sales for each store. Add an extra bar in each facet to represent average sales. B.. Create an aggregated dataset by using the Pandas GroupBy function to get average sales for each year for each store. Create a bar plot, colored by region and faceted by year, of average sales for each store. Add a horizontal line in each facet to represent average sales. C.. Create an aggregated dataset by using the Pandas GroupBy function to get average sales for each year for each region. Create a bar plot of average sales for each region. Add an extra bar in each facet to represent average sales. D.. Create an aggregated dataset by using the Pandas GroupBy function to get average sales for each year for each region. Create a bar plot, faceted by year, of average sales for each region. Add a horizontal line in each facet to represent average sales.
D - It is the best way D: visualization provides insights into regional sales trends over time and allows for comparisons between regions and the overall average. D. Create an aggregated dataset by using the Pandas GroupBy function to get average sales for each year for each region. Create a bar plot, faceted by year, of average sales for each region. Add a horizontal line in each facet to represent average sales. This visualization allows the data scientist to compare yearly average sales across regions and see how each region's performance relates to the overall average, providing clear insights into trends and deviations. Explanation: The best visualization for this task is to create a bar plot, faceted by year, of average sales for each region and add a horizontal line in each facet to represent average sales. This way, the data scientist can easily compare the yearly average sales for each region with the overall average sales and see the IT Certification Guaranteed, The Easy Way! 170 trends over time. The bar plot also allows the data scientist to see the relative performance of each region within each year and across years. The other options are less effective because they either do not show the yearly trends, do not show the overall average sales, or do not group the data by region. References: pandas.DataFrame.groupby - pandas 2.1.4 documentation pandas.DataFrame.plot.bar - pandas 2.1.4 documentation Matplotlib - Bar Plot - Online Tutorials Library - https://www.examtopics.com/discussions/amazon/view/133255-exam-aws-certified-machine-learning-specialty-topic-1/
306
307 - A company uses sensors on devices such as motor engines and factory machines to measure parameters, temperature and pressure. The company wants to use the sensor data to predict equipment malfunctions and reduce services outages. Machine learning (ML) specialist needs to gather the sensors data to train a model to predict device malfunctions. The ML specialist must ensure that the data does not contain outliers before training the model. How can the ML specialist meet these requirements with the LEAST operational overhead? - A.. Load the data into an Amazon SageMaker Studio notebook. Calculate the first and third quartile. Use a SageMaker Data Wrangler data flow to remove only values that are outside of those quartiles. B.. Use an Amazon SageMaker Data Wrangler bias report to find outliers in the dataset. Use a Data Wrangler data flow to remove outliers based on the bias report. C.. Use an Amazon SageMaker Data Wrangler anomaly detection visualization to find outliers in the dataset. Add a transformation to a Data Wrangler data flow to remove outliers. D.. Use Amazon Lookout for Equipment to find and remove outliers from the dataset.
C - from copilot - Amazon Lookout for Equipment is specifically designed for predictive maintenance and can automatically detect anomalies in sensor data, making it a suitable choice for this scenario. It minimizes the need for manual intervention and leverages advanced machine learning models to identify and handle outliers efficiently. https://docs.aws.amazon.com/sagemaker/latest/dg/data-wrangler-analyses.html#data-wrangler-time-series-anomaly-detection Data Wrangler can do it all Anomaly detection visualization feature in SageMaker Data Wrangler is designed to identify outliers in the dataset based on sensor data parameters such as temperature and pressure. By visually inspecting the anomalies, the ML specialist can easily identify and remove outliers using transformations within Data Wrangler data flows, minimizing operational overhead. Anomaly detection visualization feature in SageMaker Data Wrangler is designed to identify outliers in the dataset based on sensor data parameters such as temperature and pressure. By visually inspecting the anomalies, the ML specialist can easily identify and remove outliers using transformations within Data Wrangler data flows, minimizing operational overhead. Going with C agree with C Amazon SageMaker Data Wrangler is a tool that helps data scientists and ML developers to prepare data for ML. One of the features of Data Wrangler is the anomaly detection visualization, which uses an unsupervised ML algorithm to identify outliers in the dataset based on statistical properties. The ML specialist can use this feature to quickly explore the sensor data and find any anomalous values that may affect the model performance. The ML specialist can then add a transformation to a Data Wrangler data flow to remove the outliers from the dataset. The data flow can be exported as a script or a pipeline to automate the data preparation process. This option requires the least operational overhead compared to the other options. References: Amazon SageMaker Data Wrangler - Amazon Web Services (AWS) Anomaly Detection Visualization - Amazon SageMaker Transform Data - Amazon SageMaker - https://www.examtopics.com/discussions/amazon/view/133256-exam-aws-certified-machine-learning-specialty-topic-1/
307
308 - A data scientist obtains a tabular dataset that contains 150 correlated features with different ranges to build a regression model. The data scientist needs to achieve more efficient model training by implementing a solution that minimizes impact on the model’s performance. The data scientist decides to perform a principal component analysis (PCA) preprocessing step to reduce the number of features to a smaller set of independent features before the data scientist uses the new features in the regression model. Which preprocessing step will meet these requirements? - A.. Use the Amazon SageMaker built-in algorithm for PCA on the dataset to transform the data. B.. Load the data into Amazon SageMaker Data Wrangler. Scale the data with a Min Max Scaler transformation step. Use the SageMaker built-in algorithm for PCA on the scaled dataset to transform the data. C.. Reduce the dimensionality of the dataset by removing the features that have the highest correlation. Load the data into Amazon SageMaker Data Wrangler. Perform a Standard Scaler transformation step to scale the data. Use the SageMaker built-in algorithm for PCA on the scaled dataset to transform the data. D.. Reduce the dimensionality of the dataset by removing the features that have the lowest correlation. Load the data into Amazon SageMaker Data Wrangler. Perform a Min Max Scaler transformation step to scale the data. Use the SageMaker built-in algorithm for PCA on the scaled dataset to transform the data.
B - Standards calling for PCA. https://aws.amazon.com/blogs/machine-learning/amazon-sagemaker-data-wrangler-for-dimensionality-reduction/ PCA Requires minmax scaling. With support for PCA in Data Wrangler, you can now easily reduce the dimensionality of a high dimensional data set in only a few clicks. You can access PCA by selecting Dimensionality Reduction from the “Add step” workflow. https://aws.amazon.com/about-aws/whats-new/2022/10/amazon-sagemaker-data-wrangler-reduce-dimensionality-pca/ PCA requires scaling => use min-max scaler A: No - PCA requires feature scaling to remove dominance of high value variables B: Yes - Scaling addresses the issue of features with different ranges + PCA does feature reduction C: No - manual removal may lead to removal of important features D: No - manual removal may lead to removal of important features Standard scaler is better for PCA C performs a standard transformation and D removes variables with low correlations which will delete important features - https://www.examtopics.com/discussions/amazon/view/133088-exam-aws-certified-machine-learning-specialty-topic-1/
308
309 - An online retailer collects the following data on customer orders: demographics, behaviors, location, shipment progress, and delivery time. A data scientist joins all the collected datasets. The result is a single dataset that includes 980 variables. The data scientist must develop a machine learning (ML) model to identify groups of customers who are likely to respond to a marketing campaign. Which combination of algorithms should the data scientist use to meet this requirement? (Choose two.) - A.. Latent Dirichlet Allocation (LDA) B.. K-means C.. Semantic segmentation D.. Principal component analysis (PCA) E.. Factorization machines (FM)
BD - K means and PCA K means and PCA Classic clustering problem A: No - LDA is for topic modelling B: Yes - K-means is a clustering algorithm C: No - applies to images D: Yes - PCA makes sure only relevant features are selected E: No - FM is supervised regression/classification recommendation algorithm for sparse data B. K-means: This algorithm is effective for clustering customers into distinct groups based on similarities across their features, which can reveal segments more likely to respond to marketing campaigns. D. Principal Component Analysis (PCA): Given the high dimensionality of the dataset, PCA can reduce the number of variables to a manageable size while retaining most of the variance, making the dataset more tractable for clustering algorithms like K-means. k-means for clustering due to no comments regarding labels in the data and also PCA in order to reduce the amount of features - https://www.examtopics.com/discussions/amazon/view/133087-exam-aws-certified-machine-learning-specialty-topic-1/
309
310 - A machine learning engineer is building a bird classification model. The engineer randomly separates a dataset into a training dataset and a validation dataset. During the training phase, the model achieves very high accuracy. However, the model did not generalize well during validation of the validation dataset. The engineer realizes that the original dataset was imbalanced. What should the engineer do to improve the validation accuracy of the model? - A.. Perform stratified sampling on the original dataset. B.. Acquire additional data about the majority classes in the original dataset. C.. Use a smaller, randomly sampled version of the training dataset. D.. Perform systematic sampling on the original dataset.
A - https://aws.amazon.com/about-aws/whats-new/2022/04/amazon-sagemaker-data-wrangler-supports-random-sampling-stratified-sampling/ A. Balanced Class Representation. Stratified sampling divides the original dataset into strata (groups) based on the class labels. It then selects instances from each stratum in a proportional manner, ensuring that the class distribution in the training and validation datasets reflects the original class distribution. Improved Generalization. By having a balanced representation of all classes in the training and validation datasets, the model is exposed to a diverse range of instances during training. This helps the model learn the distinguishing features of each class more effectively, leading to better generalization performance on the validation dataset. Addressing Imbalanced Data. Stratified sampling directly addresses the issue of imbalanced data, which was identified as the root cause of the model's poor generalization performance on the validation dataset. Stratified sampling A: Yes - Stratified sampling ensures that each class is proportionally represented and mitigates the impact of class imbalance on model performance B: No - additional data about the majority classes does not solve class imbalance issue C: No - Does not solve class imbalance issue and may worsen the situation D: No - selecting data points at regular intervals does not solve class imbalance issue - https://www.examtopics.com/discussions/amazon/view/135608-exam-aws-certified-machine-learning-specialty-topic-1/
310
311 - A data engineer wants to perform exploratory data analysis (EDA) on a petabyte of data. The data engineer does not want to manage compute resources and wants to pay only for queries that are run. The data engineer must write the analysis by using Python from a Jupyter notebook. Which solution will meet these requirements? - A.. Use Apache Spark from within Amazon Athena. B.. Use Apache Spark from within Amazon SageMaker. C.. Use Apache Spark from within an Amazon EMR cluster. D.. Use Apache Spark through an integration with Amazon Redshift.
A - https://aws.amazon.com/it/blogs/aws/new-amazon-athena-for-apache-spark/ Amazon Athena makes it easy to interactively run data analytics and exploration using Apache Spark without the need to plan for, configure, or manage resources reference :-https://docs.aws.amazon.com/athena/latest/ug/notebooks-spark.html B - SageMaker Athena and redshift does not support Apache spark scripts EMR requires managing infra SageMaker provide serverless, spark cluster notebook, pay as you go if you remember to close it when you finished your work. Amazon Athena does not natively support running Python code directly. Amazon Athena is primarily a serverless, interactive query service that allows you to analyze data in Amazon S3 using standard SQL. Use Apache Spark from within Amazon SageMaker. Amazon SageMaker allows you to run Jupyter notebooks and provides managed Apache Spark integration, which means you don't need to manage the underlying compute resources yourself. You can also use SageMaker to perform the analysis and pay only for the resources you consume during the execution of your queries. A and not B also because of paying for queries that you run. Notebooks will continue to run and cost money https://docs.aws.amazon.com/athena/latest/ug/notebooks-spark-working-with-notebooks.html https://docs.aws.amazon.com/athena/latest/ug/notebooks-spark-editor.html Correct Answer: A https://docs.aws.amazon.com/athena/latest/ug/notebooks-spark.html It's A - Using Apache Spark on Amazon Athena https://aws-sdk-pandas.readthedocs.io/en/3.2.1/tutorials/041%20-%20Apache%20Spark%20on%20Amazon%20Athena.html Serverless, Python, and Notebook are key elements for making the decision. It's B I changed my mind, Athena supports spark. It's A https://docs.amazonaws.cn/en_us/athena/latest/ug/notebooks-spark-getting-started.html Just thinking out loud, how can it be not Redshift as well? The question also mentions pay for queries, and handle petabyte of data. Spark is an integration possible with Amazon Redshift, and Redshift has serverless version too. https://aws.amazon.com/blogs/aws/new-amazon-redshift-integration-with-apache-spark/ A: No - Athena does not support python code B: Yes - Sagemaker is serverless and SageMaker Processing allows you to run Spark jobs from a Jupyter notebook using Python. You only pay for resources used during processing jobs. C: No - involves managing the EMR cluster. You pay for running EC2 instances whether in use or not. D: No - Redshift can't run spark jobs and no native support for python/Jupiter notebooks - https://www.examtopics.com/discussions/amazon/view/135610-exam-aws-certified-machine-learning-specialty-topic-1/
311
312 - A data scientist receives a new dataset in .csv format and stores the dataset in Amazon S3. The data scientist will use the dataset to train a machine learning (ML) model. The data scientist first needs to identify any potential data quality issues in the dataset. The data scientist must identify values that are missing or values that are not valid. The data scientist must also identify the number of outliers in the dataset. Which solution will meet these requirements with the LEAST operational effort? - A.. Create an AWS Glue job to transform the data from .csv format to Apache Parquet format. Use an AWS Glue crawler and Amazon Athena with appropriate SQL queries to retrieve the required information. B.. Leave the dataset in .csv format. Use an AWS Glue crawler and Amazon Athena with appropriate SQL queries to retrieve the required information. C.. Create an AWS Glue job to transform the data from .csv format to Apache Parquet format. Import the data into Amazon SageMaker Data Wrangler. Use the Data Quality and Insights Report to retrieve the required information. D.. Leave the dataset in .csv format. Import the data into Amazon SageMaker Data Wrangler. Use the Data Quality and Insights Report to retrieve the required information.
D - Data wrangler design for ML etl with inbuilt functions Data Wrangler best tool for the job Data wrangler for all you need when prepare data for ML. A: No - more manual/operational overhead B: No - more manual/operational overhead C: No - Data transformation to parquet requires more unnecessary operational overhead D: Yes - least operational effort - wrangler has built-in identification of data quality issues and outliers - https://www.examtopics.com/discussions/amazon/view/135611-exam-aws-certified-machine-learning-specialty-topic-1/
312
313 - An ecommerce company has developed a XGBoost model in Amazon SageMaker to predict whether a customer will return a purchased item. The dataset is imbalanced. Only 5% of customers return items. A data scientist must find the hyperparameters to capture as many instances of returned items as possible. The company has a small budget for compute. How should the data scientist meet these requirements MOST cost-effectively? - A.. Tune all possible hyperparameters by using automatic model tuning (AMT). Optimize on {"HyperParameterTuningJobObjective": {"MetricName": "validation:accuracy", "Type": "Maximize"}}. B.. Tune the csv_weight hyperparameter and the scale_pos_weight hyperparameter by using automatic model tuning (AMT). Optimize on {"HyperParameterTuningJobObjective": {"MetricName": "validation'll", "Type": "Maximize"}}. C.. Tune all possible hyperparameters by using automatic model tuning (AMT). Optimize on {"HyperParameterTuningJobObjective": {"MetricName": "validation:f1", "Type": "Maximize"}}. D.. Tune the csv_weight hyperparameter and the scale_pos_weight hyperparameter by using automatic model tuning (AMT). Optimize on {"HyperParameterTuningJobObjective": {"MetricName": "validation:f1", "Type": "Minimize"}}.
B - Correct B. It seems there was a typographical error in the provided options. "validation'll" is not a valid metric name. It appears to be an error or a typo. B. Tune the csv_weight hyperparameter and the scale_pos_weight hyperparameter by using automatic model tuning (AMT). Optimize on {"HyperParameterTuningJobObjective": {"MetricName": "validation:f1", "Type": "Maximize"}}. the ll in the option B is recall, there must be some bug make the system miss some charactor. Given the imbalanced nature of the dataset where only 5% of customers return items, the focus should be on maximizing the model's ability to correctly identify the returned items, which corresponds to maximizing the recall or F1 score. Option C and D aim to optimize the F1 score, but option D specifies minimizing the F1 score, which is incorrect. C. Tune all possible hyperparameters by using automatic model tuning (AMT). Optimize on {"HyperParameterTuningJobObjective": {"MetricName": "validation:f1", "Type": "Maximize"}}. B. Tune the csv_weight hyperparameter and the scale_pos_weight hyperparameter by using automatic model tuning (AMT). Optimize on {"HyperParameterTuningJobObjective": {"MetricName": "validation:recall", "Type": "Maximize"}}. The dataset is imbalanced, with only 5% of customers returning items (or the positive class). The goal is typically to capture as many instances of the minority class (returned items) as possible, even at the expense of some false positives. Option D might be incorrect, as the goal is to maximize the model's ability to capture instances of returned items, not minimize the F1 score. A: No - tuning all hyperparemeters requires compute - not very cost effective B: No - Log Loss has to be not applicable to imbalanced dataset C: No - tuning all hyperparemeters requires compute - not very cost effective D: Yes - F1 metric combines both precision and recall which is more suitable for unbalanced datasets but D sais MINIMIZE, so its not correct - https://www.examtopics.com/discussions/amazon/view/135612-exam-aws-certified-machine-learning-specialty-topic-1/
313
314 - A data scientist is trying to improve the accuracy of a neural network classification model. The data scientist wants to run a large hyperparameter tuning job in Amazon SageMaker. However, previous smaller tuning jobs on the same model often ran for several weeks. The ML specialist wants to reduce the computation time required to run the tuning job. Which actions will MOST reduce the computation time for the hyperparameter tuning job? (Choose two.) - A.. Use the Hyperband tuning strategy. B.. Increase the number of hyperparameters. C.. Set a lower value for the MaxNumberOfTrainingJobs parameter. D.. Use the grid search tuning strategy. E.. Set a lower value for the MaxParallelTrainingJobs parameter.
AC - Reducing parallelism increases sequential execution time, making the job take longer overall. AE https://docs.aws.amazon.com/sagemaker/latest/dg/automatic-model-tuning-considerations.html#automatic-model-tuning-num-hyperparameters (ChatGPT) Option A (Hyperband): Efficiently utilizes computational resources. Reduces computation time by early stopping unpromising training jobs. Allows for a broader search of hyperparameter space within a shorter time. Option C (Lower MaxNumberOfTrainingJobs): Reduces the total number of training jobs. Directly decreases computation time. Helps stay within the small compute budget. stop give GPT answer, GPT may not correct https://docs.aws.amazon.com/sagemaker/latest/dg/automatic-model-tuning-considerations.html#automatic-model-tuning-num-hyperparameters Correct AE A for sure C will reduce the tuning time because we are limiting the no. of Max Training jobs. E would be find with object is not reducing time A: Yes - hyperband tuning early stops bad performing tuning jobs and reallocates resources to better performing ones B: No - Increase in hyperparams will increase time taken C: Yes - Limiting number of tuning jobs causes system not to run through entire list of tuning jobs reducing time D: No - grid search is computationally expensive and will take longer E: No - will increase time taken A and D surely. Grid search to tune the hyperparameters.Grid search algorithm is more efficient than exhaustive search and will speed up tuning the hyperparameters. Option A: Use the Hyperband tuning strategy. The Hyperband tuning strategy is a resource-efficient and time-saving approach for hyperparameter tuning. It works by running a set of hyperparameter configurations for a small number of training iterations and eliminating the poorly performing configurations early on. This strategy can significantly reduce the overall computation time compared to traditional methods like grid search or random search, especially for large hyperparameter spaces or time-consuming models like neural networks. Option E: Set a lower value for the MaxParallelTrainingJobs parameter. The MaxParallelTrainingJobs parameter in Amazon SageMaker specifies the maximum number of concurrent training jobs to be run in parallel during the hyperparameter tuning process. By setting a lower value for this parameter, the data scientist can limit the amount of computational resources used simultaneously, potentially reducing the overall computation time and cost. On second thought A,C makes more sense. C. Set a lower value for the MaxNumberOfTrainingJobs parameter. - The MaxNumberOfTrainingJobs parameter specifies the maximum number of training jobs that can be created during the tuning job. - Setting a lower value for this parameter will limit the number of training jobs and potentially reduce the computation time. - However, it may also limit the exploration of the hyperparameter space and potentially lead to suboptimal results. - This option should be considered with caution and in conjunction with other strategies to ensure adequate hyperparameter exploration. - https://www.examtopics.com/discussions/amazon/view/135662-exam-aws-certified-machine-learning-specialty-topic-1/
314
315 - A machine learning (ML) specialist needs to solve a binary classification problem for a marketing dataset. The ML specialist must maximize the Area Under the ROC Curve (AUC) of the algorithm by training an XGBoost algorithm. The ML specialist must find values for the eta, alpha, min_child_weight, and max_depth hyperparameters that will generate the most accurate model. Which approach will meet these requirements with the LEAST operational overhead? - A.. Use a bootstrap script to install scikit-learn on an Amazon EMR cluster. Deploy the EMR cluster. Apply k-fold cross-validation methods to the algorithm. B.. Deploy Amazon SageMaker prebuilt Docker images that have scikit-learn installed. Apply k-fold cross-validation methods to the algorithm. C.. Use Amazon SageMaker automatic model tuning (AMT). Specify a range of values for each hyperparameter. D.. Subscribe to an AUC algorithm that is on AWS Marketplace. Specify a range of values for each hyperparameter.
C - https://docs.aws.amazon.com/sagemaker/latest/dg/automatic-model-tuning.html automated model Tuning will be the best solution here Amazon SageMaker automatic model tuning (AMT) for sure Automated model tuning minimizes operational overhead because it automates the entire process of hyperparameter tuning, including setting up and managing the training jobs, tracking performance metrics, and selecting the best model configuration C. Use Amazon SageMaker automatic model tuning (AMT). Specify a range of values for each hyperparameter. - https://www.examtopics.com/discussions/amazon/view/135663-exam-aws-certified-machine-learning-specialty-topic-1/
315
316 - A machine learning (ML) developer for an online retailer recently uploaded a sales dataset into Amazon SageMaker Studio. The ML developer wants to obtain importance scores for each feature of the dataset. The ML developer will use the importance scores to feature engineer the dataset. Which solution will meet this requirement with the LEAST development effort? - A.. Use SageMaker Data Wrangler to perform a Gini importance score analysis. B.. Use a SageMaker notebook instance to perform principal component analysis (PCA). C.. Use a SageMaker notebook instance to perform a singular value decomposition analysis. D.. Use the multicollinearity feature to perform a lasso feature selection to perform an importance scores analysis.
A - The Quick model of Data Wrangler calculates feature importance for each feature using the Gini importance method. https://docs.aws.amazon.com/sagemaker/latest/dg/data-wrangler-analyses.html Wrangler allows to calculate feature importance scores using Gini importance A. Use SageMaker Data Wrangler to perform a Gini importance score analysis. By using the Gini importance score analysis in Data Wrangler, the ML developer can obtain importance scores for each feature of the sales dataset with minimal development effort, as it is a built-in functionality with a visual interface. This approach requires no coding or additional setup, making it the least effort-intensive solution compared to the other options involving custom coding or separate analyses. - https://www.examtopics.com/discussions/amazon/view/135664-exam-aws-certified-machine-learning-specialty-topic-1/
316
317 - A company is setting up a mechanism for data scientists and engineers from different departments to access an Amazon SageMaker Studio domain. Each department has a unique SageMaker Studio domain. The company wants to build a central proxy application that data scientists and engineers can log in to by using their corporate credentials. The proxy application will authenticate users by using the company's existing Identity provider (IdP). The application will then route users to the appropriate SageMaker Studio domain. The company plans to maintain a table in Amazon DynamoDB that contains SageMaker domains for each department. How should the company meet these requirements? - A.. Use the SageMaker CreatePresignedDomainUrl API to generate a presigned URL for each domain according to the DynamoDB table. Pass the presigned URL to the proxy application. B.. Use the SageMaker CreateHumanTaskUi API to generate a UI URL. Pass the URL to the proxy application. C.. Use the Amazon SageMaker ListHumanTaskUis API to list all UI URLs. Pass the appropriate URL to the DynamoDB table so that the proxy application can use the URL. D.. Use the SageMaker CreatePresignedNotebooklnstanceUrl API to generate a presigned URL. Pass the presigned URL to the proxy application.
A - https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_CreatePresignedDomainUrl.html - Domain specific presigned URL's can be generated through dynamodb to route users to correct domain - central proxy app can authenticate existing users with company IDP -> retrieve presigned url for users dept from dynamodb -> redirect user to sagemake domain without needing direct auth on AWS A. Use the SageMaker CreatePresignedDomainUrl API to generate a presigned URL for each domain according to the DynamoDB table. Pass the presigned URL to the proxy application. The AWS documentation mentions the CreatePresignedDomainUrl API, which generates a presigned URL that authenticates a user to a specified Amazon SageMaker Domain. By using this API, the company can generate presigned URLs for each department's SageMaker Studio domain based on the information stored in the DynamoDB table. These presigned URLs can then be passed to the central proxy application, which can authenticate users using the company's existing Identity Provider (IdP) and provide them with the appropriate presigned URL for their department's SageMaker Studio domain. When the user accesses the presigned URL, they will be automatically authenticated and routed to the corresponding SageMaker Studio domain. - https://www.examtopics.com/discussions/amazon/view/135665-exam-aws-certified-machine-learning-specialty-topic-1/
317
318 - An insurance company is creating an application to automate car insurance claims. A machine learning (ML) specialist used an Amazon SageMaker Object Detection - TensorFlow built-in algorithm to train a model to detect scratches and dents in images of cars. After the model was trained, the ML specialist noticed that the model performed better on the training dataset than on the testing dataset. Which approach should the ML specialist use to improve the performance of the model on the testing data? - A.. Increase the value of the momentum hyperparameter. B.. Reduce the value of the dropout_rate hyperparameter. C.. Reduce the value of the learning_rate hyperparameter D.. Increase the value of the L2 hyperparameter.
D - D If your model is overfitting the training data, it makes sense to take actions that reduce model flexibility. To reduce model flexibility, try the following: Feature selection: consider using fewer feature combinations, decrease n-grams size, and decrease the number of numeric attribute bins. Increase the amount of regularization used. https://docs.aws.amazon.com/machine-learning/latest/dg/model-fit-underfitting-vs-overfitting.html Increase the value of L2 regularization A- No - momentum is for SGD with momentum, N/A in this case B: No - reducing dropout may or may not help C: No - reducing learning rate may increase overfitting even further D: Yes - L2 regularization penalizes large weights in the model. Increasing can help prevent overfitting by encouraging smaller weights. - https://www.examtopics.com/discussions/amazon/view/135988-exam-aws-certified-machine-learning-specialty-topic-1/
318
319 - A developer at a retail company is creating a daily demand forecasting model. The company stores the historical hourly demand data in an Amazon S3 bucket. However, the historical data does not include demand data for some hours. The developer wants to verify that an autoregressive integrated moving average (ARIMA) approach will be a suitable model for the use case. How should the developer verify the suitability of an ARIMA approach? - A.. Use Amazon SageMaker Data Wrangler. Import the data from Amazon S3. Impute hourly missing data. Perform a Seasonal Trend decomposition. B.. Use Amazon SageMaker Autopilot. Create a new experiment that specifies the S3 data location. Choose ARIMA as the machine learning (ML) problem. Check the model performance. C.. Use Amazon SageMaker Data Wrangler. Import the data from Amazon S3. Resample data by using the aggregate daily total. Perform a Seasonal Trend decomposition. D.. Use Amazon SageMaker Autopilot. Create a new experiment that specifies the S3 data location. Impute missing hourly values. Choose ARIMA as the machine learning (ML) problem. Check the model performance.
C - https://docs.aws.amazon.com/sagemaker/latest/dg/data-wrangler-transform.html#data-wrangler-transform-handle-missing-time-series Imputing Missing Data: ARIMA models require a complete time series without missing values. Imputing the missing hourly data ensures the dataset is suitable for ARIMA modeling1. Seasonal Trend Decomposition: This step helps in understanding the underlying patterns in the data, such as seasonality and trends, which are crucial for verifying the suitability of ARIMA2. I think its 'C' Resample is needed. https://docs.aws.amazon.com/sagemaker/latest/dg/data-wrangler-transform.html#data-wrangler-resample-time-series https://docs.aws.amazon.com/sagemaker/latest/dg/data-wrangler-transform.html#data-wrangler-transform-handle-missing-time-series Option A identify if data has any underlying patterns or trends should be the first step - Daily aggregation is needed to forecast daily demand and also takes care of missing hourly values. - Seasonal Trend decomposition on the daily aggregated data helps in understanding the underlying patterns, trends, and seasonality, which is essential for determining whether an ARIMA model would be appropriate. You make a valid point about the importance of daily aggregation and understanding seasonal trends. However, the key issue here is the missing hourly data. By aggregating the data daily, you might lose valuable information that could impact the accuracy of the ARIMA model. - https://www.examtopics.com/discussions/amazon/view/135990-exam-aws-certified-machine-learning-specialty-topic-1/
319
320 - A company decides to use Amazon SageMaker to develop machine learning (ML) models. The company will host SageMaker notebook instances in a VPC. The company stores training data in an Amazon S3 bucket. Company security policy states that SageMaker notebook instances must not have internet connectivity. Which solution will meet the company’s security requirements? - A.. Connect the SageMaker notebook instances that are in the VPC by using AWS Site-to-Site VPN to encrypt all internet-bound traffic. Configure VPC flow logs. Monitor all network traffic to detect and prevent any malicious activity. B.. Configure the VPC that contains the SageMaker notebook instances to use VPC interface endpoints to establish connections for training and hosting. Modify any existing security groups that are associated with the VPC interface endpoint to allow only outbound connections for training and hosting. C.. Create an IAM policy that prevents access the internet. Apply the IAM policy to an IAM role. Assign the IAM role to the SageMaker notebook instances in addition to any IAM roles that are already assigned to the instances. D.. Create VPC security groups to prevent all incoming and outgoing traffic. Assign the security groups to the SageMaker notebook instances.
B - https://docs.aws.amazon.com/sagemaker/latest/dg/inter-network-privacy.html VPC interface Endpoints will do the trick. - VPC Interface Endpoints allow notbook instances to communicate with sagemaker services without public internet traffic - Security groups allow outbound connections for training and hosting but block all other traffic - https://www.examtopics.com/discussions/amazon/view/135991-exam-aws-certified-machine-learning-specialty-topic-1/
320
321 - A machine learning (ML) engineer uses Bayesian optimization for a hyperpara meter tuning job in Amazon SageMaker. The ML engineer uses precision as the objective metric. The ML engineer wants to use recall as the objective metric. The ML engineer also wants to expand the hyperparameter range for a new hyperparameter tuning job. The new hyperparameter range will include the range of the previously performed tuning job. Which approach will run the new hyperparameter tuning job in the LEAST amount of time? - A.. Use a warm start hyperparameter tuning job. B.. Use a checkpointing hyperparameter tuning job. C.. Use the same random seed for the hyperparameter tuning job. D.. Use multiple jobs in parallel for the hyperparameter tuning job.
A - https://docs.aws.amazon.com/sagemaker/latest/dg/automatic-model-tuning-warm-start.html using warm start allows to reuse the results of previous tune run A: Yes - Warm start allows you to reuse the results from a previously performed hyperparameter tuning job B: No - is related to saving intermediate model checkpoints during training C: No - won’t directly impact the tuning job time D: No - would increase computational resources but won’t necessarily reduce time - https://www.examtopics.com/discussions/amazon/view/135994-exam-aws-certified-machine-learning-specialty-topic-1/
321
322 - A news company is developing an article search tool for its editors. The search tool should look for the articles that are most relevant and representative for particular words that are queried among a corpus of historical news documents. The editors test the first version of the tool and report that the tool seems to look for word matches in general. The editors have to spend additional time to filter the results to look for the articles where the queried words are most important. A group of data scientists must redesign the tool so that it isolates the most frequently used words in a document. The tool also must capture the relevance and importance of words for each document in the corpus. Which solution meets these requirements? - A.. Extract the topics from each article by using Latent Dirichlet Allocation (LDA) topic modeling. Create a topic table by assigning the sum of the topic counts as a score for each word in the articles. Configure the tool to retrieve the articles where this topic count score is higher for the queried words. B.. Build a term frequency for each word in the articles that is weighted with the article's length. Build an inverse document frequency for each word that is weighted with all articles in the corpus. Define a final highlight score as the product of both of these frequencies. Configure the tool to retrieve the articles where this highlight score is higher for the queried words. C.. Download a pretrained word-embedding lookup table. Create a titles-embedding table by averaging the title's word embedding for each article in the corpus. Define a highlight score for each word as inversely proportional to the distance between its embedding and the title embedding. Configure the tool to retrieve the articles where this highlight score is higher for the queried words. D.. Build a term frequency score table for each word in each article of the corpus. Assign a score of zero to all stop words. For any other words, assign a score as the word’s frequency in the article. Configure the tool to retrieve the articles where this frequency score is higher for the queried words.
B - This approach uses the TF-IDF method, which effectively captures the relevance and importance of words in each document, meeting the requirements specified. TF-IDF TF captures the importance of a term within an individual document, while IDF captures the importance of a term across the entire corpus. Multiplying TF by IDF gives higher weights to terms that are frequent within a document but rare across the entire corpus, thus highlighting terms that are both relevant and distinctive. - https://www.examtopics.com/discussions/amazon/view/136022-exam-aws-certified-machine-learning-specialty-topic-1/
322
323 - A growing company has a business-critical key performance indicator (KPI) for the uptime of a machine learning (ML) recommendation system. The company is using Amazon SageMaker hosting services to develop a recommendation model in a single Availability Zone within an AWS Region. A machine learning (ML) specialist must develop a solution to achieve high availability. The solution must have a recovery time objective (RTO) of 5 minutes. Which solution will meet these requirements with the LEAST effort? - A.. Deploy multiple instances for each endpoint in a VPC that spans at least two Regions. B.. Use the SageMaker auto scaling feature for the hosted recommendation models. C.. Deploy multiple instances for each production endpoint in a VPC that spans least two subnets that are in a second Availability Zone. D.. Frequently generate backups of the production recommendation model. Deploy the backups in a second Region.
C - Options A and D involve more complex configurations and higher operational overhead. Option B, while useful for scaling, does not directly address the need for high availability across multiple Availability Zones two different AZ will provide highly available deployment This solution meets the requirements because it provides high availability by deploying multiple instances across different subnets in a second Availability Zone. This approach ensures that if one Availability Zone goes down, the other can continue to serve requests, achieving the desired recovery time objective (RTO) of 5 minutes. This solution requires the least effort compared to the others because it doesn’t involve managing resources across multiple regions or frequent backups, and it’s more directly targeted at high availability compared to auto-scaling. Please note that while auto-scaling (option B) can help handle increased load, it doesn’t directly address high availability in terms of uptime or recovery time objectives. Options A and D involve multiple regions, which can add complexity and may not be necessary for achieving the desired high availability and RTO. A: No - Multi region setup unnecessary B: No - Auto scaling is for capacity not for failovers C: Yes - Multiple instances running in separate subnets in two different AZ's provide quick failover D: No - Backups are for DR, not production failover - https://www.examtopics.com/discussions/amazon/view/136023-exam-aws-certified-machine-learning-specialty-topic-1/
323
324 - A global company receives and processes hundreds of documents daily. The documents are in printed .pdf format or .jpg format. A machine learning (ML) specialist wants to build an automated document processing workflow to extract text from specific fields from the documents and to classify the documents. The ML specialist wants a solution that requires low maintenance. Which solution will meet these requirements with the LEAST operational effort? - A.. Use a PaddleOCR model in Amazon SageMaker to detect and extract the required text and fields. Use a SageMaker text classification model to classify the document. B.. Use a PaddleOCR model in Amazon SageMaker to detect and extract the required text and fields. Use Amazon Comprehend to classify the document. C.. Use Amazon Textract to detect and extract the required text and fields. Use Amazon Rekognition to classify the document. D.. Use Amazon Textract to detect and extract the required text and fields. Use Amazon Comprehend to classify the document.
D - Text tract to extracts text comprehend to classify -Amazon Textract automatically extracts text, handwriting, and data from scanned pdf and jpg documents and requires minimal setup and maintenance. - Comprehend can be used for document classification and requires minimum ongoing management The most common use cases for Amazon Textract include: - Importing documents and forms into business applications - Creating smart search indexes - Building automated document processing workflows - Maintaining compliance in document archives - Extracting text for Natural Language Processing (NLP) - Extracting text for document classification https://aws.amazon.com/tw/textract/faqs/?nc1=h_ls - https://www.examtopics.com/discussions/amazon/view/136015-exam-aws-certified-machine-learning-specialty-topic-1/
324
325 - A company wants to detect credit card fraud. The company has observed that an average of 2% of credit card transactions are fraudulent. A data scientist trains a classifier on a year's worth of credit card transaction data. The classifier needs to identify the fraudulent transactions. The company wants to accurately capture as many fraudulent transactions as possible. Which metrics should the data scientist use to optimize the classifier? (Choose two.) - A.. Specificity B.. False positive rate C.. Accuracy D.. F1 score E.. True positive rate
DE - D. F1 score and E. True positive rate (Recall). F1 Score: This is the harmonic mean of precision and recall, providing a balance between the two. It is particularly useful when you need to balance the importance of both false positives and false negatives. True Positive Rate (Recall): This measures the proportion of actual fraudulent transactions that are correctly identified. High recall ensures that most fraudulent transactions are detected. recall and f1 We need both precision and recall. The F1 score incorporates both of these metrics. and , the true positive rate which is only the recall. E. True positive rate (also known as Recall or Sensitivity): True positive rate measures the proportion of actual fraudulent transactions that are correctly identified by the classifier. Maximizing the true positive rate ensures that as many fraudulent transactions as possible are captured by the model, reducing the number of false negatives. D. F1 score: F1 score is the harmonic mean of precision and recall. It provides a balance between precision (the ability of the classifier to correctly identify positive cases) and recall (the ability of the classifier to capture all positive cases). Maximizing the F1 score ensures a good balance between capturing fraudulent transactions (high recall) and minimizing false positives (high precision). Review the Confusion matrix. Depending on your selected model score threshold, you can see the simulated impact based on a sample of 100,000 events. The distribution of fraud and legitimate events simulates the fraud rate in your businesses. Use this information to find the right balance between true positive rate and false positive rate. https://docs.aws.amazon.com/frauddetector/latest/ug/training-performance-metrics.html F1 Score True Positive Rate, Recall, Sensitivity are all same thing Metrics such as specificity (A), accuracy (C), and F1 score (D) are also important but may not directly prioritize the detection of fraudulent transactions. Specificity focuses on the proportion of non-fraudulent transactions correctly identified, accuracy measures overall correctness, and F1 score balances precision and recall. While these metrics are useful for evaluating the overall performance of the classifier, they may not be the primary focus when the goal is to detect as many fraudulent transactions as possible. Therefore, the most suitable metrics for optimizing the classifier to detect fraudulent transactions are False positive rate (B) and True positive rate (E). These metrics will help the data scientist optimize the classifier to detect as many fraudulent transactions as possible. - Maximizing TPR ensures that as many fraudulent transactions as possible are captured. - F1 score balances precision and recall and is useful when the class distribution is imbalanced (as in credit card fraud detection) - https://www.examtopics.com/discussions/amazon/view/136025-exam-aws-certified-machine-learning-specialty-topic-1/
325
326 - A data scientist is designing a repository that will contain many images of vehicles. The repository must scale automatically in size to store new images every day. The repository must support versioning of the images. The data scientist must implement a solution that maintains multiple immediately accessible copies of the data in different AWS Regions. Which solution will meet these requirements? - A.. Amazon S3 with S3 Cross-Region Replication (CRR) B.. Amazon Elastic Block Store (Amazon EBS) with snapshots that are shared in a secondary Region C.. Amazon Elastic File System (Amazon EFS) Standard storage that is configured with Regional availability D.. AWS Storage Gateway Volume Gateway
A - Cross Region Replication S3: scalable versioned object storage CRR - Automatically replicates objects from source to destination buckets in a different AWS Regions https://docs.aws.amazon.com/AmazonS3/latest/userguide/replication.html#crr-scenario - https://www.examtopics.com/discussions/amazon/view/136011-exam-aws-certified-machine-learning-specialty-topic-1/
326
327 - An ecommerce company wants to update a production real-time machine learning (ML) recommendation engine API that uses Amazon SageMaker. The company wants to release a new model but does not want to make changes to applications that rely on the API. The company also wants to evaluate the performance of the new model in production traffic before the company fully rolls out the new model to all users. Which solution will meet these requirements with the LEAST operational overhead? - A.. Create a new SageMaker endpoint for the new model. Configure an Application Load Balancer (ALB) to distribute traffic between the old model and the new model. B.. Modify the existing endpoint to use SageMaker production variants to distribute traffic between the old model and the new model. C.. Modify the existing endpoint to use SageMaker batch transform to distribute traffic between the old model and the new model. D.. Create a new SageMaker endpoint for the new model. Configure a Network Load Balancer (NLB) to distribute traffic between the old model and the new model.
B - deploy in same endpoint and control the percent of traffic via Production Variants Production variants -> deploy multiple models behind a single endpoint to distribute traffic between different variants of the model without making changes to the applications that rely on the API. https://docs.aws.amazon.com/sagemaker/latest/dg/how-it-works-model-validation.html - https://www.examtopics.com/discussions/amazon/view/136010-exam-aws-certified-machine-learning-specialty-topic-1/
327
328 - A machine learning (ML) specialist at a manufacturing company uses Amazon SageMaker DeepAR to forecast input materials and energy requirements for the company. Most of the data in the training dataset is missing values for the target variable. The company stores the training dataset as JSON files. The ML specialist develop a solution by using Amazon SageMaker DeepAR to account for the missing values in the training dataset. Which approach will meet these requirements with the LEAST development effort? - A.. Impute the missing values by using the linear regression method. Use the entire dataset and the imputed values to train the DeepAR model. B.. Replace the missing values with not a number (NaN). Use the entire dataset and the encoded missing values to train the DeepAR model. C.. Impute the missing values by using a forward fill. Use the entire dataset and the imputed values to train the DeepAR model. D.. Impute the missing values by using the mean value. Use the entire dataset and the imputed values to train the DeepAR model.
B - Best to allow DeepAR to workout the missing value during training. This may allow it to adopt several approaches to find the one with the best fit. DeepAR is a supervised RNNs and is able to handle missing values directly within the model. Instead of pre-processing the data to impute missing values externally, DeepAR can work directly with missing values encoded as NaN. target—An array of floating-point values or integers that represent the time series. You can encode missing values as null literals, or as "NaN" strings in JSON, or as nan floating-point values in Parquet. https://docs.aws.amazon.com/sagemaker/latest/dg/deepar.html - https://www.examtopics.com/discussions/amazon/view/136002-exam-aws-certified-machine-learning-specialty-topic-1/
328
329 - A law firm handles thousands of contracts every day. Every contract must be signed. Currently, a lawyer manually checks all contracts for signatures. The law firm is developing a machine learning (ML) solution to automate signature detection for each contract. The ML solution must also provide a confidence score for each contract page. Which Amazon Textract API action can the law firm use to generate a confidence score for each page of each contract? - A.. Use the AnalyzeDocument API action. Set the FeatureTypes parameter to SIGNATURES. Return the confidence scores for each page. B.. Use the Prediction API call on the documents. Return the signatures and confidence scores for each page. C.. Use the StartDocumentAnalysis API action to detect the signatures. Return the confidence scores for each page. D.. Use the GetDocumentAnalysis API action to detect the signatures. Return the confidence scores for each page.
A - https://docs.aws.amazon.com/textract/latest/dg/API_AnalyzeDocument.html A. The AnalyzeDocument API action in Amazon Textract allows for the analysis of various features within a document, including signatures. By setting the FeatureTypes parameter to SIGNATURES, the law firm can instruct Textract to specifically focus on detecting signatures within the contracts. Additionally, Textract provides confidence scores for detected elements, including signatures. Therefore, by using this action and specifying the FeatureTypes parameter, the law firm can receive confidence scores for each page of each contract, facilitating the automation of signature detection. AnalyzeDocument Signatures feature automatically detects and extracts signature images from the document and comes with a confidence score (ranging from 0 to 100) for each detected signature https://docs.aws.amazon.com/textract/latest/dg/API_AnalyzeDocument.html - https://www.examtopics.com/discussions/amazon/view/135917-exam-aws-certified-machine-learning-specialty-topic-1/
329
330 - A company that operates oil platforms uses drones to photograph locations on oil platforms that are difficult for humans to access to search for corrosion. Experienced engineers review the photos to determine the severity of corrosion. There can be several corroded areas in a single photo. The engineers determine whether the identified corrosion needs to be fixed immediately, scheduled for future maintenance, or requires no action. The corrosion appears in an average of 0.1% of all photos. A data science team needs to create a solution that automates the process of reviewing the photos and classifying the need for maintenance. Which combination of steps will meet these requirements? (Choose three.) - A.. Use an object detection algorithm to train a model to identify corrosion areas of a photo. B.. Use Amazon Rekognition with label detection on the photos. C.. Use a k-means clustering algorithm to train a model to classify the severity of corrosion in a photo. D.. Use an XGBoost algorithm to train a model to classify the severity of corrosion in a photo. E.. Perform image augmentation on photos that contain corrosion. F.. Perform image augmentation on photos that do not contain corrosion.
ADE - XGBoost is primarily designed for tabular data. For a task involving image data, XGBoost would not be a suitable choice because it does not have the capability to directly process images or learn from the spatial relationships inherent in image data. In the context of the problem described, a more suitable approach would involve using a deep learning model, e.g., CNNs. Refer to Question 258 where we choose ResNet-50 instead of. XGBoost ADE ◊Recognition is for general identification, this requires a specific training with only 1% data available. A,B,F You can not use XGBoost for object recognition directly. You need to integrate it with CNN. You can use Rekognition custom for this scenario Recognition is for general identification, this requires a specific training with only 1% data available. Detecting corrosion is too specialized a task for Rekognition - trying transfer learning would literally mean re-training. AWS blog explaining suitability of Object Detection/Semantic Segmentation (though they prefer the color classification approach using XGBoost): https://aws.amazon.com/blogs/machine-learning/rust-detection-using-machine-learning-on-aws/ The enhancement step may be needed to contrast enhance/sharpen images. The combination of steps that would meet the requirements is indeed A, B, and E This is object detection problem, which Rekognitation can do since they have 0.1% corrosion photos, augmentation is necessary A. Use an object detection algorithm: This can help identify corrosion areas in a photo. E. Perform image augmentation on photos with corrosion: This can improve the model’s ability to generalize by increasing the diversity of the training data. D. Use an XGBoost algorithm: This can classify the severity of the corrosion after the areas of corrosion have been identified. It’s effective for multi-class classification problems. A: Yes - train a model that identifies corrosion areas within the photos B: Yes - identify and label objects, including corrosion, in the images. C: No - Not for classification D: No - XGBoost doesn't work on images E: Yes - rotation, scaling, and flipping, can enhance the model’s ability to generalize and improve its performance F: not required Object detection algorithm can be trained to identify corrosion rather than too customize Rekognition - https://www.examtopics.com/discussions/amazon/view/136026-exam-aws-certified-machine-learning-specialty-topic-1/
330
331 - A company maintains a 2 TB dataset that contains information about customer behaviors. The company stores the dataset in Amazon S3. The company stores a trained model container in Amazon Elastic Container Registry (Amazon ECR). A machine learning (ML) specialist needs to score a batch model for the dataset to predict customer behavior. The ML specialist must select a scalable approach to score the model. Which solution will meet these requirements MOST cost-effectively? - A.. Score the model by using AWS Batch managed Amazon EC2 Reserved Instances. Create an Amazon EC2 instance store volume and mount it to the Reserved Instances. B.. Score the model by using AWS Batch managed Amazon EC2 Spot Instances. Create an Amazon FSx for Lustre volume and mount it to the Spot Instances. C.. Score the model by using an Amazon SageMaker notebook on Amazon EC2 Reserved Instances. Create an Amazon EBS volume and mount it to the Reserved Instances. D.. Score the model by using Amazon SageMaker notebook on Amazon EC2 Spot Instances. Create an Amazon Elastic File System (Amazon EFS) file system and mount it to the Spot Instances.
B - B seems to be most cost effective The most cost-effective solution is Option B: Use AWS Batch with Amazon EC2 Spot Instances and Amazon FSx for Lustre. This approach leverages the efficiency of AWS Batch, the cost benefits of Spot Instances, and the high-performance of FSx for Lustre, making it ideal for scoring a batch model. However, Spot Instances can be interrupted, so they’re best for flexible workloads. AWS Batch managed Amazon EC2 Spot Instances - cost effective! FSx for Lustre - large volume (2 TB) data - https://www.examtopics.com/discussions/amazon/view/136031-exam-aws-certified-machine-learning-specialty-topic-1/
331
332 - A data scientist is implementing a deep learning neural network model for an object detection task on images. The data scientist wants to experiment with a large number of parallel hyperparameter tuning jobs to find hyperparameters that optimize compute time. The data scientist must ensure that jobs that underperform are stopped. The data scientist must allocate computational resources to well-performing hyperparameter configurations. The data scientist is using the hyperparameter tuning job to tune the stochastic gradient descent (SGD) learning rate, momentum, epoch, and mini-batch size. Which technique will meet these requirements with LEAST computational time? - A.. Grid search B.. Random search C.. Bayesian optimization D.. Hyperband
D - Efficient Resource Allocation: Hyperband is designed to allocate computational resources efficiently by dynamically stopping underperforming configurations and focusing on the more promising ones Hyperband is he answer The best technique for the data scientist’s requirements is Hyperband. It’s designed for a large number of experiments, stops low-performance models early, and allocates more resources to high-performance models. This reduces computational time compared to Grid Search, Random Search, and Bayesian Optimization which don’t have these features. Hyperband involves training multiple models with different hyperparameter configurations , eliminating poorly performing ones and allocating resources to promising ones. . Hyperband: This technique is a bandit-based approach that allocates resources efficiently by running multiple configurations in parallel with varying durations. It eliminates poorly performing configurations early and focuses resources on promising ones, making it ideal for minimizing compute time. - https://www.examtopics.com/discussions/amazon/view/136027-exam-aws-certified-machine-learning-specialty-topic-1/
332
333 - An agriculture company wants to improve crop yield forecasting for the upcoming season by using crop yields from the last three seasons. The company wants to compare the performance of its new scikit-learn model to the benchmark. A data scientist needs to package the code into a container that computes both the new model forecast and the benchmark. The data scientist wants AWS to be responsible for the operational maintenance of the container. Which solution will meet these requirements? - A.. Package the code as the training script for an Amazon SageMaker scikit-learn container. B.. Package the code into a custom-built container. Push the container to Amazon Elastic Container Registry (Amazon ECR). C.. Package the code into a custom-built container. Push the container to AWS Fargate. D.. Package the code by extending an Amazon SageMaker scikit-learn container.
D - A. Enough to compare the performance of new and old scikit-learn models. D. Good, but with additional overhead. Amazon SageMaker provides built-in containers for common machine learning frameworks, including scikit-learn, which are designed to handle operational maintenance such as patching, scaling, and monitoring.https://docs.aws.amazon.com/sagemaker/latest/dg/pre-built-docker-containers-scikit-learn-spark.html D is right. Customer wants AWS handle the container maintenance https://sagemaker.readthedocs.io/en/stable/frameworks/sklearn/using_sklearn.html AWS Fargate you can not directly push to fargate. first push to ecr D appears to be the best choice - https://www.examtopics.com/discussions/amazon/view/147149-exam-aws-certified-machine-learning-specialty-topic-1/
333
334 - A cybersecurity company is collecting on-premises server logs, mobile app logs, and IoT sensor data. The company backs up the ingested data in an Amazon S3 bucket and sends the ingested data to Amazon OpenSearch Service for further analysis. Currently, the company has a custom ingestion pipeline that is running on Amazon EC2 instances. The company needs to implement a new serverless ingestion pipeline that can automatically scale to handle sudden changes in the data flow. Which solution will meet these requirements MOST cost-effectively? - A.. Create two Amazon Data Firehose delivery streams to send data to the S3 bucket and OpenSearch Service. Configure the data sources to send data to the delivery streams. B.. Create one Amazon Kinesis data stream. Create two Amazon Data Firehose delivery streams to send data to the S3 bucket and OpenSearch Service. Connect the delivery streams to the data stream. Configure the data sources to send data to the data stream. C.. Create one Amazon Data Firehose delivery stream to send data to OpenSearch Service. Configure the delivery stream to back up the raw data to the S3 bucket. Configure the data sources to send data to the delivery stream. D.. Create one Amazon Kinesis data stream. Create one Amazon Data Firehose delivery stream to send data to OpenSearch Service. Configure the delivery stream to back up the data to the S3 bucket. Connect the delivery stream to the data stream. Configure the data sources to send data to the data stream.
C - Firehose does stream directly BUT without Kinesis Data Streams we dont have buffering and scaling, which is a requirement of the problem scaling firehose can handle it firehouse does not stream directly.. Need Amazon Kinesis to handle sudden changers in the data flow and good scaling. AWS firehouse does not have this capability C is more cost-effective C is the best option in this case. - https://www.examtopics.com/discussions/amazon/view/147148-exam-aws-certified-machine-learning-specialty-topic-1/
334
335 - A bank has collected customer data for 10 years in CSV format. The bank stores the data in an on-premises server. A data science team wants to use Amazon SageMaker to build and train a machine learning (ML) model to predict churn probability. The team will use the historical data. The data scientists want to perform data transformations quickly and to generate data insights before the team builds a model for production. Which solution will meet these requirements with the LEAST development effort? - A.. Upload the data into the SageMaker Data Wrangler console directly. Perform data transformations and generate insights within Data Wrangler. B.. Upload the data into an Amazon S3 bucket. Allow SageMaker to access the data that is in the bucket. Import the data from the S3 bucket into SageMaker Data Wrangler. Perform data transformations and generate insights within Data Wrangler. C.. Upload the data into the SageMaker Data Wrangler console directly. Allow SageMaker and Amazon QuickSight to access the data that is in an Amazon S3 bucket. Perform data transformations in Data Wrangler and save the transformed data into a second S3 bucket. Use QuickSight to generate data insights. D.. Upload the data into an Amazon S3 bucket. Allow SageMaker to access the data that is in the bucket. Import the data from the bucket into SageMaker Data Wrangler. Perform data transformations in Data Wrangler. Save the data into a second S3 bucket. Use a SageMaker Studio notebook to generate data insights.
B - Options A and C involve directly uploading data to SageMaker Data Wrangler, which might not be as scalable or efficient for large datasets. Option D adds an extra step of using SageMaker Studio notebooks, which increases the complexity and development effort. B is the correct option . A is not an option as 10 years data will be too much for local upload or on premise . So an intermediate storage is needed which is S3 . Answer is B option B is the most straightforward and efficient solution for the data science team to quickly perform data transformations and generate insights before building a model - https://www.examtopics.com/discussions/amazon/view/146687-exam-aws-certified-machine-learning-specialty-topic-1/
335
336 - A media company wants to deploy a machine learning (ML) model that uses Amazon SageMaker to recommend new articles to the company’s readers. The company's readers are primarily located in a single city. The company notices that the heaviest reader traffic predictably occurs early in the morning, after lunch, and again after work hours. There is very little traffic at other times of day. The media company needs to minimize the time required to deliver recommendations to its readers. The expected amount of data that the API call will return for inference is less than 4 MB. Which solution will meet these requirements in the MOST cost-effective way? - A.. Real-time inference with auto scaling B.. Serverless inference with provisioned concurrency C.. Asynchronous inference D.. A batch transform task
B - Best of both worlds, elastic and provisioned. A is more expensive On-demand Serverless Inference is ideal for workloads which have idle periods between traffic spurts.Optionally, you can also use Provisioned Concurrency with Serverless Inference. Serverless Inference with provisioned concurrency is a cost-effective option when you have predictable bursts in your traffic. https://docs.aws.amazon.com/sagemaker/latest/dg/serverless-endpoints.html The traffic is expected. Provisioned resouces have minimal cost. By choosing serverless inference with provisioned concurrency, the media company can benefit from low latency during peak traffic periods while optimizing costs by only paying for the actual inference requests - https://www.examtopics.com/discussions/amazon/view/147146-exam-aws-certified-machine-learning-specialty-topic-1/
336
337 - A machine learning (ML) engineer is using Amazon SageMaker automatic model tuning (AMT) to optimize a model's hyperparameters. The ML engineer notices that the tuning jobs take a long time to run. The tuning jobs continue even when the jobs are not significantly improving against the objective metric. The ML engineer needs the training jobs to optimize the hyperparameters more quickly. How should the ML engineer configure the SageMaker AMT data types to meet these requirements? - A.. Set Strategy to the Bayesian value. B.. Set RetryStrategy to a value of 1. C.. Set ParameterRanges to the narrow range Inferred from previous hyperparameter jobs. D.. Set TrainingJobEarlyStoppingType to the AUTO value.
D - stop the job when it is not significantly improving the objective metric If you are using the AWS SDK for Python (Boto3), set the TrainingJobEarlyStoppingType field of the HyperParameterTuningJobConfig object that you use to configure the tuning job to AUTO. https://docs.aws.amazon.com/sagemaker/latest/dg/automatic-model-tuning-early-stopping.html Answer is A Set TrainingJobEarlyStoppingType to the AUTO value - https://www.examtopics.com/discussions/amazon/view/146689-exam-aws-certified-machine-learning-specialty-topic-1/
337
338 - A global bank requires a solution to predict whether customers will leave the bank and choose another bank. The bank is using a dataset to train a model to predict customer loss. The training dataset has 1,000 rows. The training dataset includes 100 instances of customers who left the bank. A machine learning (ML) specialist is using Amazon SageMaker Data Wrangler to train a churn prediction model by using a SageMaker training job. After training, the ML specialist notices that the model returns only false results. The ML specialist must correct the model so that it returns more accurate predictions. Which solution will meet these requirements? - A.. Apply anomaly detection to remove outliers from the training dataset before training. B.. Apply Synthetic Minority Oversampling Technique (SMOTE) to the training dataset before training. C.. Apply normalization to the features of the training dataset before training. D.. Apply undersampling to the training dataset before training.
B - B is the right choice . D is not an option .While undersampling would discard valuable data from the majority class. Since the dataset is already small (only 1,000 rows), undersampling could lead to a loss of important information SMOTE is most effective. - https://www.examtopics.com/discussions/amazon/view/147147-exam-aws-certified-machine-learning-specialty-topic-1/
338
339 - A banking company provides financial products to customers around the world. A machine learning (ML) specialist collected transaction data from internal customers. The ML specialist split the dataset into training, testing, and validation datasets. The ML specialist analyzed the training dataset by using Amazon SageMaker Clarify. The analysis found that the training dataset contained fewer examples of customers in the 40 to 55 year-old age group compared to the other age groups. Which type of pretraining bias did the ML specialist observe in the training dataset? - A.. Difference in proportions of labels (DPL) B.. Class imbalance (CI) C.. Conditional demographic disparity (CDD) D.. Kolmogorov-Smirnov (KS)
B - Class Imbalance (CI): This occurs when certain classes (in this case, age groups) are underrepresented in the dataset. This can lead to biased model predictions because the model may not learn enough about the underrepresented class to make accurate predictions1. B is correct because. B is the correct option The type of pretraining bias observed in the training dataset, where there are fewer examples of customers in the 40 to 55 year-old age group compared to the other age groups, is: B. Class imbalance (CI) Explanation: Class imbalance (CI) refers to a situation where certain classes or groups are underrepresented in the dataset. In this case, the age group 40 to 55 is underrepresented compared to other age groups. Difference in proportions of labels (DPL) generally refers to differences in the proportions of different labels (outcomes) rather than input features like age. Conditional demographic disparity (CDD) refers to differences in outcomes for different demographic groups conditional on certain factors, not the raw distribution of demographic features. Kolmogorov-Smirnov (KS) is a statistical test used to compare distributions, but it is not specifically a type of bias. Therefore, the correct answer is B. Class imbalance (CI). Answer is C Class imbalance can lead to biased models that perform poorly on the underrepresented class or group, as the model may not have enough examples to learn the patterns and characteristics of that class effectively. - https://www.examtopics.com/discussions/amazon/view/146688-exam-aws-certified-machine-learning-specialty-topic-1/
339
340 - A tourism company uses a machine learning (ML) model to make recommendations to customers. The company uses an Amazon SageMaker environment and set hyperparameter tuning completion criteria to MaxNumberOfTrainingJobs. An ML specialist wants to change the hyperparameter tuning completion criteria. The ML specialist wants to stop tuning immediately after an internal algorithm determines that tuning job is unlikely to improve more than 1% over the objective metric from the best training job. Which completion criteria will meet this requirement? - A.. MaxRuntimeInSeconds B.. TargetObjectiveMetricValue C.. CompleteOnConvergence D.. MaxNumberOfTrainingJobsNotImproving
C - CompleteOnConvergence – A flag to stop tuning after an internal algorithm determines that the tuning job is unlikely to improve more than 1% over the objective metric from the best training job. The completion criteria that will meet this requirement is CompleteOnConvergence. This criterion stops the tuning job immediately after an internal algorithm determines that the tuning job is unlikely to improve more than 1% over the objective metric from the best training job C is the correct answer .CompleteOnConvergence – A flag to stop tuning after an internal algorithm determines that the tuning job is unlikely to improve more than 1% over the objective metric from the best training job. https://docs.aws.amazon.com/sagemaker/latest/dg/automatic-model-tuning-progress.html onvergence detection is a completion criteria that lets automatic model tuning decide when to stop tuning. Generally, automatic model tuning will stop tuning when it estimates that no significant improvement can be achieved. - https://www.examtopics.com/discussions/amazon/view/146705-exam-aws-certified-machine-learning-specialty-topic-1/
340
341 - A car company has dealership locations in multiple cities. The company uses a machine learning (ML) recommendation system to market cars to its customers. An ML engineer trained the ML recommendation model on a dataset that includes multiple attributes about each car. The dataset includes attributes such as car brand, car type, fuel efficiency, and price. The ML engineer uses Amazon SageMaker Data Wrangler to analyze and visualize data. The ML engineer needs to identify the distribution of car prices for a specific type of car. Which type of visualization should the ML engineer use to meet these requirements? - A.. Use the SageMaker Data Wrangler scatter plot visualization to inspect the relationship between the car price and type of car. B.. Use the SageMaker Data Wrangler quick model visualization to quickly evaluate the data and produce importance scores for the car price and type of car. C.. Use the SageMaker Data Wrangler anomaly detection visualization to Identify outliers for the specific features. D.. Use the SageMaker Data Wrangler histogram visualization to inspect the range of values for the specific feature.
D - D is the correct option se the SageMaker Data Wrangler histogram visualization to inspect the range of values for the specific feature - https://www.examtopics.com/discussions/amazon/view/147369-exam-aws-certified-machine-learning-specialty-topic-1/
341
342 - A media company is building a computer vision model to analyze images that are on social media. The model consists of CNNs that the company trained by using images that the company stores in Amazon S3. The company used an Amazon SageMaker training job in File mode with a single Amazon EC2 On-Demand Instance. Every day, the company updates the model by using about 10,000 images that the company has collected in the last 24 hours. The company configures training with only one epoch. The company wants to speed up training and lower costs without the need to make any code changes. Which solution will meet these requirements? - A.. Instead of File mode, configure the SageMaker training job to use Pipe mode. Ingest the data from a pipe. B.. Instead of File mode, configure the SageMaker training job to use FastFile mode with no other changes. C.. Instead of On-Demand Instances, configure the SageMaker training job to use Spot Instances. Make no other changes, D.. Instead of On-Demand Instances, configure the SageMaker training job to use Spot Instances, implement model checkpoints.
D - D. Instead of On-Demand Instances, configure the SageMaker training job to use Spot Instances, implement model checkpoints. Spot, implement model checkpoints. D addresses cost and checkpoints maintaining progress. B does not address cost. By combining Spot Instances for cost savings and model checkpoints to ensure training progress is not lost, the company can meet its goals of speeding up training, reducing costs, and maintaining robustness. This option combines the cost savings of Spot Instances with the reliability of model checkpoints. Checkpointing allows the training job to resume from the last saved state in case of interruption, making it a robust choice for maintaining progress without losing work. While it suggests implementing checkpoints, the setup for checkpointing can typically be done with minimal changes, depending on the existing training setup. C will have some chance to low down the training speed, the B will not. I think B, fastfile is built on pile and offers more From Copilot - The correct answer is C. Instead of On-Demand Instances, configure the SageMaker training job to use Spot Instances. Make no other changes. Switching to Spot Instances can help reduce costs significantly while still providing the necessary computational power, and it doesn’t require any code changes. FastFile mode provides the best balance - https://www.examtopics.com/discussions/amazon/view/147150-exam-aws-certified-machine-learning-specialty-topic-1/
342
343 - A telecommunications company has deployed a machine learning model using Amazon SageMaker. The model identifies customers who are likely to cancel their contract when calling customer service. These customers are then directed to a specialist service team. The model has been trained on historical data from multiple years relating to customer contracts and customer service interactions in a single geographic region. The company is planning to launch a new global product that will use this model. Management is concerned that the model might incorrectly direct a large number of calls from customers in regions without historical data to the specialist service team. Which approach would MOST effectively address this issue? - A.. Enable Amazon SageMaker Model Monitor data capture on the model endpoint. Create a monitoring baseline on the training dataset. Schedule monitoring jobs. Use Amazon CloudWatch to alert the data scientists when the numerical distance of regional customer data fails the baseline drift check. Reevaluate the training set with the larger data source and retrain the model. B.. Enable Amazon SageMaker Debugger on the model endpoint. Create a custom rule to measure the variance from the baseline training dataset. Use Amazon CloudWatch to alert the data scientists when the rule is invoked. Reevaluate the training set with the larger data source and retrain the model. C.. Capture all customer calls routed to the specialist service team in Amazon S3. Schedule a monitoring job to capture all the true positives and true negatives, correlate them to the training dataset, and calculate the accuracy. Use Amazon CloudWatch to alert the data scientists when the accuracy decreases. Reevaluate the training set with the additional data from the specialist service team and retrain the model. D.. Enable Amazon CloudWatch on the model endpoint. Capture metrics using Amazon CloudWatch Logs and send them to Amazon S3. Analyze the monitored results against the training data baseline. When the variance from the baseline exceeds the regional customer variance, reevaluate the training set and retrain the model.
A - I have no idea what "larger data source" they are referring to in A, B. C is the only one that: - Records misclassified calls from new regions (true positives & true negatives). - Tracks accuracy changes over time to detect model degradation. - Uses real-world customer interactions from different regions to improve the model. - Retrains the model with this expanded dataset, making it more globally relevant. So im going with C why you need s3? Where is A larger data source? larger is past training data + new data... Option A is the most comprehensive and proactive approach, ensuring continuous monitoring and timely alerts for data drift, which is crucial for maintaining model accuracy across different regions. from copilot - Option A is the most comprehensive and proactive approach, ensuring continuous monitoring and timely alerts for data drift, which is crucial for maintaining model accuracy across different regions. - https://www.examtopics.com/discussions/amazon/view/147978-exam-aws-certified-machine-learning-specialty-topic-1/
343
344 - A machine learning (ML) engineer is creating a binary classification model. The ML engineer will use the model in a highly sensitive environment. There is no cost associated with missing a positive label. However, the cost of making a false positive inference is extremely high. What is the most important metric to optimize the model for in this scenario? - A.. Accuracy B.. Precision C.. Recall D.. F1
B - when FP cost is higher and important = precision when FN cost is higher and important = recall In this scenario, the most important metric to optimize for is precision. Precision measures the proportion of true positive predictions among all positive predictions made by the model. Since the cost of making a false positive inference is extremely high, optimizing for precision will help minimize the number of false positives from Copilot - In this scenario, the most important metric to optimize for is precision. Precision measures the proportion of true positive predictions among all positive predictions made by the model. Since the cost of making a false positive inference is extremely high, optimizing for precision will help minimize the number of false positives - https://www.examtopics.com/discussions/amazon/view/147979-exam-aws-certified-machine-learning-specialty-topic-1/
344
345 - An ecommerce company discovers that the search tool for the company's website is not presenting the top search results to customers. The company needs to resolve the issue so the search tool will present results that customers are most likely to want to purchase. Which solution will meet this requirement with the LEAST operational effort? - A.. Use the Amazon SageMaker BlazingText algorithm to add context to search results through query expansion. B.. Use the Amazon SageMaker XGBoost algorithm to improve candidate ranking. C.. Use Amazon CloudSearch and sort results by the search relevance score. D.. Use Amazon CloudSearch and sort results by the geographic location.
C - With less operational effort from copilot - Amazon CloudSearch is a managed service that makes it easy to set up, manage, and scale a search solution for your website or application. Sorting results by the search relevance score ensures that the most relevant results are presented to customers, addressing the issue of the search tool not presenting the top search results - https://www.examtopics.com/discussions/amazon/view/147980-exam-aws-certified-machine-learning-specialty-topic-1/
345
346 - A machine learning (ML) specialist collected daily product usage data for a group of customers. The ML specialist appended customer metadata such as age and gender from an external data source. The ML specialist wants to understand product usage patterns for each day of the week for customers in specific age groups. The ML specialist creates two categorical features named dayofweek and binned_age, respectively. Which approach should the ML specialist use discover the relationship between the two new categorical features? - A.. Create a scatterplot for day_of_week and binned_age. B.. Create crosstabs for day_of_week and binned_age. C.. Create word clouds for day_of_week and binned_age. D.. Create a boxplot for day_of_week and binned_age.
B - Aunque Data Wrangler puede generar scatterplots for categorical data, they’re not as effective for understanding the relationship between two categorical variables as crosstabs are; crosstabs clearly show frequency counts for each combination. https://docs.aws.amazon.com/sagemaker/latest/dg/data-wrangler-analyses.html (see scatterplot) "Both of these columns must be numeric typed columns".... so B. copilot - Crosstabs (cross-tabulation) will allow the ML specialist to observe the frequency distribution of product usage across different days of the week and age groups, making it easier to identify patterns or relationships between these categorical features. - https://www.examtopics.com/discussions/amazon/view/147981-exam-aws-certified-machine-learning-specialty-topic-1/
346
347 - A company needs to develop a model that uses a machine learning (ML) model for risk analysis. An ML engineer needs to evaluate the contribution each feature of a training dataset makes to the prediction of the target variable before the ML engineer selects features. How should the ML engineer predict the contribution of each feature? - A.. Use the Amazon SageMaker Data Wrangler multicollinearity measurement features and the principal component analysis (PCA) algorithm to calculate the variance of the dataset along multiple directions in the feature space. B.. Use an Amazon SageMaker Data Wrangler quick model visualization to find feature importance scores that are between 0.5 and 1. C.. Use the Amazon SageMaker Data Wrangler bias report to identify potential biases in the data related to feature engineering. D.. Use an Amazon SageMaker Data Wrangler data flow to create and modify a data preparation pipeline. Manually add the feature scores.
B - Data Wrangler also provides a third option to detect multicollinearity in your dataset facilitated via Principal Component Analysis (PCA). PCA measures the variance of the data along different directions in the feature space. The ordered list of variances, also known as the singular values, can inform about multicollinearity in your data. https://aws.amazon.com/blogs/machine-learning/detect-multicollinearity-target-leakage-and-feature-correlation-with-amazon-sagemaker-data-wrangler/ "evaluate the contribution each feature of a training dataset makes to the prediction of the target" -> B. The question is asking for contribution of each feature not to view only the features that make the highest contribution which .5 and 1 scores shows. Changed my mind, #B' is the correct answer, Option D involves using Amazon SageMaker Data Wrangler to create and modify a data preparation pipeline and manually adding the feature scores. While this approach can work, it introduces additional manual steps and complexity compared to Option B. Should be B B is the correct answer. The most effective way for ML engineers to assess the contribution that each feature in the training dataset makes to the prediction of the target variable is to use the B. Amazon SageMaker Data Wrangler Quick Model Visualization to find feature importance scores between 0.5 and 1. This method provides a quick visualization of the importance of each feature and can be used to make selection decisions. Other selections A uses Principal Component Analysis (PCA) to calculate variance, but does not directly assess the contribution of a feature. C identifies bias and is not suitable for assessing contribution. D is related to data preparation but is not a method to evaluate the contribution of a feature. This approach directly addresses the need to evaluate the contribution of each feature to the prediction of the target variable by providing feature importance scores, which are crucial for understanding and selecting the most impactful features. Should be D. - https://www.examtopics.com/discussions/amazon/view/147603-exam-aws-certified-machine-learning-specialty-topic-1/
347
348 - A company is building a predictive maintenance system using real-time data from devices on remote sites. There is no AWS Direct Connect connection or VPN connection between the sites and the company's VPC. The data needs to be ingested in real time from the devices into Amazon S3. Transformation is needed to convert the raw data into clean .csv data to be fed into the machine learning (ML) model. The transformation needs to happen during the ingestion process. When transformation fails, the records need to be stored in a specific location in Amazon S3 for human review. The raw data before transformation also needs to be stored in Amazon S3. How should an ML specialist architect the solution to meet these requirements with the LEAST effort? - A.. Use Amazon Data Firehose with Amazon S3 as the destination. Configure Firehose to invoke an AWS Lambda function for data transformation. Enable source record backup on Firehose. B.. Use Amazon Managed Streaming for Apache Kafka. Set up workers in Amazon Elastic Container Service (Amazon ECS) to move data from Kafka brokers to Amazon S3 while transforming it. Configure workers to store raw and unsuccessfully transformed data in different S3 buckets. C.. Use Amazon Data Firehose with Amazon S3 as the destination. Configure Firehose to invoke an Apache Spark job in AWS Glue for data transformation. Enable source record backup and configure the error prefix. D.. Use Amazon Kinesis Data Streams in front of Amazon Data Firehose. Use Kinesis Data Streams with AWS Lambda to store raw data in Amazon S3. Configure Firehose to invoke a Lambda function for data transformation with Amazon S3 as the destination.
A - A & D both work, D is real time, A is less effort but near real time. it is very hard to choose Firehose can use record backup and preserve original records to S3, in addition of transformation using Lambda A. (For the "real-time", well, Firehose = near real time, compared to other options which have significant issues, using Firehose here does not have big problem. I assume this question is calling real-time and near real time interchangeably :( ) C is incorrect - Reason #1: This is Real-Time Data Ingestion (The data needs to be ingested in real time from the devices into Amazon S3), which is not suitable to use Glue. Reason #2: Glue cannot be used for transformation, only Lambda could be used here. Although Glue could be used for record format conversion, but only ORC and Parquet are supported in that case, which is irrelevant with this scenario. D - Kinesis Data Streams is unnecessary D. firehose can not store data and not real time, it is near real time firehose cant invoke glue, you can use firehose with lambda for transformation. And spark job requires more effort than lambda. Also C does not store raw data before transformation least operational effort copilot - This solution meets the requirements by: Ingesting data in real-time using Amazon Kinesis Data Firehose. Transforming the data using an AWS Lambda function during the ingestion process. Storing the transformed data in Amazon S3. Enabling source record backup to store raw data and failed transformation records in specific S3 locations. If you have any more questions or need further assistance, feel free to ask! C, A -X firehose can't store data. - https://www.examtopics.com/discussions/amazon/view/147689-exam-aws-certified-machine-learning-specialty-topic-1/
348
349 - A company wants to use machine learning (ML) to improve its customer churn prediction model. The company stores data in an Amazon Redshift data warehouse. A data science team wants to use Amazon Redshift machine learning (Amazon Redshift ML) to build a model and run predictions for new data directly within the data warehouse. Which combination of steps should the company take to use Amazon Redshift ML to meet these requirements? (Choose three.) - A.. Define the feature variables and target variable for the churn prediction model. B.. Use the SOL EXPLAIN_MODEL function to run predictions. C.. Write a CREATE MODEL SQL statement to create a model. D.. Use Amazon Redshift Spectrum to train the model. E.. Manually export the training data to Amazon S3. F.. Use the SQL prediction function to run predictions.
ACF - from copilot - The correct combination of steps to use Amazon Redshift ML for building a customer churn prediction model and running predictions directly within the data warehouse are: A. Define the feature variables and target variable for the churn prediction model. C. Write a CREATE MODEL SQL statement to create a model. F. Use the SQL prediction function to run predictions. The correct steps are A,C,F - https://www.examtopics.com/discussions/amazon/view/147151-exam-aws-certified-machine-learning-specialty-topic-1/
349
350 - A company’s machine learning (ML) team needs to build a system that can detect whether people in a collection of images are wearing the company’s logo. The company has a set of labeled training data. Which algorithm should the ML team use to meet this requirement? - A.. Principal component analysis (PCA) B.. Recurrent neural network (RNN) C.. К-nearest neighbors (k-NN) D.. Convolutional neural network (CNN)
D - A. PCA: No, it's for dimensionality reduction, not image classification. B. RNN: No, it's designed for sequential data, not for images. C. k-NN: No, it's a basic classifier that doesn't effectively extract image features. D. CNN: Sí, because Convolutional Neural Networks are designed to extract spatial features from images for classification tasks. from copilot- The correct answer is D. Convolutional neural network (CNN). CNNs are highly effective for image recognition tasks, including detecting logos in images. They can automatically learn and extract features from images, making them well-suited for identifying specific patterns, such as logo - https://www.examtopics.com/discussions/amazon/view/147984-exam-aws-certified-machine-learning-specialty-topic-1/
350
351 - A data scientist uses Amazon SageMaker Data Wrangler to obtain a feature summary from a dataset that the data scientist imported from Amazon S3. The data scientist notices that the prediction power for a dataset feature has a score of 1. What is the cause of the score? - A.. Target leakage occurred in the imported dataset. B.. The data scientist did not fine-tune the training and validation split. C.. The SageMaker Data Wrangler algorithm that the data scientist used did not find an optimal model fit for each feature to calculate the prediction power. D.. The data scientist did not process the features enough to accurately calculate prediction power.
A - A prediction power score of 1 in Amazon SageMaker Data Wrangler indicates perfect predictive ability for that feature. This often suggests target leakage, meaning the feature is highly correlated with the target variable, possibly because it contains information that directly or indirectly reveals the target This is target leakage - https://www.examtopics.com/discussions/amazon/view/147155-exam-aws-certified-machine-learning-specialty-topic-1/
351
352 - A data scientist is conducting exploratory data analysis (EDA) on a dataset that contains information about product suppliers. The dataset records the country where each product supplier is located as a two-letter text code. For example, the code for New Zealand is "NZ." The data scientist needs to transform the country codes for model training. The data scientist must choose the solution that will result in the smallest increase in dimensionality. The solution must not result in any information loss. Which solution will meet these requirements? - A.. Add a new column of data that includes the full country name. B.. Encode the country codes into numeric variables by using similarity encoding. C.. Map the country codes to continent names. D.. Encode the country codes into numeric variables by using one-hot encoding.
B - B is the right answer D will cause a the large increase in dimensionality, the B will not D will cause a the large increase in dimensionality, the B will not The best solution to transform the country codes for model training, while resulting in the smallest increase in dimensionality and avoiding any information loss, is to use target encoding. its B, similarity encoding - https://www.examtopics.com/discussions/amazon/view/147037-exam-aws-certified-machine-learning-specialty-topic-1/
352
353 - A data scientist is building a new model for an ecommerce company. The model will predict how many minutes it will take to deliver a package. During model training, the data scientist needs to evaluate model performance. Which metrics should the data scientist use to meet this requirement? (Choose two.) - A.. InferenceLatency B.. Mean squared error (MSE) C.. Root mean squared error (RMSE) D.. Precision E.. Accuracy
BC - Seems to be a regression situation, them BC must be the more appropriate B. Mean squared error (MSE) C. Root mean squared error (RMSE) BC are correct options - https://www.examtopics.com/discussions/amazon/view/147154-exam-aws-certified-machine-learning-specialty-topic-1/
353
354 - A machine learning (ML) specialist is developing a model for a company. The model will classify and predict sequences of objects that are displayed in a video. The ML specialist decides to use a hybrid architecture that consists of a convolutional neural network (CNN) followed by a classifier three-layer recurrent neural network (RNN). The company developed a similar model previously but trained the model to classify a different set of objects. The ML specialist wants to save time by using the previously trained model and adapting the model for the current use case and set of objects. Which combination of steps will accomplish this goal with the LEAST amount of effort? (Choose two.) - A.. Reinitialize the weights of the entire CNN. Retrain the CNN on the classification task by using the new set of objects. B.. Reinitialize the weights of the entire network. Retrain the entire network on the prediction task by using the new set of objects. C.. Reinitialize the weights of the entire RNN. Retrain the entire model on the prediction task by using the new set of objects. D.. Reinitialize the weights of the last fully connected layer of the CNN. Retrain the CNN on the classification task by using the new set of objects. E.. Reinitialize the weights of the last layer of the RNN. Retrain the entire model on the prediction task by using the new set of objects.
DE - A,B,C no transfer learning -> D,E D. Reinitialize the weights of the last fully connected layer of the CNN. Retrain the CNN on the classification task by using the new set of objects. E. Reinitialize the weights of the last layer of the RNN. Retrain the entire model on the prediction task by using the new set of objects. DE are correct. - https://www.examtopics.com/discussions/amazon/view/147153-exam-aws-certified-machine-learning-specialty-topic-1/
354
355 - A company distributes an online multiple-choice survey to several thousand people. Respondents to the survey can select multiple options for each question. A machine learning (ML) engineer needs to comprehensively represent every response from all respondents in a dataset. The ML engineer will use the dataset to train a logistic regression model. Which solution will meet these requirements? - A.. Perform one-hot encoding on every possible option for each question of the survey. B.. Perform binning on all the answers each respondent selected for each question. C.. Use Amazon Mechanical Turk to create categorical labels for each set of possible responses. D.. Use Amazon Textract to create numeric features for each set of possible responses.
A - A. Perform one-hot encoding on every possible option for each question of the survey. One-hot encoding - https://www.examtopics.com/discussions/amazon/view/147152-exam-aws-certified-machine-learning-specialty-topic-1/
355
356 - A manufacturing company stores production volume data in a PostgreSQL database. The company needs an end-to-end solution that will give business analysts the ability to prepare data for processing and to predict future production volume based the previous year's production volume. The solution must not require the company to have coding knowledge. Which solution will meet these requirements with the LEAST effort? - A.. Use AWS Database Migration Service (AWS DMS) to transfer the data from the PostgreSQL database to an Amazon S3 bucket. Create an Amazon EMR duster to read the S3 bucket and perform the data preparation. Use Amazon SageMaker Studio for the prediction modeling. B.. Use AWS Glue DataBrew to read the data that is in the PostgreSQL database and to perform the data preparation. Use Amazon SageMaker Canvas for the prediction modeling. C.. Use AWS Database Migration Service (AWS DMS) to transfer the data from the PostgreSQL database to an Amazon S3 bucket. Use AWS Glue to read the data in the S3 bucket and to perform the data preparation. Use Amazon SageMaker Canvas for the prediction modeling. D.. Use AWS Glue DataBrew to read the data that is in the PostgreSQL database and to perform the data preparation. Use Amazon SageMaker Studio for the prediction modeling.
B - This solution meets the requirements with the least effort by providing a no-code environment for both data preparation and prediction modeling. AWS Glue DataBrew allows business analysts to clean and prepare data without coding, and Amazon SageMaker Canvas enables them to build and run machine learning models without needing any coding knowledge B is the correct choice. Use SageMaker Canvas - https://www.examtopics.com/discussions/amazon/view/147156-exam-aws-certified-machine-learning-specialty-topic-1/
356
357 - A data scientist needs to create a model for predictive maintenance. The model will be based on historical data to identify rare anomalies in the data. The historical data is stored in an Amazon S3 bucket. The data scientist needs to use Amazon SageMaker Data Wrangler to ingest the data. The data scientist also needs to perform exploratory data analysis (EDA) to understand the statistical properties of the data. Which solution will meet these requirements with the LEAST amount of compute resources? - A.. Import the data by using the None option. B.. Import the data by using the Stratified option. C.. Import the data by using the First K option. Infer the value of K from domain knowledge. D.. Import the data by using the Randomized option. Infer the random size from domain knowledge.
C - D. Import the data by using the Randomized option. Infer the random size from domain knowledge. https://aws.amazon.com/it/about-aws/whats-new/2022/04/amazon-sagemaker-data-wrangler-supports-random-sampling-stratified-sampling/ A: No, porque "None" importa todo el conjunto de datos, consumiendo más recursos. B: Sí, porque la opción estratificada asegura que se incluyan casos raros en la muestra, usando menos recursos. C: No, porque "First K" puede sesgar la muestra y omitir anomalías si no están en las primeras K muestras. D: No, porque el muestreo aleatorio puede omitir las anomalías raras y depender de un tamaño de muestra arbitrario. D. Import the data by using the Randomized option. Infer the random size from domain knowledge: This option selects a random sample of the data. Pros: It provides a representative sample of the entire dataset while using fewer compute resources than importing all data. Cons: There's a small chance of missing some rare anomalies, but this risk can be mitigated by choosing an appropriate sample size based on domain knowledge. Why Option C? Efficiency: Importing a subset of the data using the First K option minimizes compute resources while still providing a representative sample for exploratory data analysis (EDA). Domain Knowledge: Leveraging domain knowledge to determine the value of K ensures that the subset is relevant and sufficient for meaningful analysis. - https://www.examtopics.com/discussions/amazon/view/150532-exam-aws-certified-machine-learning-specialty-topic-1/
357
358 - An ecommerce company has observed that customers who use the company's website rarely view items that the website recommends to customers. The company wants to recommend items to customers that customers are more likely to want to purchase. Which solution will meet this requirement in the SHORTEST amount of time? - A.. Host the company's website on Amazon EC2 Accelerated Computing instances to increase the website response speed. B.. Host the company's website on Amazon EC2 GPU-based instances to increase the speed of the website's search tool. C.. Integrate Amazon Personalize into the company's website to provide customers with personalized recommendations. D.. Use Amazon SageMaker to train a Neural Collaborative Filtering (NCF) model to make product recommendations.
C - C: Sí, porque Amazon Personalize es un servicio administrado que se integra rápidamente para generar recomendaciones personalizadas. A: No, mejorar la velocidad de respuesta no influye en la relevancia de las recomendaciones. B: No, instancias GPU solo aceleran cálculos, pero no solucionan la calidad de las recomendaciones. D: No, entrenar un modelo NCF en SageMaker lleva más tiempo que usar una solución lista como Personalize. Perfect for Personalize - https://www.examtopics.com/discussions/amazon/view/150533-exam-aws-certified-machine-learning-specialty-topic-1/
358
359 - A machine learning (ML) engineer is preparing a dataset for a classification model. The ML engineer notices that some continuous numeric features have a significantly greater value than most other features. A business expert explains that the features are independently informative and that the dataset is representative of the target distribution. After training, the model's inferences accuracy is lower than expected. Which preprocessing technique will result in the GREATEST increase of the model's inference accuracy? - A.. Normalize the problematic features. B.. Bootstrap the problematic features. C.. Remove the problematic features. D.. Extrapolate synthetic features.
A - This approach leverages the strengths of DeepAR to provide accurate and efficient sales predictions for the company's diverse product range. Wrong narrative - its for question 360 Definitely A - https://www.examtopics.com/discussions/amazon/view/150534-exam-aws-certified-machine-learning-specialty-topic-1/
359
360 - A manufacturing company produces 100 types of steel rods. The rod types have varying material grades and dimensions. The company has sales data for the steel rods for the past 50 years. A data scientist needs to build a machine learning (ML) model to predict future sales of the steel rods. Which solution will meet this requirement in the MOST operationally efficient way? - A.. Use the Amazon SageMaker DeepAR forecasting algorithm to build a single model for all the products. B.. Use the Amazon SageMaker DeepAR forecasting algorithm to build separate models for each product. C.. Use Amazon SageMaker Autopilot to build a single model for all the products. D.. Use Amazon SageMaker Autopilot to build separate models for each product.
A - C; for operationally efficient A makes sense but requires operational work https://www.amazonaws.cn/en/sagemaker/autopilot/ This approach leverages the strengths of DeepAR to provide accurate and efficient sales predictions for the company's diverse product range. - https://www.examtopics.com/discussions/amazon/view/150535-exam-aws-certified-machine-learning-specialty-topic-1/
360
361 - A machine learning (ML) specialist is building a credit score model for a financial institution. The ML specialist has collected data for the previous 3 years of transactions and third-party metadata that is related to the transactions. After the ML specialist builds the initial model, the ML specialist discovers that the model has low accuracy for both the training data and the test data. The ML specialist needs to improve the accuracy of the model. Which solutions will meet this requirement? (Choose two.) - A.. Increase the number of passes on the existing training data. Perform more hyperparameter tuning. B.. Increase the amount of regularization. Use fewer feature combinations. C.. Add new domain-specific features. Use more complex models. D.. Use fewer feature combinations. Decrease the number of numeric attribute bins. E.. Decrease the amount of training data examples. Reduce the number of passes on the existing training data.
AC - A,C These solutions focus on enhancing the model's learning process and providing it with richer information, which are key steps in improving model accuracy. - https://www.examtopics.com/discussions/amazon/view/150536-exam-aws-certified-machine-learning-specialty-topic-1/
361
362 - A data scientist uses Amazon SageMaker to perform hyperparameter tuning for a prototype machine leaming (ML) model. The data scientist's domain knowledge suggests that the hyperparameter is highly sensitive to changes. The optimal value, x, is in the 0.5 < x < 1.0 range. The data scientist's domain knowledge suggests that the optimal value is close to 1.0. The data scientist needs to find the optimal hyperparameter value with a minimum number of runs and with a high degree of consistent tuning conditions. Which hyperparameter scaling type should the data scientist use to meet these requirements? - A.. Auto scaling B.. Linear scaling C.. Logarithmic scaling D.. Reverse logarithmic scaling
D - A: No existe un escalado "auto" para este caso. B: El escalado lineal repartiría uniformemente el espacio y no se enfocaría en valores cercanos a 1.0. C: El escalado logarítmico no es ideal en este rango tan cercano a 1.0. D: El escalado logarítmico inverso concentra la búsqueda en valores cercanos a 1.0, aprovechando el conocimiento de dominio. This approach allocates more search effort near the higher end of the range, ensuring that values closer to 1.0 are explored more thoroughly, thus meeting the need for a minimum number of runs while maintaining consistent tuning conditions - https://www.examtopics.com/discussions/amazon/view/150537-exam-aws-certified-machine-learning-specialty-topic-1/
362
363 - A data scientist uses Amazon SageMaker Data Wrangler to analyze and visualize data. The data scientist wants to refine a training dataset by selecting predictor variables that are strongly predictive of the target variable. The target variable correlates with other predictor variables. The data scientist wants to understand the variance in the data along various directions in the feature space. Which solution will meet these requirements? - A.. Use the SageMaker Data Wrangler multicollinearity measurement features with a variance inflation factor (VIF) score. Use the VIF score as a measurement of how closely the variables are related to each other. B.. Use the SageMaker Data Wrangler Data Quality and Insights Report quick model visualization to estimate the expected quality of a model that is trained on the data. C.. Use the SageMaker Data Wrangler multicollinearity measurement features with the principal component analysis (PCA) algorithm to provide a feature space that includes all of the predictor variables. D.. Use the SageMaker Data Wrangler Data Quality and Insights Report feature to review features by their predictive power.
C - Principal Component Analysis (PCA) measures the variance of the data along different directions in the feature space. The feature space consists of all the predictor variables that you use to predict the target variable in your dataset. https://aws.amazon.com/blogs/machine-learning/detect-multicollinearity-target-leakage-and-feature-correlation-with-amazon-sagemaker-data-wrangler/ Not sure about A or C, but I think it is C This approach allows the data scientist to refine the training dataset by selecting the most predictive variables and understanding the variance in the data effectively. - https://www.examtopics.com/discussions/amazon/view/150538-exam-aws-certified-machine-learning-specialty-topic-1/
363
364 - A business to business (B2B) ecommerce company wants to develop a fair and equitable risk mitigation strategy to reject potentially fraudulent transactions. The company wants to reject fraudulent transactions despite the possibility of losing some profitable transactions or customers. Which solution will meet these requirements with the LEAST operational effort? - A.. Use Amazon SageMaker to approve transactions only for products the company has sold in the past. B.. Use Amazon SageMaker to train a custom fraud detection model based on customer data. C.. Use the Amazon Fraud Detector prediction API to approve or deny any activities that Fraud Detector identifies as fraudulent. D.. Use the Amazon Fraud Detector prediction API to identify potentially fraudulent activities so the company can review the activities and reject fraudulent transactions.
C - While D is the most efficient answer, the question wants the least operational effort, and manually checking every flagged transaction is costly. Anser is C Manual, more effort A. No, porque solo aprueba transacciones basadas en ventas pasadas, sin analizar el riesgo actual. B. No, porque entrenar un modelo personalizado en SageMaker implica mayor esfuerzo operativo. C. Sí, porque la API de Amazon Fraud Detector automatiza la aprobación o el rechazo, reduciendo la sobrecarga operativa. D. No, porque identificar fraudes para revisión manual aumenta el esfuerzo operativo. D is balanced and best approach while C is automatic rejection the company still has a chance to reject what is considered safe. More control This approach ensures that fraudulent transactions are effectively rejected, even if it means losing some profitable transactions, aligning with the company's risk mitigation strategy. - https://www.examtopics.com/discussions/amazon/view/150539-exam-aws-certified-machine-learning-specialty-topic-1/
364
365 - A data scientist needs to develop a model to detect fraud. The data scientist has less data for fraudulent transactions than for legitimate transactions. The data scientist needs to check for bias in the model before finalizing the model. The data scientist needs to develop the model quickly. Which solution will meet these requirements with the LEAST operational overhead? - A.. Process and reduce bias by using the synthetic minority oversampling technique (SMOTE) in Amazon EMR. Use Amazon SageMaker Studio Classic to develop the model. Use Amazon Augmented Al (Amazon A2I) to check the model for bias before finalizing the model. B.. Process and reduce bias by using the synthetic minority oversampling technique (SMOTE) in Amazon EMR. Use Amazon SageMaker Clarify to develop the model. Use Amazon Augmented AI (Amazon A2I) to check the model for bias before finalizing the model. C.. Process and reduce bias by using the synthetic minority oversampling technique (SMOTE) in Amazon SageMaker Studio. Use Amazon SageMaker JumpStart to develop the model. Use Amazon SageMaker Clarify to check the model for bias before finalizing the model. D.. Process and reduce bias by using an Amazon SageMaker Studio notebook. Use Amazon SageMaker JumpStart to develop the model. Use Amazon SageMaker Model Monitor to check the model for bias before finalizing the model.
C - A. No, porque usar EMR y A2I añade complejidad innecesaria y A2I no está diseñado para verificar sesgo. B. No, porque EMR y A2I implican mayor sobrecarga operativa y A2I no es la herramienta adecuada para evaluar sesgo. C. Sí, porque SageMaker Studio con JumpStart y Clarify permite aplicar SMOTE y evaluar el sesgo de forma integrada y rápida. D. No, porque Model Monitor se usa para monitorear modelos en producción, no para chequear sesgo antes de finalizar el modelo. This approach leverages the integrated tools within Amazon SageMaker to streamline the process, reduce operational overhead, and ensure the model is both effective and unbiased. - https://www.examtopics.com/discussions/amazon/view/150540-exam-aws-certified-machine-learning-specialty-topic-1/
365
366 - A company has 2,000 retail stores. The company needs to develop a new model to predict demand based on holidays and weather conditions. The model must predict demand in each geographic area where the retail stores are located. Before deploying the newly developed model, the company wants to test the model for 2 to 3 days. The model needs to be robust enough to adapt to supply chain and retail store requirements. Which combination of steps should the company take to meet these requirements with the LEAST operational overhead? (Choose two.) - A.. Develop the model by using the Amazon Forecast Prophet model. B.. Develop the model by using the Amazon Forecast holidays featurization and weather index. C.. Deploy the model by using a canary strategy that uses Amazon SageMaker and AWS Step Functions. D.. Deploy the model by using an A/B testing strategy that uses Amazon SageMaker Pipelines. E.. Deploy the model by using an A/B testing strategy that uses Amazon SageMaker and AWS Step Functions.
BC - A/B testing requires maintaining two models simultaneously, increasing operational complexity. Copied from ChatGPT Given the constraints of minimizing operational overhead and the need for testing, the best deployment approach is to utilize the testing and evaluation features that are built into Amazon Forecast. Amazon Forecast allows for backtesting, which uses historical data to evaluate the performance of a forecasting model. This built in functionality provides the testing that is asked for, without the overhead of setting up a complex deployment strategy. Therefore, the correct combination is: A. Develop the model by using the Amazon Forecast Prophet model. B. Develop the model by using the Amazon Forecast holidays featurization and weather index. Because these two options are used together, and Amazon forecast has built in testing, there is no need to add another deployment option. A. No, porque usar Prophet sin incorporar la featurización integrada de feriados y clima no optimiza la predicción para este caso. B. Sí, porque Amazon Forecast con featurización de feriados y clima integra esos factores de forma nativa y reduce el esfuerzo de modelado. C. Sí, porque una estrategia canary con SageMaker y Step Functions permite probar el modelo de forma gradual y con poco esfuerzo operativo. D. No, porque implementar A/B testing con SageMaker Pipelines añade complejidad adicional innecesaria para una prueba de pocos días. E. No, porque la estrategia A/B con SageMaker y Step Functions implica más sobrecarga operativa en comparación con una implementación canary. Company want to test it for 2-3 days BE for relevant features, E for simplicity These steps ensure that the model is developed with relevant features and deployed in a way that minimizes operational risks and overhead. - https://www.examtopics.com/discussions/amazon/view/150542-exam-aws-certified-machine-learning-specialty-topic-1/
366
367 - A finance company has collected stock return data for 5,000 publicly traded companies. A financial analyst has a dataset that contains 2,000 attributes for each company. The financial analyst wants to use Amazon SageMaker to identify the top 15 attributes that are most valuable to predict future stock returns. Which solution will meet these requirements with the LEAST operational overhead? - A.. Use the linear leaner algorithm in SageMaker to train a linear regression model to predict the stock returns. Identify the most predictive features by ranking absolute coefficient values. B.. Use random forest regression in SageMaker to train a model to predict the stock returns. Identify the most predictive features based on Gini importance scores. C.. Use an Amazon SageMaker Data Wrangler quick model visualization to predict the stock returns. Identify the most predictive features based on the quick mode's feature importance scores. D.. Use Amazon SageMaker Autopilot to build a regression model to predict the stock returns. Identify the most predictive features based on an Amazon SageMaker Clarify report.
D - SageMaker Data Wrangler’s quick model visualization provides feature importance scores without requiring full model training. We want LEAST operational overhead so there's no need of model training. Also, feature importance after model training has meaning only for that type of model and we haven't set a model yet. C. Use an Amazon SageMaker Data Wrangler quick model visualization to predict the stock returns. Identify the most predictive features based on the quick mode's feature importance scores. Data Wrangler's quick model visualization provides a fast way to get insights into feature importance. This option has the least operational overhead, as it requires minimal setup and coding. A. No, un modelo lineal requiere calificar manualmente coeficientes y puede no captar relaciones complejas. B. No, random forest da importancia Gini pero implica más configuración y ajuste. C. No, la visualización rápida en Data Wrangler es útil pero no automatiza la selección de 2000 atributos. D. Sí, Autopilot con un informe de Clarify automatiza el entrenamiento y extrae la importancia de las características con mínimo esfuerzo. This approach leverages the automation capabilities of SageMaker Autopilot and the detailed analysis provided by SageMaker Clarify, ensuring an efficient and effective solution for identifying the most valuable attributes for predicting stock returns. - https://www.examtopics.com/discussions/amazon/view/150544-exam-aws-certified-machine-learning-specialty-topic-1/
367
368 - A company is using a machine learning (ML) model to recommend products to customers. An ML specialist wants to analyze the data for the most popular recommendations in four dimensions. The ML specialist will visualize the first two dimensions as coordinates. The third dimension will be visualized as color. The ML specialist will use size to represent the fourth dimension in the visualization Which solution will meet these requirements? - A.. Use the Amazon SageMaker Data Wrangler bar chart feature. Use Group By to represent the third and fourth dimensions. B.. Use the Amazon SageMaker Canvas box plot visualization Use color and fill pattern to represent the third and fourth dimensions C.. Use the Amazon SageMaker Data Wrangler histogram feature Use color and fill pattern to represent the third and fourth dimensions D.. Use the Amazon SageMaker Canvas scatter plot visualization Use scatter point size and color to represent the third and fourth dimensions
D - A. No, porque el gráfico de barras no permite trazar coordenadas y representar dimensiones adicionales con color y tamaño. B. No, porque el diagrama de caja muestra distribuciones, no coordenadas con atributos visuales extra. C. No, porque el histograma no permite asignar valores a ejes, color y tamaño simultáneamente. D. Sí, porque el gráfico de dispersión en SageMaker Canvas permite mapear dos ejes (coordenadas) y usar color y tamaño para representar dos dimensiones adicionales. - https://www.examtopics.com/discussions/amazon/view/156769-exam-aws-certified-machine-learning-specialty-topic-1/
368
369 - A clothing company is experimenting with different colors and materials for its products. The company stores the entire sales history of all its products in Amazon S3. The company is using custom-built exponential smoothing (ETS) models to forecast demand for its current products. The company needs to forecast the demand for a new product variation that the company will launch soon. Which solution will meet these requirements? - A.. Train a custom ETS model. B.. Train an Amazon SageMaker DeepAR model. C.. Train an Amazon SageMaker К-means clustering model. D.. Train a custom XGBoost model.
B - A. No, porque un modelo ETS personalizado requiere historial del nuevo producto que no existe. B. Sí, porque DeepAR en SageMaker usa series relacionadas para predecir la demanda, incluso para nuevos productos. C. No, porque K-means es para agrupar datos, no para hacer predicciones de series temporales. D. No, porque XGBoost no está optimizado para pronósticos de series temporales en este caso. - https://www.examtopics.com/discussions/amazon/view/156768-exam-aws-certified-machine-learning-specialty-topic-1/