amazon-certified-machine-learning-specialty - Copia(Hoja1) Flashcards
(368 cards)
1 - A large mobile network operating company is building a machine learning model to predict customers who are likely to unsubscribe from the service. The company plans to offer an incentive for these customers as the cost of churn is far greater than the cost of the incentive. The model produces the following confusion matrix after evaluating on a test dataset of 100 customers: Based on the model evaluation results, why is this a viable model for production?
[https://www.examtopics.com/assets/media/exam-media/04145/0000200001.jpg] - A.. The model is 86% accurate and the cost incurred by the company as a result of false negatives is less than the false positives.
B.. The precision of the model is 86%, which is less than the accuracy of the model.
C.. The model is 86% accurate and the cost incurred by the company as a result of false positives is less than the false negatives.
D.. The precision of the model is 86%, which is greater than the accuracy of the model.
C - The Answer is A.
Reasons:
1. accurate is 86%
2. FN=4, FP= 10. The question is asking why this is a feasible model which means why this is working. So it is not asking the explaination of the unit cost of churn(FN) is greater than cost of incentive(FP). It is asking from the matrixs result, the number it self, FN(4) is less than FP(10). The model successfully keep a smaller number of FN regarding of FP.
Such question cannot be answered because we do not know how much more is greater the cost of churn than the cost of the incentive.
CoC - Cost of Churn
CoI - Cost of Incentive
cost incurred by the company as a result of false positives = CoI * 10
cost incurred by the company as a result of false negatives = CoC * 4
So is it the case that CoI * 10 > CoC * 4 => CoI > 0.4 * CoC, or rather CoI < 0.4 * CoC? We don’t know that because we don’t know what does it mean “far greater”, is it 100% greater, or is it 500% greater or any other number.
The answer is A
The Answer is c
Even though there are 10 false positives compared to 4 false negatives, the cost incurred by offering an incentive unnecessarily (false positive) is significantly less than the cost of losing a customer (false negative). This risk management aligns well with the company’s strategy to minimize expensive churn events.
Thus, the model is viable for production because it achieves 86% accuracy and, importantly, the cost of false positives (incentives given) is much lower than the cost associated with false negatives (lost customers).
Option C is indeed the correct choice. The model is 86% accurate, and the cost of false positives (offering incentives) is less than the cost of false negatives (losing customers). This makes the model viable for production.
Changing my earlier Answer from A to C. Cost of FP(10) is lower than Cost of FN(4)
some people that voted A have the right idea, but they chose the wrong option because they need to read the question again. We all agree that the cost of churn is much higher. So a false-negative means a customer churned and you didn’t do anything about it (because your model said “churn=no”) . A false positive means you tried to keep a customer that was not going to leave anyway (because your model said “churn=yes”). As you can see, false-negative is way costlier and should be avoided, therefore answer is C.
Cost incurred for churn higher than incentive. Cost of FN is higher than FP. And accuracy is 86%.
what tomatoteacher said
Accuracy is 86% and it should be A or C. Lost is very high compare to intensive. Means it is Okay to give intensive to customers who are not going to leave. Which means False positives potion.
Should be A. Since the cost of churn is much higher, the priority should be focused on minimizing FN and a viable model should be one with FN < FP, isn’t it?
Definitely C. If you look at the same question in https://aws.amazon.com/blogs/machine-learning/predicting-customer-churn-with-amazon-machine-learning/. Same question, but the confusion matrix is flipped in this case( TP top left, Tn bottom right) . When you miss an actual churn (FN) this would cost the company more. Therefore the answer is C 100%. I will die on this hill. I spent 20 minutes researching this to be certain. Most people who put A are incorrectly saying FPs are actual churns that are stated as no churn.. that is what a FN is. You can trust me on this.
There are more FP’s than FN’s, however the costs of FN’s are far larger than that of FP’s. So:
numberof(FP) > numberof(FN), costperunit(FP) «_space;costperunit(FN). This itself could suggest that totalcosts(FP) < totalcosts(FN), but would be somewhat subjective, since it is not stated how far the unitary costs are.
What is suggested, however, is that the model is indeed viable (question asks WHY the model is viable, and not WHETHER it’s viable).
If the model didn’t exist, there would be no way that there are FP’s or FN’s, but churns would still exist, which have the same cost as FN’s.
So it means the total costs with FP’s must be less than the total costs with FN’s (churns).
Correct Answer C.
Explanation: The model’s accuracy is calculated as (True Positives + True Negatives) / Total predictions, which is (10 + 76) / 100 = 0.86, or 86%. The cost of false positives (customers predicted to churn but don’t) is less than the cost of false negatives (customers who churn but were not predicted to). Offering incentives to the false positives incurs less cost than losing customers due to false negatives. Therefore, this model is viable for production.
A. NO - accuracy is TP+TN / Total = (76+10)/100 = 86%; we know the model is working, so the cost of giving incentives to the wrong customers (FP) is less than the cost of customers we missed (FN), cost(FP) < cost(FN)
B. NO - accuracy is 86%, precision is TP / (TP+FP) = 10 /(10+10) = 50%
C. YES - accuracy is TP+TN / Total = (76+10)/100 = 86%; we know the model is working, so the cost of giving incentives to the wrong customers (FP) is less than the cost of customers we missed (FN), cost(FP) < cost(FN)
D. NO - accuracy is 86%, precision is TP / (TP+FP) = 10 /(10+10) = 50%
C is the correct answer - https://www.examtopics.com/discussions/amazon/view/43814-exam-aws-certified-machine-learning-specialty-topic-1/
2 - A Machine Learning Specialist is designing a system for improving sales for a company. The objective is to use the large amount of information the company has on users’ behavior and product preferences to predict which products users would like based on the users’ similarity to other users. What should the Specialist do to meet this objective? - A.. Build a content-based filtering recommendation engine with Apache Spark ML on Amazon EMR
B.. Build a collaborative filtering recommendation engine with Apache Spark ML on Amazon EMR.
C.. Build a model-based filtering recommendation engine with Apache Spark ML on Amazon EMR
D.. Build a combinative filtering recommendation engine with Apache Spark ML on Amazon EMR
B - B
see https://en.wikipedia.org/wiki/Collaborative_filtering#Model-based
Content-based filtering relies on similarities between features of items, whereas colloborative-based filtering relies on preferences from other users and how they respond to similar items.
Answer is B : Build a collaborative filtering recommendation engine with Apache Spark ML on Amazon EMR.
Collaborative filtering focuses on user behavior and preferences therefore it is perfect for predicting products based on user similarities.
B. Build a collaborative filtering recommendation engine with Apache Spark ML on Amazon EMR.
Collaborative filtering is a technique used to recommend products to users based on their similarity to other users. It is a widely used method for building recommendation engines. Apache Spark ML is a distributed machine learning library that provides scalable implementations of collaborative filtering algorithms. Amazon EMR is a managed cluster platform that provides easy access to Apache Spark and other distributed computing frameworks.
Build a collaborative filtering recommendation engine with Apache Spark ML on Amazon EMR. (TRUE)
Collaborative filtering is a commonly used method for recommendation systems that aims to predict the preferences of a user based on the behavior of similar users. In the case described, the objective is to use users’ behavior and product preferences to predict which products they want, making collaborative filtering a good fit.
Apache Spark ML is a machine learning library that provides scalable, efficient algorithms for building recommendation systems, while Amazon EMR provides a cloud-based platform for running Spark applications.
You can find more detail in https://www.udemy.com/course/aws-certified-machine-learning-specialty-2023
collaborative filtering
‘Collaborative filtering is a technique that can filter out items that a user might like on the basis of reactions by similar users.’
Source: https://realpython.com/build-recommendation-engine-collaborative-filtering/#what-is-collaborative-filtering
A. NO - content-based filtering looks at similarities with items the user already looked at, not activities of other users
B. YES - state of the art
C. NO - too generic terms, everything is a model
D. NO - combinative filtering does not exist
Collaborative filtering is a technique used by recommendation engines to make predictions about the interests of a user by collecting preferences or taste information from many users. The underlying assumption of the collaborative filtering approach is that if a person A has the same opinion as a person B on an issue, A is more likely to have B’s opinion on a different issue than that of a randomly chosen person.
B. Build a collaborative filtering recommendation engine with Apache Spark ML on Amazon EMR.
I think it should be b
Content-based recommendations rely on product similarity. If a user likes a product, products that are similar to that one will be recommended. Collaborative recommendations are based on user similarity. If you and other users have given similar reviews to a range of products, the model assumes it is likely that other products those other people have liked but that you haven’t purchased should be a good recommendation for you.
feature engineering is required, use model based
Answer is “B”
go for B
B is correct
https://aws.amazon.com/blogs/big-data/building-a-recommendation-engine-with-spark-ml-on-amazon-emr-using-zeppelin/ - https://www.examtopics.com/discussions/amazon/view/11248-exam-aws-certified-machine-learning-specialty-topic-1/
3 - A Mobile Network Operator is building an analytics platform to analyze and optimize a company’s operations using Amazon Athena and Amazon S3. The source systems send data in .CSV format in real time. The Data Engineering team wants to transform the data to the Apache Parquet format before storing it on Amazon S3. Which solution takes the LEAST effort to implement? - A.. Ingest .CSV data using Apache Kafka Streams on Amazon EC2 instances and use Kafka Connect S3 to serialize data as Parquet
B.. Ingest .CSV data from Amazon Kinesis Data Streams and use Amazon Glue to convert data into Parquet.
C.. Ingest .CSV data using Apache Spark Structured Streaming in an Amazon EMR cluster and use Apache Spark to convert data into Parquet.
D.. Ingest .CSV data from Amazon Kinesis Data Streams and use Amazon Kinesis Data Firehose to convert data into Parquet.
D - Answer is B
you cannot use AWS glue for streaming data. Clearly B is incorrect.
Even if the exam’s answer is based on solution before AWS implemented the capability of AWS glue to process streaming data, this answer is still correct as Kinesis would output the data to S3 and Glue will pick it up from there and covert to parquet. Question does not say data must be converted to parquet in real time, it only says the csv data is received as a stream in real time.
Actually question says “The source systems send data in CSV format in real time The Data Engineering team wants to transform the data to the Apache Parquet format before storing it on Amazon S3” same as saying data must be converted real time
AWS Glue can do it now (2020 May)
https://aws.amazon.com/jp/blogs/news/new-serverless-streaming-etl-with-aws-glue/
This link is in Japanese
the Approve Of B
https://aws.amazon.com/blogs/aws/new-serverless-streaming-etl-with-aws-glue/
D is wrong as kinesis firehose can convert from JSON to parquet but here we have CSV.
B is correct and here is another proof link: https://medium.com/searce/convert-csv-json-files-to-apache-parquet-using-aws-glue-a760d177b45f
https://docs.aws.amazon.com/firehose/latest/dev/record-format-conversion.html
You are right.
https://docs.aws.amazon.com/firehose/latest/dev/record-format-conversion.html
If you want to convert an input format other than JSON, such as comma-separated values (CSV) or structured text, you can use AWS Lambda to transform it to JSON first
But there is no Lambda in D
But there’s a D in Lambda
Kinesis Data Firehose supports real-time streaming ingestion and can automatically convert CSV to Parquet before storing it in S3.
Amazon Kinesis Data Streams + Amazon Kinesis Data Firehose
Effort: Lowest effort
Why?
Amazon Kinesis Data Firehose natively supports real-time CSV ingestion and automatic conversion to Parquet.
Fully managed, serverless, and directly integrates with Amazon S3.
Requires zero infrastructure management compared to other solutions.
I take this back .. ans shd be B.. on researching further it is JSON or ORC to Parque that KDS supports.. So answer is B - not optimal but close to suitable
. Amazon Kinesis Data Streams + AWS Glue AWS Glue can batch-process CSV and convert it to Parquet for S3. However, Glue is batch-oriented, not real-time.
Although I’d go with Glue and option B I’m pretty sure that this is one of those “15 unscored questions that do not affect your score. AWS collects information about performance on these unscored questions to evaluate these questions for future use as scored questions”
Just for fun I asked perplexity, chatgpt, gemini, deepseek and claude: all gave D as first response
When I pointed out that “according to this https://docs.aws.amazon.com/firehose/latest/dev/record-format-conversion.html Kinesis can’t convert directly cvs to parquet. It needs a Lambda” each model responded in a different way (some of them contradictory).
My reasoning is that D (Kinesis + Firehose) is incorrect because Firehose does not support direct CSV-to-Parquet conversion and needs a Lambda not mentioned in the option. But discussing about questions like this one is nothing but I big waste of time ;-P
D
Kinesis Data Firehose is designed specifically for streaming data delivery to destinations like S3. It has built-in support for data format conversion, including CSV to Parquet. This eliminates the need for managing separate transformation services like Glue or Spark. The setup is significantly simpler: you configure a Firehose delivery stream, specify the data format conversion, and point it to your S3 bucket.
Therefore, option D requires the least implementation effort because it leverages a fully managed service (Kinesis Data Firehose) with built-in functionality for data format conversion.
Amazon Kinesis Data Firehose can only convert from JSON to Apache Parquet or Apache ORC before storing the data in Amazon S3.
Answer B,
Yes, Amazon Kinesis Data Firehose can convert CSV to Apache Parquet, but you need to use a Lambda function to transform the CSV to JSON first: here the question is least effort to build, so B is the right answer with least effort to build the solution
Use Amazon Kinesis Data Streams to ingest customer data and configure a Kinesis Data Firehose delivery stream as a consumer to convert the data into Apache Parquet is incorrect. Although this could be a valid solution, it entails more development effort as Kinesis Data Firehose does not support converting CSV files directly into Apache Parquet, unlike JSON.
Amazon Kinesis Data Firehose can convert the format of your input data from JSON to Apache Parquet or Apache ORC before storing the data in Amazon S3. Parquet and ORC are columnar data formats that save space and enable faster queries compared to row-oriented formats like JSON. If you want to convert an input format other than JSON, such as comma-separated values (CSV) or structured text, you can use AWS Lambda to transform it to JSON first.
https://docs.aws.amazon.com/firehose/latest/dev/record-format-conversion.html
Between B and D chose D.
Because Firehose can’t handle csv directly.
Between B and D chose B.
Because Firehose can’t handle csv directly.
Answer is B.
https://docs.aws.amazon.com/firehose/latest/dev/record-format-conversion.html
“If you want to convert an input format other than JSON, such as comma-separated values (CSV) or structured text, you can use AWS Lambda to transform it to JSON first.”
u need glue to convert to parquet
D for sure, Firehose can convert csv to parquet
Answer is unfortunately B. firehose cannot convert coma separated CSV to parquet directly.
b is not goog but - >given the context of “finding the solution that requires the least effort to implement,” option D is the most suitable choice. Ingesting data from Amazon Kinesis Data Streams and using Amazon Kinesis Data Firehose to convert the data to Parquet format is a serverless approach. It allows for automatic data transformation and storage in Amazon S3 without the need for additional development or management of data conversion logic. Therefore, under the given conditions, option D is considered the solution that requires the “least effort” to implement
Kinesis Data Firehose doesn’t convert anything, it rather calls a lambda function to do so which is the overhead we want to avoid. B is the correct answer.
Amazon Kinesis Data Streams is a service that can capture, store, and process streaming data in real
time. Amazon Kinesis Data Firehose is a service that can deliver streaming data to various
destinations, such as Amazon S3, Amazon Redshift, or Amazon Elasticsearch Service. Amazon Kinesis
Data Firehose can also transform the data before delivering it, such as converting the data format,
compressing the data, or encrypting the data. One of the supported data formats that Amazon
Kinesis Data Firehose can convert to is Apache Parquet, which is a columnar storage format that can
improve the performance and cost-efficiency of analytics queries. By using Amazon Kinesis Data
Streams and Amazon Kinesis Data Firehose, the Mobile Network Operator can ingest the .CSV data
from the source systems and use Amazon Kinesis Data Firehose to convert the data into Parquet
before storing it on Amazon S3
Firehose cannot natively do the conversion. It requires a Lambda function for that purpose. - https://www.examtopics.com/discussions/amazon/view/8303-exam-aws-certified-machine-learning-specialty-topic-1/
4 - A city wants to monitor its air quality to address the consequences of air pollution. A Machine Learning Specialist needs to forecast the air quality in parts per million of contaminates for the next 2 days in the city. As this is a prototype, only daily data from the last year is available. Which model is MOST likely to provide the best results in Amazon SageMaker? - A.. Use the Amazon SageMaker k-Nearest-Neighbors (kNN) algorithm on the single time series consisting of the full year of data with a predictor_type of regressor.
B.. Use Amazon SageMaker Random Cut Forest (RCF) on the single time series consisting of the full year of data.
C.. Use the Amazon SageMaker Linear Learner algorithm on the single time series consisting of the full year of data with a predictor_type of regressor.
D.. Use the Amazon SageMaker Linear Learner algorithm on the single time series consisting of the full year of data with a predictor_type of classifier.
C - answer should be C
go for C
Amazon SageMaker Linear Learner (Regressor)
Why?
The Linear Learner algorithm can be used for time series regression.
Using predictor_type=regressor, it learns trends and patterns in historical data and extrapolates future values.
Given limited historical data (only 1 year), a simple linear regression model might perform well as a baseline.
While deep learning models (like Amazon Forecast) may be more advanced, Linear Learner is easier to implement and train for a prototype.
A. NO - kNN is not forecasting, it is similarities
B. NO - RCF is for anomality detection
C. YES - Linear Regression good for forecasting
D. NO - we don’t want to classify
The reason for this choice is that the Linear Learner algorithm is a versatile algorithm that can be used for both regression and classification tasks1. Regression is a type of supervised learning that predicts a continuous numeric value, such as the air quality in parts per million2. The predictor_type parameter specifies whether the algorithm should perform regression or classification3. Since the goal is to forecast a numeric value, the predictor_type should be set to regressor.
A. Managing Kafka on EC2 is not compatible with least effort requirement
B. Doable (in 2024) as Glue supports streaming ETL to consumes streams and supports CSV records -> https://docs.aws.amazon.com/glue/latest/dg/add-job-streaming.html
C. Managing an EMR cluster imo is no compatible with least effort requirement
D. Firehose supports kinesis data stream as source and it can use lambda to convert CSV records into parquet -> https://docs.aws.amazon.com/firehose/latest/dev/record-format-conversion.html
I guess this is a bit old question, pre Glue streaming ETL support (2023) -> https://aws.amazon.com/about-aws/whats-new/2023/03/aws-glue-4-0-streaming-etl/
Thus I’ll go for D
This blog wrote Japanese.
but its said using LinearLearner for air pollution prediction.
https://aws.amazon.com/jp/blogs/news/build-a-model-to-predict-the-impact-of-weather-on-urban-air-quality-using-amazon-sagemaker/
The HyperParameter is . Either “binary_classifier” or “multiclass_classifier” or “regressor”., there is no classifier so the answer is C
Ans should be c
a kNN will require a large value of k to avoid overfitting and we only have 1 year’s worth of data - kNNs also face a difficult time extrapolating if the air quality series contains a trend
If we had assurances there is no trend in the air quality series (no extrapolation), and we had enough data, then kNN should beat a linear model … I am inclined to go for C just going off of the cue that “only daily data from last year is available”
Agree with you analysis, to further expand it: we don’t have info about dataset features based on “only daily data from last year is available” this let me think we could be in a situation where our dataset is made up by timestamp and pollution_value so KNN would be pretty useless in this situation.
Random cut forests in timeseries are used for anomaly detection, and not for forecasting. KNN’s are classification algorithms. You would use the Linear Learner as a regressor, since forecasting falls into the domain of regression.
I mean, you could use KNN’s for regression, but for forecasting I don’t think so
KNN isn’t for time series predicting, go for A!
Im sorry, I wanted to say go for C!
Creating a machine learning model to predict air quality
To start small, we will follow the second approach, where we will build a model that will predict the NO2 concentration of any given day based on wind speed, wind direction, maximum temperature, pressure values of that day, and the NO2 concentration of the previous day. For this we will use the Linear Learner algorithm provided in Amazon SageMaker, enabling us to quickly build a model with minimal work.
Our model will consist of taking all of the variables in our dataset and using them as features of the Linear Learner algorithm available in Amazon SageMaker
Answer should be A.
k-Nearest-Neighbors (kNN) algorithm will provide the best results for this use case as it is a good fit for time series data, especially for predicting continuous values. The predictor_type of regressor is also appropriate for this task, as the goal is to forecast a continuous value (air quality in parts per million of contaminants). The other options are also viable, but may not provide as good of results as the kNN algorithm, especially with limited data.
using the Amazon SageMaker Linear Learner algorithm with a predictor_type of regressor, may still provide reasonable results, but it assumes a linear relationship between the input features and the target variable (air quality), which may not always hold in practice, especially with complex time series data. In such cases, non-linear models like kNN may perform better. Furthermore, the kNN algorithm can handle irregular patterns in the data, which may be present in the air quality data, and provide more accurate predictions.
Answer is “C” !!!
answer C
I go with A. Linear regression is not suitable for time series data. there is a library that implements knn for time-series https://cran.r-project.org/web/packages/tsfknn/vignettes/tsfknn.html
I mean the air quality have many feature correlations that are not linear. - https://www.examtopics.com/discussions/amazon/view/12382-exam-aws-certified-machine-learning-specialty-topic-1/
5 - A Data Engineer needs to build a model using a dataset containing customer credit card information How can the Data Engineer ensure the data remains encrypted and the credit card information is secure? - A.. Use a custom encryption algorithm to encrypt the data and store the data on an Amazon SageMaker instance in a VPC. Use the SageMaker DeepAR algorithm to randomize the credit card numbers.
B.. Use an IAM policy to encrypt the data on the Amazon S3 bucket and Amazon Kinesis to automatically discard credit card numbers and insert fake credit card numbers.
C.. Use an Amazon SageMaker launch configuration to encrypt the data once it is copied to the SageMaker instance in a VPC. Use the SageMaker principal component analysis (PCA) algorithm to reduce the length of the credit card numbers.
D.. Use AWS KMS to encrypt the data on Amazon S3 and Amazon SageMaker, and redact the credit card numbers from the customer data with AWS Glue.
D - Why not D? When the data encrypted on S3 and SageMaker uses the same AWS KMS key it can use encrypted data there.
should be D
Should be D.
Use Glue to do ETL to Hash the card number
Answer would be D
D is correct
https://aws.amazon.com/blogs/big-data/detect-and-process-sensitive-data-using-aws-glue-studio/
AWS Glue can be used for detecting and processing sensitive data.
Use AWS KMS for encryption and AWS Glue to redact credit card numbers
Reasoning:
AWS KMS (Key Management Service) encrypts data at rest in Amazon S3 and during processing in Amazon SageMaker.
AWS Glue can be used to redact sensitive data before processing, ensuring that credit card numbers are removed from datasets before being used for ML.
Complies with PCI DSS requirements for handling payment information securely.
The reason for this choice is that AWS KMS is a service that allows you to easily create and manage encryption keys and control the use of encryption across a wide range of AWS services and in your applications1. By using AWS KMS, you can encrypt the data on Amazon S3, which is a durable, scalable, and secure object storage service2, and on Amazon SageMaker, which is a fully managed service that provides every developer and data scientist with the ability to build, train, and deploy machine learning models quickly3. This way, you can protect the data at rest and in transit.
A. NO - no need for custom encryption
B. NO - IAM Policies are not to encrypt
C. NO - launch configuration is not to encrypt
D. YES
I think d is correct
It’s D, KMS key can be used for encrypting the data at rest!
agreed with D
IMHO, the problem with the question is that it is not clear whether the credit card number is used in the model. In that case discarding is never a good option. Hashing should be a safe option to keep it in the learning path
It’s gotta be D but C is a clever fake answer. Use PCA to reduce the length of the credit card number? That’s a clever joke, as if reducing the length of a character string is the same as reducing dimensionality in a feature set.
Can Glue do redaction?
Just have the Glue job remove the credit card column.
Encryption on AWS can be done using KMS so D is the answer
D is correct
D..KMS fully managed and other options are too whacky..
D is correct
Ans D is correct - https://www.examtopics.com/discussions/amazon/view/9818-exam-aws-certified-machine-learning-specialty-topic-1/
6 - A Machine Learning Specialist is using an Amazon SageMaker notebook instance in a private subnet of a corporate VPC. The ML Specialist has important data stored on the Amazon SageMaker notebook instance’s Amazon EBS volume, and needs to take a snapshot of that EBS volume. However, the ML Specialist cannot find the Amazon SageMaker notebook instance’s EBS volume or Amazon EC2 instance within the VPC. Why is the ML Specialist not seeing the instance visible in the VPC? - A.. Amazon SageMaker notebook instances are based on the EC2 instances within the customer account, but they run outside of VPCs.
B.. Amazon SageMaker notebook instances are based on the Amazon ECS service within customer accounts.
C.. Amazon SageMaker notebook instances are based on EC2 instances running within AWS service accounts.
D.. Amazon SageMaker notebook instances are based on AWS ECS instances running within AWS service accounts.
C - I think the answer should be C
The correct answer HAS TO be A
The instances are running in customer accounts but it’s in an AWS managed VPC while exposing ENI to customer VPC if it was chosen.
See explanation at https://aws.amazon.com/blogs/machine-learning/understanding-amazon-sagemaker-notebook-instance-networking-configurations-and-advanced-routing-options/
Can’t be A because A says “but they run outside of VPCs”, which is not correct. They are attached to VPC, but it can either be AWS Service VPC or Customer VPC, or Both, as per the explanation url you provided.
This is exactly right. According to that document, if the notebook instance is not in a customer VPC, then it has to be in the Sagemaker managed VPC. See Option 1 in that document.
Actually your link says: The notebook instance is running in an Amazon SageMaker managed VPC as shown in the above diagram. That means the correct answer is C. An Amazon SageMaker managed VPC can only be created in an Amazon managed Account.
C. Amazon SageMaker notebook instances are based on EC2 instances running within AWS service accounts.
Why?
Amazon SageMaker does use EC2 instances, but they are not directly managed within the customer’s AWS account.
Instead, these instances are provisioned within AWS-managed service accounts, which is why they do not appear within the customer’s VPC or EC2 console.
The only way to access the underlying EBS volume is via SageMaker APIs, rather than the EC2 console.
Although I’d go with Glue and option B I’m pretty sure that this is one of those “15 unscored questions that do not affect your score. AWS collects information about performance on these unscored questions to evaluate these questions for future use as scored questions”
Just for fun I asked perplexity, chatgpt, gemini, deepseek and claude: all gave D as first response
When I pointed out that “according to this https://docs.aws.amazon.com/firehose/latest/dev/record-format-conversion.html Kinesis can’t convert directly cvs to parquet. It needs a Lambda” each model responded in a different way (some of them contradictory).
My reasoning is that D (Kinesis + Firehose) is incorrect because Firehose does not support direct CSV-to-Parquet conversion and needs a Lambda not mentioned in the option. But discussing about questions like this one is nothing but I big waste of time ;-P
Forget about this please I posted this here incorrectly. This corresponds to Question 3. Apologies
Amazon SageMaker notebook instances are indeed based on EC2 instances, but they are managed by the SageMaker service and do not appear as standard EC2 instances in the customer’s VPC. Instead, they run in a managed environment that abstracts away the underlying EC2 instances, which is why the ML Specialist cannot see the instance in the VPC.
The explanation for this choice is that Amazon SageMaker notebook instances are fully managed by AWS and run on EC2 instances that are not visible to customers. These EC2 instances are launched in AWS-owned accounts and are isolated from customer accounts by using AWS PrivateLink1. This means that customers cannot access or manage these EC2 instances directly, nor can they see the EBS volumes attached to them.
A. NO - AEC2 instances within the customer account are necessarily in a VPCb
B. NO - Amazon ECS service is not within customer accounts
C. YES - EC2 instances running within AWS service accounts are not visible to customer account
D. NO - SageMaker manages EC2 instance, not ECS
A. NO. If the EC2 instance of the notebook was in the customer account, customer would be able to see it. Also, “they run outside VPCs” isn’t true as they run in service managed VPC or can be also attached to customer provided VPC -> https://aws.amazon.com/blogs/machine-learning/understanding-amazon-sagemaker-notebook-instance-networking-configurations-and-advanced-routing-options/
B. NO, Notebooks are based on EC2 + EBS
C. YES -> https://aws.amazon.com/blogs/machine-learning/understanding-amazon-sagemaker-notebook-instance-networking-configurations-and-advanced-routing-options/
D. NO, Notebooks are based on EC2 + EBS
I also actually tested it in my account: I created a Notebook and attached it to my VPC, I was not able to see the EC2 instance behind the Notebook but I was able to see the its ENI with the following description “[Do not delete] Network Interface created to access resources in your VPC for SageMaker Notebook Instance …”
already given below
I am pretty sure the answer is A : Amazon SageMaker notebook instances are indeed based on EC2 instances, and these instances are within your AWS customer account. However, by default, SageMaker notebook instances run outside of your VPC (Virtual Private Cloud), which is why they may not be visible within your VPC. SageMaker instances are designed to be easily accessible for data science and machine learning tasks, which is why they typically do not reside within a VPC. If you need them to operate within a VPC, you can configure them accordingly, but this is not the default behavior.
I think it should be c
Per https://docs.aws.amazon.com/sagemaker/latest/dg/studio-notebooks-and-internet-access.html it’s C
Notebooks can run inside AWS managed VPC or customer managed VPC
C, check the digram in https://docs.aws.amazon.com/sagemaker/latest/dg/studio-notebooks-and-internet-access.html
When a SageMaker notebook instance is launched in a VPC, it creates an Elastic Network Interface (ENI) in the subnet specified, but the underlying EC2 instance is not visible in the VPC. This is because the EC2 instance is managed by AWS, and it is outside of the VPC. The ENI acts as a bridge between the VPC and the notebook instance, allowing network connectivity between the notebook instance and other resources in the VPC. Therefore, the EBS volume of the notebook instance is also not visible in the VPC, and you cannot take a snapshot of the volume using VPC-based tools. Instead, you can create a snapshot of the EBS volume directly from the SageMaker console, AWS CLI, or SDKs.
what you described is C
“This is because the EC2 instance is managed by AWS, and it is outside of the VPC.”
Notebooks run inside a VPC not outside!
Definitely C - https://www.examtopics.com/discussions/amazon/view/11559-exam-aws-certified-machine-learning-specialty-topic-1/
7 - A Machine Learning Specialist is building a model that will perform time series forecasting using Amazon SageMaker. The Specialist has finished training the model and is now planning to perform load testing on the endpoint so they can configure Auto Scaling for the model variant. Which approach will allow the Specialist to review the latency, memory utilization, and CPU utilization during the load test? - A.. Review SageMaker logs that have been written to Amazon S3 by leveraging Amazon Athena and Amazon QuickSight to visualize logs as they are being produced.
B.. Generate an Amazon CloudWatch dashboard to create a single view for the latency, memory utilization, and CPU utilization metrics that are outputted by Amazon SageMaker.
C.. Build custom Amazon CloudWatch Logs and then leverage Amazon ES and Kibana to query and visualize the log data as it is generated by Amazon SageMaker.
D.. Send Amazon CloudWatch Logs that were generated by Amazon SageMaker to Amazon ES and use Kibana to query and visualize the log data.
B - Agreed. Ans is B
Generate an Amazon CloudWatch dashboard to create a single view for latency, memory utilization, and CPU utilization
Why?
Amazon SageMaker automatically pushes latency and instance utilization metrics to CloudWatch.
CloudWatch dashboards provide a single real-time view of these key metrics during load testing.
You can configure custom CloudWatch alarms to trigger auto scaling based on the load.
the question is clear that the specialist is seeking for latency, memory utilization, and CPU utilization during the load test and the ideal answer for all of these is amazon cloud watch which give you all these metrics
https://docs.aws.amazon.com/sagemaker/latest/dg/monitoring-cloudwatch.html
The reason for this choice is that Amazon CloudWatch is a service that monitors and manages your cloud resources and applications. It collects and tracks metrics, which are variables you can measure for your resources and applications1. Amazon SageMaker automatically reports metrics such as latency, memory utilization, and CPU utilization to CloudWatch2. You can use these metrics to monitor the performance and health of your SageMaker endpoint during the load test.
the question is clear that the specialist is seeking for latency, memory utilization, and CPU utilization during the load test and the ideal answer for all of these is amazon cloud watch which give you all these metrics
https://docs.aws.amazon.com/sagemaker/latest/dg/monitoring-cloudwatch.html
I think it should be b
It’s B, even the resources that aren’t visible in a first try are visible if you use cloudwatch agent.
Should be B
agreed with B
B is the ans
Should be C right, as Cloudwatch does not have metrics for memory utilization.
After further research, I think answer is B. While indeed true that Cloudwatch does not have metrics for memory utilization by default, you can achieve by installing ClouldWatch agent on the EC2. The EC2 used by Sagemaker is pre-installed with Cloudwatch Agent.
I do not think that CloudWatch, by default, logs memory utilization. It does log CPU utilization. If memory utilization is required, then a separate agent needs to be installed to watch for memory. Hence, in this case, we have to write an agent if the answer has to be B. Else, C looks to be a better solution.
answer is B
Answer is B 100%; very straightforward method
B is correct. Don’t need to use Kibana or QuickSight.
ans is B
B is correct - https://www.examtopics.com/discussions/amazon/view/11560-exam-aws-certified-machine-learning-specialty-topic-1/
8 - A manufacturing company has structured and unstructured data stored in an Amazon S3 bucket. A Machine Learning Specialist wants to use SQL to run queries on this data. Which solution requires the LEAST effort to be able to query this data? - A.. Use AWS Data Pipeline to transform the data and Amazon RDS to run queries.
B.. Use AWS Glue to catalogue the data and Amazon Athena to run queries.
C.. Use AWS Batch to run ETL on the data and Amazon Aurora to run the queries.
D.. Use AWS Lambda to transform the data and Amazon Kinesis Data Analytics to run queries.
B - B is correct
The correct answer HAS TO be B
Using Glue Use AWS Glue to catalogue the data and Amazon Athena to run queries against data on S3 are very typical use cases for those services.
D is not ideal, Lambda can surely do many things but it requires development/testing effort, and Amazon Kinesis Data Analytics is not ideal for ad-hoc queries.
B. Use AWS Glue to catalog the data and Amazon Athena to run queries.
Why is this the best choice?
AWS Glue can automatically catalog both structured and unstructured data in S3.
Amazon Athena is a serverless SQL query service that allows direct SQL queries on S3 data without moving it.
No infrastructure setup is required—just define a Glue Data Catalog and start querying with Athena.
S3 query === athena , to catalog data glue
AWS Glue is a fully managed ETL service that makes it easy to move data between data stores. It can automatically crawl, catalogue, and classify data stored in Amazon S3, and make it available for querying and analysis. With AWS Glue, you don’t have to worry about the underlying infrastructure and can focus on your data.
Amazon Athena is an interactive query service that makes it easy to analyze data in Amazon S3 using standard SQL. It integrates with AWS Glue, so you can use the catalogued data directly in Athena without any additional data movement or transformation.
The reason for this choice is that AWS Glue is a fully managed service that provides a data catalogue to make your data in S3 searchable and queryable1. AWS Glue crawls your data sources, identifies data formats, and suggests schemas and transformations1. You can use AWS Glue to catalogue both structured and unstructured data, such as relational data, JSON, XML, CSV files, images, or media files2.
I think it should be b
B is the easiest. We can use Glue crawler.
Answer B
Querying data in S3 with SQL is almost always Athena.
If AWS asks the question of querying unstructured data in an efficient manner, it is almost always Athena
B. I don’t think that you even need Glue to transform anything. Just use Glue to define the schemas and then use Athena to query based on those schemas.
answer is B
SQL on S3 is Athena so answer is B for sure
B is right
Answer is B.
Queries Against an Amazon S3 Data Lake
Data lakes are an increasingly popular way to store and analyze both structured and unstructured data. If you want to build your own custom Amazon S3 data lake, AWS Glue can make all your data immediately available for analytics without moving the data.
https://aws.amazon.com/glue/
Correct Ans is D…Kinesis Data Analytics can use Lambda to transform and then run the SQL queries..
May I know why you are taking complex route? - https://www.examtopics.com/discussions/amazon/view/11771-exam-aws-certified-machine-learning-specialty-topic-1/
9 - A Machine Learning Specialist is developing a custom video recommendation model for an application. The dataset used to train this model is very large with millions of data points and is hosted in an Amazon S3 bucket. The Specialist wants to avoid loading all of this data onto an Amazon SageMaker notebook instance because it would take hours to move and will exceed the attached 5 GB Amazon EBS volume on the notebook instance. Which approach allows the Specialist to use all the data to train the model? - A.. Load a smaller subset of the data into the SageMaker notebook and train locally. Confirm that the training code is executing and the model parameters seem reasonable. Initiate a SageMaker training job using the full dataset from the S3 bucket using Pipe input mode.
B.. Launch an Amazon EC2 instance with an AWS Deep Learning AMI and attach the S3 bucket to the instance. Train on a small amount of the data to verify the training code and hyperparameters. Go back to Amazon SageMaker and train using the full dataset
C.. Use AWS Glue to train a model using a small subset of the data to confirm that the data will be compatible with Amazon SageMaker. Initiate a SageMaker training job using the full dataset from the S3 bucket using Pipe input mode.
D.. Load a smaller subset of the data into the SageMaker notebook and train locally. Confirm that the training code is executing and the model parameters seem reasonable. Launch an Amazon EC2 instance with an AWS Deep Learning AMI and attach the S3 bucket to train the full dataset.
A - Answer is A. The answer to this question is about Pipe mode from S3. The only options are A and C. As AWS Glue cannot be use to create models which is option C.
The correct answer is A
Answer is A.
Training locally on a small dataset ensures the training script and model parameters are working correctly.
Amazon SageMaker training jobs allow direct access to S3 data without downloading everything.
Pipe input mode efficiently streams data from S3 to the training instance, reducing disk space requirements and speeding up training.
Only Pipe mode can stream data from S3
The reason for this choice is that Pipe input mode is a feature of Amazon SageMaker that allows you to stream data directly from an Amazon S3 bucket to your training instances without downloading it first1. This way, you can avoid the time and space limitations of loading a large dataset onto your notebook instance. Pipe input mode also offers faster start times and better throughput than File input mode, which downloads the entire dataset before training1.
A. YES - pipe mode is best to start inference before the entire data is transferred; the only drawback is if multiple training jobs are done in sequence (eg. different hyperparamater), the data will be downloaded again
B. NO - we want to use SageMaker first for initial training
C. NO - We first want to test things in SageMaker
D. NO - the SageMaker notebook will not use the AMI so the testing done is useless
B. Generate daily precision-recall data in Amazon QuickSight, and publish the results in a dashboard shared with the Business team.
This solution leverages QuickSight’s managed service capabilities for both data processing and visualization, which should minimize the coding effort required to provide the Business team with the necessary insights. However, it’s important to note that QuickSight’s ability to calculate the precision-recall data depends on its support for the necessary statistical functions or the availability of such calculations in the dataset. If QuickSight cannot perform these calculations directly, option C might be necessary, despite the increased effort.
I think it should be a
It’s A, pipe mode is for dealing with very big data.
A, PIPE is to do that sort of modeling
When data is already in S3 and next it should move to Sagemaker.. so option A is suitable
Answer is A. B, C & D can be dropped because there is no integration from/to Sage Maker train job (model).
Gotta be A. You need to use Pipe mode but Glue cannot train a model.
AAAAAAAAAAa
ans is A
Will you run AWS Deep Learning AMI for all cases where the data is very large in S3? Also what role is Glue playing here? Is there a transformation? These are the two issues for options B C and D. I believe they do not represent what is required to satisfy the requirements in the question. The answer definitely requires the pipe mode, but not with Glue. I go with A https://aws.amazon.com/blogs/machine-learning/using-pipe-input-mode-for-amazon-sagemaker-algorithms/
go for A - https://www.examtopics.com/discussions/amazon/view/9656-exam-aws-certified-machine-learning-specialty-topic-1/
10 - A Machine Learning Specialist has completed a proof of concept for a company using a small data sample, and now the Specialist is ready to implement an end- to-end solution in AWS using Amazon SageMaker. The historical training data is stored in Amazon RDS. Which approach should the Specialist use for training a model using that data? - A.. Write a direct connection to the SQL database within the notebook and pull data in
B.. Push the data from Microsoft SQL Server to Amazon S3 using an AWS Data Pipeline and provide the S3 location within the notebook.
C.. Move the data to Amazon DynamoDB and set up a connection to DynamoDB within the notebook to pull data in.
D.. Move the data to Amazon ElastiCache using AWS DMS and set up a connection within the notebook to pull data in for fast access.
B - Answer is B as the data for a SageMaker notebook needs to be from S3 and option B is the only option that says it. The only thing with option B is that it is talking of moving data from MS SQL Server not RDS
https://www.slideshare.net/AmazonWebServices/train-models-on-amazon-sagemaker-using-data-not-from-amazon-s3-aim419-aws-reinvent-2018
Please look at the slide 14 of that link, although the data source from DynamoDB or RDS, it is still need to use AWS Glue to move the data to S3 for SageMaker to use.
So, the right anwser should be B.
I agree. As from the ML developer guide I just read, it is the MYSQL RDS that can be used as SQL datasource.
Amazon SageMaker does not natively connect to Amazon RDS. Instead, training jobs work best with data stored in Amazon S3.
Amazon S3 is the preferred data source for SageMaker because:
It integrates seamlessly with SageMaker’s training job infrastructure.
It supports distributed training for large datasets.
It is cost-effective and decouples storage from compute.
Best practice → Export RDS data to Amazon S3 and train using SageMaker.
B is the correct answer.
Official AWS Documentation:
“Amazon ML allows you to create a datasource object from data stored in a MySQL database in Amazon Relational Database Service (Amazon RDS). When you perform this action, Amazon ML creates an AWS Data Pipeline object that executes the SQL query that you specify, and places the output into an S3 bucket of your choice. Amazon ML uses that data to create the datasource.”
In Option B approach, the Specialist can use AWS Data Pipeline to automate the movement of data from Amazon RDS to Amazon S3. This allows for the creation of a reliable and scalable data pipeline that can handle large amounts of data and ensure the data is available for training.
In the Amazon SageMaker notebook, the Specialist can then access the data stored in Amazon S3 and use it for training the model. Using Amazon S3 as the source of training data is a common and scalable approach, and it also provides durability and high availability of the data.
This approach is the most scalable and reliable way to train a model using data stored in Amazon RDS. Amazon S3 is a highly scalable and durable object storage service, and Amazon Data Pipeline is a managed service that makes it easy to move data between different AWS services. By pushing the data to Amazon S3, the Specialist can ensure that the data is available for training the model even if the Amazon RDS instance is unavailable.
A. NO - SageMaker can only read from S3
B. YES - AWS Data Pipeline can moved from SQL Server to S3
C. NO - SageMaker can only read from S3 and not DynamoDB
D. NO - SageMaker can only read from S3 and not ElastiCache
Option B (exporting to S3) is typically more flexible and cost-effective for large-scale or complex data needs (Which is our case - production), while Option A (direct connection) can be simpler and more immediate for real-time or smaller-scale scenarios like testing.
A. NO. It is doable, but this is not the best approach.
B. YES
C. NO. Pushing data to DynamoDB would not make it easier to access data
D. NO. Pushing data to ElastiCache would not make it easier to access data
For Amazon S3, you can import data from an Amazon S3 bucket as long as you have permissions to access the bucket.
For Amazon Athena, you can access databases in your AWS Glue Data Catalog as long as you have permissions through your Amazon Athena workgroup.
For Amazon RDS, if you have the AmazonSageMakerCanvasFullAccess policy attached to your user’s role, then you’ll be able to import data from your Amazon RDS databases into Canvas.
https://docs.aws.amazon.com/sagemaker/latest/dg/canvas-connecting-external.html
https://aws.amazon.com/about-aws/whats-new/2024/04/amazon-sagemaker-studio-notebooks-data-sql-query/
I think it should be b
It’s B, even if Microsoft SQL Server is a strange name for RDS, it’s a possible database to use there and the data for sagemaker needs to be in S3!
While B is a valid answer, It is also possible to make a SQL connection in a notebook and create a data object so A could be a valid answer too
https://stackoverflow.com/questions/36021385/connecting-from-python-to-sql-server
https://www.mssqltips.com/sqlservertip/6120/data-exploration-with-python-and-sql-server-using-jupyter-notebooks/
you need to choose the best answer, not any valid answer. Often, many of the answers are valid solutions, but are not best practice.
B is correct. MS SQL Server is also under RDS.
B is right
B it is
I’ll go with B - https://www.examtopics.com/discussions/amazon/view/11376-exam-aws-certified-machine-learning-specialty-topic-1/
11 - A Machine Learning Specialist receives customer data for an online shopping website. The data includes demographics, past visits, and locality information. The Specialist must develop a machine learning approach to identify the customer shopping patterns, preferences, and trends to enhance the website for better service and smart recommendations. Which solution should the Specialist recommend? - A.. Latent Dirichlet Allocation (LDA) for the given collection of discrete data to identify patterns in the customer database.
B.. A neural network with a minimum of three layers and random initial weights to identify patterns in the customer database.
C.. Collaborative filtering based on user interactions and correlations to identify patterns in the customer database.
D.. Random Cut Forest (RCF) over random subsamples to identify patterns in the customer database.
C - answer should be C
Collaborative filtering is for recommendation, LDA is for topic modeling
In natural language processing, the latent Dirichlet allocation (LDA) is a generative statistical model that allows sets of observations to be explained by unobserved groups that explain why some parts of the data are similar.
Amazon SageMaker Random Cut Forest (RCF) is an unsupervised algorithm for detecting anomalous data points within a data set
Neural network is used for image detection
Answer is C
Collab filtering it is..
Collaborative filtering is the most widely used approach for recommendation systems.
It uses customer interactions (purchases, clicks, ratings) to determine preferences based on similar users or items.
Implicit collaborative filtering (based on user behavior) and explicit collaborative filtering (based on ratings) can effectively personalize recommendations.
A. NO - LDA is for topic modeling
B. NO - NN is a too generic term, you want Neural Collaborative
C. YES - Collaborative filtering best fit
D. NO - Random Cut Forest (RCF) for anomalities
Collaborative filtering is a machine learning technique that recommends products or services to users based on the ratings or preferences of other users. This technique is well-suited for identifying customer shopping patterns and preferences because it takes into account the interactions between users and products.
From the doc: “You can use LDA for a variety of tasks, from clustering customers based on product purchases to automatic harmonic analysis in music.”
https://docs.aws.amazon.com/sagemaker/latest/dg/lda-how-it-works.html
I think it should be c
C, always when talk about recommendation you can think about collaborative patterns!
A
LDA used before collaborative filtering is largely adopted.
1) the input data that we have doesn’t lend itself to collaborative filtering - it requires a set of items and a set of users who have reacted to some of the items, which is NOT what we have
2) recommendation is just one thing that we want to do. What about trends?
3) collaborative filtering isn’t one of the pre-built algorithms (weak argument, admittedly)
collaborative
C. Easy question.
its a appropriate use case of Collaborative filtering
this is C
I’m thinking that it is A because:
1) the input data that we have doesn’t lend itself to collaborative filtering - it requires a set of items and a set of users who have reacted to some of the items, which is NOT what we have
2) recommendation is just one thing that we want to do. What about trends?
3) collaborative filtering isn’t one of the pre-built algorithms (weak argument, admittedly)
Answer is C, demographics, past visits, and locality information data, LDA is appropriate
Collaborative filtering is appropriate
Answer A might be more suitable than other
https://docs.aws.amazon.com/zh_tw/sagemaker/latest/dg/lda-how-it-works.html
Not convinced with A. Answer C seems to be a better fit than A for recommendation model (LDA appears to be a topic-based model on unavailable data with similar patterns)
https://aws.amazon.com/blogs/machine-learning/extending-amazon-sagemaker-factorization-machines-algorithm-to-predict-top-x-recommendations/ - https://www.examtopics.com/discussions/amazon/view/8304-exam-aws-certified-machine-learning-specialty-topic-1/
12 - A Machine Learning Specialist is working with a large company to leverage machine learning within its products. The company wants to group its customers into categories based on which customers will and will not churn within the next 6 months. The company has labeled the data available to the Specialist. Which machine learning model type should the Specialist use to accomplish this task? - A.. Linear regression
B.. Classification
C.. Clustering
D.. Reinforcement learning
B - B seems to be okay
CLASSIFICATION - Binary Classification - Supervised Learning to be precise
The company wants to predict customer churn (whether a customer will leave or stay).
The data is labeled, meaning we have historical outcomes (churn or no churn).
The task involves categorizing customers into two groups:
Customers who will churn (leave)
Customers who will not churn (stay)
This means the problem is a Supervised Learning problem, specifically a binary classification problem.
The company wants to predict customer churn (whether a customer will leave or stay).
The data is labeled, meaning we have historical outcomes (churn or no churn).
The task involves categorizing customers into two groups:
Customers who will churn (leave)
Customers who will not churn (stay)
This means the problem is a Supervised Learning problem, specifically a binary classification problem.
The reason for this choice is that classification is a type of supervised learning that predicts a discrete categorical value, such as yes or no, spam or not spam, or churn or not churn1. Classification models are trained using labeled data, which means that the input data has a known target attribute that indicates the correct class for each instance2. For example, a classification model that predicts customer churn would use data that has a label indicating whether the customer churned or not in the past.
Classification models can be used for various applications, such as sentiment analysis, image recognition, fraud detection, and customer segmentation2. Classification models can also handle both binary and multiclass problems, depending on the number of possible classes in the target attribute3.
Option B. This is a scenario for supervised learning model as data is labelled and only A, B are supervised learning algorithms from the options. Linear learning is to predict time series data and distribution is selecting which class the input belongs to. Hence most suitable is to use Binomial distribution model in this case.
A. NO - Linear regression is not best for classification
B. YES - Classification
C. NO - we want supervised classification
D. NO - there is nothing to Reinforce from
The question is not clear. Actually we have 2 tasks here - group into categories (clustering) and predict if customers will churn/not churn (classification). If we had to simply do classification, why there was mentioned to group into categories?
This is definitely a classification problem
B is correct
B - it’s a Binary Classification problem. Will the customer churn: Yes or No
100% is B since it is about labelled data
i think the key is “the company has labeled the data” so this is classification, so it’s B
B is okey
B is correct - https://www.examtopics.com/discussions/amazon/view/10005-exam-aws-certified-machine-learning-specialty-topic-1/
13 - The displayed graph is from a forecasting model for testing a time series. Considering the graph only, which conclusion should a Machine Learning Specialist make about the behavior of the model?
[https://www.examtopics.com/assets/media/exam-media/04145/0000900001.jpg] - A.. The model predicts both the trend and the seasonality well
B.. The model predicts the trend well, but not the seasonality.
C.. The model predicts the seasonality well, but not the trend.
D.. The model does not predict the trend or the seasonality well.
A - A is correct answer.
Please Refer: https://machinelearningmastery.com/decompose-time-series-data-trend-seasonality/
A; the problem is bias, not trends
B. The model predicts the trend well, but not the seasonality.
Here’s what we can observe:
The predicted mean line closely follows the general upward trend of the observed line.
The predicted mean line does not capture the high frequency up and down changes of the observed line.
agreed, this seems to be A. there is similarity between the blue and green lines as far as capturing trend and seasonality is concerned. It just seems that if assumption is that the model is a linear regression model then just the intercept is off by a few units.
A. The model predicts both the trend and the seasonality well
The problem is Bias not trends or sesonality!
A is right, both trend (rising) and seasonality is there
C is correct answer
A is correct answer. Not C
The trend is up, so isn’t it correctly predicted? And the seasonality is also in sync, the amplitude is wrong.
A is right. trend and seasonality are fine, level is the one the model gets wrong
Should be C
Should be A - https://www.examtopics.com/discussions/amazon/view/45385-exam-aws-certified-machine-learning-specialty-topic-1/
14 - A company wants to classify user behavior as either fraudulent or normal. Based on internal research, a Machine Learning Specialist would like to build a binary classifier based on two features: age of account and transaction month. The class distribution for these features is illustrated in the figure provided. Based on this information, which model would have the HIGHEST accuracy?
[https://www.examtopics.com/assets/media/exam-media/04145/0001000001.jpg] - A.. Long short-term memory (LSTM) model with scaled exponential linear unit (SELU)
B.. Logistic regression
C.. Support vector machine (SVM) with non-linear kernel
D.. Single perceptron with tanh activation function
C - Answer is C. SVM sample use case is to put the dimensions into a higher hyperplane that can separates it. Seeing how separable it is, SVM can be used for it.
Support Vector Machine (SVM) with Non-Linear Kernel –> Non-linear Data
Why?
SVM is powerful for classification and works well even with small datasets.
If the data has a non-linear decision boundary, using an SVM with a non-linear kernel (like RBF or polynomial) can improve accuracy.
Works well in low-dimensional feature spaces (since we have only 2 features: age of account & transaction month).
Optimal choice if the data has a non-linear decision boundary.
SVMs are particularly effective for binary classification tasks and can handle non-linear relationships between features1.
You can use a support vector machine (SVM) when your data has exactly two classes. An SVM classifies data by finding the best hyperplane that separates all data points of one class from those of the other class. The best hyperplane for an SVM means the one with the largest margin between the two classes. Margin means the maximal width of the slab parallel to the hyperplane that has no interior data points.
Well, C is the correct answer. This example is a classical one to use SVM.
SVM for RBF mode is the answer!
Answer is C
Textbook C
C. more reading for using non-linear kernel and separate samples with a hyperplane in a higher dimension space: https://medium.com/pursuitnotes/day-12-kernel-svm-non-linear-svm-5fdefe77836c
C seems right
answer is C
Agree. The answer is A.
https://www.surveypractice.org/article/2715-using-support-vector-machines-for-survey-research
This is a good explanation of SVM
https://uk.mathworks.com/help/stats/support-vector-machines-for-binary-classification.html - https://www.examtopics.com/discussions/amazon/view/43907-exam-aws-certified-machine-learning-specialty-topic-1/
15 - A Machine Learning Specialist at a company sensitive to security is preparing a dataset for model training. The dataset is stored in Amazon S3 and contains Personally Identifiable Information (PII). The dataset: ✑ Must be accessible from a VPC only. ✑ Must not traverse the public internet. How can these requirements be satisfied? - A.. Create a VPC endpoint and apply a bucket access policy that restricts access to the given VPC endpoint and the VPC.
B.. Create a VPC endpoint and apply a bucket access policy that allows access from the given VPC endpoint and an Amazon EC2 instance.
C.. Create a VPC endpoint and use Network Access Control Lists (NACLs) to allow traffic between only the given VPC endpoint and an Amazon EC2 instance.
D.. Create a VPC endpoint and use security groups to restrict access to the given VPC endpoint and an Amazon EC2 instance
A - Important things to note here is that
- “The Data in S3 Needs to be Accessible from VPC”
- “Traffic should not Traverse internet”
To fulfill Requirement #2 we need a VPC endpoint
To RESTRICT the access to S3/Bucket
- Access allowed only from VPC via VPC Endpoint
Even though Sagemaker uses EC2 - we are NOT asked to secure the EC2 :)
So the answer is A
Between A & B, the answer should be A. From here:
https://docs.aws.amazon.com/vpc/latest/userguide/vpc-endpoints-s3.html#vpc-endpoints-s3-bucket-policies
We can see that we restrict access using DENY if sourceVpce (vpc endpoint), or sourceVpc (vpc) is not equal to our VPCe/VPC. So we are using a DENY (choice A) and not an ALLOW policy (choice B).
Choices C, D we eliminate because they don’t address S3 access at all.
Create a VPC endpoint and apply a bucket access policy that restricts access to the given VPC endpoint and the VPC.
Why is this correct?
VPC endpoint for S3 allows private connectivity between Amazon S3 and the VPC without using the public internet.
Bucket access policy can be written to allow access only from this VPC endpoint.
This ensures maximum security by:
Preventing access from outside the VPC.
Blocking public access.
In Option A, the Machine Learning Specialist would create a VPC endpoint for Amazon S3, which would allow traffic to flow directly between the VPC and Amazon S3 without traversing the public internet. Access to the S3 bucket containing PII can then be restricted to the VPC endpoint and the VPC using a bucket access policy. This would ensure that only instances within the VPC can access the data, and that the data does not traverse the public internet.
Option B and D, allowing access from an Amazon EC2 instance, would not meet the requirement of not traversing the public internet, as the EC2 instance would be accessible from the internet. Option C, using Network Access Control Lists (NACLs) to allow traffic between only the VPC endpoint and an EC2 instance, would also not meet the requirement of not traversing the public internet, as the EC2 instance would still be accessible from the internet.
A. YES - We first create a S3 endpoint in the VPC subnet so traffic does not flow through the Internet, then on the S3 bucket create an access policy that restricts access to the given VPC based on its ID
B. NO - we don’t want to be specific to an instance
C. NO - the S3 bucket is on AWS network, you cannot change the NACL for it
D. NO - not all instances in a VPC will necessarily have the same principal that can be specified in the policy
Definetly A
Well, but removing methodology, only A remains: The question never cited EC2
Per https://docs.aws.amazon.com/AmazonS3/latest/userguide/example-bucket-policies-vpc-endpoint.html it’s A
The question do not mention EC2 at all, so should be A
I think it should be B. Traning instance is a EC2 instance and need to be set an endpoint to load the data from S3.
AWS security is a conservative security model, which implies that access are denied by default rather than granted by default. We have to explicitly allow access to a AWS resource. Additionally, B talks about allowing access FROM the VPC to S3 while A talks about allowing access from S3 to VPC (which is not what we need).
So, B.
Um, no. A VPC endpoint is outbound from the VPC to a supported AWS service.
Will go with B
Betting on B here, we should control access from VPC, not to VPC.
A!
Restricting access to a specific VPC endpoint
The following is an example of an Amazon S3 bucket policy that restricts access to a specific bucket, awsexamplebucket1, only from the VPC endpoint with the ID vpce-1a2b3c4d. The policy denies all access to the bucket if the specified endpoint is not being used. The aws:SourceVpce condition is used to specify the endpoint.
https://docs.aws.amazon.com/AmazonS3/latest/userguide/example-bucket-policies-vpc-endpoint.html
Can’t be B. You simple cannot enable access to an endpoint to some selected instance. So A.
We shouldn’t use private IP in bucket policy.
B does not say enable access TO the VPC endpoint. It says to allow access FROM the endpoint. So B is the correct answer. A talks about restricting access TO the VPC endpoint, so that option is irrelevant. We’re worried about access TO the S3 bucket, not access to the VPC. The question is not poorly-worded, but it is tricky and you need to read it carefully.
I also vote A.
A
found here
“You can control which VPCs or VPC endpoints have access to your buckets by using Amazon S3 bucket policies. For examples of this type of bucket policy access control, see the following topics on restricting access.”
https://docs.aws.amazon.com/AmazonS3/latest/dev/example-bucket-policies-vpc-endpoint.html - https://www.examtopics.com/discussions/amazon/view/11279-exam-aws-certified-machine-learning-specialty-topic-1/
16 - During mini-batch training of a neural network for a classification problem, a Data Scientist notices that training accuracy oscillates. What is the MOST likely cause of this issue? - A.. The class distribution in the dataset is imbalanced.
B.. Dataset shuffling is disabled.
C.. The batch size is too big.
D.. The learning rate is very high.
D - Answer is D.
Should the weight be increased or reduced so that the error is smaller than the current value? You need to examine the amount of change to know that. Therefore, we differentiate and check whether the slope of the tangent is positive or negative, and update the weight value in the direction to reduce the error. The operation is repeated over and over so as to approach the optimal solution that is the goal. The width of the update amount is important at this time, and is determined by the learning rate.
maybe D ?
D. The learning rate is very high.
Explanation:
When the learning rate is too high, the optimization process may overshoot the optimal weights in parameter space. Instead of gradually converging, the model updates weights in a highly unstable manner, causing fluctuations in training accuracy. The network fails to settle into a minimum because the updates are too aggressive.
A high learning rate can cause oscillations in the training accuracy because the optimizer makes large updates to the model parameters in each iteration, which can cause overshooting the optimal values. This can result in the model oscillating back and forth across the optimal solution.
If the learning rate is too high, the model weights may overshoot the optimal values and bounce back and forth around the minimum of the loss function. This can cause the training accuracy to oscillate and prevent the model from converging to a stable solution. The training accuracy is the proportion of correct predictions made by the model on the training data.
When the learning rate is set too high, it can lead to oscillations or divergence during training. Here’s why:
High Learning Rate: A high learning rate means that the model’s parameters are updated by a large amount in each training step. This can cause the model to overshoot the optimal parameter values, leading to instability in training.
Oscillations: If the learning rate is excessively high, the model’s updates can become unstable, causing it to oscillate back and forth between parameter values. This oscillation can prevent the model from converging to an optimal solution.
To address this issue, you can try reducing the learning rate. It’s often necessary to experiment with different learning rates to find the one that works best for your specific problem and dataset. Learning rate scheduling techniques, such as reducing the learning rate over time, can also help stabilize training.
Answer is A.
A high learning rate means that the model parameters are being updated by large magnitudes in each iteration. As a result, the optimization process may struggle to converge to the optimal solution, leading to erratic behavior and fluctuations in training accuracy.
If learning rate is high, the accuracy is fluctuated because the value of loss function moves back and forth over the global minimum.
The big learning rating overshoot in true minima.
D Learning rate is too high. Textbook example of learning rate being too high. Lower Learning_rate will take more iterations, or longer to train, but will settle in place.
12-sep exam
D: per supuesto
A company sells thousands of products on a public website and wants to automatically identify
products with potential durability problems. The company has 1.000 reviews with date, star rating,
review text, review summary, and customer email fields, but many reviews are incomplete and
have empty fields. Each review has already been labeled with the correct durability result.
A machine learning specialist must train a model to identify reviews expressing concerns over
product durability. The first model needs to be trained and ready to review in 2 days.
What is the MOST direct approach to solve this problem within 2 days?
A.
Train a custom classifier by using Amazon Comprehend.
B.
Build a recurrent neural network (RNN) in Amazon SageMaker by using Gluon and Apache
MXNet.
C.
Train a built-in BlazingText model using Word2Vec mode in Amazon SageMaker.
D.
Use a built-in seq2seq model in Amazon SageMaker.
Is A valid option?
D is correct. big batch size make local minia.
it is a multiple answer question and answer should be both A and D
Answer is D 100%; learning rate too high will cause such an event
The answer is D, from the Coursera deep learning specialization (course 2 - improving Deep NN) - https://www.examtopics.com/discussions/amazon/view/12378-exam-aws-certified-machine-learning-specialty-topic-1/
17 - An employee found a video clip with audio on a company’s social media feed. The language used in the video is Spanish. English is the employee’s first language, and they do not understand Spanish. The employee wants to do a sentiment analysis. What combination of services is the MOST efficient to accomplish the task? - A.. Amazon Transcribe, Amazon Translate, and Amazon Comprehend
B.. Amazon Transcribe, Amazon Comprehend, and Amazon SageMaker seq2seq
C.. Amazon Transcribe, Amazon Translate, and Amazon SageMaker Neural Topic Model (NTM)
D.. Amazon Transcribe, Amazon Translate and Amazon SageMaker BlazingText
A - the MOST efficient means to you don’t need to coding, building infra
All of sevices are manage by AWS is good,
Transcribe, Amazon Translate, and Amazon Comprehend
Answer is A
Agree, Answer is A
A is not 100% correct. You don’t need to translate Spanish. Amazon Comprehend supports Spanish.
Arguably, you still need a translation since the person doesn’t speak Spanish.
I think there is no need to use Amazon translate because sometimes the translation is not accurate.
It means some information gets lost.
Given the question, I believe that is necessary: look at the enphase of not understanding spanish. besides that, even with some information lost, you will at least understand something.
A. Amazon Transcribe, Amazon Translate, and Amazon Comprehend
Explanation of the Process:
Amazon Transcribe – Converts the Spanish audio in the video into text.
Amazon Translate – Translates the Spanish text to English.
Amazon Comprehend – Performs sentiment analysis on the translated English text.
It’s A:
1.Amazon Transcribe - to convert Spanish speech to Spanish text.
2.Amazon Translate - to translate Spanish text to English text
3.Amazon Comprehend - to analyze text for sentiments
A. YES - Comprehend is supervised so user must understand through Translate
B. NO - seq2seq is for generation and not classification
C. NO - Amazon SageMaker Neural Topic Model is unsupervised topic extraction, will not give sentiment against user-defined classes
D. NO - BlazingText is word2vec, does not give sentiment classes
It’s A:
1.Amazon Transcribe - to convert Spanish speech to Spanish text.
2.Amazon Translate - to translate Spanish text to English text
3.Amazon Comprehend - to analyze text for sentiments
It’s A 100%
Transcribe: Speech to text
Translate: Any language to any language
Comprehend: offers a range of capabilities for extracting insights and meaning from unstructured text data. Ex: Sentiment analysis, entity recognition, KeyPhrase Extraction, Language Detection, Document Classification
absolutely need STT(transcribe), translation(translate), and sentimental analysis(comprehend)
A - confirmed by ACG
I agree that the answer is A
answer is a
A; D is wrong because The Amazon SageMaker BlazingText algorithm provides highly optimized implementations of the Word2vec and text classification algorithms.
The Question/Anwser is not poorly as someone mentioned.
–Even though Comprehend can do the analysis directly on Spanish (no need of translate) but if comprehend does analysis and the resulting words are still in spanish , it will no help the employee as he doesnt know Spanish.So the transalate after transcribe will help Employee understand what is being analyzed by Comprehend in next step.
So read the question carefully before jumping to conclusions. it will save you an Exam :)
I don’t get this question. Comprehend supports Spanish natively. There is no need for Translate, and translate would actually reduce effectiveness of sentimental analysis. However, BCD are all invalid choices.
A
because Comprehend can provide sentiment analysis
A,
https://aws.amazon.com/getting-started/hands-on/analyze-sentiment-comprehend/ - https://www.examtopics.com/discussions/amazon/view/8306-exam-aws-certified-machine-learning-specialty-topic-1/
18 - A Machine Learning Specialist is packaging a custom ResNet model into a Docker container so the company can leverage Amazon SageMaker for training. The Specialist is using Amazon EC2 P3 instances to train the model and needs to properly configure the Docker container to leverage the NVIDIA GPUs. What does the Specialist need to do? - A.. Bundle the NVIDIA drivers with the Docker image.
B.. Build the Docker container to be NVIDIA-Docker compatible.
C.. Organize the Docker container’s file structure to execute on GPU instances.
D.. Set the GPU flag in the Amazon SageMaker CreateTrainingJob request body.
B - https://docs.aws.amazon.com/sagemaker/latest/dg/sagemaker-dg.pdf
page 55:
If you plan to use GPU devices, make sure that your containers are nvidia-docker compatible. Only the
CUDA toolkit should be included on containers. Don’t bundle NVIDIA drivers with the image. For more
information about nvidia-docker, see NVIDIA/nvidia-docker.
So the answer is B
Yeah, it’s B. But the page in the developer guide is page number 201 (209 in pdf). Second bullet point at the top.
Answer is B. below is from AWS documentation,
If you plan to use GPU devices for model training, make sure that your containers are nvidia-docker compatible. Only the CUDA toolkit should be included on containers; don’t bundle NVIDIA drivers with the image.
https://docs.aws.amazon.com/sagemaker/latest/dg/your-algorithms-training-algo-dockerfile.html
When using Amazon SageMaker with GPU-based EC2 instances (e.g., P3 instances), you must ensure that your custom Docker container can leverage NVIDIA GPUs. NVIDIA-Docker (now part of Docker with nvidia-container-runtime) allows containers to access GPU resources without needing to bundle NVIDIA drivers inside the container.
To make a custom Docker container GPU-compatible, the Machine Learning Specialist should:
Use NVIDIA CUDA and cuDNN in the Dockerfile.
Ensure the container is built using the NVIDIA Container Toolkit (nvidia-docker).
Use nvidia-container-runtime as the runtime.
To leverage the NVIDIA GPUs on Amazon EC2 P3 instances for training with Amazon SageMaker, the Docker container must be built to be compatible with NVIDIA-Docker.
NVIDIA-Docker is a wrapper around Docker that makes it easier to use GPUs in containers by providing GPU-aware functionality.
To build a Docker container that is compatible with NVIDIA-Docker, the Specialist should install the NVIDIA GPU drivers in the Docker container and install the NVIDIA-Docker runtime on the EC2 instances.
NVIDIA-Docker is a Docker container runtime plugin that allows the Docker container to access the GPU resources on the host machine. By building the Docker container to be NVIDIA-Docker compatible, the Docker container will have access to the NVIDIA GPU resources on the Amazon EC2 P3 instances, allowing for accelerated training of the ResNet model.
The reason for this choice is that NVIDIA-Docker is a tool that enables GPU-accelerated containers by automatically configuring the container runtime to use NVIDIA GPUs1. NVIDIA-Docker allows you to build and run Docker containers that can fully access the GPUs on your host system. This way, you can run GPU-intensive applications, such as deep learning frameworks, inside containers without any performance loss or compatibility issues.
A. NO - the drivers are not necessary (https://docs.aws.amazon.com/sagemaker/latest/dg/your-algorithms-training-algo-dockerfile.html)
B. YES - it is about using the CUDA library, need to use proper base image (https://medium.com/@jgleeee/building-docker-images-that-require-nvidia-runtime-environment-1a23035a3a58)
C. NO - file structure irrelavant to GPU
D. NO - SageMaker config, irrelevant to Docker
https://docs.aws.amazon.com/sagemaker/latest/dg/sagemaker-dg.pdf
page 55
page 570
On a GPU instance, the image is run with the –gpus option. Only the CUDA toolkit should be
included in the image not the NVIDIA drivers. For more information, see NVIDIA User Guide.
Answer B
Load the CUDA toolkit only, not the drivers. Ref GPU section : https://docs.aws.amazon.com/sagemaker/latest/dg/studio-byoi-specs.html
I think it should be b
B is correct!
As per aws documentation, answer is B, and A is even explicitly not recommended
As referred in other comments ans is B
ANS B
As mentioned byi other users
As per me answer is B
The answer is for sure B - as mentioned by others. And this is clearly stated in the docs
Ans. is B. - https://www.examtopics.com/discussions/amazon/view/9805-exam-aws-certified-machine-learning-specialty-topic-1/
19 - A Machine Learning Specialist is building a logistic regression model that will predict whether or not a person will order a pizza. The Specialist is trying to build the optimal model with an ideal classification threshold. What model evaluation technique should the Specialist use to understand how different classification thresholds will impact the model’s performance? - A.. Receiver operating characteristic (ROC) curve
B.. Misclassification rate
C.. Root Mean Square Error (RMSE)
D.. L1 norm
A - Ans. A is correct
Answer is A.
An ROC curve (receiver operating characteristic curve) is a graph showing the performance of a classification model at all classification thresholds
Explanation:
The ROC curve is the best technique to evaluate how different classification thresholds impact the model’s performance. It plots True Positive Rate (TPR) against False Positive Rate (FPR) at various threshold values.
Why the ROC Curve?
Logistic regression outputs probabilities, and we need to select a classification threshold to decide between “order pizza” (1) and “not order pizza” (0).
Changing the threshold impacts the trade-off between sensitivity (recall) and specificity.
The ROC curve helps visualize this trade-off and select the best threshold based on the business goal (e.g., maximizing recall vs. minimizing false positives).
The Area Under the ROC Curve (AUC-ROC) is a useful metric to measure the model’s discrimination ability.
A is indeed correct see https://developers.google.com/machine-learning/crash-course/classification/roc-and-auc
An ROC curve (receiver operating characteristic curve) is a graph showing the performance of a classification model at all classification thresholds. This curve plots two parameters:
• True Positive Rate
• False Positive Rate
True Positive Rate (TPR) is a synonym for recall and is therefore defined as follows:
TPR = TP/TP+FN
False Positive Rate (FPR) is defined as follows:
FPR = FP/FP+TN
The reason for this choice is that a ROC curve is a graphical plot that illustrates the performance of a binary classifier across different values of the classification threshold1. A ROC curve plots the true positive rate (TPR) or sensitivity against the false positive rate (FPR) or 1-specificity for various threshold values2. The TPR is the proportion of positive instances that are correctly classified, while the FPR is the proportion of negative instances that are incorrectly classified.
ROC curve is for defining the threshold.
A surely
Question is about classification so confusion matrix would come into mind; A is the answer
It is A.
obviously A
Root Mean Square Error (RMSE) Ans. c
I think RMSE is for regression model - https://www.examtopics.com/discussions/amazon/view/10011-exam-aws-certified-machine-learning-specialty-topic-1/
20 - An interactive online dictionary wants to add a widget that displays words used in similar contexts. A Machine Learning Specialist is asked to provide word features for the downstream nearest neighbor model powering the widget. What should the Specialist do to meet these requirements? - A.. Create one-hot word encoding vectors.
B.. Produce a set of synonyms for every word using Amazon Mechanical Turk.
C.. Create word embedding vectors that store edit distance with every other word.
D.. Download word embeddings pre-trained on a large corpus.
D - the solution is word embedding. As it is a interactive online dictionary, we need pre-trained word embedding thus the answer is D. In addition, there is no mention that the online dictonary is unique and does not have a pre-trained word embedding.
Thus I strongly feel the answer is D
D is correct. It is not a specialized dictionary so use the existing word corpus to train the model
D. Download word embeddings pre-trained on a large corpus.
Reason :
For a nearest neighbor model that finds words used in similar contexts, word embeddings are the best choice. Pre-trained word embeddings capture semantic relationships and contextual similarity between words based on a large text corpus (e.g., Wikipedia, Common Crawl).
The Specialist should:
Use pre-trained word embeddings like Word2Vec, GloVe, or FastText.
Load the embeddings into the model for efficient similarity comparisons.
Use a nearest neighbor search algorithm (e.g., FAISS, k-d tree, Annoy) to quickly find similar words.
D. Download word embeddings pre-trained on a large corpus.
Word embeddings are a type of dense representation of words, which encode semantic meaning in a vector form. These embeddings are typically pre-trained on a large corpus of text data, such as a large set of books, news articles, or web pages, and capture the context in which words are used. Word embeddings can be used as features for a nearest neighbor model, which can be used to find words used in similar contexts.
Downloading pre-trained word embeddings is a good way to get started quickly and leverage the strengths of these representations, which have been optimized on a large amount of data. This is likely to result in more accurate and reliable features than other options like one-hot encoding, edit distance, or using Amazon Mechanical Turk to produce synonyms.
A. NO - one-hot encoding is a very early featurization stage
B. NO - we don’t want human labelling
C. NO - too costly to do from scratch
D. YES - leverage exiting training; the word embeddings will provide vectors than be used to measure distance in the downstream nearest neighbor model
Pre-trained word embeddings, such as Word2Vec, GloVe, or FastText, capture the semantic and contextual meaning of words based on a large corpus of text data. By downloading pre-trained word embeddings, the Specialist can leverage the semantic relationships between words to provide meaningful word features for the nearest neighbor model powering the widget. Utilizing pre-trained word embeddings allows the model to understand and display words used in similar contexts effectively.
A. One-hot word encoding vectors: These vectors represent words by marking them as present or absent in a fixed-length binary vector. However, they don’t capture relationships between words or their meanings.
B. Producing synonyms: This would involve generating similar words for each word manually, which could be time-consuming and might not cover all possible contexts.
C. Word embedding vectors based on edit distance: This approach focuses on how similar words are in terms of their spelling or characters, not necessarily their meaning or context in sentences.
D. Downloading pre-trained word embeddings: These are vectors that represent words based on their contextual usage in a large dataset, capturing relationships between words and their meanings.
correct D ay tupoy
words that are used in similar contexts will have vectors that are close in the embedding space
D is correct
I also believe that D is the correct answer. No reason to create word embeddings from scratch
- One-hot encoding will blow up the feature space - it is not recommended for a high cardinality problem domain.
- One still needs to train the word features on large bodies of text to map context to each word
12-sep exam
DDDDDDDDDDDDD
D for sure
Definitely D.
A)It requires that document text be cleaned and prepared such that each word is one-hot encoded.
Ref:https://machinelearningmastery.com/what-are-word-embeddings/ - https://www.examtopics.com/discussions/amazon/view/9825-exam-aws-certified-machine-learning-specialty-topic-1/
21 - A Machine Learning Specialist is configuring Amazon SageMaker so multiple Data Scientists can access notebooks, train models, and deploy endpoints. To ensure the best operational performance, the Specialist needs to be able to track how often the Scientists are deploying models, GPU and CPU utilization on the deployed SageMaker endpoints, and all errors that are generated when an endpoint is invoked. Which services are integrated with Amazon SageMaker to track this information? (Choose two.) - A.. AWS CloudTrail
B.. AWS Health
C.. AWS Trusted Advisor
D.. Amazon CloudWatch
E.. AWS Config
AD - AD is correct
CloudTrail is use to track scientist how ofthe they deploy a model
CloudWatch for monitoring GPU and CPU
so answer is A & D
To monitor SageMaker model deployments, track resource utilization, and log errors, the Machine Learning Specialist should use:
AWS CloudTrail – Tracks API activity, such as:
Model deployments (e.g., CreateModel, CreateEndpoint)
Notebook access and actions
SageMaker job executions
Amazon CloudWatch – Monitors and logs operational metrics, such as:
CPU & GPU utilization of SageMaker endpoints
Invocation errors and latencies
Custom metrics from deployed models
Logs from training jobs and inference endpoints (via CloudWatch Logs)
I think AWS Config is still not the service designed to track how often Data Scientists are deploying models, nor does it track operational performance metrics like GPU and CPU utilization or the invocation errors of SageMaker endpoints.
and AWS CloudTrail continues to be the service that will track and record user activity and API usage, which includes deploying models in Amazon SageMaker.
So the answers are still A and D - CloudTrail and CloudWatch.
AD is correct
A. YES - to track deployements
B. NO - AWS Health is to track AWS Cloud itself (eg. is a zone down ? )
C. NO - AWS Trusted Advisor to give recommendations on infra
D. YES - for errors
E. AWS Config
I also believe that A and D are correct. Can someone please explain to me the main differences between CloudWatch and CloudTrail? I find the documentation a bit confusing about it
Option E AWS Config to record all resource types, then the new resources will be automatically recorded in your account.
Option A CloudTrail is use to track scientist how of the they deploy a model
Option D CloudWatch for monitoring GPU and CPU
Log Amazon Sagemaker API Calls with AWS CloudTrail - https://docs.aws.amazon.com/sagemaker/latest/dg/logging-using-cloudtrail.html
I wouldn’t be so sure about CloudTrail, AWS Configs also tracks Sagemaker and the resource “AWS::Sagemaker::Model”
just seen, this was release 4 days ago…
https://aws.amazon.com/about-aws/whats-new/2022/06/aws-config-15-new-resource-types/
A&D
CloudWatch and ClouTrail
AD Are Correct.
absolutely
cloudtrail and cloudwatch, no thinking - https://www.examtopics.com/discussions/amazon/view/10012-exam-aws-certified-machine-learning-specialty-topic-1/
22 - A retail chain has been ingesting purchasing records from its network of 20,000 stores to Amazon S3 using Amazon Kinesis Data Firehose. To support training an improved machine learning model, training records will require new but simple transformations, and some attributes will be combined. The model needs to be retrained daily. Given the large number of stores and the legacy data ingestion, which change will require the LEAST amount of development effort? - A.. Require that the stores to switch to capturing their data locally on AWS Storage Gateway for loading into Amazon S3, then use AWS Glue to do the transformation.
B.. Deploy an Amazon EMR cluster running Apache Spark with the transformation logic, and have the cluster run each day on the accumulating records in Amazon S3, outputting new/transformed records to Amazon S3.
C.. Spin up a fleet of Amazon EC2 instances with the transformation logic, have them transform the data records accumulating on Amazon S3, and output the transformed records to Amazon S3.
D.. Insert an Amazon Kinesis Data Analytics stream downstream of the Kinesis Data Firehose stream that transforms raw record attributes into simple transformed values using SQL.
D - D is correct. Question has ‘“simple transformations, and some attributes will be combined” and Least development effort. Kinesis analytics can get data from Firehose, transform and write to S3
https://docs.aws.amazon.com/kinesisanalytics/latest/java/examples-s3.html
Best explanation here, kudos.
I can’t find any information that indicate Kinesis data analytics taking data from firehose
The best way to transform data is before it arrives to S3 so D should be best answer. But D is not completed. It should have another Firehose to deliver results to S3.
D. Insert an Amazon Kinesis Data Analytics stream downstream of the Kinesis Data Firehose stream that transforms raw record attributes into simple transformed values using SQL.
Explanation:
Since the data is already flowing through Amazon Kinesis Data Firehose, the least development effort solution is to use Amazon Kinesis Data Analytics, which supports SQL-based transformations on streaming data without requiring new infrastructure.
Why is this the best choice?
No major architectural changes – Data continues flowing from stores into Kinesis Data Firehose and then to Amazon S3.
Simple SQL transformations – Since the changes are simple (e.g., attribute combinations), SQL is sufficient.
Low operational overhead – No need to manage clusters or instances.
Real-time processing – Transformed records immediately enter Amazon S3 for training.
Ans is D
Amazon Kinesis Data Analytics provides a serverless option for real-time data processing using SQL queries. In this case, by inserting a Kinesis Data Analytics stream downstream of the Kinesis Data Firehose stream, the retail chain can easily perform the required simple transformations on the ingested purchasing records.
The best answer is to use a lambda, but the letter D can do it very good too in the absence of the lambda option.
I go with D. A tough question, though. And C are definitely out. They key to the question is that it does not say that the transformed data needs to be stored again in S3. It just needs to be sent to the model for training after being transformed. So a Kinesis Data Analytics stream is appropriate to do the transformation.
Legacy data – Firehose – Kinesis Analytics – S3.This happens in near real time before the data ends up in S3.
–Legacy data – Firehose – S3 is already happening (mentioned in first line in question), adding Kinesis Data Analytics to do simple transformation joins using SQL on the incoming data is the LEAST amount of work needed.
Kinesis Data analytics can write o S3. here is the AWS link with working example.Even Though Udemy tutorial said it cannot write directly to S3 :) .
https://docs.aws.amazon.com/kinesisanalytics/latest/java/examples-s3.html
It seems that LEAST developmnet effort:
https://aws.amazon.com/fr/blogs/big-data/preprocessing-data-in-amazon-kinesis-analytics-with-aws-lambda/
and GRETAST development effort:
https://aws.amazon.com/fr/blogs/big-data/optimizing-downstream-data-processing-with-amazon-kinesis-data-firehose-and-amazon-emr-running-apache-spark/
It’s D
https://aws.amazon.com/blogs/big-data/preprocessing-data-in-amazon-kinesis-analytics-with-aws-lambda/
“In some scenarios, you may need to enhance your streaming data with additional information, before you perform your SQL analysis. Kinesis Analytics gives you the ability to use data from Amazon S3 in your Kinesis Analytics application, using the Reference Data feature. However, you cannot use other data sources from within your SQL query.”
I believe, kinesis should be used only in case of live data stream and this is not the case here. So as per me D shouldn’t be the answer. I think A should be the answer as AWS storage gateway is something which is used alongwith on premise applications to move data to s3. Then glue can be used to transform the data.
With option A, you would be changing the legacy data ingestion, a huge development effort. Remember, you’re talking about 20,000 stores.
It is D.
I think the answer is D, because require the LEAST amount of development effort.
it’s D, kinesis analytic can easily connect with firehose
why not A. it seems good to me
“require stores to capture data locally using S3 gateway” - for 20k stores this creates a HUUUGE operational overhead and development effort, definitely wrong
D is correct…rest all need some kind of manual intervention as well as they are not simple..Firehose allows transformation as well as moving into S3
I think the answer is B. D would be correct if they didn’t want to transform the legacy data from before the switch, but it seems like they do. Choosing D would mean that you’d have to use an EC2 instance or something else to transform the legacy data along with adding the Kinesis data analytics functionality. Also, there is no real-time requirement so daily transformation is fine.
Its D, because with KDA you can transform the data with SQL while with EMR you need to write code, considering the requirement of “least development effort”, so D
I think the answer is B. D would be correct if they didn’t want to transform the legacy data from before the switch, but it seems like they do. Choosing D would mean that you’d have to use an EC2 instance or something else to transform the legacy data along with adding the Kinesis data analytics functionality. Also, there is no real-time requirement so daily transformation is fine.
“LEAST amount of development effort” , EMR is no complicated to LEAST
If the question is “least cost” then B, but the question is “least develope effort, then you want to keep original architeture. I agree that for daily ETL instead of real-time, and large dataset, B is better option.
You can use Lambda instead of EC2. So D should be OK.
https://aws.amazon.com/blogs/big-data/preprocessing-data-in-amazon-kinesis-analytics-with-aws-lambda/
can be B - https://www.examtopics.com/discussions/amazon/view/9826-exam-aws-certified-machine-learning-specialty-topic-1/
23 - A Machine Learning Specialist is building a convolutional neural network (CNN) that will classify 10 types of animals. The Specialist has built a series of layers in a neural network that will take an input image of an animal, pass it through a series of convolutional and pooling layers, and then finally pass it through a dense and fully connected layer with 10 nodes. The Specialist would like to get an output from the neural network that is a probability distribution of how likely it is that the input image belongs to each of the 10 classes. Which function will produce the desired output? - A.. Dropout
B.. Smooth L1 loss
C.. Softmax
D.. Rectified linear units (ReLU)
C - C might be much suitable
softmax is to turn numbers into probabilities.
https://medium.com/data-science-bootcamp/understand-the-softmax-function-in-minutes-f3a59641e86d
C is right. Softmax function is used for multi-class predictoins
In a multiclass classification problem (such as classifying an image into one of 10 animal categories), the model should output a probability distribution over the classes. The Softmax function achieves this by:
Taking the raw scores (logits) from the final dense layer (10 nodes, one per class).
Exponentializing each score and normalizing them so they sum to 1, effectively turning them into probabilities.
A. NO - Dropout is to prevent overfitting
B. NO - L1 regularization is to prevent overfitting
C. YES - Softmax will give probabilities for each class
D. NO - Rectified linear units (ReLU) is an activation function
Softmax is the correct answer.
Multiclassification with probabilities is about softmax!
Softmax is for probability distribution
it should be C. Softmax
Softmax converts outputs to Probabilites of each classification
absolutely C
Absolute C.
This is as easy a question as you will likely see on the exam, Everyone has the right answer here.
C –> Softmax.
Let’s go over the alternatives:
A. Dropout –> Not really a function, but rather a method to avoid overfitting. It consists of dropping some neurons during the training process, so that the performance of our algorithm does not become very dependent on any single neuron.
B. Smooth L1 loss –> It’s a loss function, thus a function to be minimized by the entire neural network. It’s not an activation function.
C. Softmax –> This is the traditional function used for multi-class classification problems (such as classifying an animal into one of 10 categories)
D. Rectified linear units (ReLU) –> This activation function is often used on the first and intermediate (hidden) layers, not on the final layer. In any case, it wouldn’t make sense to use it for classification because its values can exceed 1 (and probabilities can’t)
C, Softmax is the best suitable answer
Ref: The softmax function, also known as softargmax[1]:184 or normalized exponential function,[2]:198 is a generalization of the logistic function to multiple dimensions. It is used in multinomial logistic regression and is often used as the last activation function of a neural network to normalize the output of a network to a probability distribution over predicted output classes, based on Luce’s choice axiom.
You guys are right, the answer is C since it automatically provides the output with a confidence interval…
Relu could be used as well but it needs to be coded in to provide the probabilities
https://medium.com/@himanshuxd/activation-functions-sigmoid-relu-leaky-relu-and-softmax-basics-for-neural-networks-and-deep-8d9c70eed91e
Definitely C
Definitely softmax.
Are you sure it is C?`
The output should be “[the probability that] the input image belongs to each of the 10 classes.” And not the most likely class with the highest probability, which would be the result of softmax layer.
Yes, softmax returns indeed a vector of probabilities. - https://www.examtopics.com/discussions/amazon/view/8307-exam-aws-certified-machine-learning-specialty-topic-1/
24 - A Machine Learning Specialist trained a regression model, but the first iteration needs optimizing. The Specialist needs to understand whether the model is more frequently overestimating or underestimating the target. What option can the Specialist use to determine whether it is overestimating or underestimating the target value? - A.. Root Mean Square Error (RMSE)
B.. Residual plots
C.. Area under the curve
D.. Confusion matrix
B - RMSE says about the error value but not the sign of error. The question is to find whether the model overestimates or underestimates - I guess residual plots clearly show that
answer B
Answer is B. Residual plot distribution indicates over or under-estimations
A residual plot helps determine whether a regression model is overestimating or underestimating the target value.
Residual = Actual Value - Predicted Value
Positive residual → The model underestimated the target.
Negative residual → The model overestimated the target.
By plotting residuals, the Machine Learning Specialist can see patterns that indicate bias:
More positive residuals → The model is underestimating.
More negative residuals → The model is overestimating.
Randomly scattered residuals around zero → The model is well-calibrated.
Residual plots shows mistake by mistake!
B - Residual plots it is - https://docs.aws.amazon.com/machine-learning/latest/dg/regression-model-insights.html
Residual Plots (B).
AUC and Confusion Matrices are used for classification problems, not regression.
And RMSE does not tell us if the target is being over or underestimated, because residuals are squared! So we actually have to look at the residuals themselves. And that’s B.
Root Mean Square Error (RMSE) is the standard deviation of the residuals (prediction errors). Residuals are a measure of how far from the regression line data points are; RMSE is a measure of how spread out these residuals are. In other words, it tells you how concentrated the data is around the line of best fit. Root mean square error is commonly used in climatology, forecasting, and regression analysis to verify experimental results.
1) Squaring the residuals. 2) Finding the average of the residuals. 3) Taking the square root of the result.
Residual Plots (B). would have to be my answer
residual plot
https://docs.aws.amazon.com/machine-learning/latest/dg/regression-model-insights.html
https://stattrek.com/statistics/dictionary.aspx?definition=residual%20plot#:~:text=A%20residual%20plot%20is%20a,nonlinear%20model%20is%20more%20appropriate.
Answer is B
without a second thought residual plot
The answer is B. Refer to Exercise 7.2.1.A
https://stats.libretexts.org/Bookshelves/Introductory_Statistics/Book%3A_OpenIntro_Statistics_(Diez_et_al)./07%3A_Introduction_to_Linear_Regression/7.02%3A_Line_Fitting%2C_Residuals%2C_and_Correlation
Residual plot it is Option B
Residual plot
B is the correct answer!!!!
RMSE has the S in it that is square… that vanishes the above below factor of the prediction.
Answers C and D are for other type of problems
It should be B. The residual plot will be give whether the target value is overestimated or underestimated.
Answer is C.
https://www.youtube.com/watch?v=MrjWcywVEiU
Answer is B.
Your vid shows a technique that is useful for defining integrals and has NOTHING to do linear regression. Also, it over-/underestimates the area under the curve, NOT the target value.
Good grief, AUC is used for classification not regression.
B..Residual helps to find out whether the model is underestimating or overestimating
answer is B - https://www.examtopics.com/discussions/amazon/view/8308-exam-aws-certified-machine-learning-specialty-topic-1/