PAST QUESTIONS Flashcards by Annette Reid

Which python libraries are best for transforming data by changing raw feature vectors into a format best suited for a SageMaker batch transform job to generate a forecast?

Pandas + Scikit-learn

How well did you know this?

Not at all

Perfectly

What is the best python library for data wrangling and manipulating tabular data such as CSV?

Pandas

How well did you know this?

Not at all

Perfectly

What python library is the best for transforming raw feature vectors into a format suitable for downstream estimators?

Scikit-learn

How well did you know this?

Not at all

Perfectly

What python libraries would you use for data visualisation (no data transformation)?

Matplotlib + Plotly

How well did you know this?

Not at all

Perfectly

What python library is used to interface with AWS services such as S3, DynamoDB SQS etc?

Boto3

How well did you know this?

Not at all

Perfectly

Does Boto3 have data transformation function?

No, it merely interfaces with AWS services

How well did you know this?

Not at all

Perfectly

What is best used for text tagging, classification and tokenisation but not manipulating data?

Natural Language Toolkit (NLTK)

How well did you know this?

Not at all

Perfectly

What is the best python library for crawling websites to gather structured data?

Scrapy

How well did you know this?

Not at all

Perfectly

What hyperparameter setting would you use to get SageMaker Linear Learner algorithm to produce discrete results?

Set predictor_type to binary_classifier

How well did you know this?

Not at all

Perfectly

When using XGBoost what hyperparameter would you set and what would be its value to produce a logistic regression ?

Set objective to reg:logistic

How well did you know this?

Not at all

Perfectly

What hyperparameter setting would you use to get SageMaker Linear Learner algorithm to produce quantitative results?

set the predictor_type to regressor

How well did you know this?

Not at all

Perfectly

What would you use Kinesis Data Streams Naive Bayes Classifier for?

You wouldn’t as it does not exist. KDS has no machine learning capabilities.

How well did you know this?

Not at all

Perfectly

When using XGBoost what hyperparameter would you set and what would be its value to produce quantitative answers ?

set the objective to reg:linear

How well did you know this?

Not at all

Perfectly

Does Kinesis Data Analytics provide nearest neighbour?

No, but it does provide Hotspots on streams which detect higher than normal activity using the distance between hotspot and its nearest neighbour. It does not provide ML model update categories.

How well did you know this?

Not at all

Perfectly

Which algorithm would work well for near-real-time updates to the model?

Kinesis Data Analytics Random Cut Forest

How well did you know this?

Not at all

Perfectly

When would you use SageMaker Random Cut Forest?

Large batch data sets where you don’t need to update the model frequentl

How well did you know this?

Not at all

Perfectly

How would you use AWS Glue in the best way to build a data schema?

Use Glue crawlers to crawl your ride share data

How well did you know this?

Not at all

Perfectly

The Rekognition model is not able to recognise visitors to a building what might be the issue?

Face collection contents. Store multiple images of the same person with different positions, glasses and posses to make it more successful.

How well did you know this?

Not at all

Perfectly

What are Face landmarks?

Face landmarks are a set of salient points usually located at the corners, tips and midpoints of key facial components like eyes, lips and nose which Amazon Recognition uses.

How well did you know this?

Not at all

Perfectly

Could the Face Landmarks filter sharpness impact Rekognitions sucess?

No. Fade landmarks have no sharpness parameter

How well did you know this?

Not at all

Perfectly

How could setting the confidence threshold tolerance to low impact Rekognition performance?

It could cause a failure in Rekognition

How well did you know this?

Not at all

Perfectly

What Amazon service could you use to produce a dashboard instead of coding a React or Angular UI?

Amazon QuickSight

How well did you know this?

Not at all

Perfectly

You are using a regression decision tree. As you train your model you see it is overfitting to your training data. How can you improve your situation and get better training results more efficiently?

Use a random forest by building multiple randomised decision trees and averaging their outputs to get the predictions.

How well did you know this?

Not at all

Perfectly

What technique could you use with a neural network to improve overfitting?

Study These Flashcards

Use the “dropout” technique to penalise large weights and prevent overfitting

What type of chart would you use to show the size of customer recommendations?

A distribution scatter chart

What shart would you use to show how many leads where converted to customers?

Use a Conversion Rate KPI chart

You need to predict the sales levels of each of the potential next products and select one with the highest predicted purchase rate? Which Type of machine learning approach should you use?

You are trying to solve for the greatest number of sales across the potential next products. Therefore you are solving a regression problem and should use a linear regression model.

What strategy would you use to deal with missing data point values while attempting to maximise the accuracy of your model without introducing bias into the model.

Impute the missing values using a Deep Learning Strategy

What is the issue with using the Most Frequent strategy to replace missing values in Categorical issues?

It can introduce bias into the data

Which metric is best for deciding which model predicts true positives best when you have few positive cases?

PR Curve.

Which metric do you use when evaluating models where both Positive and Negative are of equal importance?

ROC curve

When do you use Precision?

When you want to figure out how positives are actually positives. TP / (TP + FP) When FP are really bad.

When do you use Recall/sensitivity?

When missing positives is really bad. TP/ (TP + FN)

When do you use Specificity?

TN / (TN + FP) How many negatives the model captures.

When do you use F1 score?

When a combination of precision and recall is needed. When positives are actually positive and positives need to be captured.

What is the problem with recall, Precision and normal F1-score?

They can be an issue when the data is unbalanced. For example cancer cases where there are few positives.

What is a PR curve?

The curve between precision and recall for various threshold values. Our target is the top right corner of the graph.

What is an ROC graph?

Receiver Operating Characteristic graph plotting True positive rate (RECALL) vs False Positive Rate (1-specificity). Top left is our target.

When should you use ROC graph or PR Curve?

When there is a majority of negative items.

When should you use the ROC equation?

When both negative and positive classifications are important and the data is unbalanced.

When should you use the PC Curve?

When correctly classifying Positive items is more important than negative and the data is unbalanced.

You have created a Glue crawler that you have configured to crawl the data on S3 and have written a custom classifier. Unfortunately the crawler failed to create a schema. Why might the Glue crawler have failed in this way?

All classifiers returned a certainty of 0.0

How would you stop your Glue Crawler from crawling objects in a set directory?

Add an exclude pattern when you configure the data store. Give the path of the objects to be ignored relative to the include path.

How would you tell the Glue Crawler to group compatible schemas?

You can create a single schema for each S3 path.

Your training data set is in-balanced. What would you use as a preprocessing step before you create your SageMaker training job?

Run your training data through a preprocessing script that uses SMOTE (synthetic Minority Over Sampling Technique) - this uses K-NN algorithm to create synthetic observations to balance the data set.

How does SageMaker Ground Truth work?

It uses an active learning model that is trained from human-labeled data. Any image it understands is automatically labeled. Any ambiguous data is sent to human labellers for annotation then sent back to the active learning model to retrain the model to improve accuracy incrementally.

You want to produce real-time analysis of streaming data from IoT devices in the field where events are analysed real-time. You also need to retain the data from the IoT devices for 7 days since you cannot fail to process any events. Which approach would give the best solution to processing your streaming data.

Use Amazon Kinesis Data Streams and its Kinesis API PutRecords call to pass your events from your producers to your kinesis streams.

Why do your not use Amazon Kinesis Data Stream Producer Library for real-time processing of event data?

Ot can occur an additional processing delay up to RecordMaxBufferedTime within the library so is not meant for real-time processing

Why do you not use Amazon Kinesis Data Streams Client Library for real-time processing?

It interacts with Amazon Kinesis Data Producer library to process its data so will also suffer from a delay.

How long can Amazon Kinesis Data Firehose store data for?

Amazon DF attempts to resend data for a maximum of 24 hours. For periods which need longer this is not a suitable service.

What is the Amazon Elastic Transcoder service used for?

to convert video files from one format to another

Can AWS Rekognition service send output directly to your SageMaker Model?

no it requires an additional component

What is the most efficent way of taking streams from Amazon Kinesis Data Streams and transforming them with SQL or Apache Flink?

Amazon Kinesis Data Analytics

Which format's does Amazon Glue currently support as output?

CSV, Parquet, Avro

Which format's does Amazon Glue currently support as input?

CSV, Parquet, XML, Avro, Grok Log, ORC

What format is the most efficent to convert data from Glue into for Hive?

orc

When you stream your data through your Kinesis Firehose through to lambs then s3. you notice no data is arriving in your S3 bucket. What might be the issue?

Your lambda timeout value is set to default. Default is 3 seconds which is not enough time to execute transformation functions when using Kinesis Kirehose

What are valid put_record request parameters?

Data

PAST QUESTIONS Flashcards

(58 cards)