ML Fundamentals Flashcards

1
Q

allows people to store objects (files) in “buckets”
(directories)

A

Amazon S3

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What pathway is this called: * <my_bucket>/my_folder1/another_folder/my_file.txt</my_bucket>

A

S3 Bucket Key

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q
  • Pattern for speeding up range queries (ex: AWS Athena)
  • By Date: s3://bucket/my-dataset/year/month/day/hour/data_00.csv
  • By Product: s3://bucket/my-data-set/product-id/data_32.csv
A

Amazon S3 Data Partitioning

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Durability or availability:
* If you store 10,000,000 objects with Amazon S3, you can on average
expect to incur a loss of a single object once every 10,000 years
* Same for all storage classes

A

Durability

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Durability or availability:
* Measures how readily available a service is
* Varies depending on storage class

A

Availability

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What S3 storage class is the below:
* 99.99% Availability
* Used for frequently accessed data
* Low latency and high throughput
* Sustain 2 concurrent facility failures
* Use Cases: Big Data analytics, mobile & gaming applications,
content distribution…

A

S3 Standard – General Purpose

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What S3 Storage class:
*For data that is less frequently accessed, but requires rapid access
when needed
* Lower cost than S3 Standard
** 99.9% Availability
* Use cases: Disaster Recovery, backups

A
  • Amazon S3 Standard-Infrequent Access (S3 Standard-IA)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

What S3 Storage class:
*For data that is less frequently accessed, but requires rapid access
when needed
* Lower cost than S3 Standard
* High durability (99.999999999%) in a single AZ; data lost when AZ is destroyed
* 99.5% Availability
* Use Cases: Storing secondary backup copies of on-premise data, or data you
can recreate

A
  • Amazon S3 One Zone-Infrequent Access (S3 One Zone-IA)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What S3 Storage class:
Small monthly monitoring and auto-tiering fee
* Moves objects automatically between Access Tiers based on usage
* There are no retrieval charges in S3 Intelligent-Tiering

A

S3 Intelligent-Tiering

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Describe the S3 storage Intelligent Tiering classes below:
*__________: default tier
* Infrequent Access tier (automatic): objects not accessed for 30 days
* ______: objects not accessed for 90 days
* _________: configurable from 90 days to 700+ days
* ________: config. from 180 days to 700+ days

A

Frequent Access tier (automatic): default tier
* Infrequent Access tier (automatic): objects not accessed for 30 days
* Archive Instant Access tier (automatic): objects not accessed for 90 days
* Archive Access tier (optional): configurable from 90 days to 700+ days
* Deep Archive Access tier (optional): config. from 180 days to 700+ days

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q
  • Help you decide when to transition objects
    to the right storage class
  • Recommendations for Standard and
    Standard IA
  • Does NOT work for One-Zone IA or Glacier
  • Report is updated daily
  • 24 to 48 hours to start seeing data analysis
  • Good first step to put together Lifecycle
    Rules (or improve them)!
A

Amazon S3 Analytics

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

bucket wide rules from the S3 console - allows cross account

A

S3 Bucket policies

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

_____ is a managed alternative to Apache Kafka
* Great for application logs, metrics, IoT, clickstreams
* Great for “real-time” big data
* Great for streaming processing frameworks (Spark, NiFi, etc…)
* Data is automatically replicated synchronously to 3 AZ

A

Amazon Kinesis

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

__________ low latency streaming ingest at scale

A

Kinesis Streams

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

________ perform real-time analytics on streams using SQL

A

Kinesis Analytics

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

_________ load streams into S3, Redshift, ElasticSearch & Splunk

A

Kinesis Firehose

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

______ meant for streaming video in real-time

A

Kinesis Video Streams

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

Kinesis Streams are divided in ordered ______

A

Shards

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

What are the two capacity modes for Kinesis Data streams?

A

Provisioned and On-Demand modes

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

What Kinesis data stream capacity mode is below:
*You choose the number of shards provisioned, scale manually or using API
* Each shard gets 1MB/s in (or 1000 records per second)
* Each shard gets 2MB/s out (classic or enhanced fan-out consumer)
* You pay per shard provisioned per hour

A

Provisioned

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

What Kinesis data stream capacity mode is below:
* No need to provision or manage the capacity
* Default capacity provisioned (4 MB/s in or 4000 records per second)
* Scales automatically based on observed throughput peak during the last 30
days
* Pay per stream per hour & data in/out per GB

A

On-demand mode

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q

What Kinesis service is this:
*Fully Managed Service, no administration
* Near Real Time (60 seconds latency minimum for non full batches)
* Data Ingestion into Redshift / Amazon S3 / ElasticSearch / Splunk
* Automatic scaling
* Supports many data formats
* Data Conversions from CSV / JSON to Parquet / ORC (only for S3)
* Data Transformation through AWS Lambda (ex: CSV => JSON)
* Supports compression when target is Amazon S3 (GZIP, ZIP, and
SNAPPY

A

Kinesis data firehose

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
23
Q

Whats the difference between kinesis data streams and firehose?

A

*Streams
* Going to write custom code (producer / consumer)
* Real time (~200 ms latency for classic, ~70 ms latency for enhanced fan-out)
* Automatic scaling with On-demand Mode
* Data Storage for 1 to 365 days, replay capability, multi consumers
*Firehose
* Fully managed, send to S3, Splunk, Redshift, ElasticSearch
* Serverless data transformations with Lambda
* Near real time (lowest buffer time is 1 minute)
* Automated Scaling
* No data storage

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
24
Q

What Kinesis tool is this:

Use cases
* Streaming ETL: select columns, make simple transformations, on streaming
data
* Continuous metric generation: live leaderboard for a mobile game
* Responsive analytics: look for certain criteria and build alerting (filtering)
* Features
* Pay only for resources consumed (but it’s not cheap)
* Serverless; scales automatically
* Use IAM permissions to access streaming source and destination(s)
* SQL or Flink to write the computation
* Schema discovery
* Lambda can be used for pre-processing

A

Kinesis data analytics

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
25
For Kinesis Analytics, you Pay only for ______ (but it’s not cheap)
resources consumed
26
Is amazon kinesis serverless?
Yes
27
What amazon data product has the below characteristics: * Producers: * security camera, body-worn camera, AWS DeepLens, smartphone camera, audio feeds, images, RADAR data, RTSP camera. * One producer per video stream * Video playback capability * Consumers * build your own (MXNet, Tensorflow) * AWS SageMaker * Amazon Rekognition Video * Keep data for 1 hour to 10 years
Kinesis video stream
28
__________ create real-time machine learning applications
Kinesis Data Stream
29
_____ ingest massive data near-real time
Kinesis Data Firehose
30
___________ real-time ETL / ML algorithms on streams
Kinesis Data Analytics
31
___________ real-time video stream to create ML applications
Kinesis Video Stream
32
* Metadata repository for all your tables * Automated Schema Inference * Schemas are versioned * Integrates with Athena or Redshift Spectrum (schema & data discovery)
Glue data catalog
33
____ go through your data to infer schemas and partitions * Works JSON, Parquet, CSV, relational store
Glue crawlers
34
Transform data, Clean Data, Enrich Data (before doing analysis) * Generate ETL code in Python or Scala, you can modify the code * Can provide your own Spark or PySpark scripts * Target can be S3, JDBC (RDS, Redshift), or in Glue Data Catalog * Fully managed, cost effective, pay only for the resources consumed * Jobs are run on a serverless Spark platform
Glue ETL
35
What type of data store is this: Data Warehousing, SQL analytics (OLAP - Online analytical processing)
Redshift
36
What type of data store is this: Relational Store, SQL (OLTP - Online Transaction Processing) * Must provision servers in advance
* RDS, Aurora:
37
What type of data store is this: NoSQL data store, serverless, provision read/write capacity * Useful to store a machine learning model served by your application
* DynamoDB:
38
What type of data store is this: Object storage * Serverless, infinite storage * Integration with most AWS Services
S3
39
What type of data storoe is this: * Indexing of data * Search amongst data points * Clickstream Analytics
OpenSearch (previously ElasticSearch)
40
What type of data store is this: * Caching mechanism * Not really used for Machine Learning
* ElastiCache
41
What are these below features identifying what AWS data service: Destinations include S3, RDS, DynamoDB, Redshift and EMR * Manages task dependencies * Retries and notifies on failures * Data sources may be on-premises * Highly available
AWS Data Pipeline
42
What are the differences between AWS Data Pipeline and AWS Glue?
Glue: * Glue ETL - Run Apache Spark code, Scala or Python based, focus on the ETL * Glue ETL - Do not worry about configuring or managing the resources * Data Catalog to make the data available to Athena or Redshift Spectrum * Data Pipeline: * Orchestration service * More control over the environment, compute resources that run code, & code * Allows access to EC2 or EMR instances (creates resources in your own account)
43
What AWS data service is below: * Run batch jobs as Docker images * Dynamic provisioning of the instances (EC2 & Spot Instances) * Optimal quantity and type based on volume and requirements * No need to manage clusters, fully serverless * You just pay for the underlying EC2 instances
AWS Batch
44
What is the difference between AWS Batch and Glue?
* Glue: * Glue ETL - Run Apache Spark code, Scala or Python based, focus on the ETL * Glue ETL - Do not worry about configuring or managing the resources * Data Catalog to make the data available to Athena or Redshift Spectrum * Batch: * For any computing job regardless of the job (must provide Docker image) * Resources are created in your account, managed by Batch * For any non-ETL related work, Batch is probably better
45
What AWS data service has the below features: * Quickly and securely migrate databases to AWS, resilient, self healing * The source database remains available during the migration * Supports: * Homogeneous migrations: ex Oracle to Oracle * Heterogeneous migrations: ex Microsoft SQL Server to Aurora * Continuous Data Replication using CDC * You must create an EC2 instance to perform the replication tasks
AWS Database Migration Service - DMS
46
What is the difference between AWS DMS and Glue?
Glue: * Glue ETL - Run Apache Spark code, Scala or Python based, focus on the ETL * Glue ETL - Do not worry about configuring or managing the resources * Data Catalog to make the data available to Athena or Redshift Spectrum * AWS DMS: * Continuous Data Replication * No data transformation * Once the data is in AWS, you can use Glue to transform it
47
What AWS Data service has the below features: For data migrations from on-premises to AWS storage services * A DataSync Agent is deployed as a VM and connects to your internal storage * NFS, SMB, HDFS * Encryption and data validation
AWS DataSync
48
* An Internet of Things (IOT) thing * Standard messaging protocol * Think of it as how lots of sensor data might get transferred to your machine learning model * The AWS IoT Device SDK can connect via ____
MQTT
49
What are the three major types of data?
* Numerical * Categorical * Ordinal
50
______ Represents some sort of quantitative measurement * Heights of people, page load times, stock prices, etc.
Numerical
51
_______ is Integer based; often counts of some event. * How many purchases did a customer make in a year? * How many times did I flip “heads”?
Discrete data
52
__________ * Has an infinite number of possible values * How much time did it take for a user to check out? * How much rain fell on a given day?
* Continuous Data
53
___________ is Qualitative data that has no inherent mathematical meaning * Gender, Yes/no (binary data), Race, State of Residence, Product Category, Political Party, etc.
Categorical data
54
A mixture of numerical and categorical * Categorical data that has mathematical meaning * Example: movie ratings on a 1-5 scale. * Ratings must be 1, 2, 3, 4, or 5 * But these values have mathematical meaning; 1 means it’s a worse movie than a 2.
Ordinal data
55
What AWS service has the below characteristics: * Interactive query service for S3 (SQL) * No need to load data, it stays in S3 * Presto under the hood * Serverless! * Supports many data formats * CSV (human readable) * JSON (human readable) * ORC (columnar, splittable) * Parquet (columnar, splittable) * Avro (splittable) * Unstructured, semi-structured, or structured
Amazon athena
56
What AWS service uses the below scenarios? * Ad-hoc queries of web logs * Querying staging data before loading to Redshift * Analyze CloudTrail / CloudFront / VPC / ELB etc logs in S3 * Integration with Jupyter, Zeppelin, RStudio notebooks * Integration with QuickSight * Integration via ODBC / JDBC with other visualization tools
amazon athena
57
What AWS service has the below cost model? Pay-as-you-go * $5 per TB scanned * Successful or cancelled queries count, failed queries do not. * No charge for DDL (CREATE/ALTER/DROP etc.) * Save LOTS of money by using columnar formats * ORC, Parquet * Save 30-90%, and get better performance
Athena
58
What AWS Service has the below characteristics: * Fast, easy, cloud-powered business analytics service * Allows all employees in an organization to: * Build visualizations * Perform ad-hoc analysis * Quickly get business insights from data * Anytime, on any device (browsers, mobile) * Serverless
Quicksight
59
What is the in memory database that is used by quicksight?
SPICE
60
What quicksight service is below: Machine learning-powered * Answers business questions with Natural Language Processing * “What are the top-selling items in Florida?” * Offered as an add-on for given regions * Personal training on how to use it is required * Must set up topics associated with datasets * Datasets and their fields must be NLP-friendly * How to handle dates must be defined
Quicksight Q
61
What quicksight service is below: Reports designed to be printed * May span many pages * Can be based on existing Quicksight dashboards * New in Nov 2022
Paginated Reports
62
What AWS Service is this: * Managed Hadoop framework on EC2 instances * Includes Spark, HBase, Presto, Flink, Hive & more * EMR Notebooks * Several integration points with AWS
Amazon EMR (Elastic Map Reduce)
63
What is this called: Applying your knowledge of the data – and the model you’re using - to create better features to train your model with. * Which features should I use? * Do I need to transform these features in some way? * How do I handle missing data? * Should I create new features from the existing ones?
Feature engineering
64
What is The Curse of Dimensionality ?
Too many features can be a problem – leads to sparse data * Every feature is a new dimension * Much of feature engineering is selecting the features most relevant to the problem at hand * This often is where domain knowledge comes into play
65
What AI data cleansing concept is below: Replace missing values with the mean value from the rest of the column (columns, not rows! A column represents a single feature; it only makes sense to take the mean from other samples of the same feature.) * Fast & easy, won’t affect mean or sample size of overall data set * Median may be a better choice than mean when outliers are present
Mean replacement
66
What are the cons of mean replacement?
Only works on column level, misses correlations between features * Can’t use on categorical features (imputing with most frequent value can work in this case, though) * Not very accurate
67
What solution to missing data is this : If not many rows contain missing data… * …and dropping those rows doesn’t bias your data… * …and you don’t have a lot of time… * …maybe it’s a reasonable thing to do. * But, it’s never going to be the right answer for the “best” approach.
Dropping data
68
What are the three ways to solve missing data with machine learning techniques?
*KNN: Find K “nearest” (most similar) rows and average their values * Assumes numerical data, not categorical * There are ways to handle categorical data (Hamming distance), but categorical data is probably better served by… * Deep Learning * Build a machine learning model to impute data for your machine learning model! * Works well for categorical data. Really well. But it’s complicated. * Regression * Find linear or non-linear relationships between the missing feature and other features * Most advanced technique: MICE (Multiple Imputation by Chained Equations)
69
What kind of data is this: Large discrepancy between “positive” and “negative” cases * i.e., fraud detection. Fraud is rare, and most rows will be notfraud * Don’t let the terminology confuse you; “positive” doesn’t mean “good” * It means the thing you’re testing for is what happened. * If your machine learning model is made to detect fraud, then fraud is the positive case. * Mainly a problem with neural networks
unbalanced data
70
To improve AI Data quality, what is the term below: Artificially generate new samples of the minority class using nearest neighbors * Run K-nearest-neighbors of each sample of the minority class * Create a new sample from the KNN result (mean of the neighbors) * Both generates new samples and undersamples majority class * Generally better than just oversampling
SMOTE (* Synthetic Minority Over-sampling TEchnique)
71
If you have too many false positives, one way to fix that is to simply increase that _________
threshold
72
_____ is simply the average of the squared differences from the mean
Variance
73
_____ is just the square root of the variance.
Standard Deviation 𝜎
74
Bucket observations together based on ranges of values. * Example: estimated ages of people * Put all 20-somethings in one classification, 30-somethings in another, etc
Binning
75
Applying some function to a feature to make it better suited for training
Transforming
76
Transforming data into some new representation required by the model
encoding
77
Some models prefer feature data to be normally distributed around 0 (most neural nets) * Most models require feature data to at least be scaled to comparable values * Otherwise features with larger magnitudes will have more weight than they should * Example: modeling age and income as features – incomes will be much higher values than ages
Scaling/normalization
78
Many algorithms benefit from _____ their training data * Otherwise they may learn from residual signals in the training data resulting from the order in which they were collected
shuffling
79
What is Ground Truth?
* Ground Truth manages humans who will label your data for training purposes * Ground Truth creates its own model as images are labeled by people * As this model learns, only images the model isn’t sure about are sent to human labelers
80
Turnkey solution * “Our team of AWS Experts” manages the workflow and team of labelers * You fill out an intake form * They contact you and discuss pricing
Ground truth plus
81
* AWS service for image recognition * Automatically classify images
Rekognition
82
* AWS service for text analysis and topic modeling * Automatically classify text by topics, sentiment
Comprehend
83
* Important data for search – figures out what terms are most relevant for a document *
TF-IDF * Stands for Term Frequency and Inverse Document Frequency
84
* just measures how often a word occurs in a document * A word that occurs frequently is probably important to that document’s meaning
Term Frequency
85
_____ is how often a word occurs in an entire set of documents, i.e., all of Wikipedia or every web page * This tells us about common words that just appear everywhere no matter what the topic, like “a”, “the”, “and”, et
Document Frequency
86
Can you explain bi grams and tri grams?
An extension of TF-IDF is to not only compute relevancy for individual words (terms) but also for bi-grams or, more generally, n-grams. * “I love certification exams” * Unigrams: “I”, “love”, “certification”, “exams” * Bi-grams: “I love”, “love certification”, “certification exams” * Tri-grams: “I love certification”, “love certification exams”
87
What are the three types of neural networks?
* Feedforward Neural Network * Convolutional Neural Networks (CNN) * Recurrent Neural Networks (RNNs)
88
What kind of activation function is this: It doesn’t really *do* anything * Can’t do backpropagation
Linear
89
What kind of activation function is this: * It’s on or off * Can’t handle multiple classification – it’s binary after all * Vertical slopes don’t work well with calculus!
Binary step function
90
What kind of activation function is this: * These can create complex mappings between inputs and outputs * Allow backpropagation (because they have a useful derivative) * Allow for multiple layers (linear functions degenerate to a single layer)
Non linear activation function
91
What kind of activation function is this: * Nice & smooth * Scales everything from 0-1 (Sigmoid / Logistic) or -1 to 1 (tanh / hyperbolic tangent) * But: changes slowly for high or low values * The “Vanishing Gradient” problem * Computationally expensive * Tanh generally preferred over sigmoid
Sigmoid / Logistic / TanH
92
What kind of activation function is this: Now we’re talking * Very popular choice * Easy & fast to compute * But, when inputs are zero or negative, we have a linear function and all of its problems
Rectified Linear Unit (ReLU)
93
What kind of activation function is this: Solves “dying ReLU” by introducing a negative slope below 0 (usually not as steep as this)
Leaky ReLU
94
What kind of activation function is this: * ReLU, but the slope in the negative part is learned via backpropagation * Complicated and YMMV
Parametric ReLU (PReLU)
95
What kind of activation function is this: * From Google, performs really well * But it’s from Google, not Amazon… * Mostly a benefit with very deep networks (40+ layers)
Swish
96
What kind of activation function is this: * Outputs the max of the inputs * Technically ReLU is a special case of maxout * But doubles parameters that need to be trained, not often practical.
Maxout
97
* Used on the final output layer of a multi-class classification problem * Basically converts outputs to probabilities of each classification * Can’t produce more than one label for something (sigmoid can)
Softmax
98
What are convolutional neural networks used for?
When you have data that doesn’t neatly align into columns * Images that you want to find features within * Machine translation * Sentence classification * Sentiment analysis * They can find features that aren’t in a specific spot * Like a stop sign in a picture * Or words within a sentence * They are “feature-location invariant”
99
_________ They can find features that aren’t in a specific spot * Like a stop sign in a picture * Or words within a sentence
convolutional neural network
100
True or false: CNNs are very resource-intensive (CPU, GPU, and RAM)
true
101
What are recurrent neural networks used for?
Time-series data * When you want to predict future behavior based on past behavior * Web logs, sensor logs, stock trades * Where to drive your self-driving car based on past trajectories * Data that consists of sequences of arbitrary length * Machine translation * Image captions * Machine-generated music
102
What neural network should you use: * Time-series data * When you want to predict future behavior based on past behavior * Web logs, sensor logs, stock trades * Where to drive you
recurrent neural network
103
________ deep learning architectures are what’s hot * Adopts mechanism of “self-attention” * Weighs significance of each part of the input data * Processes sequential data (like words, like an RNN), but processes entire input all at once. * The attention mechanism provides context, so no need to process one word at a time. * BERT, RoBERTa, T5, GPT-2 etc., DistilBERT * DistilBERT: uses knowledge distillation to reduce model size by 40%
Transformer
104
What is it called when the below things are used in AI? * NLP models (and others) are too big and complex to build from scratch and re-train every time * The latest may have hundreds of billions of parameters! * Model zoos such as Hugging Face offer pre-trained models to start from * Integrated with Sagemaker via Hugging Face Deep Learning Containers * You can fine-tune these models for your own use cases
transfer learning
105
Neural networks are trained by ________ (or similar means)
gradient descent
106
* Too high a learning rate means you might _________
overshoot the optimal solution!
107
* Too small a learning rate will _____
take too long to find the optimal solution
108
Learning rate is an example of a ___________
hyperparameter
109
Smaller batch sizes can work their way out of _________
“local minima” more easily
110
* Batch sizes that are too large can ________
end up getting stuck in the wrong solution
111
* Regularization techniques are intended to prevent ________.
overfitting
112
true or false: Overfitted models have learned patterns in the training data that don’t generalize to the real world
true
113
* Models that are good at making predictions on the data they were trained on, but not on new data it hasn’t seen before
overfitting
114
What is the vanishing gradient problem?
When the slope of the learning curve approaches zero, things can get stuck
115
_ regularization: sum of weights * Performs feature selection – entire features go to 0 * Computationally inefficient * Sparse output
L1 regularization
116
__ regularization: sum of square of weights * All features remain considered, just weighted * Computationally efficient * Dense output
L2 regularization
117
What matrix does the below show? * A test for a rare disease can be 99.9% accurate by just guessing “no” all the time * We need to understand true positives and true negative, as well as false positives and false negatives.
the confusion matrix
118
____ = AKA Sensitivity, True Positive rate, Completeness * Percent of positives rightly predicted * Good choice of metric when you care a lot about false negatives
recall
119
What is the formula for recall?
𝑇𝑅𝑈𝐸 𝑃𝑂𝑆𝐼𝑇𝐼𝑉𝐸𝑆/ (𝑇𝑅𝑈𝐸 𝑃𝑂𝑆𝐼𝑇𝐼𝑉𝐸𝑆+𝐹𝐴𝐿𝑆𝐸 𝑁𝐸𝐺𝐴𝑇𝐼𝑉𝐸)
120
____ = AKA Correct Positives * Percent of relevant results * Good choice of metric when you care a lot about false positives * i.e., medical screening, drug testing
precision
121
What is the formula for precision?
𝑇𝑅𝑈𝐸 𝑃𝑂𝑆𝐼𝑇𝐼𝑉𝐸𝑆 / (𝑇𝑅𝑈𝐸 𝑃𝑂𝑆𝐼𝑇𝐼𝑉𝐸𝑆+𝐹𝐴𝐿𝑆𝐸 𝑃𝑂𝑆𝐼𝑇𝐼𝑉𝐸𝑆)
122
* Plot of true positive rate (recall) vs. false positive rate at various threshold settings. * Points above the diagonal represent good classification (better than random) * Ideal curve would just be a point in the upper-left corner * The more it’s “bent” toward the upper-left, the better
ROC Curve * Receiver Operating Characteristic Curve
123
Equal to probability that a classifier will rank a randomly chosen positive instance higher than a randomly chosen negative one * ROC AUC of 0.5 is a useless classifier, 1.0 is perfect * Commonly used metric for comparing classifiers
* Area Under the Curve (AUC)
124
Good = higher area under curve * Similar to ROC curve * But better suited for information retrieval problems * ROC can result in very small values if you are searching large number of documents for a tiny number that are relevant
* Precision / Recall curve
125
__________ = Generate N new training sets by random sampling with replacement * Each resampled model can be trained in parallel
bagging
126
_____ = * Observations are weighted * Some will take part in new training sets more often * Training is sequential; each classifier takes into account the previous one’s success.
boosting
127
What type of sagemaker built in algorithm is this: Linear regression * Fit a line to your training data * Predications based on that line * Can handle both regression (numeric) predictions and classification predictions * For classification, a linear threshold function is used. * Can do binary or multi-class
Linear learner
128
For linear learner, it can handle both regression (numeric) predictions and _______ predictions
classification predictions
129
Linear Learner: What training input does it expect?
* RecordIO-wrapped protobuf * CSV * File or Pipe mode both supported
130
Linear learner: Preprocessing * Training data must be ______(so all features are weighted the same) * Linear Learner can do this for you automatically
normalized
131
What does sagemaker linear learner use in training?
Uses stochastic gradient descent
132
What type of sagemaker built in algorithm is this: Boosted group of decision trees * New trees made to correct the errors of previous trees * Uses gradient descent to minimize loss as new trees are added
XGBoost
133
What type of training input does xgboost expect?
it takes CSV or libsvm input.
134
With xgboost, Models are serialized/deserialized with ___
Pickle
135
What type of sagemaker built in algorithm is this: * Input is a sequence of tokens, output is a sequence of tokens * Machine Translation * Text summarization * Speech to text * Implemented with RNN’s and CNN’s with attention
Seq2Seq
136
What sagemaker built in algorithm maps to the below training inputs : * RecordIO-Protobuf * Tokens must be integers (this is unusual, since most algorithms want floating point data.) * Start with tokenized text files * Convert to protobuf using sample code * Packs into integer tensors with vocabulary files * A lot like the TF/IDF lab we did earlier. * Must provide training data, validation data, and vocabulary files.
Seq2Seq
137
Seq2Seq can optimize on : * Accuracy -Vs. provided validation dataset * __ score * Compares against multiple reference translations * Perplexity * Cross-entropy
BLEU score
138
Seq2Seq: Instance Types Can only use ____ instance types (P3 for example) * Can only use a single machine for training * But can use multi-GPU’s on one machine
GPU instance types
139
What sagemaker algorithm has the below characteristics? * Forecasting one-dimensional time series data * Uses RNN’s * Allows you to train the same model over several related time series * Finds frequencies and seasonality
DeepAR
140
What sagemaker algorithm has the below training input needs? JSON lines format * Gzip or Parquet * Each record must contain: * Start: the starting time stamp * Target: the time series values * Each record can contain: * Dynamic_feat: dynamic features (such as, was a promotion applied to a product in a time series of product purchases) * Cat: categorical features
DeepAR
141
For DeepAR, Always include entire _____ for training, testing, and inference
time series
142
For deepAR, start with ___, move up to __ if necessary.
CPU, GPU
143
What sagemaker algorithm has the below characteristics: * Text classification * Predict labels for a sentence * Useful in web searches, information retrieval * Supervised * Word2vec * Creates a vector representation of words * Semantically similar words are represented by vectors close to each other * This is called a word embedding * It is useful for NLP, but is not an NLP algorithm in itself! * Used in machine translation, sentiment analysis * Remember it only works on individual words, not sentences or documents
BlazingText
144
BlazingText: What training input does it expect?
* For supervised mode (text classification): * One sentence per line * First “word” in the sentence is the string __label__ followed by the label * Also, “augmented manifest text format” * Word2vec just wants a text file with one training sentence per line.
145
What type of sagemaker algorithm is below: * It creates low-dimensional dense embeddings of high-dimensional objects * It is basically word2vec, generalized to handle things other than words. * Compute nearest neighbors of objects * Visualize clusters * Genre prediction * Recommendations (similar items or users)
Object2Vec
146
What type of algorithm has the below training requirements: * Data must be tokenized into integers * Training data consists of pairs of tokens and/or sequences of tokens * Sentence – sentence * Labels-sequence (genre to description?) * Customer-customer * Product-product * User-item
Object2Vec
147
For object2vec, you Process data into ___ and shuffle it
JSON Lines
148
What are important hyperparameters for Object2Vec?
* The usual deep learning ones… * Dropout, early stopping, epochs, learning rate, batch size, layers, activation function, optimizer, weight decay * Enc1_network, enc2_network * Choose hcnn, bilstm, pooled_embedding
149
What sagemaker algorithm is below: * Identify all objects in an image with bounding boxes * Detects and classifies objects with a single deep neural network * Classes are accompanied by confidence scores * Can train from scratch, or use pretrained models based on ImageNet
object detection
150
What are the two variants of sagemaker object detection?
MXNet and Tensorflow * Takes an image as input, outputs all instances of objects in the image with categories and confidence scores * MXNet * Uses a CNN with the Single Shot multibox Detector (SSD) algorithm * The base CNN can be VGG-16 or ResNet-50 * Transfer learning mode / incremental training * Use a pre-trained model for the base network weights, instead of random initial weights * Uses flip, rescale, and jitter internally to avoid overfitting * Tensorflow * Uses ResNet, EfficientNet, MobileNet models from the TensorFlow Model Garden
151
What training input does object detection expect?
* MXNet: RecordIO or image format (jpg or png) * With image format, supply a JSON file for annotation data for each image
152
Whats the difference between object detection and image classification?
Object detection will show the specific point in the image where the object is. Image classification will classify the image and tell you what it is, not where it is
153
Image Classification: What’s it for?
* Assign one or more labels to an image * Doesn’t tell you where objects are, just what objects are in the image
154
For image classification , there are Separate algorithms for ________ and _____
MXNet and Tensorflow
155
Semantic Segmentation: What’s it for?
* Pixel-level object classification * Different from image classification – that assigns labels to whole images * Different from object detection – that assigns labels to bounding boxes * Useful for self-driving vehicles, medical imaging diagnostics, robot sensing
156
* Useful for self-driving vehicles, medical imaging diagnostics, robot sensing
semantic segmentation
157
Semantic Segmentation: What training input does it expect?
* JPG Images and PNG annotations * For both training and validation * Label maps to describe annotations * Augmented manifest image format supported for Pipe mode. * JPG images accepted for inference
158
What form of sagemaker algorithm tool has the below choices: Choice of 3 algorithms: * Fully-Convolutional Network (FCN) * Pyramid Scene Parsing (PSP) * DeepLabV3
semantic segmentation
159
Random cut forest us used for ________
anomaly detection
160
Neural Topic Model: What’s it for?
* Organize documents into topics * Classify or summarize documents based on topics * It’s not just TF/IDF * “bike”, “car”, “train”, “mileage”, and “speed” might classify a document as “transportation” for example (although it wouldn’t know to call it that)
161
What are the four data channels for neural topic model?
* Four data channels * “train” is required * “validation”, “test”, and “auxiliary” optional
162
Neural Topic Model: How is it used?
* You define how many topics you want * These topics are a latent representation based on top ranking words * One of two topic modeling algorithms in SageMaker – you can try them both!
163
Another topic modeling algorithm * Not deep learning * Unsupervised * The topics themselves are unlabeled; they are just groupings of documents with a shared subset of words * Can be used for things other than words * Cluster customers based on purchases * Harmonic analysis in music
* Latent Dirichlet Allocation (LDA)
164
What sagemaker algorithm: Unsupervised; generates however many topics you specify * Optional test channel can be used for scoring results * Per-word log likelihood * Functionally similar to NTM, but CPU-based * Therefore maybe cheaper / more efficient
* Latent Dirichlet Allocation (LDA)
165
Simple classification or regression algorithm * Classification * Find the K closest points to a sample point and return the most frequent label * Regression * Find the K closest points to a sample point and return the average value
* K-Nearest-Neighbors - KNN
166
for KNN: SageMaker includes a ___________ stage * Avoid sparse data (“curse of dimensionality”) * At cost of noise / accuracy * “sign” or “fjlt” methods
dimensionality reduction
167
These are important hyperparameters for what algorithm: * K! * Sample_size
KNN
168
What sagemaker algorithm: * Unsupervised clustering * Divide data into K groups, where members of a group are as similar as possible to each other * You define what “similar” means * Measured by Euclidean distance * Web-scale K-Means clustering
K Means
169
These are important hyperparameters for what algorithm: * K! * Choosing K is tricky * Plot within-cluster sum of squares as function of K * Use “elbow method” * Basically optimize for tightness of clusters * Mini_batch_size * Extra_center_factor * Init_method
K means
170
What is the below sagemaker algorithm: * Dimensionality reduction * Project higher-dimensional data (lots of features) into lower-dimensional (like a 2D plot) while minimizing loss of information * The reduced dimensions are called components * First component has largest possible variability * Second component has the next largest… * Unsupervised
* Principal Component Analysis PCA
171
PCA: What training input does it expect?
* recordIO-protobuf or CSV * File or Pipe on either
172
What sagemaker algorithm: * Covariance matrix is created, then singular value decomposition (SVD) Two modes: * Regular * For sparse data and moderate number of observations and features * Randomized * For large number of observations and features * Uses approximation algorithm
* Principal Component Analysis PCA
173
What sagemaker algorithm: Dealing with sparse data * Click prediction * Item recommendations * Since an individual user doesn’t interact with most pages / products the data is sparse * Supervised * Classification or regression * Limited to pair-wise interactions * User -> item for example
factorization machines
174
What sagemaker algorithm: Finds factors we can use to predict a classification (click or not? Purchase or not?) or value (predicted rating?) given a matrix representing some pair of things (users & items?) * Usually used in the context of recommender systems
factorization machines
175
What sagemaker algorithm: * Unsupervised learning of IP address usage patterns * Identifies suspicious behavior from IP addresses * Identify logins from anomalous IP’s * Identify accounts creating resources from anomalous IP’s
IP Insights
176
What sagemaker algorithm: * Uses a neural network to learn latent vector representations of entities and IP addresses. * Entities are hashed and embedded * Need sufficiently large hash size * Automatically generates negative samples during training by randomly pairing entities and IP’s
IP Insights
177
What sagemaker algorithm: * You have some sort of agent that “explores” some space * As it goes, it learns the value of different state changes in different conditions * Those values inform subsequent behavior of the agent * Examples: Pac-Man, Cat & Mouse game (game AI) * Supply chain management * HVAC systems * Industrial robotics * Dialog systems * Autonomous vehicles * Yields fast on-line performance once the space has been explored
reinforcement learning
178
What sagemaker algorithm: * A specific implementation of reinforcement learning * You have: * A set of environmental states s * A set of possible actions in those states a * A value of each state/action Q * Start off with Q values of 0 * Explore the space * As bad things happen after a given state/action, reduce its Q * As rewards happen after a given state/action, increase its Q
q learning
179
Reinforcement Learning in SageMaker * Uses a deep learning framework with ____ and ________
Tensorflow and MXNet
180
What is this called: * SageMaker spins up a “HyperParameter Tuning Job” that trains as many combinations as you’ll allow * Training instances are spun up as needed, potentially a lot of them * The set of hyperparameters producing the best results can then be deployed as a model * It learns as it goes, so it doesn’t have to try every possible combination
Automatic Model Tuning
181
* Visual IDE for machine learning!
SageMaker Studio
182
Create and share Jupyter notebooks with SageMaker Studio * Switch between hardware configurations (no infrastructure to manage)
Sagemaker notebooks
183
* Organize, capture, compare, and search your ML jobs
Sagemaker experiments
184
* Saves internal model state at periodical intervals * Gradients / tensors over time as a model is trained * Define rules for detecting unwanted conditions while training * A debug job is run for each rule you configure * Logs & fires a CloudWatch event when the rule is hit
sagemaker debugger
185
* Automates: * Algorithm selection * Data preprocessing * Model tuning * All infrastructure * It does all the trial & error for you * More broadly this is called AutoML
Sagemaker autopilot
186
* Integrates with SageMaker Clarify * Transparency on how models arrive at predictions * Feature attribution
autopilot explainability
187
* Get alerts on quality deviations on your deployed models (via CloudWatch) * Visualize data drift * Example: loan model starts giving people more credit due to drifting or missing input features * Detect anomalies & outliers * Detect new features * No code needed
Sagemaker model monitor
188
* _________ detects potential bias * i.e., imbalances across different groups / ages / income brackets * With ModelMonitor, you can monitor for bias and be alerted to new potential bias via CloudWatch * SageMaker Clarify also helps explain model behavior * Understand which features contribute the most to your predictions
SageMaker Clarify
189
* A “feature” is just a property used to train a machine learning model. * Like, you might predict someone’s political party based on “features” such as their address, income, age, etc. * Machine learning models require fast, secure access to feature data for training. * It’s also a challenge to keep it organized and share features across different models.
sagemaker feature store
190
* Creates & stores your ML workflow (MLOps) * Keep a running history of your models * Tracking for auditing and compliance * Automatically or manually-created tracking entities * Integrates with AWS Resource Access Manager for cross-account lineage
SageMaker ML Lineage Tracking
191
* Visual interface (in SageMaker Studio) to prepare data for machine learning * Import data * Visualize data * Transform data (300+ transformations to choose from) * Or integrate your own custom xformswith pandas, PySpark, PySpark SQL * “Quick Model” to train your model with your data and measure its results
Sagemaker data wrangler
192
* No-code machine learning for business analysts * Upload csv data (csv only for now), select a column to predict, build it, and make predictions * Can also join datasets * Classification or regression
sagemaker canvas
193
________ * For asynchronous or real-time inference endpoints * Controls shifting traffic to new models * “Blue/Green Deployments” * All at once: shift everything, monitor, terminate blue fleet * Canary: shift a small portion of traffic and monitor * Linear: Shift traffic in linearly spaced steps * Auto-rollbacks
Deployment Guardrails
194
________ * Compare performance of shadow variant to production * You monitor in SageMaker console and decide when to promote it
Shadow Tests
195
One facet (demographic group) has fewer training values than another
* Class Imbalance (CI)
196
* Imbalance of positive outcomes between facet values
* Difference in Proportions of Labels (DPL)
197
* How much outcome distributions of facets diverge
* Kullback-Leibler Divergence (KL), Jensen-Shannon Divergence(JS)
198
* P-norm difference between distributions of outcomes from facets
* Lp-norm (LP)
199
* L1-norm difference between distributions of outcomes from facets
* Total Variation Distance (TVD)
200
* Maximum divergence between outcomes in distributions from facets
* Kolmogorov-Smirnov (KS)
201
* Disparity of outcomes between facets as a whole, and by subgroups
* Conditional Demographic Disparity (CDD)
202
* Integrated into AWS Deep Learning Containers (DLCs) * Can’t bring your own container * Compile & optimize training jobs on GPU instances * Can accelerate training up to 50% * Converts models into hardware-optimized instructions * Tested with Hugging Face transformers library, or bring your own model
SageMaker Training Compiler
203
What AI Service: * Natural Language Processing and Text Analytics * Input social media, emails, web pages, documents, transcripts, medical records (Comprehend Medical) * Extract key phrases, entities, sentiment, language, syntax, topics, and document classifications * Events detection * PII Identification & Redaction * Targeted sentiment (for specific entities) * Can train on your own data
Amazon comprehend
204
What AI Service: * Uses deep learning for translation * Supports custom terminology * In CSV or TMX format * Appropriate for proper names, brand names, etc.
Amazon Translate
205
What AI service: * Speech to text * Input in FLAC, MP3, MP4, or WAV, in a specified language * Streaming audio supported (HTTP/2 or WebSocket) * French, English, Spanish only * Speaker Identificiation * Specify number of speakers * Channel Identification * i.e., two callers could be transcribed separately * Merging based on timing of “utterances” * Automatic Language Identification * You don’t have to specify a language; it can detect the dominant one spoken. * Custom Vocabularies * Vocabulary Lists (just a list of special words – names, acronyms) * Vocabulary Tables (can include “SoundsLike”, “IPA”, and “DisplayAs”)
Amazon Transcribe
206
What AI Service: * Neural Text-To-Speech, many voices & languages * Lexicons * Customize pronunciation of specific words & phrases * Example: “World Wide Web Consortium” instead of “W3C” * SSML * Alternative to plain text * Speech Synthesis Markup Language * Gives control over emphasis, pronunciation, breathing, whispering, speech rate, pitch, pauses. * Speech Marks * Can encode when sentence / word starts and ends in the audio stream * Useful for lip-synching animation
Amazon Polly
207
What AI Service: * Computer vision * Object and scene detection * Can use your own face collection * Image moderation * Facial analysis * Celebrity recognition * Face comparison * Text in image * Video analysis * Objects / people / celebrities marked on timeline * People Pathing * Image and video libraries
Rekognition
208
What AI Service: * Fully-managed service to deliver highly accurate forecasts with ML *“AutoML” chooses best model for your time series data * ARIMA, DeepAR, ETS, NPTS, CNN-QR Prophet * Works with any time series * Price, promotions, economic performance, etc. * Can combine with associated data to find relationships * Inventory planning, financial planning, resource planning * Based on “dataset groups,” “predictors,” and “forecasts.”
Amazon Forecast
209
What AI Tool: * Billed as the inner workings of Alexa * Natural-language chatbot engine * A Bot is built around Intents * Utterances invoke intents (“I want to order a pizza”) * Lambda functions are invoked to fulfill the intent * Slots specify extra information needed by the intent * Pizza size, toppings, crust type, when to deliver, etc. * Can deploy to AWS Mobile SDK, Facebook Messenger, Slack, and Twilio
Amazon Lex
210
What AI Service: * Fully-managed recommender engine * Same one Amazon uses * API access * Feed in data (purchases, ratings, impressions, cart adds, catalog, user demographics etc.) via S3 or API integration * You provide an explicit schema in Avro format * Javascript or SDK * GetRecommendations * Recommended products, content, etc. * Similar items * GetPersonalizedRanking * Rank a list of items provided * Allows editorial control / curation
Amazon Personalize
211
What AI Service: * Equipment, metrics, vision * Detects abnormalities from sensor data automatically to detect equipment issues * Monitors metrics from S3, RDS, Redshift, 3rd party SaaS apps * Vision uses computer vision to detect defects in silicon wafers, circuit boards, etc.
Amazon Lookout
212
What AI Service: * End to end system for monitoring industrial equipment & predictive maintenance
Amazon Monitron
213
What AI Service: * Computer Vision at the edge * Brings computer vision to your existing IP cameras
AWS Panorama
214
What AI Tool: * Upload your own historical fraud data * Builds custom models from a template you choose * Exposes an API for your online application
Amazon Fraud Detector
215
What AI Service: * Automated code reviews! * Finds lines of code that hurt performance * Resource leaks, race conditions * Fix security vulnerabilities
Codeguru
216
What AI Service: * For customer support call centers * Ingests audio data from recorded calls * Allows search on calls / chats * Sentiment analysis * Find “utterances” that correlate with successful calls * Categorize calls automatically * Measure talk speed and interruptions * Theme detection: discovers emerging issues
Contact Lens for Amazon Connect
217
What AI Service: * Enterprise search with natural language * For example, “Where is the IT support desk?” “How do I connect to my VPN?” * Combines data from file systems, SharePoint, intranet, sharing services (JDBC, S3) into one searchable repository * ML-powered (of course) – uses thumbs up / down feedback * Relevance tuning – boost strength of document freshness, view counts, etc.
Amazon Kendra
218
What AI Service: * Human review of ML predictions * Builds workflows for reviewing low-confidence predictions * Access the Mechanical Turk workforce or vendors * Integrated into Amazon Textract and Rekognition * Integrates with SageMaker * Very similar to Ground Truth
Amazon Augmented AI (A2I)
219
* All models in SageMaker are hosted in ________
Docker containers
220
* Docker containers are created from ______
images
221
* Images are built from a _______
Dockerfile
222
* Images are saved in a ________
repository
223
* Train once, run anywhere * Edge devices * ARM, Intel, Nvidia processors * Embedded in whatever – your car? * Optimizes code for specific devices * Tensorflow, MXNet, PyTorch, ONNX, XGBoost, DarkNet, Keras * Consists of a compiler and a runtime
Sagemaker Neo
224