Domain 1 Flashcards

(98 cards)

1
Q

Explain the AI relationship ven diagram

A

Artificial Intelligence, Machine Learning, Deep Learning

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Predictions that AI makes based on historical data

A

Inference

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

When AI recognizes a change in what has happened in the past

A

Anomaly detection

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What are some AWS services that could provide structured input data for training ML models?

A

RDS, Redshift

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What are some AWS services that could provide semi-structured input data for training ML models?

A

DynamoDB, MongoDB

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

For semi-structured, structured data, unstructured data, and time-series, where should you export data for training models?

A

S3

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

In machine learning, what describes the relationship between inputs and outputs?

A

An algorithm

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Describe the machine learning training process

A

Known data -> features -> algorithm -> output

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Describe the machine learning inference process, which comes after training

A

new data -> features -> model -> output

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

What are the two artifacts produced that create a model?

A

Inference code + model artifacts

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What type of inferencing provides low-latency, high throughput, and a persistent endpoint (also usually more expensive)?

A

Real-time

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

What type of inferencing is performed offline, uses large datasets, and either happens on an infrequent schedule?

A

Batch transform

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Training your model with data that is pre-labeled (pictures with fish/not fish)

A

Supervised Learning

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

What is the challenge with supervised learning?

A

You need a lot of data, people to label…takes time and money

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

What is Amazon Ground Truth?

A

A service that helps you provided labeling

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

What process uses data that has features but is not labeled and is good for pattern recognition, anomaly detection, and grouping data into categories?

A

Unsupervised learning

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

What process uses both supervised and unsupervised learning, provides rewards to an agent when criteria are ment, uses trial and error, and allows the agent to make mistakes to learn, and has and end goal?

A

Reinforcement learning

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

What sub service of Ground Truth uses crowdsourcing to label
via affordable labor

A

AWS Mechanical Turk

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

A model telling you a fish is not a fish because it is out of water, a result of training being to specific and not having enough varied examples, is called what?

A

Overfitting

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

What is called when a model cannot determine a meaningful relationship between the input and output data, happens when you haven’t trained the model long enough or with a large enough set?

A

Underfitting

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

What is bias?

A

When a model discriminates against a specific group because of a lack of fair representation in the data used to train the model

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q

Also, if a model is showing bias, what can be done with features?

A

the weight of features that are introducing noise can be directly adjusted by the data scientists. For example, it could completely remove gender consideration

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
23
Q

Items such as age and sex discrimination, should be identified at the beginning before creating a model.

A

Fairness constraints

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
24
Q

A type of machine learning that uses algorithmic structures called neural networks.

A

Deep learning

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
25
The three layers of deep neural networks
input layer, several hidden layers, and an output layer of nodes
26
Deep learning can excel at tasks like
image classification and natural language processing where there is a need to identify the complex relationship between data objects
27
A big advantage of deep learning models for computer vision is that
they don't need the relevant features given to them.
28
Traditional machine learning algorithms will generally perform well and be efficient when
It comes to identifying patterns from structured data and labeled data. Examples include classification and recommendation systems.
29
On the other hand, deep learning solutions are more suitable for
unstructured data like images, videos, and text. Tasks for deep learning include image classification and natural language processing, where the is a need to identify the complex relationships between pixels and words. but only deep learning uses neural networks to simulate human intelligence.
30
Gen AI use transformer neural networks, which change an input sequence, in Gen AI known as prompt, into an output sequence, which is the response to your prompt. Neural networks process the elements of a sequence sequentially one word at a time. Transformers process the sequence in parallel, which speeds up the training and allows much bigger datasets to be used. They outperform other ML approaches to natural language processing. They excel at understanding human language so they can read long articles and summarize them. They are also great at generating text that's similar to the way a human would. As a result, they are good at language translation and even writing original stories, letters, articles, and poetry. They even know computer programming languages and can write code for software developers.
Gen AI Notes
31
Consider these use cases for AI/ML
Increasing business efficiency Solving complex problems Making better decisions
32
Consider AI/ML alternatives when
Costs outweigh benefits Models cannot meet interpretability requirements (can't know how a neural network made a decision, so instead use a rules based system) Systems must be deterministic (produces same output with the same input) rather than probabilistic
33
If your dataset consists of features or attributes as inputs with labeled target values as outputs, then you have a supervised learning problem. In this type of problem, you train your model with data containing known inputs and outputs.
supervised learning problem
34
If your target values are categorical, for example, one or more discrete values, then you have a
classification problem. (supervision)
35
If these target values you're trying to predict are mathematically continuous, then you have a
regression problem.
36
Binary classification classification
assigns an input to one of several classes based on the input attributes.
37
Multiclass classification
assigns an input to one of several classes based on the input attributes. An example is the prediction of the topic most relevant to a tax documen
38
When your target values are mathematically continuous, then you have a
egression problem. Regression estimates the value of dependent target variable based on one or more other variables,
39
multiple independent variables,
If we have such as weight and age, then we have a multiple linear regression problem. A
40
s. Cluster analysis is
a class of techniques that are used to classify data objects into groups, called clusters. It attempts to find discrete groupings within data. Members of a group are similar as possible to one another, and as different as possible from members of other gro
41
you define the features or attributes that you want the algorithm to use to determine similarity. Then you select a distance function to measure similarity and specify the number of clusters, or groups, you want for the analysis.
Clustered analysis
42
Is the identification of rare items, events, or observations in the data, which raise suspicions, because they differ significantly from the rest of the data
Anomaly detection
43
This service provides facial recognition, object detection, text detection, and content moderation
Amazon Rekognition
44
Extracts text, handwriting, etc from scanned documents
Amazon Textract
45
Extracts key phrases, entities, and sentiment
Amazon Comprehend
46
This service is pretrained to find PII
Amazon Comprehend
47
Converts Text to Speech
Polly
48
Converts Speech (Live and recorded) to Text
Transcribe
49
This AWS services has Intelligent document search , responds to questions with appropriate context
Amazon Kendra
50
Personalized product recommendations
Amazon Personalize
51
Translates between 75 languages, built on a neural network
Amazon Translate
52
Provided with historical time series data, this AWS service predicts future points in time series
Amazon Forecast
53
Detects fraud through checking online transactions, product reviews, checkout and payments, new accounts, and account takeover
Amazon Fraud Detector
54
What is the first step in the AI/ML process?
Identify the business goal
55
When identifying the business goal, what two things should youdo
define success criteria align stakeholders
56
Second step in the ai/ml process
Frame the ML problem
57
When framing the ML problem, what four things should you do
Define the ML task, including inputs, outputs and metrics Determine feasibility Start with the simplest model options Do a cost benefit analysis
58
When approach model selection, what should yo udo?
Start with the simplest, things AI/ML hosted services and pre-trained models. Fully customize only if needed.
59
To collect training data, you need to know these three things
Data sources Data ingestion, including ETL Labels
60
ETL includes
Gathering transforming and storing data in a new central location
61
What is likely one of the most time intensive parts of processing data?
Labeling, as you likely don't already have the data labeled and need to do that
62
When pre-processing data, what types of things are you doing?
Looking for missing data, masking PII data, cleaning it, and splitting it.
63
What are the recommended splits for data?
80% for training the model 10% for model eval 10% for final testing before prod deploy
64
Feature engineering
which characteristics of the dataset should be used as features to train the model. This is the subset that is relevant and contributes to minimizing the error rate of a trained model. You should reduce the features in your training data to only those that are needed for inference. Features can be combined to further reduce the number of features. Reducing the number of features reduces the amount of memory and computing power required for training
65
What service is a cloud optimized ETL service, contains it's own data catalog, and has built in transformations (dropping duplicate records, splitting data, etc)?
AWS Glue
66
Describe the AWS Glue Data Catalog
Crawls source systems, discovers metadata and schemas, understands the source data. Only metadata is stored in the data catalog
67
For AWS Glue ETL jobs, what is a common destination location for transformed data?
S3
68
What service has data quality rules, visualization and data preparation.
AWS Glue DataBrew
69
What service helps you prepare a well labeled dataset for use in supervised learning? It uses machine learning to label those things it can, then Turk for those it cant.
Amazon SageMaker Ground Truth
70
What service can you use to simplify the feature engineering process, to import/prepare/transform/visualize and analyze features?
Amazon SageMaker Canvas
71
Amazon Feature Store
Amazon SageMaker Feature Store is a centralized store for features and associated metadata, so features can be easily discovered and reused. Feature Store makes it easy to create, share, and manage features for ML development. Feature Store accelerates this process by reducing repetitive data processing and curation work required to convert raw data into features for training an ML algorithm. You can create workflow pipelines that convert raw data into features and add them to feature groups.
72
A machine learning algorithm updates a set of numbers in such a way that the inference matches an expected output. These numbers are
Parameters
73
True or false: The training process requires you to run one training run.
False.This can't be done in one iteration, because the algorithm has not learned yet. It has no knowledge of how changing weights will shift the output closer toward the expected value. Therefore, it watches the weights and outputs from previous iterations, and shifts the weights to a direction that lowers the error in generated output. This iterative process stops either when a defined number of iterations have been run, or when the change in error is below a target value.
74
What is known as running experiments?
There are usually multiple algorithms to consider for a model. The best practice is to run many training jobs in parallel, by using different algorithms and settings. This is known as running experiments, which helps you land on the best-performing solution
75
Each algorithm has a set of external parameters that affect its performance. These are set by the data scientists before training the model. These include adjusting things like how many neural layers and nodes there will be in a deep learning model. The optimal values can only be determined by running multiple experiments with different settings.
known as hyperparameters
76
To run a training job, what do you give Sagemaker?
the URL of the S3 bucket containing your training data. You also specify the compute resources you want to use for training, and the output bucket for the model artifacts. You specify the algorithm by giving SageMaker the path to a Docker container image that contains the training algorithm. In the Amazon Elastic Container Registry, Amazon ECR, you can specify the location of SageMaker provided algorithms and deep learning containers, or the location of your custom container, containing a custom algorithm. You also need to set the hyperparameters required by the algorithm.
77
A capability of Amazon SageMaker that lets you create, manage, analyze, and compare your machine learning experiments. An experiment is a group of training runs, each with different inputs, parameters, and configurations. It features a visual interface to browse your active and past experiments, compare runs on key performance metrics, and identify the best-performing models.
Amazon SageMaker experiments
78
Amazon Sagemaker automatic model tuning (AMT)
also known as hyperparameter tuning, finds the best version of a model, by running many training jobs on your dataset. To do this, AMT uses the algorithm and ranges of hyperparameters that you specify. It then chooses the hyperparameter values that create a model that performs it best, as measured by a metric that you choose.
79
What is the most cost efficient way to run your model?
Batch inference
80
What are the ways you can deploy your model?
Batch inference Real-time inference Self-managed Hosted (sagemaker inference)
81
What options are available for Amazon sagemaker inference?
Batch transform (offline line inference, large datasets) Asynchronous (long processing times, large payloads) Serverless (intermittent traffic, periods of no traffic) Real-time (live predictions, sustained traffic, low latency, consistent performance)
82
What service can you use to monitor your model and be notified of suspected drift in your deployed model?
Amazon SageMaker Model Monitor
83
What is MLOPs
IAC Rapid Experimentation Version Control Active perf mon Automatic model retraining and validation when there is data and code changes
84
What are the benefits of MLOps
Productivity Repeatability Reliability Auditability Data and model quality
85
What service allows you to manage and build model pipelines, defining them with the python SDK or JSON, automated data processing, training jobs, creating models, and registering models?
Amazon SageMaker Model Building Pipelines
86
Name four repository options
CodeCommit SageMaker Model Registry SageMaker Feature Store Third party
87
Name four options for orchestration
SageMaker Pipelines Amazon Managed Worklows for Apache Airflow AWS Step Functions Third party
88
What is a confusion matrix?
A confusion matrix is a table with actual data typically across the top and the predicted values on the left.used to summarize the performance of a classification model when it's evaluated against task data
89
What is accuracy?
which is simply the percentage of correct predictions
90
What is precision?
Precision measures how well an algorithm predicts true positives out of all the positives that it identifies. The formula is the number of true positives divided by the number of true positives, plus the number of false positives.
91
What is Recall (TPR)?
If we want to minimize the false negatives, then we can use a metric known as recall. For example, we want to make sure that we don't miss if someone has a disease and we say they don't. The formula is the number of true positives divided by the number of true positives plus the number of false negatives.
92
Can you optimize a model for both precision and recall?
No, but you can use F1
93
What is F1?
Combines recall and precision into one figure, allowing you to optimize on both of these
94
What is False Positive Rate
which is the false positives divided by the sum of the false positives and true negatives. In our example, this metric shows us how the model is handling the images that are not fish. It is a measure of how many of the predictions were of fish out of the images that were not fish
95
What is the True Negative Rate
Closely related to the false positive rate is the true negative rate, which is the ratio of the true negatives to the sum of the false positives and true negatives. It is a measure of how many of the predictions were of not fish out of the images that were not fish.
96
What is Receiver operating characteristics
97
98