Things I got wrong - MLS Flashcards

(30 cards)

1
Q

Can KDF perform data format transformations?

A

Yes, e.g. JSON to Parquet

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What type of data is well suited to imputation by deep learning?

A

Categorical

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What can object2vec turn into embeddings?

A

Full sentences

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What is incremental training?

A

Slowly retraining your model with new data

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

How does KDF perform data transformations?

A

With built in lambda functions

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What type of algorithm are support vector machines and what can they be used for?

A

A supervised ML algorithm that can be employed for both classification and regression

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Can K-Means be used for classification?

A

No, it is for clustering

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

What type of learning is classification?

A

Supervised learning with labelled data

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

If you are overfitting, should you use more or less training data?

A

More, to introduce a greater diversity in the training data and force the model to generalise more

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

When should you use PCA in randomised mode?

A

For datasets with a large amount of observations and features as it can then use an approximation algorithm to run faster

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

If a model has high specificity, it means that nearly all ____ ______ have been weeded out.

A

If a model has high specificity, it means that nearly all false positives have been weeded out

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

What is the equation for tf-idf?

A

TF x IDF, where:
TF is the amount of times a word appears in a document / total words in the document.
IDF is the the log of the total number of documents / the amount of documents containing the specified word.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Can Glue transform data into RecordIO-Protobuf?

A

No

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

If you are underfitting, what 2 actions can you take?

A

Use more features
Remove regularisation

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

What is the fastest command to move data from S3 to Redshift?

A

COPY, this is much faster than INSERT

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

What algorithm should you use to forecast sales for a product that has not been seen before specifically, but for which there are similar existing products?

17
Q

What type of algorithm is DeepAR?

A

Supervised RNN

18
Q

Can you integrate your own scripts with XGBoost?

19
Q

What are two scaling techniques that can be used to reduce the effect of outliers in your data?

A

Robust standardisation and logarithm transformation

20
Q

Does SageMaker support resource based policies and service linked roles?

21
Q

Does training with a SageMaker built-in algorithm require an IAM role?

22
Q

What is the read/write of each KDS shard?

A

Read max is 2MB
Write max is 1MB

23
Q

What two conditions must CSVs going into SageMaker supervised learning algorithms meet?

A

They should not have a header record and the target variable should be in the first column

24
Q

Are InvokeEndpoint calls to SageMaker monitored by CloudTrail?

25
What is a Poisson distribution?
A probability distribution that is used to show how many times an event is likely to occur over a specified period
26
What part of a dataset is standardised or normalised?
Specific numerical features, NOT the whole thing
27
What is the difference between standardisation and normalisation?
Standardisation moves the mean to 0, but the standard deviation remains the same. Normalisation makes the values all between 0 and 1.
28
What is elastic inference?
A way to speed up the throughput and decrease the latency of getting real-time inferences from SageMaker deep learning models
29
What is AWS Panorama?
Enables you to do computer vision at the edge through AWS devices. Works with existing camera networks
30
K-fold validation splits data into k equal parts, trains the model k times, each time using a different fold as validation data and the remaining k-1 folds as training data. The model's performance is averaged across all k iterations. Common value for k is 5 or 10.