Things I got wrong - MLA Flashcards

(36 cards)

1
Q

How many variables’ distributions are shown in a histogram?

A

1

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What is JSONL?

A

JSON lines. Used to encode JSON objects with each in a separate line.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What is Parquet?

A

A columnar data format which is well suited to batch data processing

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What is a scatter plot useful for?

A

Testing the correlation between two continuous variables

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What is the main use of Managed Service for Apache Kafka?

A

Transfer data between different applications and systems

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Can EFS be used outside of EC2 instances?

A

No

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What is FSx for Lustre good for with regards to Sagemaker?

A

Distributed training. Works well due to its high performance

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Is Ground Truth using humans or machines to do the labelling?

A

Both!

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Do you change data’s format first or filter it first?

A

You filter it first, then change the format

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

What does the “difference in proportions of labels” metric tell you?

A

The difference in the proportions of positive and negative labels between 2 groups. For example, identify how much the predicted churn changes between different demographics to be aware of potential biases

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What does Cramer’s V tell you?

A

The association between two categorical variables

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

What type of splitting is used when there is a date or time series?

A

Ordered splitting. This means that the model only has access to historical data, otherwise it wouldn’t make sense to ask it to predict for dates it already knows

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

What does Spearman’s rank correlation tell you and when should it be used?

A

To find the strength and direction of the relationship between two monotonic variables. Doesn’t assume a distribution

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Why might SageMaker processing jobs be advantageous over Glue for ML workloads?

A

SM processing jobs integrate w/ ML libraries like Tensorflow

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

What is the purpose of Sagemaker batch transform?

A

Allows you to perform inference on large datasets without having to maintain a persistent endpoint

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

What is the maximum processing time of Sagemaker serverless?

17
Q

What is random cut forest used for?

A

Anomaly detection in high-dimensional data, such as time series or multivariate data

18
Q

Do VPC gateway endpoints allow data transfer across regions? What about interface endpoints?

A

Gateway: no, Interface: yes

19
Q

What is the purpose of Sagemaker IP insights?

A

Identify IPs that deviate from normal patterns through the analysis of historical logs

20
Q

Do S3 access points have authentication or access control?

21
Q

What is Sagemaker DeepAR algorithm used for?

A

Time series forecasting only

22
Q

What type of instances are often good for deep learning inference?

A

Accelerated computing instances w/ GPUs or FGPAs

23
Q

What is the decorator that you have to use to integrate a custom step into a SageMaker Pipeline?

24
Q

Can Comprehend be used to redact/anonymise PII? When would you use this over Macie?

A

Yes, and you’d use it for real-time or near-real-time use cases e.g. customer communications

25
What is the difference in SageMaker Clarify and SageMaker Model Monitor?
SageMaker Clarify is explainability and bias detection. SageMaker Model Monitor is drift, model performance, managing costs
26
Is using a notebook for production deployments efficient?
Not really, notebooks are better for getting a feel on how the system works and how you can interact with it
27
What is model bias?
The idea that the output is being biased to a specific outcome for a specific group, e.g. old people always get bigger loans
28
What is the feature attribution?
The idea that you can say which metric from the input data caused a certain outcome in the model's results, e.g. you can see that the 'income' field is contributing the most dominantly to the inferences of the model
29
Where is pipe mode better over fast file mode?
Pipe mode is better if the data should be loaded sequentially, but otherwise fast file is probably more suitable
30
Is Amazon MWAA serverless?
No
31
What ensemble method should you use if you want to assemble multiple different algorithms/model types?
Stacking. Bagging and boosting typically work with one model type at a time.
32
What is feature splitting?
Breaking a complex input feature into 2 smaller and simpler sub features.
33
What is one hot encoding?
A technique to convert categorical variables into numerical ones
34
What is the purpose of AWS X-Ray?
Finding bottlenecks and tracking the flow of execution in microservices architectures
35
The ROC is the plot of the XXX XXX vs XXX XXX rates at various thresholds.
The ROC is the plot of the true positive vs false positive rates at various thresholds.
36
What is ordinal encoding?
A technique used to encode categorical data into numerical data