Things I got wrong Flashcards

(21 cards)

1
Q

How many variables’ distributions are shown in a histogram?

A

1

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What is JSONL?

A

JSON lines. Used to encode JSON objects with each in a separate line.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What is Parquet?

A

A columnar data format which is well suited to batch data processing

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What is a scatter plot useful for?

A

Testing the correlation between two continuous variables

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What is the main use of Managed Service for Apache Kafka?

A

Transfer data between different applications and systems

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Can EFS be used outside of EC2 instances?

A

No

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What is FSx for Lustre good for with regards to Sagemaker?

A

Distributed training. Works well due to its high performance

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Is Ground Truth using humans or machines to do the labelling?

A

Both!

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Do you change data’s format first or filter it first?

A

You filter it first, then change the format

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

What does the “difference in proportions of labels” metric tell you?

A

The difference in the proportions of positive and negative labels between 2 groups. For example, identify how much the predicted churn changes between different demographics to be aware of potential biases

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What does Cramer’s V tell you?

A

The association between two categorical variables

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

What type of splitting is used when there is a date or time series?

A

Ordered splitting. This means that the model only has access to historical data, otherwise it wouldn’t make sense to ask it to predict for dates it already knows

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

What does Spearman’s rank correlation tell you and when should it be used?

A

To find the strength and direction of the relationship between two monotonic variables. Doesn’t assume a distribution

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Why might SageMaker processing jobs be advantageous over Glue for ML workloads?

A

SM processing jobs integrate w/ ML libraries like Tensorflow

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

What is the purpose of Sagemaker batch transform?

A

Allows you to perform inference on large datasets without having to maintain a persistent endpoint

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

What is the maximum processing time of Sagemaker serverless?

17
Q

What is random cut forest used for?

A

Anomaly detection in high-dimensional data, such as time series or multivariate data

18
Q

Do VPC gateway endpoints allow data transfer across regions? What about interface endpoints?

A

Gateway: no, Interface: yes

19
Q

What is the purpose of Sagemaker IP insights?

A

Identify IPs that deviate from normal patterns through the analysis of historical logs

20
Q

Do S3 access points have authentication or access control?

21
Q

What is Sagemaker DeepAR algorithm used for?

A

Time series forecasting only