Things I got wrong Flashcards
(21 cards)
How many variables’ distributions are shown in a histogram?
1
What is JSONL?
JSON lines. Used to encode JSON objects with each in a separate line.
What is Parquet?
A columnar data format which is well suited to batch data processing
What is a scatter plot useful for?
Testing the correlation between two continuous variables
What is the main use of Managed Service for Apache Kafka?
Transfer data between different applications and systems
Can EFS be used outside of EC2 instances?
No
What is FSx for Lustre good for with regards to Sagemaker?
Distributed training. Works well due to its high performance
Is Ground Truth using humans or machines to do the labelling?
Both!
Do you change data’s format first or filter it first?
You filter it first, then change the format
What does the “difference in proportions of labels” metric tell you?
The difference in the proportions of positive and negative labels between 2 groups. For example, identify how much the predicted churn changes between different demographics to be aware of potential biases
What does Cramer’s V tell you?
The association between two categorical variables
What type of splitting is used when there is a date or time series?
Ordered splitting. This means that the model only has access to historical data, otherwise it wouldn’t make sense to ask it to predict for dates it already knows
What does Spearman’s rank correlation tell you and when should it be used?
To find the strength and direction of the relationship between two monotonic variables. Doesn’t assume a distribution
Why might SageMaker processing jobs be advantageous over Glue for ML workloads?
SM processing jobs integrate w/ ML libraries like Tensorflow
What is the purpose of Sagemaker batch transform?
Allows you to perform inference on large datasets without having to maintain a persistent endpoint
What is the maximum processing time of Sagemaker serverless?
60 seconds
What is random cut forest used for?
Anomaly detection in high-dimensional data, such as time series or multivariate data
Do VPC gateway endpoints allow data transfer across regions? What about interface endpoints?
Gateway: no, Interface: yes
What is the purpose of Sagemaker IP insights?
Identify IPs that deviate from normal patterns through the analysis of historical logs
Do S3 access points have authentication or access control?
No
What is Sagemaker DeepAR algorithm used for?
Time series forecasting only