Algorithms Flashcards

1
Q

You are working as a lead data scientist for a retail company. Your team is building a regression model and using the linear learner built-in algorithm to predict the optimal price of a particular product. The model is clearly overfitting to the training data and you suspect that this is due to the excessive number of variables being used. Which of the following approaches would best suit a solution that addresses your suspicion?

A

Applying L2 regularization and changing the wd hyperparameter of the linear learner algorithm.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

RecordIO-protobuf is an optimized data format that’s used to train AWS, built-in algorithms, where SageMaker converts each observation in the dataset into a binary representation as a set of 4-byte floats. RecordIO-protobuf can operate in two modes: pipe and file mode. What is the difference between them?

A

In pipe mode, the data will be streamed directly from S3, which helps optimize storage. In file mode, the data is copied from S3 to the training instance’s store volume.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

You are the cloud administrator of your company. You have done great work creating and managing user access and you have fine-grained control of daily activities in the cloud. However, you want to add an extra layer of security by identifying accounts that are attempting to create cloud resources from an unusual IP address. What would be the fastest solution to address this use case

A

Create an IP Insights model to identity anomalous accesses.

Integrate your IP Insights with existing rules from Amazon Guard Duty.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

You are working as a data scientist for a large company. One of your internal clients has requested that you improve a regression model that they have implemented in production. You have added a few features to the model and now you want to know if the model’s performance has improved due to this change. Which of the following options best describes the evaluation metrics that you should use to evaluate your change?

A

Check if the R squared adjusted RMSE of the new model is better than the R squared of the current model in production.

In this case, you have been exposed to a particular behavior of regression model evaluation, where, by adding new features, you will always increase R squared. You should use R squared adjusted to understand if the new features are adding value to the model or not. Additionally, RMSE will give you a business perspective of the model’s performance. Although option b is correct, option d best describes the optimal decision you should make.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Which AWS built-in algorithms is optimized to work with sparse data?

A

Factorization machines is a general-purpose algorithm that is optimized for sparse data.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Which AWS built-in algorithms uses an ensemble method based on decision trees during the training process?

A

XGBoost is a very popular algorithm that uses an ensemble of decision trees to train the model. XGBoost uses a boosting approach, where decision trees try to correct the error of the prior model.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Which AWS built-in algorithm is considered an index-based algorithm?

A

KNN is an index-based algorithm because it has to compute distances between points, assign indexes for these points, and then store the sorted distances and their indexes. With that type of data structure, KNN can easily select the top K closest points to make the final prediction.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

You are a data scientist in a big retail company that wants to predict their sales per region on a monthly basis. You have done some exploratory work and discovered that the sales pattern per region is different. Your team has decided to approach this project as a time series model, and now, you have to select the best approach to create a solution. Which of the following options would potentially give you a good solution with the least effort?

A

Develop a DeepAR model and set the region, associated with each time series, as a vector of static categorical features. You can use the cat field to set up this option.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

You are working on a dataset that contains nine numerical variables. You want to create a scatter plot to see if those variables could be potentially grouped on clusters of high similarity. How could you achieve this goal?

A

Compute the two principal components (PCs) using PCA. Then, plot PC1 and PC2 in the scatter plot.
Using K-means or KNN will not solve this question.

You have to apply PCA to reduce the number of features and then plot the results in a scatter plot. Since scatter plots only accept two variables, option b is the right one.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

How should you preprocess your data in order to train a Blazing Text model on top of 100 text files?

A

You should create a text file with space-separated tokens. Each line of the file must contain a single sentence. If you have multiple files for training, you should concatenate all of them into a single one.

You should provide just a single file to Blazing Text with space-separated tokens, where each line of the file must contain a single sentence.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly