Data Preparation and Transformation Flashcards

Question 1

Q

A model feature is the number of times a particular drug has been prescribed, to the same patient of the claim, in a period of 2 years. What type of feature is this?

Answer

A

Discrete. The feature is counting the number of times that a particular drug has been prescribed. Individual and countable items are classified as discrete data.

Question 2

Q

You are building a ML model for an educational group that owns schools and universities across the globe. Your model aims to predict how likely a particular student is to leave his/her studies. Many factors may contribute to school dropout, but one of your features is the current academic stage of each student: preschool, elementary school, middle school, or high school. Which type of feature is this?

Answer

A

Ordinal. The feature has an implicit order and should be considered a categorical/ordinal variable.

Question 3

Q

You are building a machine learning model for a car insurance company. The company wants to create a binary classification model that aims to predict how likely their insured vehicles are to be stolen. You have considered many features to this model, including the type of vehicle (economy, compact, premium, luxury, van, sport, and convertible). How would you transform the type of vehicle in order to use it in your model?

Answer

A

Applying one-hot encoding. In this case, we have a categorical/nominal variable (there is no order among each category). Additionally, the number of unique categories looks pretty manageable; furthermore, one-hot encoding would fit very well for this type of data.

Question 4

Q

You are working as a data scientist for a financial company. The company wants to create a model that aims to classify improper payments. You have decided to use “type of transaction” as one of your features (local, international, pre-approved, and so on). After applying one-hot encoding to this variable, you realize that your dataset has many more variables, and your model is taking a lot of time to train. How could you potentially solve this problem?

Answer

A

By analyzing which types of transactions have the most impact on improper/proper payments. Only apply one-hot encoding to the reduced types of transactions. Your transformation resulted in more features due to the excessive number of categories in the original variable. Although the one-hot encoding approach looks right, since the variable is a nominal feature, the number of levels (unique values) for that feature is probably too high.

In this case, you could do exploratory data analysis to understand the most important types of transactions for your problem. Once you know that information, you can then restrict the transformation to just those specific types (reducing the sparsity of your data). It is worth adding that you would be missing some information during this process because now, your dummy variables would only be focusing only on a subset of categories, but it is a valid approach.

Question 5

Q

Consider a dataset that stores the salaries of employees in a particular column. The mean value of salaries on this column is $2,000, while the standard deviation is equal to $300. What is the standard scaled value of someone that earns $3,000?

Question 6

Q

Which type of data transformations can we apply to convert a continuous variable into a binary variable?

Answer

A

Binning and one-hot encoding

Question 7

Q

You are a data scientist for a financial company and you have been assigned the task of creating a binary classification model to predict whether a customer will leave the company or not (also known as churn). During your exploratory work, you realize that there is a particular feature (credit utilization amount) with some missing values. This variable is expressed in real numbers; for example, $1,000. What would be the fastest approach to dealing with those missing values, assuming you don’t want to lose information?

Answer

A

Replacing the missing data with the mean or median value of the variable.

Question 8

Q

You have to create a machine learning model for a particular client, but you realize that most of the features have more than 50% of data missing. What’s our best option on this critical case?

Answer

A

Check with the dataset owner if you can retrieve the missing data from somewhere else

Question 9

Q

You are working as a senior data scientist from a human resource company. You are creating a particular machine learning model that uses an algorithm that does not perform well on skewed features. Which transformations could you apply to this feature to reduce its skewness?

Answer

A

Log or Box-Cox transformation. To reduce skewness, power transformations are the most appropriate. Particularly, you could apply the log transformation or Box-Cox transformation to make this distribution more similar to a Gaussian one.

Question 10

Q

You are working on a fraud identification issue where most of your labeled data belongs to one single class (not fraud). Only 0.1% of the data refers to fraudulent cases. Which modeling techniques would you propose to use on this use case

Answer

A

Applying random oversampling to create copies of the fraudulent cases.

Applying random undersampling to remove observations from the not fraudulent cases.

Question 11

Q

You are preparing text data for machine learning. This time, you want to create a bi-gram BoW matrix on top of the following texts:
“I will master this certification exam”

“I will pass this certification exam”

How many rows and columns would you have on your BoW matrix representation?

Answer

A

2 rows and 7 columns

Data Preparation and Transformation Flashcards

(11 cards)