3 - Data Preparation Flashcards

1
Q

What data sets are used in the bank marketing analysis?

A

bank_marketing_training and bank_marketing_test data sets

These data sets are adapted from the bank-additional-full.txt data set from the UCI Machine Learning Repository.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What are the four predictors used in the analysis?

A
  • age
  • education
  • previous_outcome
  • days_since_previous

The target response is whether contacts subscribe to a term deposit account.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

How many records are in the bank_marketing_training data set?

A

26,874 records

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

How many records are in the bank_marketing_test data set?

A

10,255 records

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What is the first phase of the Data Science Methodology?

A

Problem Understanding Phase

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What is one objective of the bank marketing analysis?

A

Learn about potential customers’ characteristics

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What is another objective of the bank marketing analysis?

A

Develop a profitable method of identifying likely positive responders

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

What is a method to learn about potential customers?

A

Use Exploratory Data Analysis

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What is one classification model that can be developed for the analysis?

A
  • Decision Trees
  • Random Forests
  • Naïve Bayes Classification
  • Neural Networks
  • Logistic Regression
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

What is the purpose of adding an index field?

A

Acts as an ID field and tracks the sort order of records

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What is the command to read a CSV file in Python?

A

pd.read_csv()

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

How do you create an index field in Python?

A

bank_train[‘index’] = pd.Series(range(0,26874))

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

What function in R provides the number of records in a data set?

A

dim()

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

What is the misleading value in the days_since_previous field?

A

999

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

What value should replace the misleading field value of 999 in Python?

A

np.NaN

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

What command is used to create a histogram in Python?

A

plot(kind = ‘hist’)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

How do you change misleading field values in R?

A

bank_train$days_since_previous <- ifelse(test = bank_train$days_since_previous == 999, yes = NA, no = bank_train$days_since_previous)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

What is the purpose of re-expressing categorical data as numeric?

A

To provide information on the relative differences among categories

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

What issue arises if categorical data is left unchanged?

A

Data science algorithms would not recognize the ordering of categories

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

What is the command to view the first six records in R?

A

head()

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

Fill in the blank: The bank marketing data sets are used for a _______ campaign.

A

phone-based direct marketing

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q

What is the goal of transforming data values into numeric values?

A

To ensure that one value is larger than another while preserving relative differences among various categories.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
23
Q

What is the numeric value assigned to ‘illiterate’ in the education variable?

A

0

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
24
Q

What is the numeric value assigned to ‘high.school’ in the education variable?

A

12

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
25
What Python command is used to replicate the education variable?
bank_train['education_numeric'] = bank_train['education']
26
In Python, how do you replace categorical values with numeric ones in a DataFrame?
bank_train.replace(dict_edu, inplace=True)
27
What R function is used to replace values in a variable according to specified rules?
revalue()
28
Fill in the blank: The command used in Python to calculate the z-score is _______.
stats.zscore()
29
What is the purpose of standardizing numeric fields?
To ensure the field mean equals 0 and the field standard deviation equals 1.
30
What is considered an outlier in the context of z-values?
A data value with a z-value greater than 3 or less than -3.
31
How do you identify outliers using Python?
bank_train.query('age_z > 3 | age_z < -3')
32
What command is used in R to sort a data set by a specific variable?
order()
33
What is the default behavior of the scale() function in R?
It centers and scales the variable to calculate the z-score.
34
What does the command bank_train$education_numeric <‐ as.numeric(levels(edu.num))[edu.num] do in R?
Converts factor levels of edu.num to numeric and assigns them to education_numeric.
35
What is the numeric value assigned to 'unknown' in the education variable?
Missing (np.NaN in Python, NA in R)
36
What is the mean number of contacts per customer in the example?
2.6
37
What does the replace() function do in Python?
Replaces values in a DataFrame according to a specified dictionary.
38
True or False: Outliers should always be removed from the dataset.
False
39
What does the command bank_train.sort_values(['age_z'], ascending=False) do in Python?
Sorts the DataFrame by the age_z variable in descending order.
40
What is the first step to reexpress categorical field values using Python?
Create a dictionary for converting categorical values to numeric values.
41
Fill in the blank: In R, the function _______ is used to center a variable by subtracting its mean.
scale()
42
How can you view the first 10 records of a sorted dataset in R?
bank_train_sort[1:10, ]
43
What is the purpose of the z-score?
To measure how many standard deviations a data value is from the mean.
44
What are the two main objectives of the bank_marketing analysis?
1. Understanding potential customers 2. Developing profitable models
45
What are the three ways to learn about potential customers?
* Analyze existing data * Conduct surveys * Use focus groups
46
How can we accomplish the objective of developing profitable models for identifying likely positive responders?
By using statistical techniques and machine learning algorithms
47
Why might it be a good idea to add an index field to the data set?
* To uniquely identify each record * To facilitate data manipulation
48
Why is the field days_since_previous essentially useless until we handle the 999 code?
Because 999 is often used to indicate missing or invalid data
49
Why was it important to reexpress education as a numeric field?
To enable quantitative analysis and modeling
50
If a data value has a z-value of 1, how may we interpret this value?
It is one standard deviation above the mean
51
What is the rough rule of thumb for identifying outliers using z-values?
Values with z-scores greater than 3 or less than -3 are considered outliers
52
Should outliers be automatically removed or changed? Why or why not?
No, because outliers may contain valuable information
53
What should we do with outliers we have identified?
Investigate their cause and decide whether to keep or modify them
54
What is the first step to work with the bank_marketing_training data set?
Derive an index field and add it to the data set
55
What should be done for the days_since_previous field regarding the value 999?
Change it to the appropriate code for missing values
56
What should be done to the education field?
Reexpress the field values as numeric values
57
What is the task for the age field?
Standardize the field age and print the first 10 records
58
What should be done to identify outliers in the age_z field?
Obtain a listing of all records that are outliers
59
How should jobs with less than 5% of records be handled?
Combine them into a field called 'other'
60
What should the default predictor be renamed to?
credit_default
61
How should the month variable be modified?
Change values to 1–12 but keep it as categorical
62
For the duration field, what are the tasks to be completed?
* Standardize the variable * Identify outliers and the most extreme outlier
63
What should be done for the campaign field?
* Standardize the variable * Identify outliers and the most extreme outlier
64
What does the Nutrition_subset data set contain?
Weight in grams, amount of saturated fat, and cholesterol for 961 foods
65
What should be done with the saturated fat data?
* Sort by saturated fat * List the five food items highest in saturated fat
66
What is the importance of comparing food items of different sizes?
It may not be valid as size affects fat content
67
How can saturated_fat_per_gram be derived?
By dividing the amount of saturated fat by the weight in grams
68
What should be done after deriving saturated_fat_per_gram?
* Sort by saturated_fat_per_gram * List the five food items highest in saturated fat per gram
69
What is the task for cholesterol_per_gram?
* Derive the variable * Sort and list the five food items highest in cholesterol per gram
70
What should be done for saturated_fat_per_gram regarding outliers?
* Standardize the field * List high-end outliers and count low-end outliers
71
What should be done for cholesterol_per_gram regarding outliers?
Standardize the field and list high-end outliers
72
What is the first step for the adult_ch3_training data set?
Add a record index field
73
What should be checked for the education field?
Determine if any outliers exist
74
What are the tasks for the age field?
* Standardize the variable * Identify outliers and the most extreme outlier
75
What is the flag for capital-gain?
capital-gain-flag equals 0 for capital gain equals zero, and 1 otherwise
76
What should be done for records with age at least 80?
Construct a histogram of age and analyze the results