3 - Data Preparation Flashcards
What data sets are used in the bank marketing analysis?
bank_marketing_training and bank_marketing_test data sets
These data sets are adapted from the bank-additional-full.txt data set from the UCI Machine Learning Repository.
What are the four predictors used in the analysis?
- age
- education
- previous_outcome
- days_since_previous
The target response is whether contacts subscribe to a term deposit account.
How many records are in the bank_marketing_training data set?
26,874 records
How many records are in the bank_marketing_test data set?
10,255 records
What is the first phase of the Data Science Methodology?
Problem Understanding Phase
What is one objective of the bank marketing analysis?
Learn about potential customers’ characteristics
What is another objective of the bank marketing analysis?
Develop a profitable method of identifying likely positive responders
What is a method to learn about potential customers?
Use Exploratory Data Analysis
What is one classification model that can be developed for the analysis?
- Decision Trees
- Random Forests
- Naïve Bayes Classification
- Neural Networks
- Logistic Regression
What is the purpose of adding an index field?
Acts as an ID field and tracks the sort order of records
What is the command to read a CSV file in Python?
pd.read_csv()
How do you create an index field in Python?
bank_train[‘index’] = pd.Series(range(0,26874))
What function in R provides the number of records in a data set?
dim()
What is the misleading value in the days_since_previous field?
999
What value should replace the misleading field value of 999 in Python?
np.NaN
What command is used to create a histogram in Python?
plot(kind = ‘hist’)
How do you change misleading field values in R?
bank_train$days_since_previous <- ifelse(test = bank_train$days_since_previous == 999, yes = NA, no = bank_train$days_since_previous)
What is the purpose of re-expressing categorical data as numeric?
To provide information on the relative differences among categories
What issue arises if categorical data is left unchanged?
Data science algorithms would not recognize the ordering of categories
What is the command to view the first six records in R?
head()
Fill in the blank: The bank marketing data sets are used for a _______ campaign.
phone-based direct marketing
What is the goal of transforming data values into numeric values?
To ensure that one value is larger than another while preserving relative differences among various categories.
What is the numeric value assigned to ‘illiterate’ in the education variable?
0
What is the numeric value assigned to ‘high.school’ in the education variable?
12