4 - Exploratory Data Analysis Flashcards
What is the primary purpose of exploratory data analysis (EDA)?
To explore data without a priori hypotheses and uncover relationships.
What is a hypothesis test (HT)?
A method to test specific hypotheses about data using statistical methods.
What does EDA allow the user to do?
Explore relationships, derive new variables, and use binning to increase predictive value.
What is the relationship explored in the bar graphs discussed?
The relationship between a categorical predictor and the target variable.
What does ‘previous_outcome’ refer to?
The result of a previous marketing campaign with the same customer.
What is the advantage of normalized bar graphs?
They allow easier comparison of response proportions between categories.
What are two best practices when working with bar graphs?
- Supplement unclear bar graphs with normalized versions
- Provide non-normalized graphs to indicate original distribution.
How do you create a contingency table in Python?
Using the crosstab() command.
What should the response variable represent in a contingency table?
The rows.
How do you calculate column percentages in Python?
Using the sum() and div() commands.
What is a histogram?
A graphical representation of a frequency distribution for a numerical variable.
What is the benefit of using a normalized histogram?
It helps distinguish response patterns more clearly.
What is the best practice for histograms?
- Use non-normalized histograms for original distributions
- Use normalized histograms for response patterns.
What command is used in R to create a contingency table?
The table() command.
What is the purpose of the addmargins() command in R?
To add row and column totals to a contingency table.
What is the significance of the geom_bar() function in ggplot2?
It specifies that a bar chart should be created.
Fill in the blank: EDA is often preferred when clients have _______ about the data.
no salient a priori notions
True or False: A normalized bar graph shows the original distribution of data.
False
What is the main drawback of a normalized histogram?
It does not indicate the original distribution of the data.
What is the purpose of using a non-normalized histogram?
To obtain the original distribution of the data values.
What is the benefit of using a normalized histogram?
To help better distinguish the response patterns.
What Python package is used for constructing histograms?
matplotlib
What command in Python creates a stacked histogram?
plt.hist()
In the plt.hist() command, what does the parameter ‘stacked = True’ do?
It stacks the two variables in the histogram.