Exploratory Data Anlaysis Flashcards
(38 cards)
What does EDA stand for
exploratory data analyis
What is the purpose of EDA?
convert the available data from their raw form to an informative one in which main features of the data are illuminated
What are the three things we should always do when performing EDA?
- use visual displays plus numerical summaries
- describe the overall pattern and mention any striking deviations from that pattern
- interpret the results we got in context
What are the two catagories of variable when examining the distribution of a single variable?
catagorical and quantitative
What are the four types of methods used to summarize distribution of a categorical variable
- pie chart
- bar chart
- pictogram (can be misleading)
- category (group) percentages
What are the three types of methods used to summarize distribution of a quantitative variable
- histogram
- stemplot
- descriptive statistics
When describing the distribution displayed by a histogram or stemplot, what are the four factors that should be described?
Overall pattern:
- shape
- center
- spread
Deviations from the pattern
4. outliers
What do descriptive statistics generally cover?
measure of center plus measures of spread
What descriptive statistics in the numerical summary of a quantitative summary should be included when the distribution is symmetric with no outliers?
- mean
2. standard deviation
What descriptive statistics should be included for the summary of a quantitative summary when the distribution is skewed
- five number summary w/median and IQR
What makes up the five number summary?
- Min (minimum value)
- Q1 (quartile 1)
- M (median)
- Q3
Max (max value)
What rule is used for identifying outliers?
IQR: Intraquadrant range 1.5 criterion
What are the two measurements for 1.5IQR Criterion for outliers?
below Q1 - 1.5(IQR)
above Q3 + 1.5(IQR)
How do you find 1.5(IQR)
- Q3 - Q1 = IQR
- Q1 - 1.5(IQR)
- Q1 + 1.5(IQR)
What are three factors to be considered with whether or not to include outliers in your data?
- Even if it is an extreme value, if it was produced by the same physical/biological process as rest of the data, and is expected to eventually occur again, then it should be included in the data
- if outlier was produced under fundamentally different conditions/process from rest of data, outlier can be removed from data if goal is to investigate oly process that produced the rest of the data
- may indicate a mistake in data (like typo or measuring error), and should be corrected if possible or removed from data
Which relationship uses boxplots for examination?
C > Q
In which distribution shape does the Standard Deviation Rule apply?
normal distribution
What is the Standard Deviation Rule?
tells us what percentage of observations fall within 1, 2 or 3 deviations away from the mean
What are the percentage ranges under the Standard Deviation Rule?
99.7% = 3rd deviation 95% = 2nd deviation 68% = 1st deviation
what is the symbol for mean
x with a line over it
What is the symbol for median
X
How is standard deviation calculated? (5 steps)
- find the mean of the data
- find the deviations from the mean (subtract the mean from each observation)
- square each of the deviations
- find the variance of the data: average each of the devations by adding them up and dividing by (n-1)…. (n = the sample size)
- find the SD by finding the square root of the variance
What type(s) of graphical display and/or numerical summaries are used for C > Q? (2)
- boxplots
2. numerical summaries w/conditional percentages
What types of graphical display and/or numerical summaries are used for C > C? (2)
- two-way table
2. conditional percentages