C1 Intro to Probability and Data Analysis with R M4 Exploratory Data Analysis | Inference Intro Flashcards
What is the arithmetic average called?
Mean
The mean is calculated by adding all values and dividing by the number of values.
What term refers to the midpoint of a data set?
Median
The median divides the data into two equal halves.
What is the most frequent observation in a data set called?
Mode
A data set can have more than one mode or none at all.
What measure indicates variability around the mean?
Standard deviation
Standard deviation quantifies the amount of variation or dispersion in a set of values.
What is the formula for calculating the range of a data set?
Max - Min
Maximum value - Minimum value found in the data set
The range provides a measure of how spread out the values are.
What does the interquartile range represent?
The interquartile range represents the range of the middle 50% of the distribution.
It is the distance between the first quartile (25th percentile) and third quartile (75th percentile)
IQR = Q3 - Q1 where Q1 and Q3 are the 25th and 75th percentiles
The IQR is the length of the box in a box plot.
Fill in the blank: The three commonly used measures of center are mean, median, and _______.
Mode
Mode is essential in understanding the frequency of data points.
True or False: The range is defined as the maximum value minus the minimum value.
True
This measure provides a quick sense of the spread of the data.
What are the three commonly used measures of center?
- mean (the arithmetic average)
- median (the midpoint)
- mode (the most frequent observation)
What are the three commonly used measures of spread?
- standard deviation (variability around the mean)
- range (max-min)
- interquartile range (middle 50% of the distribution)
Which of the following cannot be determined from a boxplot?
Box plots do not display modality, histograms do.
modality: whether the distribution is unimodal, bimodal, uniform, etc.
What are box plots and dot plots used for?
To highlight outliers and display the median and interquartile range
What is a common tool for visualizing the relationship between two numerical variables?
Scatter plot
The primary purpose of a scatter plot is to visualize the relationship between two numerical variables
A scatterplot provides a case-by-case view of data for two numerical variables.
In a scatter plot, the explanatory variable is placed in the x-axis, with the response variable in the y-axis
What can we only talk about when using observational data?
Correlation, not causation.
What does a strong relationship in a scatter plot indicate?
Little scatter around the curve.
What is a naive approach to handling outliers in data analysis?
Immediately excluding them.
What is the purpose of histograms in data visualization?
Histogram
An histogram is a good way to visualize the distribution of a single numerical variable
In an Histogram, the height of the bers represent the number of cases that fall into each interval.
What is a dot plot used for?
Visualizing individual values.
What does a box plot display?
Abox plot summarizes a data set using five statistics while also plotting unusual observations.
In a box plot display, the median and the interquartile range of the data are strongly displayed.
What does an intensity map reveal?
Spatial distribution trends in the data.
When we encounter geographic data, we should create an intensity map, where colors are used to show higher and lower values of a variable.
The intensity maps are not generally very helpful for getting precise values in any given area, but they are very helpful for seeing geographic trends and generating interesting research questions or hypotheses.
What is the meaning of skewness of a distribution?
Skewness refers to the direction of the tail in a distribution (left or right)
- left skewed
- symmetric
- right skewed
Distibutions are skewed to the side of the long tail
What is modality?
Modality is an important aspect of shape to describe a distribution.
Modality refers to the number of peaks in a distribution.
A distribution might be unimodal with one prominent peak, bimodal with two prominent peaks, or uniform with no prominent peaks. With more than two prominent peaks a distribution is usually said to be multimodal.
Definition
Sample statistics
Sample statistics are point estimates for the unknown population parameters.
They are measurements calculated from a sample that is representative of the total population.
Compare the mean and median according to the skewness of a distributiom
left skewed: mean < median
symmetric: mean | median
right skewed: mean > median