1.2 Summarizing Data Using Frequency Distributions Flashcards
A frequency distribution
summarizes the values of a numerical variable into a few intervals
helpful when working with large data sets
the types of frequency
Absolute frequency
Relative frequency
Cumulative relative frequency
Absolute frequency
The actual number of observations in each interval
Relative frequency
The absolute frequency divided by the total number of observations
Cumulative relative frequency
The relative frequencies added up from the first interval to the current interval
the steps to construct a frequency distribution
- Sort the data in ascending order.
- Calculate the range of the data.
Range = Maximum value − Minimum value
- Determine the desired number of intervals, k
- Determine the interval width.
Interval width = Range/k
- Construct a table based on the minimum value, maximum value, the desired number of intervals (k), and the interval width.
- Assign the observations into the respective intervals. Each observation will only fall in one interval since the intervals will not overlap.
a contingency table
used to present the frequency distributions for multiple categorical variables simultaneously
the data in a contingency table can display absolute frequencies or relative frequencies
joint frequencies
the individual cells in a contingency table that results from the crossing point of two variables (one from column and one from the row)
marginal frequencies
last row and last column of the contingency table
shows the total per variable
confusion matrix
uses the contingency table to evaluate the performance of a classification model.
constructed to evaluate the performance of the prediction model
chi-square test of independence
test relationships between different variables in a contingency table
how to preform chi-square test of independence
- Use marginal frequencies in the contingency table to construct another table with expected values of the observations
- Compare the expected values to the actual values to derive the chi-square test statistic
- Compare the chi-square test statistic to the chi-square critical value for a given level of significance
chi-square test of independence
Test statistic > Critical value:
conclusion and implication
conclusion: Reject the claim of independence
implication: There is a significant association between the categorical variables
chi-square test of independence
Test statistic < Critical value:
conclusion and implication
conclusion: Do no reject the claim of independence
implication: There is no significant association between the categorical variables
in a frequency distribution, the absolute frequency measure most likely:
A: represents the percentages of each unique value of the variable.
B: represents the actual number of observations counted for each unique value of the variable.
C: allows for comparisons between datasets with different numbers of total observations.
B: represents the actual number of observations counted for each unique value of the variable.
A histogram
uses a chart to present the distribution of numerical data
non-overlapping intervals
from frequency distribution table
useful in presenting the frequency distribution of numerical data
A frequency polygon
similar to a histogram
However, rather than using bars, a frequency polygon plots each interval’s midpoint on the x-axis and the absolute frequency on the y-axis
points are connected through line segments
A cumulative frequency distribution chart
can also be used to illustrate the cumulative frequency distribution.
This shows how many observations lie below a certain value.
Most observations will lie on the steep slope
bar chart
more appropriate when handling the frequency distribution of categorical data
types of bar chart
pareto chart
A grouped bar chart (or clustered bar chart)
stacked bar chart
A Pareto chart
sorts the categories by frequency in descending order along with a cumulative relative frequency line
A grouped bar chart (or clustered bar chart)
may also be used to show joint frequencies when there are multiple categorical variables.
stacked bar chart
instead of grouped (cluster) bar chart, pile them up one over the other
tree-map
the frequency distribution of categorical data can be displayed using a tree-map
made up of different colored rectangles that have areas that represent the frequency of each category
provides a clear picture of the category that has the highest frequency.
–> However, it may become challenging to read when there are too many sub-categories