1.2 Summarizing Data Using Frequency Distributions Flashcards

1
Q

A frequency distribution

A

summarizes the values of a numerical variable into a few intervals

helpful when working with large data sets

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

the types of frequency

A

Absolute frequency

Relative frequency

Cumulative relative frequency

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Absolute frequency

A

The actual number of observations in each interval

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Relative frequency

A

The absolute frequency divided by the total number of observations

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Cumulative relative frequency

A

The relative frequencies added up from the first interval to the current interval

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

the steps to construct a frequency distribution

A
  1. Sort the data in ascending order.
  2. Calculate the range of the data.

Range = Maximum value − Minimum value

  1. Determine the desired number of intervals, k
  2. Determine the interval width.

Interval width = Range/k

  1. Construct a table based on the minimum value, maximum value, the desired number of intervals (k), and the interval width.
  2. Assign the observations into the respective intervals. Each observation will only fall in one interval since the intervals will not overlap.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

a contingency table

A

used to present the frequency distributions for multiple categorical variables simultaneously

the data in a contingency table can display absolute frequencies or relative frequencies

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

joint frequencies

A

the individual cells in a contingency table that results from the crossing point of two variables (one from column and one from the row)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

marginal frequencies

A

last row and last column of the contingency table

shows the total per variable

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

confusion matrix

A

uses the contingency table to evaluate the performance of a classification model.

constructed to evaluate the performance of the prediction model

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

chi-square test of independence

A

test relationships between different variables in a contingency table

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

how to preform chi-square test of independence

A
  1. Use marginal frequencies in the contingency table to construct another table with expected values of the observations
  2. Compare the expected values to the actual values to derive the chi-square test statistic
  3. Compare the chi-square test statistic to the chi-square critical value for a given level of significance
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

chi-square test of independence

Test statistic > Critical value:

conclusion and implication

A

conclusion: Reject the claim of independence

implication: There is a significant association between the categorical variables

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

chi-square test of independence

Test statistic < Critical value:

conclusion and implication

A

conclusion: Do no reject the claim of independence

implication: There is no significant association between the categorical variables

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

in a frequency distribution, the absolute frequency measure most likely:

A: represents the percentages of each unique value of the variable.

B: represents the actual number of observations counted for each unique value of the variable.

C: allows for comparisons between datasets with different numbers of total observations.

A

B: represents the actual number of observations counted for each unique value of the variable.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

A histogram

A

uses a chart to present the distribution of numerical data

non-overlapping intervals

from frequency distribution table

useful in presenting the frequency distribution of numerical data

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

A frequency polygon

A

similar to a histogram

However, rather than using bars, a frequency polygon plots each interval’s midpoint on the x-axis and the absolute frequency on the y-axis

points are connected through line segments

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

A cumulative frequency distribution chart

A

can also be used to illustrate the cumulative frequency distribution.

This shows how many observations lie below a certain value.

Most observations will lie on the steep slope

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

bar chart

A

more appropriate when handling the frequency distribution of categorical data

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

types of bar chart

A

pareto chart

A grouped bar chart (or clustered bar chart)

stacked bar chart

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

A Pareto chart

A

sorts the categories by frequency in descending order along with a cumulative relative frequency line

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q

A grouped bar chart (or clustered bar chart)

A

may also be used to show joint frequencies when there are multiple categorical variables.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
23
Q

stacked bar chart

A

instead of grouped (cluster) bar chart, pile them up one over the other

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
24
Q

tree-map

A

the frequency distribution of categorical data can be displayed using a tree-map

made up of different colored rectangles that have areas that represent the frequency of each category

provides a clear picture of the category that has the highest frequency.

–> However, it may become challenging to read when there are too many sub-categories

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
25
Q

A word cloud (a.k.a. tag cloud)

A

used to illustrate the frequency of textual data, which is a type of unstructured data.

It allows analysts to quickly spot the most frequent terms in a report/article

Words which appear more frequently have bigger sizes in the word cloud, and different colors may indicate different sentiments

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
26
Q

A line chart

A

useful at visualizing ordered observations

typically used to present the change in data over time

One of the most common applications of a line chart in the finance industry is showing stock price trend over time

A line chart may accommodate more than one set of data

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
27
Q

A bubble line chart

A

can be used to add a third variable into a two-dimensional line chart

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
28
Q

A scatter plot

A

describes the joint variation in two numerical variables

shows the correlation between the variables at a particular point in time, which can be none, linear, or non-linear

The degree of association is shown by the distance between the data points and the line of best fit.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
29
Q

A scatter plot matrix

A

can be used to visualize pairwise associations for more than two variables.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
30
Q

A heat map

A

used to visualize the frequency distribution of categorical data

It enhances the presentation of a contingency table by introducing a color spectrum based on the frequency distribution

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
31
Q

which type of chart should we use to explore or present a relationship between variables?

A

Scatter Plot

Scatter Plot Matrix

Heat Map

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
32
Q

which type of chart should we use to explore or present a comparison among categories?

A

Bar Chart

Tree-map

Heat map

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
33
Q

which type of chart should we use to explore or present a comparison over time?

A

Line Chart (two variables)

Bubble Line chart (three variables)

34
Q

which type of chart should we use to explore or present a distribution with numerical data?

A

Histogram

Frequency polygon

cumulative distribution chart

35
Q

which type of chart should we use to explore or present a distribution with categorical data?

A

Bar Chart

tree-map

Heat Map

36
Q

which type of chart should we use to explore or present a distribution with unstructured data?

A

Word cloud

37
Q

A bar chart is similar to a histogram except it offers an alternative presentation of the same data

is this true or false?

A

false

38
Q

the most common statistical measures

A

Measures of central tendency

39
Q

Measures of central tendency

A

indicate where the data are centered.

40
Q

The most common central tendency measures

A

arithmetic mean

median

mode

weighted mean

geometric mean

41
Q

population

A

all possible observations

extremely difficult to collect data on an entire population

42
Q

sample statistic

A

subset of the population

can then be used to draw inferences about the population statistic

43
Q

the most common measure of where the data are centered

A

The arithmetic mean

44
Q

The arithmetic mean

A

the sum of the observations divided by the number of observations

basically, the easiest type of average

45
Q

The sample mean

A

the arithmetic mean for a sample

46
Q

The value of the mean is extremely sensitive to extreme values or outliers

true or false

A

true

47
Q

three ways to deal with outliers

A

No adjustments: This is appropriate if all values are equally important and meaningful.

Remove all outliers
A trimmed mean

Replace outliers with another value: A winsorized mean

48
Q

a trimmed mean

A

removing all outliers

calculated by discarding a certain percentage of the highest and lowest values

For example, with a sample of 100 observations, a 2% trimmed mean would be the arithmetic mean without the highest value (top 1%) and the lowest value (bottom 1%).

49
Q

a winsorized mean

A

adjusts any outliers’ values to either an upper or lower limit.

No observations are excluded from the calculation

Replaces outliers with another value

50
Q

The median

A

the middle item of a sorted list

does not use all information about the observations because it only focuses on their relative position

more complicated to calculate (less mathematically tractable) than the mean

51
Q

if n is odd, what is the median

A

(n+1) / 2 th term

52
Q

if n is even, what is the median

A

the mean between (n/2) and ((n+2)/2)

53
Q

The mode

A

the most frequently occurring value in a distribution

54
Q

The mode

A

the most frequently occurring value in a distribution

55
Q

Some distributions have more than one mode, while others have none

true or nah

A

true

56
Q

A distribution with just one mode

A

unimodal

57
Q

A distribution with two modes

A

bimodal

58
Q

modal intervals

A

mode for data grouped in intervals

it would be the interval with the highest bar

59
Q

The weighted mean formula

A

X = Wn*n

60
Q

uses of weighted average in finance

A

used to calculate past portfolio or index returns

They can also be used to calculate future expected returns by weighting various scenarios

61
Q

The geometric mean

A

used to average rates over time or compute growth rates

It is often used to average portfolio returns from different time periods

The geometric mean is always less than or equal to the arithmetic mean

62
Q

formula for geometric mean

A

n root of the multiplication of (1 +return) for all the periods necessary -1 under the root

63
Q

The harmonic mean

A

not as commonly used
The observation’s weight is inversely proportional to its magnitude

Smaller weights are assigned to larger observations

This property reduces the sensitivity of the harmonic mean to extremely large outliers

64
Q

harmonic mean formula

A

1 / ((1/n)*(E * 1/Xi))

65
Q

when is the harmonic mean useful

A

when the data consists of ratios (e.g., P/Es)

It would also be appropriate if the analyst wants the average price paid for a security when investing the same dollar amount for several time periods (also known as the cost averaging technique)

66
Q

is the harmonic mean always more or less than the geometric mean?

what about the arithmetic mean?

A

always less

always less

67
Q

which Mean to use if we have a sample and we want to include all values including outliers?

A

arithmetic mean

68
Q

which Mean to use if we have a sample and we want to do compounding?

A

geometric mean

69
Q

the quantile

A

a value at or below which a stated fraction of the data is found

If we arrange the observations in ascending order

70
Q

most common quantiles

A

our quartiles

five quintiles

ten deciles

one hundred percentiles

71
Q

The y th percentile

A

the value at or below which y percent of the observations lie

For example, the 90th percentile score (P90) on an exam is the number that separates the top 10% scores from the bottom 90%

72
Q

The interquartile range (IQR)

A

the difference between the third quartile and the first quartile

IQR = Q3 - Q1

73
Q

how is the location of the y th percentile (Ly) in a list of n observations ranked?

A

ranked in ascending order (lowest to highest) is calculated as follows:

Ly = (n + 1) * y/100

74
Q

how do you find the percentile itself (Py)?

A

you do a weighted average depending on the decimals of the number equaling to Ly (location of percentile)

if Ly = 14.07

you do: n14 * (1 - 0.07) + n15 *0.07

if Ly = 16.63

you do: n16* (1 - 0.63) + n17 *0.63

75
Q

The dispersion of data across quartiles can be visualized how?

A

using a box and whisker plot

76
Q

box and whisker plot

A

The “box” has a height equal to the interquartile range and is connected by two “whiskers.”

The two whiskers are bounded by the “fences,”

the fences are the highest and the lowest values of the observations

77
Q

dispersion (or variability) around the mean

A

dispersion addresses risk

78
Q

The most common measures of absolute dispersion

A

range

mean absolute deviation

variance

standard deviation

79
Q

The range

A

the difference between the maximum and minimum values

limited as a dispersion measure because it uses only the highest and lowest values

80
Q

The mean absolute deviation (MAD)

A

uses all the observations in the sample, which makes it better than the range

81
Q

MAD formula

A

Multiplication of all (Mean of sample i - general mean) / n