Part 2. Organising, Visualising and Describing Data Flashcards

1
Q

Data

A

A collection of numberpanel datas, characters, words and text, as well as audio and video in a raw or organised format to represent facts or information.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Data Classifications

A
  1. Numerical vs Categorical Data
  2. Cross-sectional vs Time Series vs Panel Data
  3. Structured vs Unstructured Data
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Numerical/Quantitative Data

A

Values that represent measured or counted quantities as a number.

This data can be split to two types:

  • Continuous Data
  • Discrete Data
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Continuous Data

A

Data that can be measured and can take on any numerical value in a specified range of values.

Examples:

  1. The future value of a lump sum investment measures the amount of money to be received after a certain period of time bearing an interest rate.
  2. The price returns of a stock that measures price change over a given period in percentage terms.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Discrete Data

A

Numerical values that result from a counting process, in which the data is limited to finite number of values.

Example:

  1. The frequency of discrete compounding, m, counts the number of times that interest is accrued and paid out in a given year.
    i. e. m = 12 means a monthly frequency
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Categorical Data/Qualitative

A

Values that describe a quality or characteristic of a group of observation, used as labels to divide a data set into groups to summarise and visualise.

Example:

  1. Bankrupt vs. Not Bankrupt
  2. Dividends increased vs. No Dividend Action
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Nominal Data

A

Categorical values that are not amendable to being organised in a logical order/rank.

Example:

  1. Classification of publicly listed stocks into 11 sectors, defined by the Global Industry Classification Standard (GICS).
  2. Text labels i.e. Sector
  3. Numerical Label i.e. GICS Code
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Ordinal Data

A

Categorical values that can be logically ordered or ranked, or numbers to identify categories.

Example:

  1. S&P star ratings for investment funds, in which a star represents a group of funds judged to have worst performance/quality, and 2, 3, 4, 5 stars to have better.
  2. Ranking growth oriented investment funds based on the 5-year cumulative returns i.e. 1 to the top performing 10% of funds.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Data classification based on data collection:

A
  1. Cross-sectional
  2. Time series
  3. Panel
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Variable (field, attribute, feature)

A

A characteristic/quantity that can be measured, counted or categorised, and is subject to change.

Example:

  1. Stock price
  2. Market capitalisation
  3. dividend and dividend yield
  4. Earnings per share (EPS)
  5. Price-to-earnings ratio (P/E)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Observation

A

The value of a specific variable collected at a point in time or over a specified period of time.

Example:

  1. DEF inc. recorded EPS of $7.50, this value represented a 15% annual increase.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Cross-sectional Data

A

A list of the observations of a specific variable from multiple observational units at a given point in time.

These observational units can be individuals, groups, companies, trading markets, regions, etc.

Example:

  • January inflation rates (i.e. the variable) for each of the euro area countries (i.e. the observational units) in the EU for a given year constitute cross-sectional data.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Time-series Data

A

A sequence of observations for a single observational unit of a specific variable collected over time and at discrete/equally spaced intervals of time

i.e. daily, weekly, monthly, annually, or quarterly.

Example:

  • The daily closing prices (i.e. the variable) of a particular stock recorded for a given month constitute time-series data.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Panel Data

A

A mix of time-series and cross-sectional data is frequently used in financial analysis and modeling.

Consists of observations through time on one or more variables for multiple observational units.

Example:

  1. Earnings per share in euros of three eurozone companies in a given year.

Data from Q1-4 is the time series.
Data from Co.A-C is the cross-sectional.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Structured Data

A

Highly organised in a pre-defined manner, usually with repeating patterns.

Typical form:

  • One-dimensional arrays (time series of a single variable)
  • Two-dimensional data tables (each column represents a variable or observation unit, and each row contains a set of values for the same columns).
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Structured Data Types

A
  1. Market data - issued by stock exchanges such as intra-day, daily closing stock prices, trading volumes.
  2. Fundamental data - data contained in financial statements, such as earnings per share, price to earnings ratio, dividend yield and return on equity.
  3. Analytical data - data derived from analytics such as cash flow projections, or forecasted earnings growth.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

Unstructured Data

A

Data that does not follow any conventionally organised forms.

Usually, financial models are able to take only structured data as inputs, so unstructured data must be transformed into structured that models can process.

Examples:

  1. Text - financial news, social media posts
  2. Audio/Video - management earning calls, presentation to analysts
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

Unstructured Data Types

A
  1. Produced by individuals (i.e. via social media posts, web searches, etc)
  2. Generated by business processes (i.e credit card transactions, corporate regulatory filings, etc)
  3. Generated by sensors (i.e. satellite imagery, foot traffic by mobile devices, etc)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

Raw Data

A

Data available in their original form as collected such as data typically cannot be used by humans or computers to directly extract information and insights.

Data can be organized for quantitative analysis using:

  1. One-dimensional arrays
  2. Two-dimensional rectangular arrays
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

One-dimensional array

A

Simplest form for representing a collection of data of the same data type, suitable for representing a single variable.

Example:

  1. Daily closing price of ABC Inc. stock, after the company went public.
    - Closing prices are time-series data collected at daily intervals.
    - Plotting data against time means we can learn whether data demonstrates an increasing or decreasing trend and whether time series repeats certain patterns in a systematic way over time. (summary of central tendency and spread variation in data distribution.)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

Two-dimensional rectangular arrays (data table)

A

Compromised of columns and rows to hold multiple variables and observations, respectively.

When a data table is used to organize data of a single observational unit, each column represents a different variable of the observational unit, and each row holds an observation for different variables; successive rows represent the observations for successive time periods.

Observations of each variable are a time-series sequence that is sorted in either ascending or descending time order.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q

Frequency Distribution

A

A tabular display of data constructed either by counting the observations of a variable by distinct values, or groups or by tallying the values of a numerical variable into a set of numerically ordered bins.

Helps the analysis of large amounts of numerical data, as it requires creating non-overlapping bins (intervals or buckets), and counts observations falling into each bin.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
23
Q

Constructing a frequency distribution:

A
  1. Count the number of observations for each unique value of the variable.
  2. Construct a table listing each unique value and the corresponding counts, and then sort the records by number of counts in descending or ascending order to facilitate the display.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
24
Q

Absolute Frequency

A

The actual number of observations counted for each unique value of the variable (i.e. each sector).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
25
Q

Relative Frequency

A

Calculated as the absolute frequency of each unique value of the variable divided by the total number of observations.

This provides a normalised measure of the distribution of the data, allowing comparisons between datasets with different numbers of total observations.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
26
Q

Frequency Distribution for Numerical Data

A
  1. Sort the data in ascending order.
  2. Calculate the range of data, defined as Range = Maximum Value - Minimum Value.
  3. Decide on the number of bins (k) in the frequency distribution.
  4. Determine bin width as Range/k.
  5. Determine the first bin by adding the bin width to the minimum value. Then, determine the remaining bins by successively adding the bin width to prior bins end point and then stop after reaching a bin that includes the max. value.
  6. Determine no. of observations falling into each bin by counting no. of observations whose values equal to or exceed the bin minimum value, yet are less than bins max. value. With exception in last bin where max. value is equal to last bin’s max, and therefore the observation with the max. value is included in bin’s count.
  7. Construct table of bins listed from smallest to largest that shows the no. of observations falling in each bin.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
27
Q

Cumulative absolute frequency

A

Adds up the absolute frequencies as we move from the first bin to the last bin.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
28
Q

Cumulative relative frequency

A

A sequence of partial sums of the relative frequencies.

For the last bin, the cumulative absolute frequency will equal the number of observations in the dataset (1,258), and the cumulative relative frequency will equal 100%.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
29
Q

Contingency table

A

A tabular format that displays the frequency distributions of two or more categorical variables simultaneously and is used for finding patterns between the variables.

This table having R levels of one variable in rows and C levels of the other variable in columns is referred to as R x C table.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
30
Q

Joint Frequencies

A

When you join one variable from the row (i.e. sector) and the other variable from the column (i.e. market cap) to count observations in a contingency table.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
31
Q

Marginal Frequencies

A

The corresponding sums of when joint frequencies are then added across rows and columns.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
32
Q

Applications of contingency tables

A
  1. Confusion Matrix

2. Chi-square test for independence

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
33
Q

Confusion Matrix

A

Evaluates the performance of a classification model.

i.e. a model classifying companies into two groups: those that default on their bond payments and those that do not default.

The matrix for displaying model results will be 2 x 2 table, showing frequency of actual defaults vs models predicted frequency of defaults.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
34
Q

Chi-square test for independence

A

To test for a potential association between categorical variables.

The procedure involves using the marginal frequencies in contingency table to construct a table with expected values of observations.

Actual and expected values are used to derive chi square test statistic.

The test statistic is then compared to a value from the chi-square distribution for a given level of significance.

If test statistic is greater than chi-square distribution value, then there is evidence to reject claim of independence, implying significant association between the categorical variable.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
35
Q

Visualization

A

The presentation of data in a pictorial/graphical format for purpose of increasing understanding and gaining insights into the data.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
36
Q

Histogram

A

A chart that presents the distribution of numerical data by using the height of a bar, or column to represent the absolute frequency of each bin/interval in the distribution.

y axis - the absolute frequency/relative frequency in percentage terms.

x axis - represents the bin of a variable.

Absolute frequency histogram = answers the question of how many items are in each bin.

Relative frequency histogram = gives the proportion or percentage of the total observations in each bin.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
37
Q

Frequency Polygon

A

Plotting the mid point of each return bin on x-axis and the absolute frequency for that bin in the y-axis, connected with a straight line.

This can quickly convey a visual understanding of the distribution since it displays frequency as an area under the curve.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
38
Q

Cumulative Frequency Chart

A

A chart that can plot either the cumulative absolute frequency or the cumulative relative frequency on the y-axis against the upper limit of the interval.

This allows us to see the number or percentage of the observation that lie below a certain value.

Curve flattens = frequencies of observations in bins are small.
Curve steep = reflects most of the observations

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
39
Q

Bar Chart

A

A frequency distribution of categorical data is plotted where each bar represents a distinct category, with the bar’s height proportional to the frequency of the corresponding category.

Vertical bar chart:

  • the y axis represents the absolute frequency/relative frequency
  • the x axis represents the mutually exclusive categories to be compared than bins that group numerical data.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
40
Q

Pareto Chart

A

The categories in a bar chart are ordered by frequency in descending order and includes a line displaying cumulative relative frequency.

The chart is used to highlight dominant categories or the most important groups.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
41
Q

Grouped Bar Chart (Clustered Bar Chart)

A

Presents frequency distribution of 2 categorical variables to show joint frequencies.

The bars within each cluster should be colored differently to distinguish between them, but color schemes for subgroups must be identical across sector clusters.

The bars in each sector cluster must always be placed in the same order throughout the chart.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
42
Q

Stacked Bar Chart

A

An alternative form for presenting the joint frequency distribution of two categorical variables.

Each subsection of the bar is shown in a different color to represent the contribution of each subgroup.

The overall height of the stacked bar represents the marginal frequency for the category.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
43
Q

Tree-Map

A

A graphical tool for presenting categorical data consists of a set of colored rectangles to represent distinct groups and the area of each rectangle is proportional to the value of the corresponding group.

This can represent data with additional dimensions by displaying a nest of rectangles. To display joint frequencies of sub-groups, we split the rectangle into sub-sections where the area of each nested rectangle would be proportional to the number of stocks in each market capitalization sub-group.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
44
Q

Word Cloud (Tag cloud)

A

A visual device for representing textual data, consisting of words extracted from a source of textual data, with the size of each distinct work being proportional to the frequency with which it appears in the given text.

This format allows us to quickly perceive the most frequent terms among given text for information about the nature/sentiment of the text.

Words conveying a different sentiment may be presented in different colors, i.e. positive words in green and negative words in red.

45
Q

Line chart

A

A type of chart used to visualise ordered observations, often used to display the change of data series over time.

This facilitates showing changes in the data, and underlying trends in a clear and concise way, helping understand the current data and forecasting data series.

It is especially helpful for making comparisons, with each distinct colour/pattern line representing each group of data.

46
Q

Bubble Line Chart

A

This shows multidimensional data in one chart when an observational unit has more than 2 features of interest.

Replacing data points with varying-sized bubbles to represent the 3rd dimension of data, even color-coded to represent more information.

Each marker representing a revenue data point is replaced by circular bubbles with a size proportional to the magnitude of EPS in the corresponding quarter.

The bubbles are colored in a binary scheme with green representing profits and red representing losses.

3 elements:

  • changes for revenue
  • changes for EPS
  • EPS represents profit or loss
47
Q

Scatter Plot

A

A type of graph for visualising the joint variation in two numerical variables, useful for understanding potential relationships between variables.

y axis = one variable
x axis = other variable

If data points seem to align along a straight line, then a significant relationship may exist among variables (positive or negative association).

The strength of association is dependent on how closely the data points are clustered around the line, with a tight cluster signaling a stronger relationship.

Assuming relationship among variables is apparent, the scatter plot can help spot extreme values (i.e. outliers).

48
Q

Scatter Plot Matrix

A

A useful tool for organizing scatter plots between pairs of variables, making it easy to inspect all pairwise relationships in one combined visual.

This contains each combination of bivariate scatter plots (i.e. S&P 500 vs each sector, IT vs utilities, IT vs financials, financial vs utilities), and univariate frequency distribution histograms for each variable plotted along the diagonal.

Despite usefulness, these should not be considered as substitutes for robust statistical tests, but work alongside tests for best results.

49
Q

Heat Map

A

A type of graphic that organises and summarizes data in a tabular format, and represents them using a colour spectrum.

Cells in chart are colour coded to differentiate high values from low values defined by colour spectrum beside chart.

Also used for visualising the degree of correlation among different variables.

50
Q

4 pitfalls to avoid misleading graphs

A
  1. Improper chart type is selected to present data which could hinder the accurate interpretation of data.
  2. Data are selectively plotted in favor of the conclusions an analyst intends to draw, i.e. presenting data with a short time frame mistakenly points to a non-existing trend.
  3. Data is improperly plotted in a truncated graph at a y-axis that does not start at zero i.e. creates a false impression of significant differences when actually small.
  4. Improper scaling of axis i.e. a line chart setting a higher than necessary maximum on y-axis compresses graph in an area close to the x-axis, appearing less steep and less volatile if properly plotted.
51
Q

Measure of central tendency

A

Specifies where data is centered, more widely measured as can be computed and applied relatively easily.

Most common:

  • the arithmetic mean
  • the median
  • the mode
  • the weighted mean
  • the geometric mean
  • the harmonic mean
52
Q

Measures of location

A

Includes measures of central tendency and other measures that illustrate the location or distribution of data.

Most common:

  • quartiles
  • quintiles
  • deciles
  • percentiles
53
Q

Statistic

A

A summary measure of a set of observations and descriptive statistics to summarise the central tendency and spread variation in the distribution of data.

54
Q

Parameter

A

The statistic summarises the set of all possible observations of a population.

55
Q

Sample statistic

A

If the statistic summarises a set of observations that is a subset of the population.

56
Q

Arthimetic Mean

A

The sum of values of observations is divided by the number of observations.

57
Q

Sample Mean

A

The arithmetic mean/average is computed for a sample, as we cannot observe every member of the population, so instead observe a subset or sample of the population.

58
Q

Deviations from arthimetic mean

A

The distance from the mean and each outcome.

This indicates risk, forming the foundation for complex concepts of variance, skewness, and kurtosis.

59
Q

Outlier

A

This represents a rare value/meaningful in the population, or may also reflect an error in recording the value of an observation, or an observation generated from a different population from that producing the other observations in the sample.

60
Q

How to identify an outlier

A
  1. Examine the data either by inspecting the sample observations if the sample is not too large or by using visualisation approaches.
  2. Once comfortable that we have identified and eliminated errors, then we can address what to do with extreme values in the sample.
  3. Possibility of transforming the variable or of selecting another variable that achieves the same purpose.
    i. e. alternative model specs, variable transformation
61
Q

How to deal with extreme values/outliers:

A
  1. Do nothing; use data without any adjustment - value could be legitimate and present meaningful information.
  2. Delete all the outliers - trimmed mean
  3. Replace the outliers with another value. - winsorised mean
62
Q

Trimmed Mean

A

A measure of central tendency computed by excluding a stated small percentage of the lowest and highest values, and then computing an arithmetic mean of remaining variables.

e.g. 5% trimmed mean discards the lowest 2.5% and highest 2.5% of values and computes mean of the remaining 95% of values.

63
Q

Winsorized Mean

A

A measure of central tendency calculated by assigning a stated percentage of the lowest values equal to one specified low value and a stated percentage of the highest values equal to one specified high value, and then it computes a mean from the restated data.

e.g. 95% windorized mean sets bottom 2.5% of values equal to the value at or below which 2.5% of all values lie (2.5th percentile), and top 2.5% of values equal to the value at or below which 97.5% of all values lie (97.5th percentile).

64
Q

Median

A

The value of the middle item of a set of items that has been sorted into ascending or descending order.

For odd n sample: (n+1)/2
For even n sample: (n+2)/2

65
Q

Pros and Cons of Median

A
  • It is affected less by outliers than the mean, so useful in describing data that follows a distribution and is not symmetric, e.g. revenue.
  • Does not use all the information about the size of the observations, it focuses on the relative position of the ranked observations.
  • Median is less mathematically tractable than mean, as ranking from smallest to largest determines if sample size is odd or even, then applies one of two calculations.
66
Q

Mode

A

The most frequently occuring value in a distribution.

There can be either more than 1 mode or no mode.

The only measure of central tendency that can be used with nominal data.

e.g. we categorise investment funds into different styles, and assign a number to each style, the mode of these categorised data is the most frequent investment fund style.

67
Q

Unimodal

A

When a distribution has a single value that is most frequently occurring.

68
Q

Bimodal

A

If a distribution has two most frequently occurring vaues, then it has 2 modes.

69
Q

Trimodal

A

If the distribution has three most frequently occurring values, then it has 3 modes.

70
Q

No mode

A

e.g. stock return data, and other data from continuous distributions.

71
Q

Modal interval

A

When contiunous data are grouped into bins, we often find an interval (possibly more) with the highest frequency.

72
Q

Weighted Mean

A

Investment Manager: $100m
Allocation: $70m equities, $30m bonds

Portfolio weight: 0.7 stocks and 0.3 bonds

What is the calculated portfolio return?

  • This means averaging of the returns on the stock and bond investments, so multiply return on stock investment by 0.7 and bond investment 0.3, and then sum both results.
73
Q

Portfolio Return

A
  • Weighted average of the returns on the assets in the portfolio; the weight applied to each asset’s return is the fraction of the portfolio invested in the asset.
74
Q

Market Indexes

A

i. e. S&P in the USA
- each included stock receives a weight corresponding to its market value divided by the total market value of all stocks in the index.

Expected value at year-end = (probability of expansion x forecast year-end level of S&P assuming expansion ) + (probability of contraction x forecast year-end level of S&P assuming contraction)

Expected return - taking weighted average of possible future returns on S&P 500, where weights are probabilities = 1.

75
Q

Geometric Mean

A

Used to average rates of change over time or compute growth rate of a variable.

Use:
- average time series of rates of return on an asset or portfolio.

  • compute growth rate of a financial variable such as earnings or sales.
76
Q

Analysing geometric/arithmetic mean return

A
  • Represents the growth rate or compound rate of return on an investment. (geometric)
  • Geometric - focus on profitability of an investment over a multi-period horizon.
  • Arithmetic - focus on average single-period performance.
77
Q

Harmonic Mean

A

Another measure of central tendency, appropriate in cases in which the variable is a rate or ratio.

The value obtained by summing the reciprocals of the observations 1/Xi, then averaging that sum by dividing it by the number of observations n, then taking the reciprocal of the average.

Useful measure of central tendency in the presence of outliers.

The concept of mean is appropriate for averaging ratios (amount per unit), when ratios are repeatedly applied to a fixed quantity to yield a variable number of units.

78
Q

Harmonic vs Geometric vs Arithmetic Mean

A

arithmetic > geometric > harmonic mean return

79
Q

Cost averaging

A

Involves the periodic investment of a fixed amount of money.

arithmetic mean x harmonic mean = geometric mean^2

80
Q

Quantile (fractile)

A

A value at or below which a stated fraction of the data lies.

Quartiles = 1/4
Quintiles = 1/5
Deciles = 1/10
Percentiles = 1/100

the yth percentile is the value at or below which y% of observations lie.

81
Q

Interquartile Range (IQR)

A

The difference between the 3rd and 1st quartile, or IQR = Q3 - Q1.

82
Q

Linear Interpolation

A

Estimating an unknown value on the basis of two known values that surround it (i.e. lie above and below it), linear refers to straight line estimate.

83
Q

Box and Whisker plot

A

Box - represents the lower bound of Q2 and upper bound of Q3 with median or arithmetic average as measure of central tendency.

Whiskers - the lines that run from the box and are bounded by fences which represent the lowest and highest values of distribution.

84
Q

Quantiles in Investment Practice

A
  1. Rank performance i.e. portfolios - Morningstar investment fund star rankings associate number of stars with percentiles of performance relative to similar style investment funds.
  2. Investment research - set of companies with returns falling below 10th percentile cut off as bottom return decile, allowing analysts to divide data into quantiles based on characteristics allows evaluating the impact of characteristic on quantity of interest.
    i. e. ranking companies by decile to compare performance of small co. with larger ones.
85
Q

Dispersion

A

The variability around the central tendency, addressing risk.

86
Q

Absolute Dispersion

A

The amount of variability present without comparison to any reference point or benchmark.

87
Q

Most common measures of absolute dispersion:

A
  1. Analyses data
  • range
  • mean absolute deviation
  1. Measures risk
  • variance
  • standard deviation
88
Q

Range

A

Maximum value - minimum value

Pros:

  • Ease of computation

Cons:

  • only uses 2 pieces of information from the distribution, not representative.
89
Q

Mean absolute deviation

A

A way to prevent the problem of negative deviations canceling out positive so that the means of deviations does not always equal zero.

Pros:

Uses all observations in sample, thus superior to range as measure of dispersion.

Cons:

Its is difficult to manipulate mathematically compared with the next measure sample variance.

90
Q

Variance

A

The average of squared deviations around the mean.

91
Q

Standard deviation

A

The positive square root of variance.

More easily interpreted than variance as expressed in the same unit of measurement as the observations, by taking square root.

92
Q

Downside risk

A

When returns to an investor are below the mean or below some specified minimum target return.

93
Q

Target semideviation

A

Measure of dispersion of observations below the target.

  1. Specify the target.
  2. After identifying observation below target, we find sum of the squared negative deviations from the target.
  3. Divide the sum by the total number of observations in the sample minus 1.
  4. Take the square rool.
94
Q

Relative dispersion

A

The amount of dispersion relative to a reference value or benchmark.

i.e coefficient of variation

95
Q

Coefficient of variation (CV)

A

The ratio of the standard deviation of a set of observations to their mean value.

uses:

  • when observations are returns, CV measures the amount of risk (standard deviation) per unit of reward (mean return).
  • Issue dealing with returns is that if X- is negative, the stat is meaningless.
  • CV may be stated as a multiple or percentage, expressing magnitude of variation among obs relative to average size, CV permits direct comparisons of dispersion across different datasets.
  • A scale free measure.
96
Q

Normal distribution

A

A symmetrical bell-shaped distribution that plays a central role in the mean-variance model of portfolio selection is extensively used in financial risk management.

Characteristics:

  • mean, median, and mode are equal
  • completely described by two parameters - it’s mean and variance (or standard dev).
97
Q

Skewness

A

The average cubed deviation from the mean standardised by dividing by the standard deviation cubed to make the measure free of scale.

  • Cubing preserves the sign of the deviations from the mean.

Positive skew with mean greater than median:

  • Means more than half of the deviations from the mean are negative and less than half are positive.
  • For the sum of cubed deviations to be positive, the losses must be small and likely, and gains less likely but more extreme.
  • A positive skew means the average magnitude of positive deviations is larger than the average magnitude of negative deviations.
98
Q

Kurtosis

A

The measure of the combined weight of the tails of a distribution relative to the rest of the distribution.

The proportion of the total probability that is outside of 2.5 standard deviations of the mean.

Normal: 3
Fat tailed dist. of kurtosis: >3
Thin tailed dist. of kurtosis: <3

99
Q

Leptokurtic/Fat-tailed

A

A distribution that has fatter tails than the normal distribution.

This tends to generate more frequent extremely large deviations from mean than normal distribution.

100
Q

Platykurtic/Thin tailed

A

A distribution that has thinner tails than the normal distribution.

101
Q

Mesokurtic

A

A distribution similar to the normal distribution as concerns relative weight in the tails.

102
Q

Skewness and Kurtosis of EAA Equity Index Daily Returns

A
  • The distribution is negatively skewed of -0.4260 and influence of observations below mean of 0.0347%.
  • Highest frequency of returns occurs within -0.5 to 0.0 standard deviations from the mean (negative skew).
  • The distribution is fat-tailed, indicated by positive excess kurtosis of 3.7962. With fat tails, a concentration of returns around mean and fewer observations in regions between central and two tail regions.
103
Q

Correlation

A

The measure of linear relationship between two random variables, with the first step being how 2 variable vary together, their covariance.

104
Q

Sample Covariance (sXY)

A

The measure of how two variables in a sample move together.

The average value of the product of the deviations of observations on two random variables (Xi and Yi) from their sample means.

i.e. if X tends to be above mean, when Y is above mean then there is a positive covariance.

104
Q

Sample correlation coefficient

A

A standardised measure of how two variables in a sample move together.

The ratio of the sample covariance to the product f the two variables standard deviations.

105
Q

Properties of correlation

A
  1. Correlation ranges from -1 and +1 for 2 random variables X and Y: -1 rXY 1.
  2. Correlation = 0 is no linear relationship between variables.
  3. Correlation close to 1 is positive relationship, with =1 being perfect linear relationship.
  4. Correlation close to -1 is negative relationship, with =-1 being perfectly inverse relationship.
106
Q

Limitations of Correlation Analysis

A
  • Not a reliable measure as two variables can have a strong non-linear relation, and still have very low correlation.
  • Unreliable measure when outliers are present in one or both variables.
  • Correlation does not imply causality.
  • Spurious correlation
107
Q

Spurious correlation

A

Refers to:

  1. correlation between two variables that reflects chance relationships in a particular dataset.
  2. Correlation induced by calculation that mixes each of two variables with the third.
  3. Correlation between 2 variables arising not from a direct relation between them, but relation from the third variable.
108
Q

Anscombes Quartet

A

Knowing the means and standard deviations of two variable as well as the correlation between them does not tell the entire story.