Data Flashcards

Terms, statistics, properties, management (55 cards)

1
Q

What is a data dictionary and what is its purpose

A

1) A map of data assets where data is specified including the required metadata
2) Valuable to keep track of data, and the effort of making one is minimized if data editing is organized from the beginning

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Data element

A

Also called data field. An aspect of an individual or object that can take on varying values among individuals. Every piece of info in a database is a measurement of a data element.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Heteroscedasticity

A

Describes data that does not have constant variance. (The variability of a variable is not consistent across the values of another variable, like time.) In a residual plot, this looks like a fan or cone shape

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Metadata

A

Descriptions of the fields in the database and their permissible values, as well as how they are created and limitations on their use, if known.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Redundancy

A

A technique for obtaining high quality data. Ask for the same or similar information at least twice to reduce risk of errors and inaccuracy.
Ex: ask for email twice to make sure it is typed correctly
Ex: ask for age and date of birth

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

When to implement redundancy

A

Only when a virtually error free result is required. Otherwise, the cost might outweigh the benefit

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Skewness. Positive vs negative

A

Describes a distribution’s departure from symmetry. Negative skew means the left-hand tail is longer. Positive skew means the right-hand tail is longer.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Why is skewness important?

A
  • Variance tends to understate the likelihood of loss if the distribution is skewed but assumed to be symmetric.
  • We must consider skew to avoid taking higher-than-anticipated levels of risk and rejecting projects with understated likelihood of profits.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Stationary

A

If a distribution is stationary, it means the parameters are stable over time

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

TVaR

A

Tail value at risk. The expected loss given that the loss falls in the worst (1-alpha) part of the distribution

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Pros and cons of TVaR

A
  • Pro: coherent risk measure
  • Pro: describes the full tail of the distribution
  • Con: difficult to calculate
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

VaR

A
  • Value at Risk
  • The maximum loss that could occur with a specified probability over a given time horizon for a given distribution
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Portfolio VaR

A

The VaR of the entire portfolio

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Individual VaR

A

The VaR of one asset in the portfolio in isolation

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Diversified VaR

A

The portfolio VaR, taking into account diversification benefits

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Undiversified VaR

A

The sum of the individual VaRs in the portfolio when there is no short position and all correlations are unity

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

Marginal VaR

A

The VaR that would be added for a unit increase in the investment in a particular asset

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

Incremental VaR

A

The VaR that would be added to the portfolio VaR if the given investment adjustments were made to the portfolio

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

Component VaR

A

A partition of the portfolio VaR that indicates how much the portfolio VaR would change (approximately) if the given asset was deleted from the portfolio

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

Purpose of VaR

A
  1. Identify the component (asset or BU or risk) that contributes most to the total risk
  2. Pick the best hedges
  3. Rank trades
  4. Select the asset/project/BU that provides the best risk-return tradeoff
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

How to calculate VaR

A
  1. Empirical: the worst (1-alpha)% of results in the sample data
  2. Parametric: assume that the data follows a statistical distribution and use that distribution to calculate VaR
  3. Stochastic: apply the empirical method to thousands of simulations
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q

Pros and cons of VaR

A
  • Pro: easy to understand
  • Con: not a coherent risk measure
  • Con: doesn’t describe the tail of the distribution
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
23
Q

Data quality

A

Refers to data’s “fitness for use.” The ability to fulfill the requirements of intended usage of data in a specific situation

24
Q

Why is data quality important?

A

High data quality can be a competitive advantage. Poor data quality can:
1. reduce customer satisfaction
2. reduce employee satisfaction (causing high turnover)
3. breed organizational mistrust
4. make it difficult or impossible to accurately determine the financial position of the business
5. make it difficult or impossible to calculate premium income and reserve required
6. waste time and resources investigating and fixing data issues

25
Properties of high quality data
1. Relevance 2. Accuracy 3. Timeliness 4. Accessibility and clarity of results 5. Comparability 6. Coherence 7. Completeness
26
How to obtain high quality data
1. Prevention: keep bad data out (most) 2. Detection: look for bad data already entered 3. Repair: let the bad data find you and fix things (least)
27
How to prevent bad data from entering the dataset
- Edit data before it enters the database to prevent issues instead of fix them - Encourage staff to improve the data management process - Improve the data collection instrument - Improve the data collection method - Build in Redundancy
28
How to look for bad data already in the dataset
- Deterministic tests - Probabilistic tests to detect outliers - Exploratory data analysis - Frequency counts - Two-way tabulations - Record Linkage
29
Deterministic tests
- Range Test - If-Then Test - Ratio Control Test - Zero Control Test - Internal Consistency Test
30
How to maintain high quality data
Data quality should: 1) be a regular item on the Management Board’s agenda 2) receive ongoing priority attention within the organization 3) be a structural component of operational management 4) be applied to the reporting and operational processes
31
Pros and cons of standard deviation as a risk measure
Pro: easy to understand Con: not a coherent risk measure Con: doesn't describe the tail of the distribution well Con: considers upside and downside risk equal Con: underestimates risk if the underlying distribution is leptokurtic (thicker tails than normal dist)
32
Correlation
The relationship between variables (like risks)
33
Types of correlation measures
1) Pearson's Rho 2) Spearman's Rho 3) Kendall's Tau 4) Tail Correlation
34
Pearson's Rho
The linear correlation coefficient. A value of 0 may not mean 0 correlation, only 0 linear correlation.
35
Spearman's Rho
- Equal to Pearson's Rho in the uniform distribution. - Only the order of observations matters. Spearman's Rho is independent of the statistical distribution.
36
Kendall's Tau
Calculated by comparing pairs of data points. If X and Y both in/decrease from one data point to another, the observations are concordant. If not, they are discordant.
37
Tail correlation
Correlation between 2 variables may not be constant for all values. We can measure the correlation of the tails.
38
How can you choose which correlation measure to use?
- If the variables are numeric and you care about the linear relationship specifically, choose Pearson's. - If any variables are ordinal, choose Spearman's or Kendall's. - If you care about the degree of deviation between data points, choose Spearman's. If not, Kendall's.
39
Kurtosis
- Describes the degree of flatness of a distribution - An indication of the likelihood of extreme observations relative to those that would be expected with the normal distribution
40
Types of kurtosis
- Mesokurtic distribution: normal distribution with kurtosis of 3 - Platykurtic distribution: thinner tails than the normal distribution, so kurtosis less than 3 - Leptokurtic distribution: thicker tails than the normal distribution, so kurtosis greater than 3
41
Why is leptokurtosis important to look out for when quantifying risk?
It means the distribution has thicker tails than the normal distribution. If it is present and not properly accounted for, then the probability of extreme events will be underestimated.
42
Spread
Often referred to as the difference between the returns on 2 assets
43
Comonotonic
- 2 risks are comonotonic if one can be expressed as an increasing deterministic function of the other - If X lies at its q-quantile, Y will also be at its q-quantile - More than 2 risks can be comontonic
44
Countermonotonic
- 2 risks are countermonotonic if one can be expressed as a strictly decreasing function of the other - If X lies at its q-quantile, Y will be at its (1-q)-quantile - More than 2 risks cannot be countermonotonic
45
Credibility
For full credibility, there needs to be a sufficient number of observations. Technically, true full credibility can never be achieved. But we say full credibility exists at a given confidence level for a given distance from the expected value. There are 3 types of credibility: classic, Buhlmann, and Bayesian.
46
Expected Shortfall
Sometimes considered the same as TVaR. While TVaR is the average value in the tail, Expected shortfall is the probability of loss multiplied by the expected loss given that a loss has occurred.
47
Pros and cons of expected shortfall
It has many of the same benefits as TVaR, but it has little intuitive meaning.
48
How to calculate expected shortfall
1. Empirical: sum the losses in the tail and divide by the total number of observations 2. Parametric: (1-alpha)xTVaR 3. Stochastic: apply the empirical approach to the output from the stochastic model
49
Probability of Ruin
It is the probability that a given extreme loss will occur. The reciprocal of VaR because VaR is the max loss for a given level of confidence.
50
What's wrong with probability of ruin?
It generally has the same limitations as VaR, and the assessment of loss if it occurs is not usually a priority. We use VaR and TVaR to determine the loss for which capital is required.
51
Seasonality
Values are generally higher than an underlying trend at some points in the period and below it in others. To handle this, we can use an ARIMA model because seasonality is essentially an autoregressive process.
52
Types of dependency
1) Immediate: direct immediate causal relationship 2) Time-lagged: delayed causal relationship 3) Feedback: variables interact with each other over time 4) Phase-shift: one variable affects another only after a change has reached a threshold
53
Why is dependency important
Experience has shown that in many situations, dependencies in stressed situations are different than they are under normal situations. Ex: Low interest rates, high credit risk, and widespread panic among investors (a financial crisis) can happen in many countries at the same time and quickly deplete a firm's capital
54
Ways to quantify dependency
1) Correlation matrix approach 2) Copula approach 3) Structured scenario approach 4) Multivariate distribution approach
55
Correlation matrix approach to dependency
Assume linear correlation. Use a correlation matrix or a collection of matrices. Different matrices can be used for different percentiles to reflect higher correlation in tail events.