Data Flashcards
Terms, statistics, properties, management (55 cards)
What is a data dictionary and what is its purpose
1) A map of data assets where data is specified including the required metadata
2) Valuable to keep track of data, and the effort of making one is minimized if data editing is organized from the beginning
Data element
Also called data field. An aspect of an individual or object that can take on varying values among individuals. Every piece of info in a database is a measurement of a data element.
Heteroscedasticity
Describes data that does not have constant variance. (The variability of a variable is not consistent across the values of another variable, like time.) In a residual plot, this looks like a fan or cone shape
Metadata
Descriptions of the fields in the database and their permissible values, as well as how they are created and limitations on their use, if known.
Redundancy
A technique for obtaining high quality data. Ask for the same or similar information at least twice to reduce risk of errors and inaccuracy.
Ex: ask for email twice to make sure it is typed correctly
Ex: ask for age and date of birth
When to implement redundancy
Only when a virtually error free result is required. Otherwise, the cost might outweigh the benefit
Skewness. Positive vs negative
Describes a distribution’s departure from symmetry. Negative skew means the left-hand tail is longer. Positive skew means the right-hand tail is longer.
Why is skewness important?
- Variance tends to understate the likelihood of loss if the distribution is skewed but assumed to be symmetric.
- We must consider skew to avoid taking higher-than-anticipated levels of risk and rejecting projects with understated likelihood of profits.
Stationary
If a distribution is stationary, it means the parameters are stable over time
TVaR
Tail value at risk. The expected loss given that the loss falls in the worst (1-alpha) part of the distribution
Pros and cons of TVaR
- Pro: coherent risk measure
- Pro: describes the full tail of the distribution
- Con: difficult to calculate
VaR
- Value at Risk
- The maximum loss that could occur with a specified probability over a given time horizon for a given distribution
Portfolio VaR
The VaR of the entire portfolio
Individual VaR
The VaR of one asset in the portfolio in isolation
Diversified VaR
The portfolio VaR, taking into account diversification benefits
Undiversified VaR
The sum of the individual VaRs in the portfolio when there is no short position and all correlations are unity
Marginal VaR
The VaR that would be added for a unit increase in the investment in a particular asset
Incremental VaR
The VaR that would be added to the portfolio VaR if the given investment adjustments were made to the portfolio
Component VaR
A partition of the portfolio VaR that indicates how much the portfolio VaR would change (approximately) if the given asset was deleted from the portfolio
Purpose of VaR
- Identify the component (asset or BU or risk) that contributes most to the total risk
- Pick the best hedges
- Rank trades
- Select the asset/project/BU that provides the best risk-return tradeoff
How to calculate VaR
- Empirical: the worst (1-alpha)% of results in the sample data
- Parametric: assume that the data follows a statistical distribution and use that distribution to calculate VaR
- Stochastic: apply the empirical method to thousands of simulations
Pros and cons of VaR
- Pro: easy to understand
- Con: not a coherent risk measure
- Con: doesn’t describe the tail of the distribution
Data quality
Refers to data’s “fitness for use.” The ability to fulfill the requirements of intended usage of data in a specific situation
Why is data quality important?
High data quality can be a competitive advantage. Poor data quality can:
1. reduce customer satisfaction
2. reduce employee satisfaction (causing high turnover)
3. breed organizational mistrust
4. make it difficult or impossible to accurately determine the financial position of the business
5. make it difficult or impossible to calculate premium income and reserve required
6. waste time and resources investigating and fixing data issues