Quantitative Methods Flashcards

(73 cards)

1
Q

Accuracy

A

What was my accuracy with actually all
(TP+TN)/(TP+FP+TN+FN)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Bias error

A

Degree to which model fits training data. Algorithms erroneous assumptions produce high bias with poor approximation
Causes underfitting and high in-sample error

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Breusch-Godfrey

A

Used for positive serial correlation. More general than DW

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Breusch-Pagan

A

One tailed test
Used for conditional heteoskedasticity. Null hypothesis of homoskedasticity. Tested by regressing the squared residuals against the original independent variables.
Test statistic is n x R_squared_residuals.
Under null hypothesis follows a chi-squared distribution
Correction: Use robust standard errors (e.g., White-corrected standard errors) or Generalized Least Squares (GLS).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Covariance Stationary Conditions

A

Constant Mean: The expected value (mean) of the series is constant and finite for all periods: E(Y t ​ )=μ.
Constant Variance: The variance of the series is constant and finite for all periods: Var(Y t ​ )=E[(Y t ​ −μ) 2 ]=σ 2 .
Constant Autocovariance: The covariance between values at any two time periods depends only on the distance (lag) between the periods, not on the specific point in time: Cov(Y t ​ ,Y t−k ​ )=γ k ​ for any time t and lag k.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Data Wrangling - Aggregation

A

Two or more variables combined into one

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Data Wrangling - Conversion

A

Converting variables into appropriate type for analysis

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Data Wrangling - Extraction

A

New variable created from existing one e.g. data of birth -> age

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Data Wrangling - Filtration

A

Data rows not needed must be identified and filtered (e.g. row with entry for non-US state)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Data Wrangling - Selection

A

Data columns not needed must be removed (e.g. Name)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Test for unit root to test whether data is covariance non-stationary

A

Check if: 1) Expected value is constant and finite. 2) Variance is constant and finite. 3) Covariance with itself for a fixed lag is constant and finite. Use the Dickey-Fuller test.
Dickey-Fuller test for unit roots could be used to test whether the data is covariance non-stationarity. The Durbin-Watson test is used for detecting serial correlation in the residuals of trend models but cannot be used in AR models. A t-test is used to test for residual autocorrelation in AR models.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Document Frequency

A

Sentence count with word / Total number of sentences

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Dummy variable misspecification

A

If we use too many dummy variables (e.g. >n-1), there will be multicollinearity

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Duplication error

A

When there are duplicates of the data

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Durbin-Watson

A

Test for serial correlation in residuals of trend models but cannot be used in AR models. tests if
ϵ_t =ρϵ_t−1+u_t.
DW = ~2(1-ρ), where ρ is the sample correlation between residuals from one period to another
D=2 no serial correlation
D<2 positive serial corerlation (usually around 4)
D>2 negative serial correlation (usually around 0)
If its between upper and lower limit inconclusive
Above upper limit fail to reject
Below upper limit reject

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

F-statistic

A

MSR/MSE
Assumes all slope coefficients simultaneously 0
Rejection if F>F(critical)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

F-statistic joint hypothesis

A

((SSE_R - SSE_U)/q) / (SSE_U/(n-k-1))
q is number of excluded variables in restricted model

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

F1 score

A

(2 × P × R) / (P + R)
Harmonic mean of precision and recall

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

FPR

A

FP/(TN+FP)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

How are qualitative independent variables incorporated into regression models?

A

Using dummy variables (binary 0 or 1 variables). To distinguish between n classes, use n-1 dummy variables.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

How can a time series with a unit root be transformed for analysis?

A

By first differencing: modeling the change in the variable (Yt​−Yt−1​) instead of the level.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q

How can overfitting be addressed?

A

Use validation samples, cross-validation (like K-fold), penalized regression (like LASSO), reducing model complexity, or using ensemble methods (like Random Forests).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
23
Q

How does Hierarchical clustering work?

A

Builds a hierarchy of clusters. Agglomerative (bottom-up) starts with individual points and merges clusters; Divisive (top-down) starts with one cluster and splits them.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
24
Q

How does K-Means clustering work?

A

Partitions data into ‘k’ clusters by iteratively assigning observations to the nearest centroid and recalculating centroids until assignments stabilize.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
25
How is model performance evaluated in classification?
Using metrics like Accuracy, Precision, Recall, F1-score, ROC curve, and AUC derived from a confusion matrix (TP, FP, TN, FN)
26
How is seasonality detected and corrected in time-series models?
Detected by testing for significant autocorrelation at seasonal lags. Corrected by adding a seasonal lag (another independent variable) to the AR model.
27
How is the Regression Sum of Squares (RSS) calculated?
1. Subtract the mean from each predicted observation. 2. Square each result. 3. Sum the squared results. Degrees of freedom = k.
28
How is the Sum of Squared Errors (SSE) calculated?
1. Subtract the predicted value from each observed value. 2. Square each result. 3. Sum the squared results. Degrees of freedom = n-k-1.
29
How is the Total Sum of Squares (SST) calculated?
1. Subtract the mean from each individual observation. 2. Square each result. 3. Sum the squared results. Degrees of freedom = n-1.
30
Impact of conditional heteroskedasticity (underestimate)
No effect on coefficient estimate Std Err of coefficient overestimated More type II
31
Impact of serial correlation
No impact on coefficient estimate Std Err underestimated More type I error
32
Inaccuracy Error
Where the data is not a measure of true value
33
Incompleteness Error
Where data not present, resulting in missing data. Must be deleted or replaced with imputed data using mean, median, mode or assuming 0
34
Inconsistency Error
Data conflicts with corresponding data points or reality
35
Invalidity Error
Data outside of meaningful range, causing data to be invalid (e.g. date of birth in the future)
36
K-fold Cross Validation
Data excluding test sample and fresh data shuffled randomly and divided into k equal subsamples K-1 samples in training set and kth used as a validation sample K typically between 5 or 10 Process repeated k times, which helps minimize both bias and variance by insuring each data point is ued in training set k-1 times and in validation set once
37
LASSO
Least absolute shrinkage and selection operator Type of penalized regression, where penalty term has form lambda x sum(regression coefficients) In addition to minimizing the sum squared residuals, LASSO mnimises the sum of the absolute values of the regression coefficients Help to build more parsimonious models, is a regularizing method applied in asset management
38
MSE Formula
SSE / (n-k-1)
39
Name some supervised learning algorithms and their uses.
Penalized Regression (Regression, reduces overfitting), Support Vector Machine (SVM) (Classification), K-Nearest Neighbor (KNN) (Classification), CART (Classification/Regression), Random Forest (Classification/Regression, ensemble).
40
Name some unsupervised learning algorithms and their uses.
Principal Components Analysis (PCA) (Dimension Reduction), K-Means Clustering (Clustering), Hierarchical Clustering (Clustering).
41
Newey-West Standard Errors
Used to correct positive serial correlation
42
Non-uniformity error
Data not present in identical format
43
Normalised variable X
Xi-Xmin / Xmax - Xmin
44
Precision
How precise are the predictions of positive TP/(TP+FP)
45
Recall
Remembering all the actual positives. TP/(TP + FN)
46
Standard error of autocorrelation
1/sqrt(T), where T is just the number of observations
47
Standardised variable X
Xi - mu / sd
48
Steps in structured data-based ML models
1) conceptualization of the modeling task, 2) data collection, 3) data preparation and wrangling, 4) data exploration, 5) model training.
49
Steps in text-based ML models
1) text problem formulation 2) data (text) curation 3) text preparation and wrangling 4) text exploration 5) model training.
50
Steps to determine appropriate autoregressive model
Add different lags until the serial correlation is removed, by testing the autocorrelation of the lags?
51
Text cleansing
Remove HTML tags Remove punctuation, but preserve some context e.g. \end sentence\ instead of period Remove numbers replace with words 1->\one\ or \number\ to help reduce noise Remove excess whitespace
52
TPR
TP/(TP+FN)
53
Trimming
When extreme values removed (e.g. top 5% and bottom 5%)
54
Variance error
Degree to which model's results change in response to new data from validation and test samples. Unstable models pick up noise and produce high variance causing overfitting and high out of sample error
55
What are common text wrangling techniques?
Tokenization, removing stop words, lowercasing, stemming, lemmatization, creating Bag-of-Words or N-grams.
56
What are the assumptions underlying a multiple linear regression model?
1. Linearity between dependent and independent variables. 2. No significant multicollinearity. 3. Expected error is 0. 4. Homoscedasticity (constant error variance). 5. No serial correlation (errors are independent). 6. Errors are normally distributed.
57
What are the key steps in a data analysis project?
Conceptualize model, Collect data, Prepare & Wrangle data (cleanse, transform, scale), Explore data (EDA, feature selection/engineering), Train model, Evaluate model.
58
What are the main types of machine learning?
Supervised learning (labeled data), Unsupervised learning (unlabeled data), Deep learning (neural networks with many layers), Reinforcement learning (agent learns through rewards/penalties).
59
What are the principles of good model specification?
Grounded in economic reasoning, appropriate functional form, essential variables only, no violation of assumptions, tested out of sample.
60
What do AIC and BIC represent, and how are they used?
Lower AIC (n×ln(SSE/n)+2(k+1)) or BIC (n×ln(SSE/n)+ln(n)(k+1)) indicates a better fitting model. They penalize adding variables, BIC more heavily. AIC is better for prediction, BIC for goodness of fit. difference in second part A = 2, B = ln(n) acho->A=2 B lyin -> B=ln(n)
61
What happens to the coefficients of correlated independent variables when a new correlated variable is added to the model?
Adding the new variable will change the coefficient for the other correlated variables.
62
What is a random walk process?
A time series where the predicted value in one period is equal to the value in the previous period (b1 = 1). Not covariance stationary. If b0 ≠ 0, it's a random walk with drift.
63
What is a unit root, and how is it tested?
Occurs in an AR model when b1 = 1, leading to a non-stationary (random walk) process. Tested using the Dickey-Fuller test, which tests if (b1 −1)=0.
64
What is Autoregressive Conditional Heteroskedasticity (ARCH)?
When the variance of the error term in one period depends on the variance in a previous period. Tested by regressing squared residuals on lagged squared residuals: ϵ^2t = a^0 + a^1 ϵ^2t−1 + μt.
65
What is Conditional Heteroskedasticity and how does it affect statistical inference?
Error variance changes systematically with the independent variable. It can lead to underestimated standard errors (Type I errors) or overestimated standard errors (Type II errors). Detected using residual plots or Breusch-Pagan test. Corrected using White-corrected standard errors.
66
What is Logistic Regression used for?
Used for qualitative dependent variables, modeling the probability of an event happening (between 0-1). The dependent variable is the log odds: ln(1−P/P).
67
What is mean reversion in a time series?
A tendency for the series to move back towards its long-term average. Occurs in AR models when $\$
68
What is Multicollinearity and how does it affect regression analysis?
Significant correlation exists between two or more independent variables. Causes unreliable coefficient estimates and inflated standard errors, leading to insignificant t-statistics (Type II errors). Detected by high pairwise correlations, insignificant t-tests with a significant F-test, or VIF > 10. VIF = 1 / (1-R^2) R^2 regression of independent variable on all the others Corrected by excluding problematic variables.
69
What is overfitting in machine learning?
When a model learns the training data too well, including noise, resulting in poor performance on new, unseen data (high variance error). It often occurs with complex models or insufficient data.
70
What is Serial Correlation and how does it affect statistical inference?
Error terms are not independently distributed. Positive serial correlation leads to underestimated standard errors and overestimated T/F statistics (Type I errors). Detected using residual plots, Durbin-Watson test, or Breusch-Godfrey test. Corrected using Newey-West standard errors.
71
What is the purpose of Principal Component Analysis (PCA)?
To reduce dimensionality by summarizing correlated features into a smaller set of uncorrelated factors called principal components (eigenvectors).
72
White-corrected standard errors
Used to correct for conditional heteroskedasticity
73
Winsorisation
Values above a cutoff are replaced with highest/lowest non-outlier value