Quantitative Methods Flashcards

Question

How is model performance evaluated in classification?

Answer 1

Using metrics like Accuracy, Precision, Recall, F1-score, ROC curve, and AUC derived from a confusion matrix (TP, FP, TN, FN)

Answer 2

Detected by testing for significant autocorrelation at seasonal lags. Corrected by adding a seasonal lag (another independent variable) to the AR model.

Answer 3

1. Subtract the mean from each predicted observation. 2. Square each result. 3. Sum the squared results. Degrees of freedom = k.

Answer 4

1. Subtract the predicted value from each observed value. 2. Square each result. 3. Sum the squared results. Degrees of freedom = n-k-1.

Answer 5

1. Subtract the mean from each individual observation. 2. Square each result. 3. Sum the squared results. Degrees of freedom = n-1.

Answer 6

No effect on coefficient estimate Std Err of coefficient overestimated More type II

Answer 7

No impact on coefficient estimate Std Err underestimated More type I error

Answer 8

Where the data is not a measure of true value

Answer 9

Where data not present, resulting in missing data. Must be deleted or replaced with imputed data using mean, median, mode or assuming 0

Answer 10

Data conflicts with corresponding data points or reality

Answer 11

Data outside of meaningful range, causing data to be invalid (e.g. date of birth in the future)

Answer 12

Data excluding test sample and fresh data shuffled randomly and divided into k equal subsamples K-1 samples in training set and kth used as a validation sample K typically between 5 or 10 Process repeated k times, which helps minimize both bias and variance by insuring each data point is ued in training set k-1 times and in validation set once

Answer 13

Least absolute shrinkage and selection operator Type of penalized regression, where penalty term has form lambda x sum(regression coefficients) In addition to minimizing the sum squared residuals, LASSO mnimises the sum of the absolute values of the regression coefficients Help to build more parsimonious models, is a regularizing method applied in asset management

Answer 14

SSE / (n-k-1)

Answer 15

Penalized Regression (Regression, reduces overfitting), Support Vector Machine (SVM) (Classification), K-Nearest Neighbor (KNN) (Classification), CART (Classification/Regression), Random Forest (Classification/Regression, ensemble).

Answer 16

Principal Components Analysis (PCA) (Dimension Reduction), K-Means Clustering (Clustering), Hierarchical Clustering (Clustering).

Answer 17

Used to correct positive serial correlation

Answer 18

Data not present in identical format

Answer 19

Xi-Xmin / Xmax - Xmin

Answer 20

How precise are the predictions of positive TP/(TP+FP)

Answer 21

Remembering all the actual positives. TP/(TP + FN)

Answer 22

1/sqrt(T), where T is just the number of observations

Answer 23

Xi - mu / sd

Answer 24

1) conceptualization of the modeling task, 2) data collection, 3) data preparation and wrangling, 4) data exploration, 5) model training.

Answer 25

1) text problem formulation 2) data (text) curation 3) text preparation and wrangling 4) text exploration 5) model training.

Answer 26

Add different lags until the serial correlation is removed, by testing the autocorrelation of the lags?

Answer 27

Remove HTML tags Remove punctuation, but preserve some context e.g. \end sentence\ instead of period Remove numbers replace with words 1->\one\ or \number\ to help reduce noise Remove excess whitespace

Answer 28

TP/(TP+FN)

Answer 29

When extreme values removed (e.g. top 5% and bottom 5%)

Answer 30

Degree to which model's results change in response to new data from validation and test samples. Unstable models pick up noise and produce high variance causing overfitting and high out of sample error

Answer 31

Tokenization, removing stop words, lowercasing, stemming, lemmatization, creating Bag-of-Words or N-grams.

Answer 32

1. Linearity between dependent and independent variables. 2. No significant multicollinearity. 3. Expected error is 0. 4. Homoscedasticity (constant error variance). 5. No serial correlation (errors are independent). 6. Errors are normally distributed.

Answer 33

Conceptualize model, Collect data, Prepare & Wrangle data (cleanse, transform, scale), Explore data (EDA, feature selection/engineering), Train model, Evaluate model.

Answer 34

Supervised learning (labeled data), Unsupervised learning (unlabeled data), Deep learning (neural networks with many layers), Reinforcement learning (agent learns through rewards/penalties).

Answer 35

Grounded in economic reasoning, appropriate functional form, essential variables only, no violation of assumptions, tested out of sample.

Answer 36

Lower AIC (n×ln(SSE/n)+2(k+1)) or BIC (n×ln(SSE/n)+ln(n)(k+1)) indicates a better fitting model. They penalize adding variables, BIC more heavily. AIC is better for prediction, BIC for goodness of fit. difference in second part A = 2, B = ln(n) acho->A=2 B lyin -> B=ln(n)

Answer 37

Adding the new variable will change the coefficient for the other correlated variables.

Answer 38

A time series where the predicted value in one period is equal to the value in the previous period (b1 = 1). Not covariance stationary. If b0 ≠ 0, it's a random walk with drift.

Answer 39

Occurs in an AR model when b1 = 1, leading to a non-stationary (random walk) process. Tested using the Dickey-Fuller test, which tests if (b1 −1)=0.

Answer 40

When the variance of the error term in one period depends on the variance in a previous period. Tested by regressing squared residuals on lagged squared residuals: ϵ^2t = a^0 + a^1 ϵ^2t−1 + μt.

Answer 41

Error variance changes systematically with the independent variable. It can lead to underestimated standard errors (Type I errors) or overestimated standard errors (Type II errors). Detected using residual plots or Breusch-Pagan test. Corrected using White-corrected standard errors.

Answer 42

Used for qualitative dependent variables, modeling the probability of an event happening (between 0-1). The dependent variable is the log odds: ln(1−P/P).

Answer 43

A tendency for the series to move back towards its long-term average. Occurs in AR models when $\$

Answer 44

Significant correlation exists between two or more independent variables. Causes unreliable coefficient estimates and inflated standard errors, leading to insignificant t-statistics (Type II errors). Detected by high pairwise correlations, insignificant t-tests with a significant F-test, or VIF > 10. VIF = 1 / (1-R^2) R^2 regression of independent variable on all the others Corrected by excluding problematic variables.

Answer 45

When a model learns the training data too well, including noise, resulting in poor performance on new, unseen data (high variance error). It often occurs with complex models or insufficient data.

Answer 46

Error terms are not independently distributed. Positive serial correlation leads to underestimated standard errors and overestimated T/F statistics (Type I errors). Detected using residual plots, Durbin-Watson test, or Breusch-Godfrey test. Corrected using Newey-West standard errors.

Answer 47

To reduce dimensionality by summarizing correlated features into a smaller set of uncorrelated factors called principal components (eigenvectors).

Answer 48

Used to correct for conditional heteroskedasticity

Answer 49

Values above a cutoff are replaced with highest/lowest non-outlier value

Quantitative Methods Flashcards

(73 cards)