Quantitative Methods Flashcards
(73 cards)
Accuracy
What was my accuracy with actually all
(TP+TN)/(TP+FP+TN+FN)
Bias error
Degree to which model fits training data. Algorithms erroneous assumptions produce high bias with poor approximation
Causes underfitting and high in-sample error
Breusch-Godfrey
Used for positive serial correlation. More general than DW
Breusch-Pagan
One tailed test
Used for conditional heteoskedasticity. Null hypothesis of homoskedasticity. Tested by regressing the squared residuals against the original independent variables.
Test statistic is n x R_squared_residuals.
Under null hypothesis follows a chi-squared distribution
Correction: Use robust standard errors (e.g., White-corrected standard errors) or Generalized Least Squares (GLS).
Covariance Stationary Conditions
Constant Mean: The expected value (mean) of the series is constant and finite for all periods: E(Y t )=μ.
Constant Variance: The variance of the series is constant and finite for all periods: Var(Y t )=E[(Y t −μ) 2 ]=σ 2 .
Constant Autocovariance: The covariance between values at any two time periods depends only on the distance (lag) between the periods, not on the specific point in time: Cov(Y t ,Y t−k )=γ k for any time t and lag k.
Data Wrangling - Aggregation
Two or more variables combined into one
Data Wrangling - Conversion
Converting variables into appropriate type for analysis
Data Wrangling - Extraction
New variable created from existing one e.g. data of birth -> age
Data Wrangling - Filtration
Data rows not needed must be identified and filtered (e.g. row with entry for non-US state)
Data Wrangling - Selection
Data columns not needed must be removed (e.g. Name)
Test for unit root to test whether data is covariance non-stationary
Check if: 1) Expected value is constant and finite. 2) Variance is constant and finite. 3) Covariance with itself for a fixed lag is constant and finite. Use the Dickey-Fuller test.
Dickey-Fuller test for unit roots could be used to test whether the data is covariance non-stationarity. The Durbin-Watson test is used for detecting serial correlation in the residuals of trend models but cannot be used in AR models. A t-test is used to test for residual autocorrelation in AR models.
Document Frequency
Sentence count with word / Total number of sentences
Dummy variable misspecification
If we use too many dummy variables (e.g. >n-1), there will be multicollinearity
Duplication error
When there are duplicates of the data
Durbin-Watson
Test for serial correlation in residuals of trend models but cannot be used in AR models. tests if
ϵ_t =ρϵ_t−1+u_t.
DW = ~2(1-ρ), where ρ is the sample correlation between residuals from one period to another
D=2 no serial correlation
D<2 positive serial corerlation (usually around 4)
D>2 negative serial correlation (usually around 0)
If its between upper and lower limit inconclusive
Above upper limit fail to reject
Below upper limit reject
F-statistic
MSR/MSE
Assumes all slope coefficients simultaneously 0
Rejection if F>F(critical)
F-statistic joint hypothesis
((SSE_R - SSE_U)/q) / (SSE_U/(n-k-1))
q is number of excluded variables in restricted model
F1 score
(2 × P × R) / (P + R)
Harmonic mean of precision and recall
FPR
FP/(TN+FP)
How are qualitative independent variables incorporated into regression models?
Using dummy variables (binary 0 or 1 variables). To distinguish between n classes, use n-1 dummy variables.
How can a time series with a unit root be transformed for analysis?
By first differencing: modeling the change in the variable (Yt−Yt−1) instead of the level.
How can overfitting be addressed?
Use validation samples, cross-validation (like K-fold), penalized regression (like LASSO), reducing model complexity, or using ensemble methods (like Random Forests).
How does Hierarchical clustering work?
Builds a hierarchy of clusters. Agglomerative (bottom-up) starts with individual points and merges clusters; Divisive (top-down) starts with one cluster and splits them.
How does K-Means clustering work?
Partitions data into ‘k’ clusters by iteratively assigning observations to the nearest centroid and recalculating centroids until assignments stabilize.