Quantitative Analysis Flashcards

Question 1

Q

Token
tokenization

Answer

A

Word
Splitting a sentence into words

Question 2

Q

Document term matrix

Answer

A

Convert unstructured data into structured data

Question 3

Q

5 steps of data analysis

Answer

A

Conceptualization of modeling task
Data collection
Data preparation and wrangling
Data exploration
Model training

Question 4

Q

Errors reduced by data cleansing

Answer

A

Missing, invalid, non-uniform and inaccurate

Question 5

Q

Data Normalization and Standardization

Question 6

Q

Parsimonious model

Answer

A

Parsimonious models are simple models with great explanatory predictive power. They explain data with a minimum number of parameters, or predictor variables.

Question 7

Q

Techniques of feature engineering

Answer

A

Numbers - four digit number usually associated with years and are assigned number4

N-grams - multiword patterns ex expansionary_monetary_policy

Name of entity (NER) - Microsoft > ORG

Parts of speech (POS) - Microsoft > proper noun, 1969 > cardinal number

Question 8

Q

Feature selection methods

Answer

A

Frequency - number of documents with that token divided by total number of documents (document frequency DF)

Chi-square - rank tokens by usefulness to a class

Mutual information (MI) - if a token appears in all classes it is not considered useful discriminant and equals to 0.
Tokens associated with 1 or fewer classes would have a MI approaching 1.

Question 9

Q

Steps of data exploration

Answer

A

1 exploratory data analysis

2 feature selection

3 feature engineering
One-hot-encoding (OHE) - transform categorical feature into a binary variable for machine processing

Question 10

Q

What is overfitting?

Answer

A

Issue with a supervised ML that results when a large number of features (indep. Variables) are included in the data sample. It will decrease the accuracy of model forecasts on out of sample data (they do not generalize well to new data - low out of sample R2 )

Question 11

Q

What are the 3 tasks of model training?

Answer

A

1 method selection
Supervised learning - support vector machine (SVM) and Neural Networks (NNs)
Unsupervised learning - clustering, dimension reduction, anomaly detection
type of data
Numerical data - classification and regression trees (CART)
Text data - generalized linear model (GLM) and SVMs
Image data - NNs and deep learning methods
Size of data - large data SVMs and NNs work better with large number of observations and few features

2 Performance evaluation

3 tuning- implement changes to improve performance

Question 12

Q

How to divide data set for supervised learning in model training process?

Answer

A

60% for model training
20% model validation and tuning
20% test out of sample performance

Question 13

Q

Model fitting erros can be caused by:

Answer

A

Size of training sample (small data sets)
Number of features (small > underfitting, large > overfitting)

Question 14

Q

The three tasks of model training are:

Answer

A

1 method selection
Supervised (training data contains ground truth or known outcome) or unsurpervised learning (no target available)
2 Type of data
Numerical data (CART methods)
Text data (GLMs)
Image (Neural Networks and deep learning)
3 Size of data
Large data sets with many observations and features (SVMs)
Large number of observations and few features (NNs)

Question 15

Q

What is error type 1 and 2

Answer

A

Type 1 are false positives
Type 2 are false negatives

Question 16

Q

Formula of model accuracy and F1 score

Question 17

Q

Formula precision and recall

Question 18

Q

AUD/GBP 1.5060 - 1.5067
1 mm GBP and 1 mm AUD
Apply up the bid and multiply
Down the ask and divide

Answer

A

1 GBP X 1,5060
1 AUD x 1,5067

Question 19

Q

Z Statonato cpf 68%, 90%, 95%, 99%

T statistic of 90%, 95%, 99% os more ir less Z statistic

Question 20

Q

R2 or R2adj is better? Why?

Answer

A

R2 always increases with the addition of variables and it may cause overfiting.
R2adj

Question 21

Q

Effect of model misspecification

Question 22

Q

Assumptions de regressão multipla

Question 23

Q

What is heteroskedasticity type 1 and 2?

Question 24

Q

What is serial correlation? What are the implications?

Question 25

Q

What is serial correlation? What are the implications?

Question 26

Q

How to detect serial correlation?

Question 27

Q

What are the implications of multicolinearity?

Question 28

Q

How to detect multicollinearity?

Answer

A

Test F or

Question 29

Q

What is? Effect? Detection? Correction?
Conditional heteroskedasticity, serial correlation and multicollinearity

Question 30

Q

What is outlier and what is high leverage point

Question 31

Q

What is the rmse criterion?

Question 32

Q

How to calculate mean reverting level?

Question 33

Q

ARCH
What is ARCH, its effect and how to correct it.

Answer

A

Autoregressive conditional heteroskedasticity exists when the variance of the residuals from a period depends on the variance of the residuals from previous period.

Question 34

Q

How to test serial correl in AR model? And how to fix it?

Answer

A

Can’t use DW
Use t-test on residual autocorrelation. Add a lag , seasonal lag

Question 35

Q

ML - relation btw complexity and vias / variance

Question 36

Q

ML - What is generalization

Answer

A

ML model capacity to make accurate out of sample predictions

Question 37

Q

What is bagging? Why it is important?

Question 38

Q

Como calcular accruals ratio e aggregate accruals

Answer

A

Aggregate accruals = NI - (CFO + CFI)

Brainscape's Knowledge GenomeTM

Quantitative Analysis Flashcards

Brainscape's Knowledge Genome^TM