Final Stuff Flashcards
(27 cards)
MSE and MAE formulas
true - predicted, square/take absolute value, sum up, divide by n
advantages of MSE and MAE
MSE - differentiable, good for learning
MAE - result is interpretable, simple, less prone to outliers
accuracy
correct vs total
diagonal vs total
recall/true positive rate
also called true positive rate
TP / (TP + FN)
out of all the positives, how many was the model able to get correct (recall)
first column
false positive rate
FP / (FP + TN)
second column
precision
TP / (TP + FP)
how many predicted positives were truly positive?
across first row
how to set up confusion matrix
actual on top, predicted on the side
positive positive in the top left
false negative
we predict negative, but that is false (actually positive)
false positive
we predict positive, but that is false (actually negative)
types of transformers
encoder, decoder, encoder-decoder
high level of how to train an LLM
pretraining - predict next token
supervised fine tuning - train on prompts and good responses
reinforcement learning with human feedback - humans rate responses
normalization vs regularization
normalization - making sure weights are same scale
regularization - makes sure model doesn’t overfit
types of normalization
batch norm, L2 norm
what are support vectors in SVM
the data points that are closest to the hyperplane
what is out of bag evaluation
evaluate tree on the data that wasn’t used for training
what is calibration?
making sure the output of the model represents how confident it is
2 examples of non parametric models
knn, decision trees
don’t make assumptions about the underlying distribution
what is LIME?
a technique that aims to create an interpretable model local to a data point
what are proxy models?
models that behave similar to complex models
why does the vanishing/exploding gradient problem occur?
we are multiplying gradients as we move to earlier layers, multiplying lots of small/large numbers together
types of autoencoders
deep (multiple layers), sparse, variational, etc.
what is the loss function for logistic regression?
binary cross entropy, it is strictly convex
what are scaling laws?
take in model size and data and try to predict loss
how much will throwing resources at the model improve it?
three svm kernels
linear, polynomial, RBF