Data Science Flashcards
Refresh knowledge of data science algorithms & models (20 cards)
1
Q
Decision Tree
- Why is it used?
- How does the algorithm work?
A
- Classification with categorical features and categorical target (Regression Tree can be used for quantitative target). Easily interpretable and simple algorithm.
- Algo considers info gain (measured by reduction in entropy from splitting a feature). Entropy is
p(x)log(p(x))
for probability dist p (here feature p). Alternatively can use gini measure in place of entropy.
2
Q
Random Forest
- Why is it used?
- How does algorithm work?
A
- Go-to first algorithm for many problems. Reduces the variance of DT algorithm through bagging on dataset and features
- Each tree is built on bagged dataset, and each tree uses a random subset of features.
3
Q
SVM
- Why is it used?
- How does algorithm work?
- Any considerations?
- What is the kernel trick?
A
- Simple classifier which is easily explainable
- Computationally inexpensive
- Creates a hyperplane separating two classes. Tries to maximise the perpendicular distance from each point to this hyperplane
- Typically data is not separable, which is why a hinge loss function is used (this is minimised)
- SVM uses the kernel trick to map data into a higher dimensional space where it can be separable. The trick enables computation of inner product <a,b>in a higher dimension without explicitly computing a or b (the kernel function is designed to compute exactly that)
- We learn a classifier by computing coefficients of hyperplane in transformed space (c_i) which allow us to compute the boundary of the hyperplane and thus classify each test point z by computing its sign in the transformed space relative to b
4
Q
Linear Regression
- Why is it used?
- How does algorithm work?
- Any considerations?
A
- Assumes linear relation y=a1x1+a2x2…anxn
- Often fitted using least squares. We minimise ||Ax-b|| over all x which can be solved directly as matrix algebra, so the loss function is the squared error
- Variants include ridge regression (L2 regularisation) or lasso regression (L1 regularisation)
5
Q
Bias Variance Tradeoff
- What is it?
- What impact does it have?
A
- In general models tend to have more bias with less variance and vice versa
- Bias is the average error of predictions E[f(x)]-f(x) (error of predictions)
- Variance describes distribution of predictions E[(f(x)-E[f(x)])^2]
- We divide error E[(Y-f(x))^2] into these above two plus noise (can show mathematically)
- Typically high bias and low variance means underfitting (probably low complexity model)
- Typically low bias and high variance means overfitting (probably high complexity model)
6
Q
Loss Functions
- Where are they used?
- Common Loss Functions
A
- Primarily used during training, the training objective is to minimise the loss function. Can also be used during evaluation to measure the performance of the model
- Cross entropy loss for classification (note DT is trained using conditional entropy, not cross entropy). Logistic regression uses cross entropy loss (see this section).
- SVM uses hinge loss function L = max(0, 1-yf(x)) where f is the predicted output and y is the actual output
- MSE or L2 loss. MSE = (1/n) * Σ(yᵢ - ȳ)². Squared error is used as loss function for regression
- MAE or L1 loss. MAE = (1/n) * Σ|yᵢ - ȳ|
7
Q
Logistic Regression
- Why is it used?
- How does algorithm work?
- Any considerations?
A
- Logistic function is used to compute a probability between 0 and 1 of an event given quantitative input
- Logistic function p(x)=1/(1+e^(β0+β1x))
- Cross entropy is used as a loss function
- We can optimise the loss using gradient descent (similar algo to linear regression can be used)
8
Q
Clustering - KMeans & DBSCAN
- Why is it used?
- How does algorithm work?
- Any considerations?
A
- Clustering is used to group similar items together based on distance of features in some space
- KMeans - initialise k centroids (can use various algorithms for this, e.g. start with furthest points. Go through iterations of 1) assigning points to centroids and 2) adjusting centroids based on members
- DBSCAN uses minPts, ε params. A point is defined as a core point if there are minPts other points within ε-radius. Any points reachable from core point also are assigned to the cluster.
- K-means be computationally intensive. DBSCAN is a density based algo.
9
Q
Naive Bayes
- Why is it used?
- How does algorithm work?
- Any considerations?
A
- Classification algorithm
- Naive assumption - features are independent
- Based on Bayes theorem
- P(Ck|x) = P(x|Ck)P(Ck)/P(x)
- Assume P(x) constant, use
ŷ=argmax P(Ck) ∏ p(x_i|Ck), so choose the max over all classes after computing marginal probability - Relied on the assumption of feature independence
10
Q
PCA
- Why is it used?
- How does algorithm work?
- Any considerations?
A
- Reduce dimensionality of data whilst maintaining variation
- Compute eigenvalues and eigenvectors of the covariance matrix (eigenvectors also called principal components)
- Rank the eigenvalues and use the largest d to project the data to new dimensions
11
Q
Collaborative Filtering
A
- We take a user-item matrix A (mxn) of implicit or explicit ratings
- We learn implicit matrices U (dxm) and V (nxd) in representing users and items in some dimension space of size d (could be genres in film case, for example), where UV^T=A
- To score we take dot product of U_i and V_j for user i and item j
- Use SGD or WALS to minimise the objective (A_ij - <U,V>)
12
Q
Content-based Recommendation
- How does it work?
A
- Let’s take film example, say we have a matrix of user ratings for films and film genres
- We take the dot product of those two so that we get effectively a score that tells us how much each user likes each genre. We normalise this and use these learned likes to predict how much a user likes a new film (where we have the genre content for the new film)
13
Q
- Unsupervised, supervised?
- Semi-sup. , weakly sup. and active learning?
A
- Learning from labels vs not
- Semi-supervised (a.k.a. weakly supervised) is a combination of supervised and unsupervised. * A common example is self-training where the model is trained on small number of labels, then a second model is trained on these labels plus labels derived from the most confident predictions of the first model
- Another approach is co-training where two models are trained on data with different ‘views’ e.g. content in and links to a webpage. We make predictions and update the labels with the most confident predictions
- Active learning is a system where we pick the least certain classifications to be labelled by a human used and pass these labels back to a model
14
Q
- What does empirical risk minimisation mean?
- What is empirical here? What is risk here?
A
- This describes the system of learning from training data, trying to fit a model as closely to it as possible
- Empirical means ‘based on observation’, i.e. the training data
- Risk is the delta between the observed performance on training data and the unknown true performance
15
Q
- Wide vs deep NN’s
A
- Wide networks can learn more features (each neuron might relate to a feature)
- Deep networks can learn more complex features (need a more complex function to describe this feature)
16
Q
- What is universal approximation theorem?
A
- It states that for any function f which maps from Euclidean space to Euclidean space there exists a family of neural nets ϕ1, ϕ2… such that ϕn is dense in f
- This just says that such a family exists, not necessarily that it can be found, depends on the data and training and gradient descent
17
Q
- Parametric vs nonparametric methods
A
- Parametric means that a method relies on assumptions about the distribution of the data (here parameter means mean, variance etc.). We might estimate these values in which case they become ‘statistics’
18
Q
- Hyperparameters
- What are they?
- Explain algo for hyperparameter tuning
A
- Configuration settings of model (parameters are what the model learns during training)
- Grid search
- Bayesian optimisation
19
Q
- What is Bayesian optimisation, how does it work?
A
- Black box optimisation method, i.e. find a minimum of f(x) where f(x) is an unknown function (so in case of NN f is our network)
- We construct a prior over f, which in this case is a multivariate gaussian distribution
- We compute based on our prior the sampling point we think is most likely to help us progress to a minimum
- We observe the real value at this point x_i by computing f(x_i)
- We update the prior function based on the observation using Bayes rule
P(f|D) ∝ P(f)P(D|f) for function f and data D
20
Q
A