Data Science Flashcards

Refresh knowledge of data science algorithms & models (20 cards)

1
Q

Decision Tree

  • Why is it used?
  • How does the algorithm work?
A
  • Classification with categorical features and categorical target (Regression Tree can be used for quantitative target). Easily interpretable and simple algorithm.
  • Algo considers info gain (measured by reduction in entropy from splitting a feature). Entropy is p(x)log(p(x)) for probability dist p (here feature p). Alternatively can use gini measure in place of entropy.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Random Forest

  • Why is it used?
  • How does algorithm work?
A
  • Go-to first algorithm for many problems. Reduces the variance of DT algorithm through bagging on dataset and features
  • Each tree is built on bagged dataset, and each tree uses a random subset of features.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

SVM

  • Why is it used?
  • How does algorithm work?
  • Any considerations?
  • What is the kernel trick?
A
  • Simple classifier which is easily explainable
  • Computationally inexpensive
  • Creates a hyperplane separating two classes. Tries to maximise the perpendicular distance from each point to this hyperplane
  • Typically data is not separable, which is why a hinge loss function is used (this is minimised)
  • SVM uses the kernel trick to map data into a higher dimensional space where it can be separable. The trick enables computation of inner product <a,b>in a higher dimension without explicitly computing a or b (the kernel function is designed to compute exactly that)
  • We learn a classifier by computing coefficients of hyperplane in transformed space (c_i) which allow us to compute the boundary of the hyperplane and thus classify each test point z by computing its sign in the transformed space relative to b
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Linear Regression

  • Why is it used?
  • How does algorithm work?
  • Any considerations?
A
  • Assumes linear relation y=a1x1+a2x2…anxn
  • Often fitted using least squares. We minimise ||Ax-b|| over all x which can be solved directly as matrix algebra, so the loss function is the squared error
  • Variants include ridge regression (L2 regularisation) or lasso regression (L1 regularisation)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Bias Variance Tradeoff

  • What is it?
  • What impact does it have?
A
  • In general models tend to have more bias with less variance and vice versa
  • Bias is the average error of predictions E[f(x)]-f(x) (error of predictions)
  • Variance describes distribution of predictions E[(f(x)-E[f(x)])^2]
  • We divide error E[(Y-f(x))^2] into these above two plus noise (can show mathematically)
  • Typically high bias and low variance means underfitting (probably low complexity model)
  • Typically low bias and high variance means overfitting (probably high complexity model)

Article
Diagram

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Loss Functions

  • Where are they used?
  • Common Loss Functions
A
  • Primarily used during training, the training objective is to minimise the loss function. Can also be used during evaluation to measure the performance of the model
  • Cross entropy loss for classification (note DT is trained using conditional entropy, not cross entropy). Logistic regression uses cross entropy loss (see this section).
  • SVM uses hinge loss function L = max(0, 1-yf(x)) where f is the predicted output and y is the actual output
  • MSE or L2 loss. MSE = (1/n) * Σ(yᵢ - ȳ)². Squared error is used as loss function for regression
  • MAE or L1 loss. MAE = (1/n) * Σ|yᵢ - ȳ|
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Logistic Regression

  • Why is it used?
  • How does algorithm work?
  • Any considerations?
A
  • Logistic function is used to compute a probability between 0 and 1 of an event given quantitative input
  • Logistic function p(x)=1/(1+e^(β01x))
  • Cross entropy is used as a loss function
  • We can optimise the loss using gradient descent (similar algo to linear regression can be used)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Clustering - KMeans & DBSCAN

  • Why is it used?
  • How does algorithm work?
  • Any considerations?
A
  • Clustering is used to group similar items together based on distance of features in some space
  • KMeans - initialise k centroids (can use various algorithms for this, e.g. start with furthest points. Go through iterations of 1) assigning points to centroids and 2) adjusting centroids based on members
  • DBSCAN uses minPts, ε params. A point is defined as a core point if there are minPts other points within ε-radius. Any points reachable from core point also are assigned to the cluster.
  • K-means be computationally intensive. DBSCAN is a density based algo.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Naive Bayes

  • Why is it used?
  • How does algorithm work?
  • Any considerations?
A
  • Classification algorithm
  • Naive assumption - features are independent
  • Based on Bayes theorem
  • P(Ck|x) = P(x|Ck)P(Ck)/P(x)
  • Assume P(x) constant, use
    ŷ=argmax P(Ck) ∏ p(x_i|Ck), so choose the max over all classes after computing marginal probability
  • Relied on the assumption of feature independence
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

PCA

  • Why is it used?
  • How does algorithm work?
  • Any considerations?
A
  • Reduce dimensionality of data whilst maintaining variation
  • Compute eigenvalues and eigenvectors of the covariance matrix (eigenvectors also called principal components)
  • Rank the eigenvalues and use the largest d to project the data to new dimensions
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Collaborative Filtering

A
  • We take a user-item matrix A (mxn) of implicit or explicit ratings
  • We learn implicit matrices U (dxm) and V (nxd) in representing users and items in some dimension space of size d (could be genres in film case, for example), where UV^T=A
  • To score we take dot product of U_i and V_j for user i and item j
  • Use SGD or WALS to minimise the objective (A_ij - <U,V>)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Content-based Recommendation

  • How does it work?
A
  • Let’s take film example, say we have a matrix of user ratings for films and film genres
  • We take the dot product of those two so that we get effectively a score that tells us how much each user likes each genre. We normalise this and use these learned likes to predict how much a user likes a new film (where we have the genre content for the new film)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q
  • Unsupervised, supervised?
  • Semi-sup. , weakly sup. and active learning?
A
  • Learning from labels vs not
  • Semi-supervised (a.k.a. weakly supervised) is a combination of supervised and unsupervised. * A common example is self-training where the model is trained on small number of labels, then a second model is trained on these labels plus labels derived from the most confident predictions of the first model
  • Another approach is co-training where two models are trained on data with different ‘views’ e.g. content in and links to a webpage. We make predictions and update the labels with the most confident predictions
  • Active learning is a system where we pick the least certain classifications to be labelled by a human used and pass these labels back to a model
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q
  • What does empirical risk minimisation mean?
  • What is empirical here? What is risk here?
A
  • This describes the system of learning from training data, trying to fit a model as closely to it as possible
  • Empirical means ‘based on observation’, i.e. the training data
  • Risk is the delta between the observed performance on training data and the unknown true performance
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q
  • Wide vs deep NN’s
A
  • Wide networks can learn more features (each neuron might relate to a feature)
  • Deep networks can learn more complex features (need a more complex function to describe this feature)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q
  • What is universal approximation theorem?
A
  • It states that for any function f which maps from Euclidean space to Euclidean space there exists a family of neural nets ϕ1, ϕ2… such that ϕn is dense in f
  • This just says that such a family exists, not necessarily that it can be found, depends on the data and training and gradient descent
17
Q
  • Parametric vs nonparametric methods
A
  • Parametric means that a method relies on assumptions about the distribution of the data (here parameter means mean, variance etc.). We might estimate these values in which case they become ‘statistics’
18
Q
  • Hyperparameters
  • What are they?
  • Explain algo for hyperparameter tuning
A
  • Configuration settings of model (parameters are what the model learns during training)
  • Grid search
  • Bayesian optimisation
19
Q
  • What is Bayesian optimisation, how does it work?
A
  • Black box optimisation method, i.e. find a minimum of f(x) where f(x) is an unknown function (so in case of NN f is our network)
  • We construct a prior over f, which in this case is a multivariate gaussian distribution
  • We compute based on our prior the sampling point we think is most likely to help us progress to a minimum
  • We observe the real value at this point x_i by computing f(x_i)
  • We update the prior function based on the observation using Bayes rule
    P(f|D) ∝ P(f)P(D|f) for function f and data D