Data Science Flashcards

Question 1

Q

Decision Tree

Why is it used?
How does the algorithm work?

Answer

A

Classification with categorical features and categorical target (Regression Tree can be used for quantitative target). Easily interpretable and simple algorithm.
Algo considers info gain (measured by reduction in entropy from splitting a feature). Entropy is p(x)log(p(x)) for probability dist p (here feature p). Alternatively can use gini measure in place of entropy.

Question 2

Q

Random Forest

Why is it used?
How does algorithm work?

Answer

A

Go-to first algorithm for many problems. Reduces the variance of DT algorithm through bagging on dataset and features
Each tree is built on bagged dataset, and each tree uses a random subset of features.

Question 3

Q

SVM

Why is it used?
How does algorithm work?
Any considerations?
What is the kernel trick?

Answer

A

Simple classifier which is easily explainable
Computationally inexpensive
Creates a hyperplane separating two classes. Tries to maximise the perpendicular distance from each point to this hyperplane
Typically data is not separable, which is why a hinge loss function is used (this is minimised)
SVM uses the kernel trick to map data into a higher dimensional space where it can be separable. The trick enables computation of inner product <a,b>in a higher dimension without explicitly computing a or b (the kernel function is designed to compute exactly that)
We learn a classifier by computing coefficients of hyperplane in transformed space (c_i) which allow us to compute the boundary of the hyperplane and thus classify each test point z by computing its sign in the transformed space relative to b

Question 4

Q

Linear Regression

Why is it used?
How does algorithm work?
Any considerations?

Answer

A

Assumes linear relation y=a1x1+a2x2…anxn
Often fitted using least squares. We minimise ||Ax-b|| over all x which can be solved directly as matrix algebra, so the loss function is the squared error
Variants include ridge regression (L2 regularisation) or lasso regression (L1 regularisation)

Question 5

Q

Bias Variance Tradeoff

What is it?
What impact does it have?

Answer

A

In general models tend to have more bias with less variance and vice versa
Bias is the average error of predictions E[f(x)]-f(x) (error of predictions)
Variance describes distribution of predictions E[(f(x)-E[f(x)])^2]
We divide error E[(Y-f(x))^2] into these above two plus noise (can show mathematically)
Typically high bias and low variance means underfitting (probably low complexity model)
Typically low bias and high variance means overfitting (probably high complexity model)

Article
Diagram

Question 6

Q

Loss Functions

Where are they used?
Common Loss Functions

Answer

A

Primarily used during training, the training objective is to minimise the loss function. Can also be used during evaluation to measure the performance of the model
Cross entropy loss for classification (note DT is trained using conditional entropy, not cross entropy). Logistic regression uses cross entropy loss (see this section).
SVM uses hinge loss function L = max(0, 1-yf(x)) where f is the predicted output and y is the actual output
MSE or L2 loss. MSE = (1/n) * Σ(yᵢ - ȳ)². Squared error is used as loss function for regression
MAE or L1 loss. MAE = (1/n) * Σ|yᵢ - ȳ|

Question 7

Q

Logistic Regression

Why is it used?
How does algorithm work?
Any considerations?

Answer

A

Logistic function is used to compute a probability between 0 and 1 of an event given quantitative input
Logistic function p(x)=1/(1+e^(β₀+β₁x))
Cross entropy is used as a loss function
We can optimise the loss using gradient descent (similar algo to linear regression can be used)

Question 8

Q

Clustering - KMeans & DBSCAN

Why is it used?
How does algorithm work?
Any considerations?

Answer

A

Clustering is used to group similar items together based on distance of features in some space
KMeans - initialise k centroids (can use various algorithms for this, e.g. start with furthest points. Go through iterations of 1) assigning points to centroids and 2) adjusting centroids based on members
DBSCAN uses minPts, ε params. A point is defined as a core point if there are minPts other points within ε-radius. Any points reachable from core point also are assigned to the cluster.
K-means be computationally intensive. DBSCAN is a density based algo.

Question 9

Q

Naive Bayes

Why is it used?
How does algorithm work?
Any considerations?

Answer

A

Classification algorithm
Naive assumption - features are independent
Based on Bayes theorem
P(Ck|x) = P(x|Ck)P(Ck)/P(x)
Assume P(x) constant, use
ŷ=argmax P(Ck) ∏ p(x_i|Ck), so choose the max over all classes after computing marginal probability
Relied on the assumption of feature independence

Question 10

Q

PCA

Why is it used?
How does algorithm work?
Any considerations?

Answer

A

Reduce dimensionality of data whilst maintaining variation
Compute eigenvalues and eigenvectors of the covariance matrix (eigenvectors also called principal components)
Rank the eigenvalues and use the largest d to project the data to new dimensions

Question 11

Q

Collaborative Filtering

Answer

A

We take a user-item matrix A (mxn) of implicit or explicit ratings
We learn implicit matrices U (dxm) and V (nxd) in representing users and items in some dimension space of size d (could be genres in film case, for example), where UV^T=A
To score we take dot product of U_i and V_j for user i and item j
Use SGD or WALS to minimise the objective (A_ij - <U,V>)

Question 12

Q

Content-based Recommendation

How does it work?

Answer

A

Let’s take film example, say we have a matrix of user ratings for films and film genres
We take the dot product of those two so that we get effectively a score that tells us how much each user likes each genre. We normalise this and use these learned likes to predict how much a user likes a new film (where we have the genre content for the new film)

Question 13

Q

Unsupervised, supervised?
Semi-sup. , weakly sup. and active learning?

Answer

A

Learning from labels vs not
Semi-supervised (a.k.a. weakly supervised) is a combination of supervised and unsupervised. * A common example is self-training where the model is trained on small number of labels, then a second model is trained on these labels plus labels derived from the most confident predictions of the first model
Another approach is co-training where two models are trained on data with different ‘views’ e.g. content in and links to a webpage. We make predictions and update the labels with the most confident predictions
Active learning is a system where we pick the least certain classifications to be labelled by a human used and pass these labels back to a model

Question 14

Q

What does empirical risk minimisation mean?
What is empirical here? What is risk here?

Answer

A

This describes the system of learning from training data, trying to fit a model as closely to it as possible
Empirical means ‘based on observation’, i.e. the training data
Risk is the delta between the observed performance on training data and the unknown true performance

Question 15

Q

Wide vs deep NN’s

Answer

A

Wide networks can learn more features (each neuron might relate to a feature)
Deep networks can learn more complex features (need a more complex function to describe this feature)

Question 16

Q

What is universal approximation theorem?

Answer

Study These Flashcards

A

It states that for any function f which maps from Euclidean space to Euclidean space there exists a family of neural nets ϕ1, ϕ2… such that ϕn is dense in f
This just says that such a family exists, not necessarily that it can be found, depends on the data and training and gradient descent

Question 17

Q

Parametric vs nonparametric methods

Answer

Study These Flashcards

A

Parametric means that a method relies on assumptions about the distribution of the data (here parameter means mean, variance etc.). We might estimate these values in which case they become ‘statistics’

Question 18

Q

Hyperparameters
What are they?
Explain algo for hyperparameter tuning

Answer

Study These Flashcards

A

Configuration settings of model (parameters are what the model learns during training)
Grid search
Bayesian optimisation

Question 19

Q

What is Bayesian optimisation, how does it work?

Answer

Study These Flashcards

A

Black box optimisation method, i.e. find a minimum of f(x) where f(x) is an unknown function (so in case of NN f is our network)
We construct a prior over f, which in this case is a multivariate gaussian distribution
We compute based on our prior the sampling point we think is most likely to help us progress to a minimum
We observe the real value at this point x_i by computing f(x_i)
We update the prior function based on the observation using Bayes rule
P(f|D) ∝ P(f)P(D|f) for function f and data D

Question 20

Q

Answer

Study These Flashcards

A

Data Science Flashcards

Refresh knowledge of data science algorithms & models (20 cards)