Learning From Data Flashcards

Question

Interpret what the values of the correlation coefficient imply.

Answer 1

R12 > 0 variables increase and decrease together R12 < 0 one variable decreases as the other increases R12 ≈ 0 variables not associated (roughly circular scatter diagram)

Answer 2

obtained by dividing the covariance by the product of the standard deviations

Answer 3

Correlation values are standardized whereas, covariance values are not. Correlation is a special case of covariance which can be obtained when the data is standardized

Answer 4

The value of covariance is affected by the change in scale of the variables

Answer 5

Main diagonal of correlation coefficient are all 1, since the variables have been standardized

Answer 6

Identity matrix

Answer 7

The squared Euclidean distance of the sphered data

Answer 8

Cov | actually in the inverse of the cov

Answer 9

Mean VECTOR | COVARIANCE MATRIX

Answer 10

Parameters µ and Σ estimated by maximum likelihood | OR Bayesian statistics.

Answer 11

Systematic: average response Random: variability of observations

Answer 12

Probability that a datum x was generated by model p is the conditional probability p(x|w)

Answer 13

Independence of observations

Answer 14

(mean squared error) | sum of squares errors

Answer 15

Cross entropy/log loss

Answer 16

X†X = I when X is square and invertible. | It’s the best approximation when X is rectangular or singular

Answer 17

Contribution to least squares error is largest from targets with largest errors. Susceptible to outliers. p(t | x) is not always Gaussian

Answer 18

sum of absolute errors

Answer 19

Transfer function MLP - multi layer perceptrons Basis functions

Answer 20

May be difficult to learn

Answer 21

Fourier Radial (Gaussian Radial Basis functions) Wavelets

Answer 22

``` local centred on (some of) the training data ```

Answer 23

apply non-linearity before linear regression

Answer 24

In supervised learning applications. | It is a measure of how accurately an algorithm is able to predict outcome values for previously unseen data

Answer 25

Too few - inflexible network, poor generalisation | Too many - over-fitting, poor generalisation

Answer 26

Cross-validation

Answer 27

1. Divide the training data into two sets: training, validation – surrogate test set 2. Train on training set 3. Evaluate “test” error on validation set 4. Adjust number of parameters/hidden units for best generalisation on validation set K-fold validation - reshuffle the data - Randomly partition into k training and validation sets and average the validation error over all k sets.

Answer 28

K times more expensive that ordinary Cross validation | Not as good on small data sets

Answer 29

Regularisation is the process of adding information in order to prevent overfitting (by improving an already poorly over-fitted model) i.e. penalise overly complicated models by regularisation terms

Answer 30

Weight decay regularisation Minimum description length Support Vector Machines

Answer 31

Cross validation

Answer 32

Classification

Answer 33

Fitting a line, plane or hyperplane through data

Answer 34

``` The probability that a data point belongs to a certain class p( C_k | x ) ```

Answer 35

p( C_k ) | Often the proportion of samples in Ck

Answer 36

``` Bayes error rate is the minimum error that may be achieved in a classification problem By assigning x to the class with the largest posterior probability It is achieved if the posterior class probabilities are known exactly ```

Answer 37

``` Ĉ_kj is the number of instances of class k that are classified as class j. (square matrix) ```

Answer 38

Normalise the confusion matrx, Ĉ, so that each row sums to 1

Answer 39

A test result which wrongly indicates that a particular condition or attribute is present.

Answer 40

Soft and Probabilistic

Answer 41

``` Produces a score as an output. yn = F(xn; w) ∈ R i.e. the output is a continuous value Classify xn to C1 if F(xn; w) > λ for some threshold; otherwise allocate xn to C2 ```

Answer 42

Classifier produces a probability score [0,1] yn = F(xn; w) ∈ (0, 1) Can be interpreted as posterior probability p(C1 | xn) Maximum accuracy if xn is assigned to C1 if F(xn; w) > λ = 1/2 otherwise allocate xn to C2

Answer 43

``` Classifier produces an allocation to a class i.e. the output is a 'class'/category yn = F(xn; w) ∈ {C1, C2} ```

Answer 44

y-axis ~ true positive rate | x-axis ~ False positive rate

Answer 45

Receiver Operating Characteristic

Answer 46

For each threshold: 1) Evaluate confusion matrix 2) Plot TPR vs FPR

Answer 47

The diagonal line shows the performance of the classifier that allocates at random

Answer 48

Separating surface

Answer 49

The sign Discriminant function = 0 gives the decision boundary ``` Define a function y(x) so that x is classified as class C1 iff y(x) > 0 ```

Answer 50

Classifier

Answer 51

Cross Entropy function

Answer 52

If the class conditional density functions do not overlap then the classes are separable.

Answer 53

``` When classes are separable. The class conditional density functions do not overlap. ```

Answer 54

Classes that can be separated by a line or (hyper)plane

Answer 55

Approximate posterior probabilities

Answer 56

Transform x; then use logistic regression

Answer 57

MLP | Basis functions

Answer 58

A dataset mapped to a higher dimensional space is more likely to be linearly separable

Answer 59

Make the activation function to a step function

Answer 60

Lkj quantifes the penalty of assigning x to Ck when it belongs to Cj Costs are relative to an additive constant. Often we count the cost of correct classification as zero, so Lkk = 0 for all k

Answer 61

Average cost of making a classification. | Averaged over all classes.

Answer 62

Sigmoidal | g(a) = 1/(1+exp(-a))

Answer 63

0 < y(x) < 1 output may be interpreted as a probability of class membership outputs approximate posterior probabilities

Answer 64

linear RBF sigmoidal

Answer 65

local radially symmetric centred on (some of) the training data

Answer 66

w0 ~ bias | Controls the distance of the (linear) boundary from the origin

Answer 67

Order of partition = model complexity | How many clusters we are using to model the data

Answer 68

``` Sequential clustering Hierarchical algorithms Algorithms based on optimization Spectral clustering Graph-theoretical methods Statistical approaches ```

Answer 69

No cluster can be empty The union of all the clusters is the Partition set (set of all clusters) The intersection of 2 distinct clusters is the empty set

Answer 70

Ordered n inputs elements dissimilarity measure d(-,-) Cluster radius Max number of clusters Q

Answer 71

PROS Since its generates a hierarchy of partitions with different resolutions, it is flexible and allows choice of resolution level Can be applied to many different data types (not only vectors) CONS Heavy on the computational side

Answer 72

Agglomerative | Divisive

Answer 73

Sequence of clusterings

Answer 74

MAXIMISE the INTRA-cluster similarity, while at the same time, MINIMISE the INTER-cluster similarity In general, a partition is “good” if related clusters are compact and separated

Answer 75

Agglomerative: start from N singleton clusters Divisive: initially contains only one cluster

Answer 76

``` Topology based (min spanning tree) Spectral (graph in matrix form) Random walks ```

Answer 77

demanding in terms of space and time computational complexity

Answer 78

The sum of the squared euclidean distances of each data point with all other data points within each cluster The sum of the squared euclidean distances of each data point with a cluster representative within each cluster

Answer 79

PROS It is easy to implement can be applied to virtually any input domain (i.e., input data that is not defined as vectors of real-numbers) CONS typically finds sub-optimal solutions Does not perform well on high-dimensional data and on datasets with clusters that cannot be modeled by covariances Doesn't work well for different variances

Answer 80

Data Order K (no. of clusters) Max no. of iterations

Answer 81

Similarity | NOT distance

Answer 82

Kernel k-means

Answer 83

RBF k(x, z) = exp( −γ (||x − z||_2)^2 ) (Gaussian when γ =1 / (2σ)^2 )

Answer 84

More hyper-parameters to define Since the embedding is implicit, the centroids are not defined explicitly and don't have the explicit formulation for the model. Hence it is difficult to extend to out-of-sample data.

Answer 85

Ground-truth

Answer 86

Uses a similarity MATRIX

Answer 87

Yes, as long as you can quantify similarities between inputs.

Answer 88

RBF kernel

Answer 89

The number of zero eigenvalues of L, the laplacian matrix

Answer 90

Yes, since the input will be at least one connected component

Answer 91

Binary vectors | That encode which vertices belong to which (connected) component

Answer 92

n data points (S is nxn) | Then the eigen decomposition has complexity n^3

Answer 93

S is the similarity matrix Symmetric? S_ij = s( xi, xj ) denotes the similarity between xi and xj

Answer 94

W: weighted adjacency matrix deg(vi): sum of all the weights of the edges attached to vi D = diag(deg(v1), deg(v2), ...., deg(vi)) Volume: if A is a subset of vertices, the volume is the sum of the degrees of all the vertices in A U: matrix of all eigenvectors, corresponding to the increasing set of eigvectors L = D - W

Answer 95

Positive semi definite positive-semi definite, i.e., it has non-negative real eigenvalues λ1 = 0 is always an eigenvalue of L, where 1 is the corresponding eigenvector

Answer 96

For fully connected graphs, use RBF. k-nearest neighbour: connect vertices if vj is among the k nearest neighbour of vi or vice-versa epsilon-nearest neighbour: Thresholding of weights Wij = wij if d(vi, vj) ≤ epsilon; 0 otherwise

Answer 97

The curse of dimensionality refers to various phenomena that arise when analyzing and organizing data in high-dimensional spaces (often with hundreds or thousands of dimensions) that do not occur in low-dimensional settings such as the three-dimensional physical space of everyday experience. if the amount of available training data is fixed, then overfitting occurs if we keep adding dimensions. On the other hand, if we keep adding dimensions, the amount of training data needs to grow exponentially fast to maintain the same coverage and to avoid overfitting.

Answer 98

A square, real-valued matrix U its rows and columns are orthonormal vectors UU^T = I, where I is an identity matrix U−1 = UT

Answer 99

Principal components

Answer 100

The approximation error

Answer 101

When input features are on very different scale | ranges, i.e., the standard deviation of the features is very different

Answer 102

λk is the k-th eigenvalue Measures the importance of k-th PC (variance) Quantify the mean squared projection (variance) onto each principal component

Answer 103

Check the cumulative explained variance and select a | number of PCs that explains at least ~ 80%

Answer 104

First divide each eigenvalue by the sum of all the eigenvalues Then plot the graph of summing these together one at a time Can help determine where your cut off for the number of PCs should be Note that each eigenvalue gives the variance that PC covers?

Answer 105

Allows to back-project in input space from the space spanned by the PCs

Answer 106

PCA minimizes the approximation error when projecting data onto any linear subspace Provide “natural” coordinates for the data: uncorrelated and compact

Answer 107

Noise reduction

Answer 108

Generative and Discriminant Generative: Guassian mixture model (Linear regression, LDA) Discriminant: Logistic regression generative classifiers (joint distribution) and discriminative classifiers (conditional distribution or no distribution) If the observed data are truly sampled from the generative model, then fitting the parameters of the generative model to maximize the data likelihood is a common method.

Answer 109

``` p( x | C_k ) class conditional density p( C_k | x ) posterior probability ```

Learning From Data Flashcards

(138 cards)