Learning From Data Flashcards

(138 cards)

1
Q

Difference between Classification and Regression

A

classification is about predicting a label
regression is about predicting a quantity

Classification is the task of predicting a discrete class label.
Regression is the task of predicting a continuous quantity.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

multi-class classification problem

A

A problem with more than two classes

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

multi-label classification problem.

A

A problem where an example is assigned multiple classes

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

datum

A

data
a piece of information
a fixed starting point of a scale or operation

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

k nearest neighbours

A

1) find the k nearest neighbors to x in the training data

2) assign x to the class with the most k nearest neighbors

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Unsupervised Learning

A

Unsupervised learning is where you only have input data (X) and no corresponding output variables.

The goal for unsupervised learning is to model the underlying structure or distribution in the data in order to learn more about the data.

Given: data
D = {xn}, n = 1, . . . , N
and a parameterised generative model describing how the data might be generated, p(x; w), depending on parameters w.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Supervised Learning

A

Supervised learning is where you have input variables (x) and an output variable (Y) and you use an algorithm to learn the mapping function from the input to the output.

Supervised learning is the machine learning task of learning a function that maps an input to an output based on example input-output pairs

Y = f(X)

The goal is to approximate the mapping function so well that when you have new input data (x) that you can predict the output variables (Y) for that data.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Hyperparameter

A

a parameter whose value is set before the learning process begins. By contrast, the values of other parameters are derived via training

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Multi variate data

A

More than one variable is measured on each individual in a sample

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Centroid

A

The mean of each variable, into a vector

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Properties of data that has been sphered

A

for each variable
mean=0
variance=1
all the variables are mutually uncorrelated

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Disadvantages to Euclidean Distance

A

is popular for numerical data, but:
it gives equal weight to all variables
it disregards correlations between variables

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Reasons for sphering

A

Sphering the data puts the variables on an equal footing and removes (linear) correlations

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

What can i.i.d. stand for

A

independent and identically

distributed

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Examples of Unsupervised Learning methods

A
Clustering
Gaussian distribution
Mixture model
Principal Component Analysis
Kohonen maps (SOMs)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Deterministic Model

A

In deterministic models, the output of the model is
fully determined by the parameter values and the
initial conditions.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

Main aim of classification

A

Train a machine F to map features to targets

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

Main aim of regression

A

Train a machine F to map features to continuous targets

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

What the different types/formats of variables?

A

numerical: continuous or discrete
categorical: nominal or ordinal
binary: presence/absence or 2-state categorical

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

In a data matrix, what does X_nd refer to?

A

Xnd is the value of the dth variable for the nth individual

i.e. observations are rows.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

How to measure association between 2 variables?

A

Covariance, (S_12)^2

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q

Mean and variance of a standardised variable

A
Mean = 0 
Variance = 1
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
23
Q

‘Standardised measure of association’ between variables

A

Correlation coefficient, R_12

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
24
Q

Does the correlation coefficient lie in a given range?

A

Yes

[-1,1]

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
25
Interpret what the values of the correlation coefficient imply.
R12 > 0 variables increase and decrease together R12 < 0 one variable decreases as the other increases R12 ≈ 0 variables not associated (roughly circular scatter diagram)
26
How to obtain the correlation coefficient?
obtained by dividing the covariance by the product of the standard deviations
27
What is the main difference between the correlation coefficient and the covariance?
Correlation values are standardized whereas, covariance values are not. Correlation is a special case of covariance which can be obtained when the data is standardized
28
Negative of using the covariance matrix?
The value of covariance is affected by the change in scale of the variables
29
What are the entries of the main diagonal on the correlation matrix?
Main diagonal of correlation coefficient are all 1, since the variables have been standardized
30
Covariance of sphered variables?
Identity matrix
31
Is correlation a linear measure?
Yes
32
What is the squared Mahalanobis distance equal to?
The squared Euclidean distance of the sphered data
33
Does the mahalanobis distance use the Covariance matrix or the Correlation coefficient?
Cov | actually in the inverse of the cov
34
What are used as the parameters for multivariate normal density?
Mean VECTOR | COVARIANCE MATRIX
35
How would you estimate the parameters of a multivariate normal density from a sample?
Parameters µ and Σ estimated by maximum likelihood | OR Bayesian statistics.
36
2 usual components of supervised learning
Systematic: average response Random: variability of observations
37
Likelihood function
Probability that a datum x was generated by model p is the conditional probability p(x|w)
38
The overall likelihood for all the data makes what assumption?
Independence of observations
39
Error function for Regression with Gaussian noise
(mean squared error) | sum of squares errors
40
Error function for Classification
Cross entropy/log loss
41
pseudo-inverse of X
X†X = I when X is square and invertible. | It’s the best approximation when X is rectangular or singular
42
Negatives of the Linear Regression model
Contribution to least squares error is largest from targets with largest errors. Susceptible to outliers. p(t | x) is not always Gaussian
43
Error function for Regression with Laplacian noise
sum of absolute errors
44
Approaches to Non-linear Regressions
Transfer function MLP - multi layer perceptrons Basis functions
45
Negative of MLP approach
May be difficult to learn
46
Examples of choices for basis functions (In Regression)
Fourier Radial (Gaussian Radial Basis functions) Wavelets
47
General/common properties of basis functions
``` local centred on (some of) the training data ```
48
General method of using basis functions in Non-linear regression
apply non-linearity before linear regression
49
Define generalisation error
In supervised learning applications. | It is a measure of how accurately an algorithm is able to predict outcome values for previously unseen data
50
Consequence of too few and too many hidden units in the MLP model?
Too few - inflexible network, poor generalisation | Too many - over-fitting, poor generalisation
51
How can over-fitting be combatted?
Cross-validation
52
Outline the steps in Cross-validation and N-fold Cross-validation
1. Divide the training data into two sets: training, validation – surrogate test set 2. Train on training set 3. Evaluate “test” error on validation set 4. Adjust number of parameters/hidden units for best generalisation on validation set K-fold validation - reshuffle the data - Randomly partition into k training and validation sets and average the validation error over all k sets.
53
Downside of k-fold cross validation?
K times more expensive that ordinary Cross validation | Not as good on small data sets
54
Define regularisation
Regularisation is the process of adding information in order to prevent overfitting (by improving an already poorly over-fitted model) i.e. penalise overly complicated models by regularisation terms
55
Examples of regularisation terms
Weight decay regularisation Minimum description length Support Vector Machines
56
How to determine alpha in Weight Decay regularisation?
Cross validation
57
What types of problems are SVMs designed for: Regression or classification?
Classification
58
Is the SVM model nonlinear or linear?
Linear
59
Aim of linear regression
Fitting a line, plane or hyperplane through data
60
Posterior probability
``` The probability that a data point belongs to a certain class p( C_k | x ) ```
61
Prior class probability
p( C_k ) | Often the proportion of samples in Ck
62
Bayes error
``` Bayes error rate is the minimum error that may be achieved in a classification problem By assigning x to the class with the largest posterior probability It is achieved if the posterior class probabilities are known exactly ```
63
Confusion matrix, Ĉ
``` Ĉ_kj is the number of instances of class k that are classified as class j. (square matrix) ```
64
Confusion RATE matrix, C
Normalise the confusion matrx, Ĉ, so that each row sums to 1
65
False positive
A test result which wrongly indicates that a particular condition or attribute is present.
66
What types of classifiers can be used in ROC analysis?
Soft and Probabilistic
67
Soft Classifier
``` Produces a score as an output. yn = F(xn; w) ∈ R i.e. the output is a continuous value Classify xn to C1 if F(xn; w) > λ for some threshold; otherwise allocate xn to C2 ```
68
Probabilistic Classifier
Classifier produces a probability score [0,1] yn = F(xn; w) ∈ (0, 1) Can be interpreted as posterior probability p(C1 | xn) Maximum accuracy if xn is assigned to C1 if F(xn; w) > λ = 1/2 otherwise allocate xn to C2
69
Hard Classifier
``` Classifier produces an allocation to a class i.e. the output is a 'class'/category yn = F(xn; w) ∈ {C1, C2} ```
70
In ROC analysis, determine the measurements on the axis of the ROC curve
y-axis ~ true positive rate | x-axis ~ False positive rate
71
What does ROC stand for?
Receiver Operating Characteristic
72
How to construct an ROC curve
For each threshold: 1) Evaluate confusion matrix 2) Plot TPR vs FPR
73
What does the diagonal line in a ROC curve depict?
The diagonal line shows the performance of the classifier that allocates at random
74
Another term for discriminating function
Separating surface
75
How to distinguish which class to allocate to by a discriminant function?
The sign Discriminant function = 0 gives the decision boundary ``` Define a function y(x) so that x is classified as class C1 iff y(x) > 0 ```
76
Is logistic regression a regression model or a classification model?
Classifier
77
Error function for logistic regression
Cross Entropy function
78
Define seperable classes
If the class conditional density functions do not overlap then the classes are separable.
79
Give an example of when Bayes error is 0
``` When classes are separable. The class conditional density functions do not overlap. ```
80
Linear separability
Classes that can be separated by a line or (hyper)plane
81
What are the outputs of the Logistic discriminant?
Approximate posterior probabilities
82
How to use basis functions for classification
Transform x; then use logistic regression
83
Non-linear models for classification
MLP | Basis functions
84
Cover’s theorem
A dataset mapped to a higher dimensional space is more likely to be linearly separable
85
How to design a hard classifier (from the MLP model)
Make the activation function to a step function
86
Loss matrix L_kj
Lkj quantifes the penalty of assigning x to Ck when it belongs to Cj Costs are relative to an additive constant. Often we count the cost of correct classification as zero, so Lkk = 0 for all k
87
Risk of a loss matrix
Average cost of making a classification. | Averaged over all classes.
88
Activation function in logistic discriminant?
Sigmoidal | g(a) = 1/(1+exp(-a))
89
Output of logistic discriminant?
0 < y(x) < 1 output may be interpreted as a probability of class membership outputs approximate posterior probabilities
90
Types of kernels
linear RBF sigmoidal
91
Properties of the types of basis function used in classification?
local radially symmetric centred on (some of) the training data
92
Role of w0 in Linear Discriminant analysis?
w0 ~ bias | Controls the distance of the (linear) boundary from the origin
93
Order of partition in clustering - give another term for this and define.
Order of partition = model complexity | How many clusters we are using to model the data
94
Approaches to clustering
``` Sequential clustering Hierarchical algorithms Algorithms based on optimization Spectral clustering Graph-theoretical methods Statistical approaches ```
95
Properties of clusters in hard clustering
No cluster can be empty The union of all the clusters is the Partition set (set of all clusters) The intersection of 2 distinct clusters is the empty set
96
Inputs of the Sequential clustering
Ordered n inputs elements dissimilarity measure d(-,-) Cluster radius Max number of clusters Q
97
Pros and cons of Hierarchical clustering
PROS Since its generates a hierarchy of partitions with different resolutions, it is flexible and allows choice of resolution level Can be applied to many different data types (not only vectors) CONS Heavy on the computational side
98
Name the 2 types of algorithms in Hierarchical clustering
Agglomerative | Divisive
99
Dendogram of clustering
Sequence of clusterings
100
Goal of clustering (what do we want to optimise?)
MAXIMISE the INTRA-cluster similarity, while at the same time, MINIMISE the INTER-cluster similarity In general, a partition is “good” if related clusters are compact and separated
101
Initial starting point in agglomerative and divisive clustering approaches.
Agglomerative: start from N singleton clusters Divisive: initially contains only one cluster
102
3 main graph clustering methods
``` Topology based (min spanning tree) Spectral (graph in matrix form) Random walks ```
103
Negatives of graphical clustering methods
demanding in terms of space and time computational complexity
104
What are you trying to minimise in K-clustering
The sum of the squared euclidean distances of each data point with all other data points within each cluster The sum of the squared euclidean distances of each data point with a cluster representative within each cluster
105
Pros and cons of K-means clustering
PROS It is easy to implement can be applied to virtually any input domain (i.e., input data that is not defined as vectors of real-numbers) CONS typically finds sub-optimal solutions Does not perform well on high-dimensional data and on datasets with clusters that cannot be modeled by covariances Doesn't work well for different variances
106
Inputs of k-means clustering
Data Order K (no. of clusters) Max no. of iterations
107
What are kernel functions a measure of (in clustering context)?
Similarity | NOT distance
108
A method of clustering for non-spherical/ellipsoid data/nonlinear patterns.
Kernel k-means
109
Example of a kernel function commonly used in kernel k-means clustering
RBF k(x, z) = exp( −γ (||x − z||_2)^2 ) (Gaussian when γ =1 / (2σ)^2 )
110
In kernel k-means, is the mapping of ϕ implicit/explicit?
Implicit
111
Negatives of kernel k-means clustering
More hyper-parameters to define Since the embedding is implicit, the centroids are not defined explicitly and don't have the explicit formulation for the model. Hence it is difficult to extend to out-of-sample data.
112
Another term for a reference partition in clustering validation?
Ground-truth
113
Key difference of Spectral clustering
Uses a similarity MATRIX
114
Can spectral clustering be applied to data that isn't in vector format?
Yes, as long as you can quantify similarities between inputs.
115
Give an example of S in spectral clustering for vector data input
RBF kernel
116
How to determine the number of connected components in spectral clustering?
The number of zero eigenvalues of L, the laplacian matrix
117
What are the range of eigenvalues of the normalised Laplacian matrix in spectral clustering?
[0,2]
118
Is it guaranteed that there will be at least one zero eigenvalue of L in spectral clustering?
Yes, since the input will be at least one connected component
119
What are indicator vectors in Spectral Clustering?
Binary vectors | That encode which vertices belong to which (connected) component
120
Complexity of Spectral clustering
n data points (S is nxn) | Then the eigen decomposition has complexity n^3
121
S_ij in spectral clustering
S is the similarity matrix Symmetric? S_ij = s( xi, xj ) denotes the similarity between xi and xj
122
``` In spectral clustering, define: W deg(vi) D Volume U L ```
W: weighted adjacency matrix deg(vi): sum of all the weights of the edges attached to vi D = diag(deg(v1), deg(v2), ...., deg(vi)) Volume: if A is a subset of vertices, the volume is the sum of the degrees of all the vertices in A U: matrix of all eigenvectors, corresponding to the increasing set of eigvectors L = D - W
123
Properties of L in spectral clustering
Positive semi definite positive-semi definite, i.e., it has non-negative real eigenvalues λ1 = 0 is always an eigenvalue of L, where 1 is the corresponding eigenvector
124
Types of similarity graphs:
For fully connected graphs, use RBF. k-nearest neighbour: connect vertices if vj is among the k nearest neighbour of vi or vice-versa epsilon-nearest neighbour: Thresholding of weights Wij = wij if d(vi, vj) ≤ epsilon; 0 otherwise
125
The curse of dimensionality
The curse of dimensionality refers to various phenomena that arise when analyzing and organizing data in high-dimensional spaces (often with hundreds or thousands of dimensions) that do not occur in low-dimensional settings such as the three-dimensional physical space of everyday experience. if the amount of available training data is fixed, then overfitting occurs if we keep adding dimensions. On the other hand, if we keep adding dimensions, the amount of training data needs to grow exponentially fast to maintain the same coverage and to avoid overfitting.
126
An orthonormal matrix
A square, real-valued matrix U its rows and columns are orthonormal vectors UU^T = I, where I is an identity matrix U−1 = UT
127
PCA basis vectors
Principal components
128
What is the choice of U trying to minimise in PCA?
The approximation error
129
Are the PCs correlated?
No
130
When should you use PCA on the correlation matrix instead of the covariance?
When input features are on very different scale | ranges, i.e., the standard deviation of the features is very different
131
In PCA, what do the corresponding eigenvalues of eigendecomposition tell you?
λk is the k-th eigenvalue Measures the importance of k-th PC (variance) Quantify the mean squared projection (variance) onto each principal component
132
Generally, how to determine how many PCs you need?
Check the cumulative explained variance and select a | number of PCs that explains at least ~ 80%
133
Cummulative explained variance Graph
First divide each eigenvalue by the sum of all the eigenvalues Then plot the graph of summing these together one at a time Can help determine where your cut off for the number of PCs should be Note that each eigenvalue gives the variance that PC covers?
134
What is a significant unique (good) feature of PCA?
Allows to back-project in input space from the space spanned by the PCs
135
Aims of PCA
PCA minimizes the approximation error when projecting data onto any linear subspace Provide “natural” coordinates for the data: uncorrelated and compact
136
As well as dimensionality reduction, what else can PCA be used for?
Noise reduction
137
2 main types of classifiers
Generative and Discriminant Generative: Guassian mixture model (Linear regression, LDA) Discriminant: Logistic regression generative classifiers (joint distribution) and discriminative classifiers (conditional distribution or no distribution) If the observed data are truly sampled from the generative model, then fitting the parameters of the generative model to maximize the data likelihood is a common method.
138
p( x | C_k ) | p( C_k | x )
``` p( x | C_k ) class conditional density p( C_k | x ) posterior probability ```