Week 7: Feature Extraction Flashcards

1
Q

Feature Engineering

A

The process of changing some measurements to make them more useful for classification. Examples include translating dates to the day of weekl

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Feature Selection

A

The process of choosing the most relevant variables that are most useful for classification. This helps reduce the risk of over-fitting and reduces the training time required for classification. It’s important to factor in how difficult it is to obtain a measurement and how informative it is. Domain knowledge and data analysis may be needed to find the proper subset of features.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Feature Extraction

A

The process of finding and calculating features that are functions of raw data or selected features. This helps reduce the dimensionality of the data. An example would be calculating BMI from body height and body weight.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Principal Component Analysis (PCA)

A

The process of taking N-D data and finding M (M \le N) orthogonal directions in which the data has the most variance.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Karhunen-Loève Transformation (KLT)

A

A popular method for performing PCA. First calculate the mean of all vectors. Then calculate the covariance matrix. Then being the eigenvalues (E) and eigenvectors (V) of the covariance matrix. C = VEV^T. Form matrix \hat{V}, which has the M principal components, and apply to the data: y_i = \hat{V}^T(x_i - \mu)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Neural Networks for PCA

A

Using zero-mean data, linear neurons, and specific learning rules, PCA can be done with neural networks.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Hebbian Learning

A

A learning method for PCA Neural Networks. \Delta w = \eta y x^t. This method aligns w with the 1st principal component.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Oja’s Rule

A

A learning method for PCA Neural Networks. \Delta w = \eta y (x^t - yw). This method aligns w with the 1st principal component, with the weight decay term causing w to approach unit length.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Sanger’s Rule

A

For each neuron, subtract from x the contribution of neurons representing the 1st j-1 principal components, then apply Oja’s Rule. W-j will learn the j-th principal component.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Oja’s Subspace Rule

A

This rule will learn the same subspace spanned by 1st n principal components (i.e. PC’s are in a random order). The rule is identical to the rule used to update the Negative Feedback Network. \Delta W = \eta y (x - W^t y)^t

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Autoencoder and PCA

A

An autoencoder with linear units can also learn PCA using negative feedback learning (Oja’s subspace rule) and backpropagation error.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Whitening Transform

A

This process makes a covariance matrix in a new space equal to the identity matrix. As such, each dimension has the same variance. This is opposed to PCA, whose variances after very often unequal, with the first principal component having the greatest level of variance for the data.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Linear Discriminant Analysis (LDA)

A

As PCA is unsupervised, it may remove discriminative dimensions. This may result in overlap between samples of different classes when projecting onto the principal components. LDA looks for maximally discriminative projections. This requires feature extraction to be informed by class labels (i.e. supervised learning). Overlap between elements from different categories is minimised.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Fisher’s Method

A

This is used in LDA. Fisher’s Method seeks to find w that maximises J(w) = sb/sw. Between class scatter (sb) needs to be maximised. Within class scatter (sw) needs to be minimised. sb = (w(m_1 - m2))^2, m_i = (1/n) \sum_{x \in \omega_i} x. sw = s_1^2 + s_2^2. s_i^2 = \sum_{x \in \omega_i}(w(x - m_i))^2

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Independent Component Analysis (ICA)

A

ICA finds statistically independent components, unlike PCA, which finds uncorrelated components. For Gaussian distributions, uncorrelated is the same as independence. But for non-Gaussian distributions, ICA and PCA results diverge.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Neural Networks and ICA

A

Oja’s subspace rule is used. \Delta W = \eta g(y)(x - W^t g(y))^t. g is a nonlinear function. This method will learn independent components. Inputs must be zero-mean and whitened. Other algorithms for doing ICA exist.

17
Q

Random Projections

A

This method initialises V, a matrix with random values. y = g(Vx). Random Projections are often combined with simple/linear classifiers and a neural network architecture. Common architectures include:

  • Extreme Learning Machines
  • Echo State Networks
  • Liquid State Machines
  • Reservoir Computing
18
Q

Extreme Learning Machines

A

This method uses hidden nonlinear projection using random weights (y = g(Vx)). The output uses a linear classifier: z = wy. The output weights must minimise MSE between hidden-layer outputs and desired targets (just lif RBF networks). w = TY^{\dagger}, with T = required output for each training exemplar, Y = hidden layer for each training exemplar.

19
Q

Sparse Coding

A

The data is projected into high-dimensional space to make the classes more separable. The projection isn’t random. y = g(Vx), such that y contains only a few non-zero elements where V is the matrix of weights. Neural Networks are used to find V, such as the competition via negative feedback. This method can minimise the cost as represented by \underset{y}{\min}(||x - V^ty||_2 + \lambda ||y||_0). \lambda is positive and defines the trade-off between accuracy and sparsity.

20
Q

Dictionary-based Optimisation

A

A method to find V^t, the matrix used for Sparse Coding. x \sim V^ty, such that V^t acts as basis vectors to efficiently interact with the sparse coefficients y to represent the original feature space x in a higher dimension.