Week 6: Instance-Based Learning Flashcards by Henry Cao

Instance-based Classification

To classify a new instance, search the memory for the training instance that most resembles the new instance. A distance metric is used to compare the training instances.

Main components:
- Distance Function, which calculates the distance between any two instances.
- Combination function, which combines results from neighbours to arrive at an answer

Pros:
- Not restricted by data format
- Efficiently adaptable to the additional of new instances
- Requires relatively low training effort

Cons:
- Memory intensive
- Can be more time consuming that neural networks or decision trees
- Making suitable distance and combination functions can be tough

How well did you know this?

Not at all

Perfectly

Similarity/Distance

This is the means by which different training instances can be measured for suitability with the new instance.

How well did you know this?

Not at all

Perfectly

Absolute Difference

\left\lvert x_{i,j} - x_{i^{‘}, j} \right\rvert

How well did you know this?

Not at all

Perfectly

Squared Difference

(x_{i,j} - x_{i^{‘},j})^2

How well did you know this?

Not at all

Perfectly

Normalised Absolute Value

\frac{\left\lvert x_{i,j} - x_{i^{‘},j} \right\rvert}{\max_j - \min_j}

How well did you know this?

Not at all

Perfectly

Absolute Difference of Standardised Values

\frac{\left\lvert x_{i,j} - \mu_j \right\rvert}{\sigma_j} - \frac{\left\lvert x_{i^{‘},j} - \mu_j \right\rvert}{\sigma_j}

How well did you know this?

Not at all

Perfectly

Distance Metric for Nominal Attributes

d(x_{i,j}, x_{i^{‘},j}) = \left{ \begin{array}{cc} 0, & if x_{i,j} = v and x_{i^{‘},j} = v \ 1 & if x_{i,j} = v and x_{i^{‘},j} \ne v \end{array}

How well did you know this?

Not at all

Perfectly

Euclidean Distance

d(\boldsymbol{x}i, \boldsymbol{x}{i^{‘}}) = \sqrt{\sum_{j=1}^n(x_{i,j} - x_{i^{‘},j})^2}

How well did you know this?

Not at all

Perfectly

Manhattan Distance

d(\boldsymbol{x}i, \boldsymbol{x}{i^{‘}}) = \sum_{j=1}^n \left\lvert x_{i,j} - x_{i^{‘},j} \right\rvert

How well did you know this?

Not at all

Perfectly

Weighted Voting

In this setup, not all neighbours are treated equally. The weight of the vote is proportional to the similarity with the new instance. Weighted voting allows for preventing ties.

How well did you know this?

Not at all

Perfectly

Spectrogram

Can represent audio data in the frequency domain. Use for identifying songs from audio samples.

In the case of music identification, the vertical axis correspond to music frequencies. The horizontal axis represents time. The data can be transformed into constellation plots.

How well did you know this?

Not at all

Perfectly

Constellation Plot

A graph of the frequency peaks for a song in the frequency domain. It only has the spectrogram peaks. Each song is defined by a unique constellation of peaks in the frequency domain, with background noise minimised.

How well did you know this?

Not at all

Perfectly

Metric for Finding Similar Songs

Number of peaks that the two songs have in common / Number of all distinct Peaks

How well did you know this?

Not at all

Perfectly

Anchor Points

An anchor point is a peak in the constellation chart that is paired with other peaks that follow in the time axis within a specified range of frequencies.

In the context of song identification, each anchor-peak pair is associated with a time difference between the peak and the anchor pair, and the frequency difference between the peak and the anchor.

How well did you know this?

Not at all

Perfectly

Song Identification Method

Convert the new snippet into a constellation plot.
Turn the constellation plot into anchor points with anchor-peak pairs.
Identify matching anchor points between the listened song and songs in the database. Determine the longest consecutive sequence of overlap between the two.
Return the song with the longest overlap, i.e. the nearest neighbour.

How well did you know this?

Not at all

Perfectly

kD-tree

Study These Flashcards

An efficient data structure for storing instances in the k-dimensional space, where k is the number of attributes. It partitions the feature space into different regions, similar to Random Forests.

Finding Nearest Neighbours in kD-trees

Study These Flashcards

Follow the appropriate path in the tree from the root to the relevant leaf node to find the nearest neighbour for a new instance. It’s not always correct, but is a good approximation.

Model

Study These Flashcards

An abstract representation of a real-world process or artefact.

Predictive Modelling

Study These Flashcards

Making a model that predicts future data. Examples: classification, regression

Descriptive Modelling

Study These Flashcards

Making a model of known data. Examples: clustering, density estimation

Prescriptive Modelling

Study These Flashcards

Making a model using data for making optimal decisions. Examples: mathematical programming

Parametric Models

Study These Flashcards

Assumes a functional form. Uses the training data to estimate a fixed set of parameters for a model summarising the data. The number of parameters is independent of the number of training examples.

Examples: linear regression, logistic regression, neural networks

Non-Parametric Models

Study These Flashcards

Data-driven approach. Has very few assumptions about the functional form. Don’t have a bounded set of parameters.

Examples: histograms, nearest neighbours, kernel methods

Histograms

Study These Flashcards

Counts the number of points that fall in a certain bin. The bin heights estimate density. Is non-parametric.

Kernel Density Estimate

Estimate the density function by taking a local weighted average of measurements around the point x of interest. \hat{f}(\boldsymbol{x}) = \frac{1}{m} \sum_{i=1}^m K \left( \frac{\boldsymbol{x} - \boldsymbol{x}_i}{h} \right) The choice of h determines the smoothness of the density function estimate. Large h values lead to oversmoothing, while small values lead to spiky estimates without much smoothing. In practice, kernel models work only for low-dimensional data due to being memory-based. It's also similar to nearest neighbour methods. Note that the choice of the kernel function is more important than the choice of the bandwidth.

Regression

Given a set of inputs in format of (x,y), learn y = \hat{f}(x)

Connect-the-dots Regression

Compute a piecewise linear function connecting each data point x with two points on the left and right, respectively. The function is spiky and doesn't generalise well, but is used in visualisations.

k-Nearest Neighbours Regression

For each data point x, compute the mean of the data points in N_k(x), i.e. the k nearest neighbours of x. It's somewhat smooth, but still has discontinuities.

k-Nearest Neighbours Linear Regression

For a new data point x, compute a linear regression line through the set N_k(x) of k-nearest neighbours.

Locally Weighted Linear Regression

For a new datapoint x, instances close to x are weighted heavily, while data points far away are weighted less heavily. A kernel is used to compute the weights.

Support Vector Machines

Categorises data points by using a kernel to construct a linear separator by building maximum margin separators. The linear separators act as a binary classifier. Support vectors are the points closest to the separator.

Week 6: Instance-Based Learning Flashcards

(31 cards)