Learning From Data Flashcards
(138 cards)
Difference between Classification and Regression
classification is about predicting a label
regression is about predicting a quantity
Classification is the task of predicting a discrete class label. Regression is the task of predicting a continuous quantity.
multi-class classification problem
A problem with more than two classes
multi-label classification problem.
A problem where an example is assigned multiple classes
datum
data
a piece of information
a fixed starting point of a scale or operation
k nearest neighbours
1) find the k nearest neighbors to x in the training data
2) assign x to the class with the most k nearest neighbors
Unsupervised Learning
Unsupervised learning is where you only have input data (X) and no corresponding output variables.
The goal for unsupervised learning is to model the underlying structure or distribution in the data in order to learn more about the data.
Given: data
D = {xn}, n = 1, . . . , N
and a parameterised generative model describing how the data might be generated, p(x; w), depending on parameters w.
Supervised Learning
Supervised learning is where you have input variables (x) and an output variable (Y) and you use an algorithm to learn the mapping function from the input to the output.
Supervised learning is the machine learning task of learning a function that maps an input to an output based on example input-output pairs
Y = f(X)
The goal is to approximate the mapping function so well that when you have new input data (x) that you can predict the output variables (Y) for that data.
Hyperparameter
a parameter whose value is set before the learning process begins. By contrast, the values of other parameters are derived via training
Multi variate data
More than one variable is measured on each individual in a sample
Centroid
The mean of each variable, into a vector
Properties of data that has been sphered
for each variable
mean=0
variance=1
all the variables are mutually uncorrelated
Disadvantages to Euclidean Distance
is popular for numerical data, but:
it gives equal weight to all variables
it disregards correlations between variables
Reasons for sphering
Sphering the data puts the variables on an equal footing and removes (linear) correlations
What can i.i.d. stand for
independent and identically
distributed
Examples of Unsupervised Learning methods
Clustering Gaussian distribution Mixture model Principal Component Analysis Kohonen maps (SOMs)
Deterministic Model
In deterministic models, the output of the model is
fully determined by the parameter values and the
initial conditions.
Main aim of classification
Train a machine F to map features to targets
Main aim of regression
Train a machine F to map features to continuous targets
What the different types/formats of variables?
numerical: continuous or discrete
categorical: nominal or ordinal
binary: presence/absence or 2-state categorical
In a data matrix, what does X_nd refer to?
Xnd is the value of the dth variable for the nth individual
i.e. observations are rows.
How to measure association between 2 variables?
Covariance, (S_12)^2
Mean and variance of a standardised variable
Mean = 0 Variance = 1
‘Standardised measure of association’ between variables
Correlation coefficient, R_12
Does the correlation coefficient lie in a given range?
Yes
[-1,1]