Week 5 intro to machine learning Flashcards
(21 cards)
What is machine learning?
A set of methods to detect patterns in data and use those patterns to predict future data
supervised learning
Program trained on a given set of examples with labels . It learns how to reach an accurate conclusion when given new data
(x1 , y1 ) , (x2 , y2) … (xn , yn)
eg:
(1, True) , (2,False) , (3,True) , (5, True) , (12 ,False) , (27, True)
algorithm comes up with pattern (decides whether number is odd in this case ) -
when given new data it predicts the label (learns to make accurate conclusion)
Unsupervised learning
Program given unlabelled data and algorith uses patterns and relationship to group related data
ie we may pass in a bunch of images og either dogs or cats and algorithm used to group those that are related (dog images - group 1 cat images group 2) for example
reinforcement learning
Program learns from the consequences of its actions and selects actions by exploting what went well previously while still having options to make new choices
classification
A type of supervised learning
organise data into classes and when given new data it predicts the the class
class - possible category a data point can belong to (Same thing as a label)
ie for muffin and chiwawa problem classes :
[“muffin”, “chiwawa”]
returns a class when given new data
Regression
A type of supervised learning
Fit functions to fata and determine values of new datapoints
my definition for understanding:
. finds relationship between input features and a numeric value
. We provide model with lots of examples of input fearues (ie size of house , distance from major city , no of rooms…) and a numeric value (price)
model learns a function so when you pass in a new set of input features it predicts the corresponding numeric output
Clustering
A type of unsupervised learning
separate data into groups and when given new data we determing which group it goes in
NO LABELS AKA UNSUPERVISED LEARNING
Dimensionality reduction
transform high dimensional (lots of features ) data to lower dimensional data while preserving desired properties
What is a training set IN SUPERVISED LEARNING
A set of pairs of data and their labels that we give to the program for learning
What is a test set in supervised learning
UNTOUCHED (unseen) portion of data that once we have trained our model we use to predict label . We then compare the actual vs predicted labels
Precision
TP / TP + FP ( ie of how many that WE SAID have disease how many actually do)
Sensistivity
TP / TP + FN ( ie of how many that ACTUALLY HAVE disease how many actually do)
F1 score (harmonic mean between sensitivity and precision)
2 / (1 / sensitivity) + (1 / precision)
balanced test set
each class (in set of classes ) has equal representation
Mean absolute error
average of absolute differences between predicted and actual value
look at ipad for notation
bad thing is ( it doesnt treat outliers harshly)
ie 2 + 3 + 4 / 3 = 9 / 3 = 3 is the same as 0 + 0 + 9 / 3 for the second one the first two points are bang on (predicted = actual value but for the last one (9) difference is huge we have a massive outlier) but MAE treats this the same as our earlier example
Mean squared error
average of the squares of absolute differences between predicted and actual value
Penalises outliers which is good but now :
disadvantage - if the original values (in somethign like cm) mean squared error value is in cm^2 - WE NOW HAVE DIFFERENT MEASURE
solution root mean squared error
root mean squared error
root of mean squared error
No longer have different measure
Generalisation error
We want to minimise the error on unseen data (generalisation error) . However we only deal with samples (dont have access to unseen data obviously to calculate generalisation error)
We therefore use empirrical error (calculated using available samples (training set)
We hope by minimising empirical error we are minimising generalisation error
underfitting
model fails to capture complexity of of the training data
eg: model is linear when the true pattern is quadratic or degree 4
overfitting
model is too accurate - IT FITS TOO MUCH OF TRAINING DATA AND FAILS TO GENERALISE UNSEEN DATA