Classification and regression Flashcards

(100 cards)

1
Q

classification

A

predicts discrete class labels

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

example of classification

A

labelling emails spam or ham

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

decision tree classifier

A

flowchart-like structure in which each node represents a “test” on an attribute (e.g. whether a coin flip comes up heads or tails), each branch represents the outcome of the test, and each leaf node represents a class label tree like model that makes decisions by splitting data into subsets based on feature values creating branches that lead to outcomes (class labels)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

decision tree makes a sequence

A

of partitions of training data one attribute at a time

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

probability in classification

A

probability helps determines likelihood of each class level given a set of features
relates to confidence in predictions

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

ordering in classification

A

attributes are selected and split based on a measure like information gain creating an order of importance for features

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

entropy

A

entropy is a measure of uncertainty or disorder in a system

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

info entropy in classification

A

entropy measures how hard it is to guess the label of a randomly taken sample from dataset

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

choose level with ___ entropy as ___

A

lowest
as the data labels are more uniform so its easy to guess

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

how is entropy used in data splits for decision trees?

A

decision trees use information gain based on entropy to decide best feature to split the data at each node
entropy is calculated before and after split to determine how well a feature divides that data into pure sets

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

3 steps of entropy and data splits

A

1) partition example recursively by choosing one attribute at a time
2) choose attribute based on which attribute can separate classes of training examples best
3) choose goodness function (info gain, gain ratio, gini ratio)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

3 attribute types

A

nominal (categorical values with no order like animal, food)
ordinal (categorical values that have order like hot, warm, cold)
numerical

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

how do you handle numerical attribute in decision tree? and 3 ways you can?

A

convert to a nominal attribute
1) assign category to numerical and keep trying until you find a good split
2) use entropy value till you find the best split
3) frequency bining

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

attribute resulting in ____ info gain is selected for split

A

highest

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

process of splitting decision tree by attribtiues is continued recursively ____

A

building tree by splitting data using features that minimise uncertainty at each step

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Th is the

A

entropy threshold

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

What is the purpose of Th

A

criterion for deciding when to stop splitting the data at a node or to continue

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

When entropy of a node is below Th?

A

If the entropy of a node is below a certain threshold, it means that the data at that node is sufficiently pure (i.e., it mostly contains examples of one class). As a result, the decision tree can stop splitting further at that node, and the node is labeled with the majority class

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

When entropy of a node is above Th?

A

If the entropy is above the threshold, it indicates that the data at the node is still impure, meaning there’s a mix of different class labels. In this case, the decision tree continues splitting by choosing the attribute that reduces entropy the most (maximizing information gain)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

only use Th=0 when

A

example is really simple

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

Th=0, Th>0

A

=0 perfect order
>1 can tolerate some mixed levels

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q

avoid overfitting by using 1) and 2) and 3)

A

entropy threshold
pruning
limit depth of tree

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
23
Q

gain ratio formula

A

information gain A/ (#A x A entropy)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
24
Q

want big or small gain ratio and why?

A

small as prevents selecting attributes that overfit the model by using many small, specific splits

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
25
gini index doesn't rely on
entropy only on class proportion
26
when you would use info gain as goodness function?
imbalanced dataset
27
when you would use gain ratio as goodness function?
imbalanced dataset high brand attribute
28
when you would use gini index as goodness function?
binary classification
29
rank fastest to slowest for goodness function evaluation?
fastest gini, middle if IG, slowest is gain ratio
30
perceptron is an
artificial neuron fundamental unit in neural networks modelled after biological neuron activity is weighted sum of its inputs + bias term passed through an activation function to produce output adjusting weights allows neuron to learn choice of activation function determines type of computation the neuron performs
31
single neuron vs multiple computation ability wise
single neuron can only do simple computations but many connected in a large network can deliver any function mapping
32
what is the activation function symbol
like a hook
33
activation function
determines output of neuron based on weighted sumo inputs introduces non-linearity to make the model network capable of learning more complex patterns e.g sigmoid, relu, softmax
34
weights
coefficients that adjust the influence of certain input attributes on the output
35
bias
threshold value added to the sum of weighted inputs to shift the activation function's output
36
why is the bias helpful?
shifts decision boundary away from the origin making the model more flexible as without the decision boundary would always pass through the origin
37
half space
space divided by hyperplane which classifies the data points based on which side they fall on
38
one hot encoding
converts categorical data variables into a numerical format that machine learning models can use
39
binary classification vs multi class classification
binary classifies data into 1 of 2 classes multi classifies data into 1 of many classes
40
how does multi class classification work?
uses K neuron's and trains each to separate one class from all others
41
how is training done on a perceptron?
iteratively updating weights in a way that minimises error function (difference between actual and predicted output)
42
how is a perceptron trained?
the goal is to learn a set of weights that allows the perceptron to correctly classify input data trained through supervised learning
43
a perceptron model is guaranteed to converge if
1) learning parameters= small enough 2) linearly separable classification
44
learning rate
controls magnitude of weight updates during training
45
a too high learning rate can mean
computation is faster but can lead to potentially unstable training
46
epoch is a
entire passing of training data through the algorithm
47
online training
one update per training sample N updates per epoch
48
batch learning
average updates from all training samples one update per epoch
49
mini batch learning
dividing data into fixed size batches and taking the average over mini batches and shuffling data assigned to mini batches between epochs
50
limit of perceptron model and how is it overcome?
linear separation simple single perceptron can only solve linearly separable problems so can't handle XOR problems in more complex neural networks MLP
51
MLP is
forward feed artificial neural network that generates a set of outputs from a set of inputs with multiple hidden laters to allow model more complex problems and capture non linear patterns
52
MLP structure
input layer each neuron in the input layer corresponds to one feature in the input data hidden layers can have 1 or more Each neuron in a hidden layer is connected to every node in the previous and next layer (fully connected) Neurons in the hidden layers apply non-linear activation functions to the weighted sum of inputs and produce an output learn abstract features from input data output layer provides the final prediction of the mode
53
number of input layers is the number of
features
54
how does a MLP learn
adjusting the weights between neurons to minimize error and is able to capture complex patterns in data due to its layered architecture
55
SLP
one layer of neutrons can only solve linearly separable problems
56
3 advantages to MLP
can solve non-linearly separable problems can model more complex decision boundaries and patterns in the data by stacking ML of neutrons can learn hierarchical features where each hidden layer captures different levels of abstraction from the data
57
universal function approximation
technically single layer perceptron is model that can approximate any continuous function given enough neutrons and layers
58
ufa in classification and regression
can produce desired class labelling on any data hypothesis that fits any day with arbitrarily small MSE
59
3 non linear activation functions
relu sigmoid tanh
60
relu (what, computation expense, pros and cons)
outputs input directly if its positive and gives it value of 0 if not no exponent so computationally efficient pros: most widely used, no vanishing/exploding gradient, not 0 centred output cons: incorrect mapping for negative values, dead relu
61
what is dead relu and what's an attempted solution?
if input is bounded between 0 and 1 large gradient updates can cause bias towards large negative values difficult to recover as gradient of 0 is 0 leaky relu is an attempt to fix it and it says that if value is +ve just gives it its own value otherwise gives a* value
62
sigmoid (what, computation expense, pros and cons) and what formula and graph roughly looks like
predicts probability as squashes real value between 0 and 1 exponent so expensive pros: guarantees gradient cannot grow pass certain bound cons: gradient is bounded so can get vanishing gradient, outputs are not 0 centred so all neutrons have same sing in training 1/ 1+ e^-v s from bottom to top and bottom in line with 0
63
tanh (what, computation expense, pros and cons) and what formula roughly looks like
squashes values between -1 and 1 exponent is computationally expensive gradient is steeper than sigmoid pros: outputs are 0 centres so means faster learning cons: gradient is bounded so vanishing gradient got the most e's ion the formula and looks like s from bottom to top but bottom in line with -1 not 0
64
which activation function to use?
use reLu binary classification use sigmoid for output layer multi class classification use softmax for output layer
65
3 weight initialisation techniques
set =0 neural network acts as linear model choose randomly can lead to vanishing/exploding gradient but OK with reLu heuristic but x random w to some value to avoid grad stuff
66
softmax
converts vector of K real numbers into a probability distribution of K possible outcomes
67
backpropogation
an algorithm used to train MLP by computing gradient of loss function with respect to each weights in network then systematically propagates error backwards from output to all preceding layers
68
forward pass back propagation
input passes through the network layer by layer and output is computed
69
loss calculation in back propagation
computes difference between y hat with y and computes error
70
backward pass back propagation
error flows backward through network layer by layer computing gradient (partial derivative of loss function with respect to weight) using chain rule Update the weights by moving in the opposite direction of the gradient
71
back propagation allows
network to learn from its mistakes by adjusting weights and biases based on error
72
steepest gradient decent
optimization method that uses the gradients computed by backpropagation to update the weights, aiming to minimize the loss function over time
73
choice of output layer act func, activation and loss function for MLP for classification
output layer (multi class classification= softmax, binary= sigmoid (or tanh)) softmax takes too long for binary loss function: cross entropy activation function: reLu
74
choice of output layer act func, activation and loss function for MLP for regression
output later: linear loss function: MSE (more sensitive to outliers) or MAE activation function: reLu or tanh in hidden layers
75
impact of architecture on complexity and capability of MLP
more hidden layers allows MLP to model more complex relationships but also increase risk of overfitting
76
5 steps of supervised learning and MLP
1) examine - how many inputs - how many outputs - what type is desired output (classification or regression) 2) # hidden layers - usually number attributes 3) activation functions of hidden layers decide 4) for each hidden and output layer - initialise w (not all 0) - initialise bais matrices (all 0 is ok) 5) train network on training data test performance on test data
77
example of MLP and requirements
MLP with one hidden layer is classified as a universal function approxiamtor with enough neurons in the hidden layer and a sensible non linear activation function
78
filtering and expanding
filtering is finding only key features and expanding is feature representation
79
convert pixels to
grey scale as don't want model to learn the colours
80
logistic regression
has 2 modifications to normal regression model to make it suitable for binary classification problem where y E {0,1} output of regression model passed through a sigmoid function to convert it to a continuous value between 0 and 1 with that being the probability of y hat being 1 class
81
how does the sigmoid function work with logistic regression
the output is passed through a sigmoid function which converts the number into a probability between 0-1 with it being the probability of belonging to class 1 that number is checked with a hard limiting function P>0.5 then class 1 and P<0.5 class 0
82
cross entropy loss
measures difference between predicted probability distribution and actual class labels penalises incorrect classifications more heavily when model is confident in wrong prediction
83
feature space
set of all possible values for a chosen set of features from chosen data decision boundaries drawn in this feature space to separate classes based on the features
84
degree of polynomial and model complexity
increase degree of polynomial increases flexibility of the model however could overfit
85
steepest gradient descent
optimization technique that iteratively adjusts model parameters by following the steepest descent direction of the loss surface, aiming to minimize the error
86
steepest gradient descent only works with
continuous loss function
87
optimisation in ML
goal is to find set of parameters that minimises loss function similar to parameter search through a space of possible parameter values optimisation allows algorithm to learn and adapt
88
loss surface
given constant dataset loss J evaluated on some h(x,w) for some choice of parameters w as a function J(w) in the space of all possible parameters. manifold J(w) makes in parameter space = loss surface
89
linear in parameters and not
independent variables have linear relationship with output y hat= w1g1+w2g1 and also y hat= w1sinx etc but not for y hat= sin(x1w1) as w1 has sin relationship to y hat
90
convex loss function
has one minimum (global) and no local minima guarantees gradient based optimization methods will always find best solution
91
global minima
post on loss surface where loss is at its absolute minimum
92
local minima
point of loss surface where loss is lower than neighbouring but not necessary lowest overall
93
limitations to SGD
can get stuck in local minima for none-convex functions if gradient is near 0 the descent slows down significantly finding right learning rate is critical to performance
94
GA for learning parameters
genetic algorithm is a search heuristic reflects process of natural selcetion can be used to learn parameters of a model especially when loss surface is non-convex or sgd struggles with local min
95
regression
predicting numerical values
96
linear regression
model relationship between dependant (output) and 1 or more indepdant variables (inputs) using a straight line uses model hypothesis that is weighted sum of inputs
97
goal of regression
goal is to find optimal weights that minimise differences between predicted values and actual values using loss function
98
MSE and formula
average squared difference between predicted and actual values JMSE= 1/N sum of y-y hat
99
why do we care about MSE when using it for our model?
minimising MSE increases accuracy of regression model by decreasing prediction error
100
least squares fit
maths procedure for finding best fitting curve to given set of points of minimising sum of squared residuals weights are adjusted iteratively using steepest gradient descent to find optimal values to minimise loss