Oldie Flashcards

1
Q

Shortcomings of Information Gain?

A
  • IG tends to prefer highly branching attributes
  • Subsets are more likely to be pure if there is a large number of variables
    • May result in overfitting

Solve with Gain Ratio

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What aspect of Information Gain is “Gain Ratio” intended to remedy? Explain with equation how to achieve this.

A
  • Fixes IG’s shortcoming of preferring highly branching attributes (i.e. ID attributes low entropy and lots of them so high IG)
  • GR reduces the bias for IG toward highly branching attributes by normalising relative to the Split Information (SI).

GR(Ra | R) = IG(Ra | R) / SI(Ra | R)
= H(R) - SumOf P(xi)H(xi) / - (sumof P(xi)logP(xi))

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What’s the difference between classification and regression?

A
Classification = Predicting which class a data point is part of.
- The dependent variables are categorical

Regression = Predicting continuous values
- The dependent variables are numerical

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Outline the nature of “hill climbing” & provide example of hill climbing algorithm.

A
  • Finds the global maximum iteratively - sometimes gets stuck in local maximum - unless convex function?
    e. g. EM algorithm - guaranteed positive hill climb
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What basic approach does hill climbing contrast with?

A

???

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Advantages of “bagging”

A
  • Decrease variance
    • Highly effective over noisy data
  • Performance is generally better & never substantially worse
  • Possibility to parallelise computation of individual base classifiers
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What is sum of squared errors? Name one thing it’s applied to.

A

Method to evaluate the quality of clusters.

Applied to K-Means

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

How does stratified cross validation contrast with non-stratified cross validation?

A

Stratification is generally better than regular.
- Both in terms of bias and variance

Achieves this by rearranging the data to ensure each fold is a good representation of the whole data set.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Outline the difference between supervised and unsupervised ML methods

A

Unsupervised:

  • No knowledge of labelled data
  • Needs to find clusters, patterns, etc by itself

Supervised:

  • Has knowledge of labelled data
  • Learning involves comparing its current predicted output with the correct outputs
    • Able to know what the error is
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Define “discretisation” with an example

A

Process of converting continuous attributes to nominal or ordinal attributes.

Some learners, such as Decision Trees, generally work better with nominal attributes.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Briefly describe what “overfitting” is

A

Overfitting is when the model picks up errors and/or noise OR has a lack of training data

  • High variance in classifier
  • Causes the training accuracy and test accuracy to not be similar - big gap between test and training curves
  • Occurs when the model is excessively complex
  • Struggles to generalise
  • Overreacts to minor fluctuations in the training data
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Briefly outline the “inductive learning hypothesis”

A

Any hypothesis found to approx. the target function well, over a sufficiently large set of training examples, will also approximate the target function well, over any other unobserved examples

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Categorise each as either “parametric” or “non-parametric”

i) SVM
ii) Max Entropy Model
iii) C4.5 -> Style DT

A

SVM - Depends on kernel
Max Entropy - Parametric?
DT - Non-parametric

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

What is a “consistent” classifier?

A

A classifier which is able to flawlessly predict the class of all TRAINING instances

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Outline the difference between “Lazy” and “eager” classifiers

A

Lazy (instance-based) -> KNN
- Stores data and waits until a query is made to do anything

Eager (decision tree, SVM)
- Constructs a classification model (generalise the training data) before receiving queries (new data to classify)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

What is Zero-R?

A

Classifies all instances according to most occurring class.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

Briefly outline the nature of conditional independence assumption in the context of NB

A

???

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

What is “hidden” in a HMM?

A

???

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

What is the role of “log likelihood” in EM algorithm?

A

???

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

“Smoothing” is considered important when dealing with probability

i) why?
ii) name one smoothing method - explain how it works

A

???

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

What are 2 parameters required to generate a Gaussian probability density function?

A

???

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q

How are missing values in test instances dealt with in NB?

A

If training -> Don’t include in classifier
If test -> Ignore from end result
Just drop missing shit

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
23
Q

Briefly explain how to generate a training & test learning curve

A

???

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
24
Q

Explain with equations what PRECISION and RECALL are

A

Use Precision and Recall - if we want to know what we have positively identified NOT what we correctly identified

Precision = positive predictive values - proportion of positive predictions that are correct
= TP / TP+FP

Recall = sensitivity - accuracy with respect to positive cases
= TP / TP+FN

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
25
What sort of task is "root relative squared error" used to evaluate
Don't think this is part of course
26
Describe with equation how the "overlap" metric is used to calculate distance between 2 continuous feature values in instance based learning
Don't think this is part of course
27
Advantage of "overlap" metric over Euclidean Distance?
??? don't think relevant
28
Describe how instance based learning combines distances for discrete and continuous features
??? don't think relevant
29
Describe with use of an equation how "inverse distance" is used in weighting voting for instance based learning
??? don't think relevant
30
Outline each step you would go through in calculating classification accuracy based on "10-fold stratified cross-validation" for a given dataset and supervised algo.
???
31
Why is stratified cross-validation superior to "holdout" method
???
32
Outline steps involved in the "bagging" algorithm
???
33
Give an example of an "unstable" algorithm
???
34
Classify as either supervised or unsupervised i) C4.5 DT ii) K-means iii) boosting iv) zero-R
i) Supervised ii) Unsupervised iii) ? iv) Supervised
35
What is an "incremental" algo? When is this a desirable property?
???
36
What is the relationship between similarity and distance?
???
37
What is a nearest neighbour?
Closest point: Maximum similarity or Minimum distance | d(x,y) = min(d(z,y) | z belonging to C)
38
Big O of brute force nearest neighbour?
N samples, D dimensions - O(DN^2) Good for small datasets, but quickly becomes infeasible as N grows
39
PROS and CONS of Nearest Neighbours
PROS - Simple - Can handle arbitrarily many classes - Incremental (can add extra data to the classifier on the fly) CONS - Prone to bias - Arbitrary k-value - Need a useful distance function - Need an averaging function - Expensive - Lazy learner
40
What is the specificity equation
Accuracy with respect to negative cases Specificity = TN / TN+FP
41
What is the (training) bias of a classifier?
The difference between correct and predicted value, averaged over training sets - Bias is large if the learning method produces classifiers that are consistently wrong (underfitting) - Bias is small if consistently correct or different training sets cause errors on different test sets
42
What is the (test) variance of a classifier?
The validation in the prediction of learned classifiers across training sets - Measures inconsistency not accuracy - Variance is large if different training sets give rise to very different classifiers (overfitting) - Variance is small if different training sets have a minor effect on the classification decisions made, don't care if correct or incorrect
43
Pros and Cons of Holdout Strategy?
Pros - Simple to work with - High reproducibility Cons - Tradeoff between more training and more test data (variance vs bias) - Representativeness of the training and test data
44
Pros and Cons of Random Sampling?
PROS - Reduction in variance and bias over "holdout" strategy Cons - Lack of reproducibility
45
Pros and Cons of leave-one-out?
Pros - No sampling bias in evaluating the system and results will be unique and repeatable - Generally gives high accuracy Cons - Very expensive if working with a large dataset
46
Pros and Cons of M-Fold Cross Validation
Pros - Need to only train the model M times (rather than N times like in leave-one-out) M is partitions, N is all instances - Can measure stability of the system across different training/test combinations Cons - How data is distributed among the M partitions due to sampling can lead to bias (training data differs from test) - The results will not be unique unless we always partition the data identically - Slightly lower accuracy because (M-1) / M of data is used for training
47
What is M-Fold CV similar to?
Similar to leave-one-out, instead of doing over all N instances, you partition into M and only do over M
48
What is a baseline and what is a benchmark?
Baseline - Naïve method which we would expect any reasonable well-developed classifier to perform better (the bare minimum) Benchmark - Established rival technique which we are pitching our method against (goal)
49
Discuss the importance of baselines, give examples
- Establishing whether any proposed method is doing better than "dumb and simple" - Valuable in getting a sense for the intrinsic difficulty of a given task e. g. Random Baseline, Zero-R, One-R
50
What is One-R? How does it work?
Baseline method - Creates one rule for each attribute in the training data - Then selects the rule with the smallest error rate as its "one rule" How it works - Create a decision stump for each attribute with branches for each value (attribute) - Populate leaf with the majority class at that leaf (i.e. make all instances the majority class in this leaf - if majority is YES, make all instances YES) - Select decision stump which leads to lowest error rate over training Weather (Outlook) 9 Yes 5 No - Sunny 2 Yes 3 NO (Replace all with NO - 2 errors) - O'cast 4 YES 0 No (replace all with YES) - Rainy 3 YES 2 No (replace all with YES - 2 errors) Total errors = 4 / 14
51
What to be aware of / sensitive of when formulating a baseline?
Sensitive to the importance of positive and negatives in the classification task.
52
Pros & cons of One-R?
Pros - Simple to understand and implement - Simple to comprehend results - Good results Cons - Unable to capture attribute interactions - Bias towards high-arity attributes
53
What is the "convergence" criterion in k-means?
No change for past 2 iterations or if difference between iterations falls below a threshold specified. => Stable state
54
What is the name of the process of converting nominal -> continuous values? explain how this works with e.g.
Similarities / Hamming Distance
55
What is "meta-classification"?
???
56
What is the range in entropy values?
0 <= Entropy =< log(n) n = number of outcomes
57
Formula for Bayes Rule? How to apply? Give example
???
58
What is F-Score? How to apply? Give example
???
59
Derive Naïve Bayes
???
60
K-Means algorithm?
???
61
How to calculate base probabilities in Naïve Bayes
???
62
Nearest Neighbour algorithm?
???
63
Nearest Prototype Algorithm?
???
64
What is semi-supervised learning and when is it desirable?
TL;DR do both supervised and unsupervised When we have a small number of labelled instances, large number of unlabelled instances. labelled << unlabelled This is not enough data to train a RELIABLE classifier (purely supervised), but we can potentially leverage the labelled instances to build a better classifier than a purely unsupervised method.
65
What is self training?
Self Training - Train the learner on the currently labelled instances - Use the learner to predict the labels of unlabelled instances - Where the learner is very confident, add newly labelled instances to the training set - Repeat until all instances are labelled, or no new instances can be labelled confidently
66
What is co-training?
More or less the same as self-training, but with two learners operating (hopefully) independently.
67
What is active learning? And what are some methods to choose instances?
In active learning, the learner is allowed to choose a small number of instances to be labelled by a human judge (an oracle). The idea is that, many instances are easy to classify and a small number of instances are difficult to classify, BUT, would be easy to classify with more training data (however we don't have that luxury). Some methods to choose instances include; - The learner generating its own difficult instances - Instances are selected as the ones which are most difficult to classify in a fixed, unlabelled set
68
What is a structured classification?
Supervised Machine Learning technique that involves predicting structured objects, rather than discrete or real values - Sequential structure - Hierarchical structure - Graph structure
69
Give an example of a structured classification
Translating a natural language sentence into a syntactic representation of a parse tree
70
What is a Hidden Markov Model?
Variant of a finite state machine having a set of hidden states Current state is not observable (hidden) Each state produces an output with a certain probability
71
What are the 3 canonical problems to solve with HMM?
1) Evaluation: Compute the PROBABILITY of a particular output sequence - forward & backward algo 2) Decode: Find the MOST LIKELY SEQUENCE OF STATES which could have generated a given output sequence - Viterbi & posterior decoding 3) Train: Given an output sequence, find the most likely set of state transitions and output probability - Baum Welch
72
How to evaluate HMM?
1) Likelihood of test data - Keep some test data and compute the likelihood of the test sequences by using the forward algo 2) Predicting parts of the data given other parts
73
How to decode HMM?
1) Enumerate all the hidden state sequences, brute force & sort 2) Viterbi
74
How to train HMM (unsupervised)?
Baum-Welch
75
Limitations of HMM?
Similar to Naïve Bayes, suffers from floating point underflow. As with generative models, hard to add ad-hoc features
76
Difference between max entropy markov model and HMM?
Max Entropy Markov Model is able to add extra features indiscriminately, as well as capturing unidirectional tag interactions
77
What are some alternatives to HMM for structured classification?
Maximum Entropy Markov Models - just logistic regression model where we condition on the tag for the preceding instance Conditional Random Fields
78
What is clustering?
Method for unsupervised learning. Grouping a set of objects in such a way that the objects in the same group (called a cluster) are more similar to each other than those in other groups (clusters)
79
The 2 types of clustering?
Hard clustering - Each data point either belongs to a cluster or not Soft clustering - Probability or likelihood of that data point belonging to that cluster is assigned
80
Types of clustering models?
``` Connectivity models - Hierarchical Centroid models - K-means & soft k-means Distribution models - Expectation Maximisation (EM) Density models - DBSCAN ```
81
Incremental vs Batch
???
82
What is k-means? How does it work?
Iterative clustering algorithm to find the local maxima in each iteration 1) if not given k, specify k (number of centroids) 1. 5) Randomly assign k 2) randomly assign each data point to a cluster 3) compute centroid point of cluster 4) reassign each data point to the closest cluster centroid 5) recompute cluster centroids 6) repeat 4 and 5 until no more updates can be made -> convergence criterion reached
83
What is soft k-means? How does it work?
K-means with softmax function 1) Set t = 0, randomly initialise the centroids 2) Soft assign each instance Xj to a cluster 3) Update each centroid 4) Set t = t+1, go back to step 2, until centroids stabalise
84
Overlapping, probabilistic, partitioning, batch clustering. Which k-means?
Soft k-means
85
Exclusive, deterministic, partitioning, batch clustering. Which k-means?
K-means
86
What is EM clustering?
Expectation Maximisation. - Quasi-newton parameter estimation method with guaranteed positive hill climbing characteristics to the gradient of log likelihood * used to estimate hidden parameter values or cluster membership
87
How does EM work?
Iteratively alternates between performing 2 steps ``` Step E (expectation) - Calculate the expected log-likelihood based on the current estimates of the parameters ``` ``` Step M (maximisation) - Compute new parameter distribution, maximising the log-likelihood found in step E. ```
88
How to measure convergence?
Relative difference in log likelihood from one iteration to the next, once this falls below a certain predefined level, can consider that the estimate to have converged.
89
WTF Is Baum Welch?
Application of EM to HMM (unsupervised)
90
PROS and CONS of EM?
PROS - Guaranteed positive hill climbing - Fast to converge - Results in probabilistic cluster assignment - Relatively simple but powerful CONS - Can get stuck in a local maximum - Relies on arbitrary k - Tends to overfit
91
What is the output of hard/soft k-means and EM sensitive to?
* The initial random seed centroids | * Initial class assignments
92
What are the two types of cluster evaluation measures?
Unsupervised: - Cohesion and Separation Supervised: - Purity and entropy
93
Discuss Unsupervised Cluster Evaluation
A good cluster analysis should have one or both: - High cluster cohesion - High cluster separation
94
What will the implementation detail for unsupervised cluster evaluation depend on?
Whether clustering method is: - prototype or graph-based - deterministic or probabilistic - etc.
95
WTF is cluster compactness (squared errors)?
Sum of Squared Errors (SSE) is a method to evaluate the quality of clusters (especially k-means)
96
WTF is silhouette coefficient?
Combines cohesion and separation to compute a figure of merit for the overall clustering output, individual clusters and individual points.
97
Discuss Supervised cluster evaluation
``` Measures the degree to which predicted class labels MATCH the actual class labels * purity and entropy ```
98
What assumptions is co-training based on?
That there is a feature split which leads to independent classifiers. If given this, can lead to significant improvement in classifier accuracy.
99
What are the main sampling strategies in active learning?
1) membership query synthesis - Synthesises queries for labelling 2) Stream-based selecting sampling - Determines for each stream of instances whether to have them labelled or not 3) Pool based sampling - Selects from a fixed set of instances what it wants to have labelled
100
Outline a selection of query strategies in active learning
1) query those that the classifier is least confident on 2) perform margin sampling 3) query-by-committee
101
What is classifier combination?
Constructs a set of base classifiers from a given set of training data and aggregates the outputs into a signle meta classifier
102
What are the motivations behind classifier combination?
1) the sum of lots of weak classifiers can be at least as good as one strong classifier 2) the sum of a selection of strong classifiers is usually at least as good as the best of the base classifiers
103
What is bagging?
Simple classifier combination method based on sampling and voting - Can parallelise computation of individual base classifiers - Highly effective over noisy data - Performance is generally significantly better and never substantially worse - Decreases variance
104
Thinking behind Bagging? How does it work?
- Construct a novel dataset through random sampling with replacement * Generate k permutations of the training set, build classifier for each * Combine classifiers via voting
105
What is boosting? How does it work?
Tunes classifiers to focus on the HARD TO CLASSIFY instances. - Mathematically complicated, computationally cheap - More computuationally expensive than bagging - Tendency to overfit How it works? Iteratively change the distribution and weights of training instances to reflect performance of classifier in previous iteration 1) Each instance has probability of 1/N of being included in sample 2) Over T iterations, train classifier and update the weight of each instance according to whether it is correctly classified 3) Combine base classifiers via WEIGHTED voting
106
WTF is random forest?
Bagging of Decision Trees | - An extention of DT
107
Is bagging prone to overfitting?
No
108
What does Bagging/RF minimise?
Variance
109
What does Boosting minimise?
Bias
110
What is stacking?
Smooths errors over a range of algorithm with different biases Method 1) Simple voting - assumes the classifiers have equal performance Method 2) Train a classifier over the outputs of the base classifiers (meta classification)
111
What is a meta classifier?
A classifier that has the outputs of base classifiers aggregated into one single classifier, A META CLASSIFIER
112
What is error analysis?
Legit just analysing the errors, figure out it the issue is to do with quantity of data or something else - Identifying different classes of errors that the system makes - Hypothesising what caused the errors - Feed hypothesis back into feature/model engineering
113
How to carry out error analysis?
1) Confusion matrix 2) Random subsample of misclassified instances 3) Come up with hypothesis for error 4) Test hypothesis against data 5) Where possible, use the model to guide the error analysis
114
What are models HYPERPARAMETERS and PARAMETERS?
Hyperparameter: - Parameters which define/bias/constrain the learning process Parameters: - What is learnt when a given learner with a given set of hyperparameters is applied to a particular training set and is then used to classify test instances
115
What can a model trained with a given set of hyperparameters be interpreted to?
Interpreted relative to the parameters associated with a given test instance
116
Hyperparams for Nearest Neighbour?
- K (neighbourhood size) - Distance / Similarity metric - Feature weighting / selection
117
Parameters for Nearest Neighbour?
NONE - because it's a lazy learner - doesn't abstract away from the training instances in anyway
118
How can Nearest Neighbour model be interpreted?
Relative to the training instances that give rise to a given classification and their geometric distribution
119
Hyperparameters for Nearest Prototype?
- Distance / similarity metric | - Feature weighting / selection
120
Parameters for Nearest Prototype?
- Prototype for each class - Size = O(|C| |F|) c = set of classes f = set of features
121
How can Nearest Prototype be interpreted?
relative to the geometric distribution of the prototypes and distance to each for a given test instance
122
Hyperparameters for Naïve Bayes?
- Choice of smoothing method | - Optionally the choice of distribution to used to model the features - binomial or multinomial
123
Parameters for Naïve Bayes
- Class priors and conditional probability for each feature-value-class combination - Size = O(|C| +|C| |FV|) c = set of clases fv = set of feature value pairs
124
How can Naïve Bayes be interpreted?
Usually based on the most positively weighted features associated with a given instance
125
Hyperparameters for Decision Trees?
- Choice of function used for attribute selection - Information Gain or Gain Ratio - Convergence Criterion
126
Parameters for Decision Trees?
- The decision tree itself - Worst case size O(V^|Tr|) - Average case size O(|FV|^2|Tr|) ``` V = average branching factor tr = set of training instances fv = set of feature-value pairs ```
127
How can decision trees be interpreted?
Based directly on the path through the decision tree
128
Hyperparameters for SVMs?
- Penalty term for soft marging SVM - Feature value scaling - Choice of kernel (and any hyperparameters associated with it)
129
Parameters for SVM?
- Vector of feature weights + bias term | - Size = O(|C|^2|F|) - assuming one-vs-rest
130
Hyperparameter for Logistic Regression?
Choice of weight of regulariser
131
Parameters of logistic regression?
- Weight associated with each feature function and the bias term - Size = O(|C||FV|) c = set of classes fv = feature-value pairs
132
How can logistic regreesion be interpreted?
based on the absolute-weighted features associated with a given instance - high positive = correlation - high negative = anti-correlation
133
What is dimensionality reduction?
reducing the number of features or attributes to be more easily graphable
134
What is the caveat to dimensionality reduction?
Going to be lossy | - Not possible to reproduce the original data FROM the reduced version
135
WTF is PCA? How does PCA work?
Principal Component Analysis - a method of dimensionality reduction Transforming to a new set of variables, the principal components, which are uncorrelated, and which are ordered so that the first few return retain most of the variation present in all of the original variables
136
Wtf is nearest prototype?
Also known as nearest centroid classifier, assigns to instances the label of the class of training samples whose mean (centroid) is closest to the observation.