Oldie Flashcards

Question

What sort of task is "root relative squared error" used to evaluate

Answer 1

Don't think this is part of course

Answer 2

Don't think this is part of course

Answer 3

??? don't think relevant

Answer 4

??? don't think relevant

Answer 5

??? don't think relevant

Answer 6

i) Supervised ii) Unsupervised iii) ? iv) Supervised

Answer 7

Closest point: Maximum similarity or Minimum distance | d(x,y) = min(d(z,y) | z belonging to C)

Answer 8

N samples, D dimensions - O(DN^2) Good for small datasets, but quickly becomes infeasible as N grows

Answer 9

PROS - Simple - Can handle arbitrarily many classes - Incremental (can add extra data to the classifier on the fly) CONS - Prone to bias - Arbitrary k-value - Need a useful distance function - Need an averaging function - Expensive - Lazy learner

Answer 10

Accuracy with respect to negative cases Specificity = TN / TN+FP

Answer 11

The difference between correct and predicted value, averaged over training sets - Bias is large if the learning method produces classifiers that are consistently wrong (underfitting) - Bias is small if consistently correct or different training sets cause errors on different test sets

Answer 12

The validation in the prediction of learned classifiers across training sets - Measures inconsistency not accuracy - Variance is large if different training sets give rise to very different classifiers (overfitting) - Variance is small if different training sets have a minor effect on the classification decisions made, don't care if correct or incorrect

Answer 13

Pros - Simple to work with - High reproducibility Cons - Tradeoff between more training and more test data (variance vs bias) - Representativeness of the training and test data

Answer 14

PROS - Reduction in variance and bias over "holdout" strategy Cons - Lack of reproducibility

Answer 15

Pros - No sampling bias in evaluating the system and results will be unique and repeatable - Generally gives high accuracy Cons - Very expensive if working with a large dataset

Answer 16

Pros - Need to only train the model M times (rather than N times like in leave-one-out) M is partitions, N is all instances - Can measure stability of the system across different training/test combinations Cons - How data is distributed among the M partitions due to sampling can lead to bias (training data differs from test) - The results will not be unique unless we always partition the data identically - Slightly lower accuracy because (M-1) / M of data is used for training

Answer 17

Similar to leave-one-out, instead of doing over all N instances, you partition into M and only do over M

Answer 18

Baseline - Naïve method which we would expect any reasonable well-developed classifier to perform better (the bare minimum) Benchmark - Established rival technique which we are pitching our method against (goal)

Answer 19

- Establishing whether any proposed method is doing better than "dumb and simple" - Valuable in getting a sense for the intrinsic difficulty of a given task e. g. Random Baseline, Zero-R, One-R

Answer 20

Baseline method - Creates one rule for each attribute in the training data - Then selects the rule with the smallest error rate as its "one rule" How it works - Create a decision stump for each attribute with branches for each value (attribute) - Populate leaf with the majority class at that leaf (i.e. make all instances the majority class in this leaf - if majority is YES, make all instances YES) - Select decision stump which leads to lowest error rate over training Weather (Outlook) 9 Yes 5 No - Sunny 2 Yes 3 NO (Replace all with NO - 2 errors) - O'cast 4 YES 0 No (replace all with YES) - Rainy 3 YES 2 No (replace all with YES - 2 errors) Total errors = 4 / 14

Answer 21

Sensitive to the importance of positive and negatives in the classification task.

Answer 22

Pros - Simple to understand and implement - Simple to comprehend results - Good results Cons - Unable to capture attribute interactions - Bias towards high-arity attributes

Answer 23

No change for past 2 iterations or if difference between iterations falls below a threshold specified. => Stable state

Answer 24

Similarities / Hamming Distance

Answer 25

0 <= Entropy =< log(n) n = number of outcomes

Answer 26

TL;DR do both supervised and unsupervised When we have a small number of labelled instances, large number of unlabelled instances. labelled << unlabelled This is not enough data to train a RELIABLE classifier (purely supervised), but we can potentially leverage the labelled instances to build a better classifier than a purely unsupervised method.

Answer 27

Self Training - Train the learner on the currently labelled instances - Use the learner to predict the labels of unlabelled instances - Where the learner is very confident, add newly labelled instances to the training set - Repeat until all instances are labelled, or no new instances can be labelled confidently

Answer 28

More or less the same as self-training, but with two learners operating (hopefully) independently.

Answer 29

In active learning, the learner is allowed to choose a small number of instances to be labelled by a human judge (an oracle). The idea is that, many instances are easy to classify and a small number of instances are difficult to classify, BUT, would be easy to classify with more training data (however we don't have that luxury). Some methods to choose instances include; - The learner generating its own difficult instances - Instances are selected as the ones which are most difficult to classify in a fixed, unlabelled set

Answer 30

Supervised Machine Learning technique that involves predicting structured objects, rather than discrete or real values - Sequential structure - Hierarchical structure - Graph structure

Answer 31

Translating a natural language sentence into a syntactic representation of a parse tree

Answer 32

Variant of a finite state machine having a set of hidden states Current state is not observable (hidden) Each state produces an output with a certain probability

Answer 33

1) Evaluation: Compute the PROBABILITY of a particular output sequence - forward & backward algo 2) Decode: Find the MOST LIKELY SEQUENCE OF STATES which could have generated a given output sequence - Viterbi & posterior decoding 3) Train: Given an output sequence, find the most likely set of state transitions and output probability - Baum Welch

Answer 34

1) Likelihood of test data - Keep some test data and compute the likelihood of the test sequences by using the forward algo 2) Predicting parts of the data given other parts

Answer 35

1) Enumerate all the hidden state sequences, brute force & sort 2) Viterbi

Answer 36

Baum-Welch

Answer 37

Similar to Naïve Bayes, suffers from floating point underflow. As with generative models, hard to add ad-hoc features

Answer 38

Max Entropy Markov Model is able to add extra features indiscriminately, as well as capturing unidirectional tag interactions

Answer 39

Maximum Entropy Markov Models - just logistic regression model where we condition on the tag for the preceding instance Conditional Random Fields

Answer 40

Method for unsupervised learning. Grouping a set of objects in such a way that the objects in the same group (called a cluster) are more similar to each other than those in other groups (clusters)

Answer 41

Hard clustering - Each data point either belongs to a cluster or not Soft clustering - Probability or likelihood of that data point belonging to that cluster is assigned

Answer 42

``` Connectivity models - Hierarchical Centroid models - K-means & soft k-means Distribution models - Expectation Maximisation (EM) Density models - DBSCAN ```

Answer 43

Iterative clustering algorithm to find the local maxima in each iteration 1) if not given k, specify k (number of centroids) 1. 5) Randomly assign k 2) randomly assign each data point to a cluster 3) compute centroid point of cluster 4) reassign each data point to the closest cluster centroid 5) recompute cluster centroids 6) repeat 4 and 5 until no more updates can be made -> convergence criterion reached

Answer 44

K-means with softmax function 1) Set t = 0, randomly initialise the centroids 2) Soft assign each instance Xj to a cluster 3) Update each centroid 4) Set t = t+1, go back to step 2, until centroids stabalise

Answer 45

Soft k-means

Answer 46

Expectation Maximisation. - Quasi-newton parameter estimation method with guaranteed positive hill climbing characteristics to the gradient of log likelihood * used to estimate hidden parameter values or cluster membership

Answer 47

Iteratively alternates between performing 2 steps ``` Step E (expectation) - Calculate the expected log-likelihood based on the current estimates of the parameters ``` ``` Step M (maximisation) - Compute new parameter distribution, maximising the log-likelihood found in step E. ```

Answer 48

Relative difference in log likelihood from one iteration to the next, once this falls below a certain predefined level, can consider that the estimate to have converged.

Answer 49

Application of EM to HMM (unsupervised)

Answer 50

PROS - Guaranteed positive hill climbing - Fast to converge - Results in probabilistic cluster assignment - Relatively simple but powerful CONS - Can get stuck in a local maximum - Relies on arbitrary k - Tends to overfit

Answer 51

* The initial random seed centroids | * Initial class assignments

Answer 52

Unsupervised: - Cohesion and Separation Supervised: - Purity and entropy

Answer 53

A good cluster analysis should have one or both: - High cluster cohesion - High cluster separation

Answer 54

Whether clustering method is: - prototype or graph-based - deterministic or probabilistic - etc.

Answer 55

Sum of Squared Errors (SSE) is a method to evaluate the quality of clusters (especially k-means)

Answer 56

Combines cohesion and separation to compute a figure of merit for the overall clustering output, individual clusters and individual points.

Answer 57

``` Measures the degree to which predicted class labels MATCH the actual class labels * purity and entropy ```

Answer 58

That there is a feature split which leads to independent classifiers. If given this, can lead to significant improvement in classifier accuracy.

Answer 59

1) membership query synthesis - Synthesises queries for labelling 2) Stream-based selecting sampling - Determines for each stream of instances whether to have them labelled or not 3) Pool based sampling - Selects from a fixed set of instances what it wants to have labelled

Answer 60

1) query those that the classifier is least confident on 2) perform margin sampling 3) query-by-committee

Answer 61

Constructs a set of base classifiers from a given set of training data and aggregates the outputs into a signle meta classifier

Answer 62

1) the sum of lots of weak classifiers can be at least as good as one strong classifier 2) the sum of a selection of strong classifiers is usually at least as good as the best of the base classifiers

Answer 63

Simple classifier combination method based on sampling and voting - Can parallelise computation of individual base classifiers - Highly effective over noisy data - Performance is generally significantly better and never substantially worse - Decreases variance

Answer 64

- Construct a novel dataset through random sampling with replacement * Generate k permutations of the training set, build classifier for each * Combine classifiers via voting

Answer 65

Tunes classifiers to focus on the HARD TO CLASSIFY instances. - Mathematically complicated, computationally cheap - More computuationally expensive than bagging - Tendency to overfit How it works? Iteratively change the distribution and weights of training instances to reflect performance of classifier in previous iteration 1) Each instance has probability of 1/N of being included in sample 2) Over T iterations, train classifier and update the weight of each instance according to whether it is correctly classified 3) Combine base classifiers via WEIGHTED voting

Answer 66

Bagging of Decision Trees | - An extention of DT

Answer 67

Smooths errors over a range of algorithm with different biases Method 1) Simple voting - assumes the classifiers have equal performance Method 2) Train a classifier over the outputs of the base classifiers (meta classification)

Answer 68

A classifier that has the outputs of base classifiers aggregated into one single classifier, A META CLASSIFIER

Answer 69

Legit just analysing the errors, figure out it the issue is to do with quantity of data or something else - Identifying different classes of errors that the system makes - Hypothesising what caused the errors - Feed hypothesis back into feature/model engineering

Answer 70

1) Confusion matrix 2) Random subsample of misclassified instances 3) Come up with hypothesis for error 4) Test hypothesis against data 5) Where possible, use the model to guide the error analysis

Answer 71

Hyperparameter: - Parameters which define/bias/constrain the learning process Parameters: - What is learnt when a given learner with a given set of hyperparameters is applied to a particular training set and is then used to classify test instances

Answer 72

Interpreted relative to the parameters associated with a given test instance

Answer 73

- K (neighbourhood size) - Distance / Similarity metric - Feature weighting / selection

Answer 74

NONE - because it's a lazy learner - doesn't abstract away from the training instances in anyway

Answer 75

Relative to the training instances that give rise to a given classification and their geometric distribution

Answer 76

- Distance / similarity metric | - Feature weighting / selection

Answer 77

- Prototype for each class - Size = O(|C| |F|) c = set of classes f = set of features

Answer 78

relative to the geometric distribution of the prototypes and distance to each for a given test instance

Answer 79

- Choice of smoothing method | - Optionally the choice of distribution to used to model the features - binomial or multinomial

Answer 80

- Class priors and conditional probability for each feature-value-class combination - Size = O(|C| +|C| |FV|) c = set of clases fv = set of feature value pairs

Answer 81

Usually based on the most positively weighted features associated with a given instance

Answer 82

- Choice of function used for attribute selection - Information Gain or Gain Ratio - Convergence Criterion

Answer 83

- The decision tree itself - Worst case size O(V^|Tr|) - Average case size O(|FV|^2|Tr|) ``` V = average branching factor tr = set of training instances fv = set of feature-value pairs ```

Answer 84

Based directly on the path through the decision tree

Answer 85

- Penalty term for soft marging SVM - Feature value scaling - Choice of kernel (and any hyperparameters associated with it)

Answer 86

- Vector of feature weights + bias term | - Size = O(|C|^2|F|) - assuming one-vs-rest

Answer 87

Choice of weight of regulariser

Answer 88

- Weight associated with each feature function and the bias term - Size = O(|C||FV|) c = set of classes fv = feature-value pairs

Answer 89

based on the absolute-weighted features associated with a given instance - high positive = correlation - high negative = anti-correlation

Answer 90

reducing the number of features or attributes to be more easily graphable

Answer 91

Going to be lossy | - Not possible to reproduce the original data FROM the reduced version

Answer 92

Principal Component Analysis - a method of dimensionality reduction Transforming to a new set of variables, the principal components, which are uncorrelated, and which are ordered so that the first few return retain most of the variation present in all of the original variables

Answer 93

Also known as nearest centroid classifier, assigns to instances the label of the class of training samples whose mean (centroid) is closest to the observation.

Oldie Flashcards

(136 cards)