Exam 2017 Flashcards

1
Q

What is the primary difference between “supervised” and “unsupervised” learning?

A

whether the training instances are explicitly labelled or not

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Indicate which of “numeric”, “ordinal”, and “categorical” best captures its type: blood pressure level, with possible values {flow, medium, high}

A

ordinal

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Indicate which of “numeric”, “ordinal”, and “categorical” best captures its type: age, with possible values [0,120]

A

numeric

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Indicate which of “numeric”, “ordinal”, and “categorical” best captures its type: weather, with possible values {clear, rain, snow}

A

categorical

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Indicate which of “numeric”, “ordinal”, and “categorical” best captures its type: abalone sex, with possible values {male, female, infant}

A

categorical

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Describe a strategy for measuring the distance between two data points comprising of “categorical” features

A

Hamming distance OR cosine similarity OR jaccard OR dice

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What is the relationship between “accuracy” and “error rate” in evaluation?

A

accuracy = 1− error rate

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

With the aid of a diagram, describe what is meant by “maximal marginal” in the context of training a “support vector machine”.

A

the width of the margin (= distance from separating hyperplane and the support vectors) should be maximised

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What makes a feature “good”, i.e. worth keeping in a feature representation? How might we measure that “goodness”?

A

good = correlation/association with category of interest (and non-redundant)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

For the mode below state whether it is canonically applied in a “classification”, “regression” or “clustering” setting: multi-layer perceptron with a softmax final layer

A

classification

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

For the mode below state whether it is canonically applied in a “classification”, “regression” or “clustering” setting: soft k-means

A

clustering

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

For the mode below state whether it is canonically applied in a “classification”, “regression” or “clustering” setting: multi-response linear regression

A

classification

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

For the mode below state whether it is canonically applied in a “classification”, “regression” or “clustering” setting: logistic regression

A

classification

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

For the mode below state whether it is canonically applied in a “classification”, “regression” or “clustering” setting: model tree

A

regression

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

For the mode below state whether it is canonically applied in a “classification”, “regression” or “clustering” setting: support vector regression

A

regression

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

With the aid of an example, briefly describe what a “hyperparameter” is.

A

a top-level setting for a given model (which is set prior to training)

17
Q

With the use of an example, outline what “stacking” is.

A

combining the output of a number of base classifiers as input to a further supervised learning model

18
Q

What is the convergence criterion for the “EM algorithm”?

A

convergence of maximum log-likelihood to within an episilon (small) change

19
Q

Outline the basis of “purity” as a form of cluster evaluation.

A

what proportion of instances in the cluster correspond to majority class

20
Q

What is the underlying assumption behind active learning based on “query-by-committee”?

A

disagreement between base classifiers indicates that the instance is hard to classify, and thus will have high utility as a training instance

21
Q

“Random forests

are based on decision trees under different dimensions of “randomisation”. With reference to the following toy training dataset, provide a brief outline of two (2) such “random processes” used in training a random forest. (You should give examples as necessary; it is not necessary to draw the resulting trees, although you may do so if you wish.)

A
  1. random sampling of training instances (similarly to bagging) 2. random subsampling of attributes for given decision tree 3. random construction of new features based on linear combinations of numeric features
22
Q

HMMs

In the “forward algorithm”, αt(j) is used to “memoise” a particular value for each state j and observation t. Describe what each αt(j) represents.

A

the probability of observing all observations up to and including t and ending up in state j

23
Q

HMMs

In the “Viterbi algorithm”, two memoisation variables are used to describe for each combination of state j and observation t:

  1. βt(j) (which plays a similar role to αt(j) in the forward algorithm)
  2. φt(j). Describe what each φt(j) represents.
A

he most probable immediately preceding state for state j given observations up to and including t

24
Q

HMMs

Why do we tend to use “log probabilities” in the Viterbi algorithm but not the forward
algorithm?

A

Viterbi based on multiplication, so can convert to sum of log probabilities; forward based on sum of product of probabilities,
so logging the probabilities doesn’t help in the calculation

25
Q

Model Learning

Is our primary objective in machine learning to derive a model that fits the subset of the data
that we do have? Why or why not?

A

No. Want to build a model that generalises to new data

26
Q

Model Learning

Explain how we can use our limited data, in a machine learning context, to demonstrate
whether or not our objective has been met.

A

Split data into training/dev/test. Need to measure generalisation from “best” (tuned) model to unseen data

27
Q

Model Learning

Identify and explain one important problem that can emerge with respect to this primary
objective, even if we are successful in deriving a good model for the data that we have.
Name one specific technique discussed in class that can be applied to mitigate this problem,
and explain how it does so.
A

overfitting; model does well on training data but poorly on test data

technique: L1/L2/Lasso regularisation
how: reduce complexity of model/constrain model function

28
Q

Model Learning

Define “bias” and “variance”, indicating how we might detect each one. Discuss how bias
and variance relate to each other in the context of our primary objective

A

bias: how well our model approximates the (training) data; approximation error (high bias=consistently poor performance)
variance: how well our model generalises to held-out (test) data; estimation error (high variance=sensitive to training data)
Relationship: Increasing the complexity of our model tends to reduce bias (by fitting the data better) but increase variance
(because the model will be more strongly fit to the training data); in order to obtain generalisation we seek to minimise both
bias and variance in order to have a good model that generalises well (isn’t overfit to training). Increase training examples
and control for overfitting to lower variance

29
Q

Briefly explain — in at most two sentences — the basic logic behind the “ID3” algorthmic
approach toward building decision trees. This should be focussed on labelling the nodes
and leaves; you do not have to explain edge-cases.

A

Recursively determine which feature has the highest information gain (i.e. does the best job of partitioning the data into pure
subsets) over the subset of training instances selected by the path to that node, and add a branch for each value of that feature;
continue until every leaf node is pure (and label with corresponding class)

30
Q

The criterion for labelling a node, as explained in the lectures, was based around the idea of
“entropy” — what does entropy tell us about a node of a decision tree?

A

how skewed (“pure”) the label distribution is (lower entropy ! more skewed ! better)