Exam 2017 Flashcards
(30 cards)
What is the primary difference between “supervised” and “unsupervised” learning?
whether the training instances are explicitly labelled or not
Indicate which of “numeric”, “ordinal”, and “categorical” best captures its type: blood pressure level, with possible values {flow, medium, high}
ordinal
Indicate which of “numeric”, “ordinal”, and “categorical” best captures its type: age, with possible values [0,120]
numeric
Indicate which of “numeric”, “ordinal”, and “categorical” best captures its type: weather, with possible values {clear, rain, snow}
categorical
Indicate which of “numeric”, “ordinal”, and “categorical” best captures its type: abalone sex, with possible values {male, female, infant}
categorical
Describe a strategy for measuring the distance between two data points comprising of “categorical” features
Hamming distance OR cosine similarity OR jaccard OR dice
What is the relationship between “accuracy” and “error rate” in evaluation?
accuracy = 1− error rate
With the aid of a diagram, describe what is meant by “maximal marginal” in the context of training a “support vector machine”.
the width of the margin (= distance from separating hyperplane and the support vectors) should be maximised
What makes a feature “good”, i.e. worth keeping in a feature representation? How might we measure that “goodness”?
good = correlation/association with category of interest (and non-redundant)
For the mode below state whether it is canonically applied in a “classification”, “regression” or “clustering” setting: multi-layer perceptron with a softmax final layer
classification
For the mode below state whether it is canonically applied in a “classification”, “regression” or “clustering” setting: soft k-means
clustering
For the mode below state whether it is canonically applied in a “classification”, “regression” or “clustering” setting: multi-response linear regression
classification
For the mode below state whether it is canonically applied in a “classification”, “regression” or “clustering” setting: logistic regression
classification
For the mode below state whether it is canonically applied in a “classification”, “regression” or “clustering” setting: model tree
regression
For the mode below state whether it is canonically applied in a “classification”, “regression” or “clustering” setting: support vector regression
regression
With the aid of an example, briefly describe what a “hyperparameter” is.
a top-level setting for a given model (which is set prior to training)
With the use of an example, outline what “stacking” is.
combining the output of a number of base classifiers as input to a further supervised learning model
What is the convergence criterion for the “EM algorithm”?
convergence of maximum log-likelihood to within an episilon (small) change
Outline the basis of “purity” as a form of cluster evaluation.
what proportion of instances in the cluster correspond to majority class
What is the underlying assumption behind active learning based on “query-by-committee”?
disagreement between base classifiers indicates that the instance is hard to classify, and thus will have high utility as a training instance
“Random forests”

are based on decision trees under different dimensions of “randomisation”. With reference to the following toy training dataset, provide a brief outline of two (2) such “random processes” used in training a random forest. (You should give examples as necessary; it is not necessary to draw the resulting trees, although you may do so if you wish.)
- random sampling of training instances (similarly to bagging) 2. random subsampling of attributes for given decision tree 3. random construction of new features based on linear combinations of numeric features
HMMs
In the “forward algorithm”, αt(j) is used to “memoise” a particular value for each state j and observation t. Describe what each αt(j) represents.
the probability of observing all observations up to and including t and ending up in state j
HMMs
In the “Viterbi algorithm”, two memoisation variables are used to describe for each combination of state j and observation t:
- βt(j) (which plays a similar role to αt(j) in the forward algorithm)
- φt(j). Describe what each φt(j) represents.
he most probable immediately preceding state for state j given observations up to and including t
HMMs
Why do we tend to use “log probabilities” in the Viterbi algorithm but not the forward
algorithm?
Viterbi based on multiplication, so can convert to sum of log probabilities; forward based on sum of product of probabilities,
so logging the probabilities doesn’t help in the calculation