Machine Learning Flashcards

- What is (machine) learning? - What different types of datasets do exist? - Which learning paradigms do result from these types? - How does the Perceptron learning rule work? - What is a common method for supervised learning?

1
Q

What is learning?

A

Learning is the acquisition of new information or knowledge or the process of doing so by trial and error.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What is Machine Learning?

A

Machine Learning is the field of study that gives computers the ability to learn without being explicitly programmed.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What are the four components of machine learning?

A
  • Dataset S: set of samples generated by some process; either labelled or unlabelled
  • Model M: an representation of a input/output relationships intended to model the process that generates S
  • Objective Function L: a function that encodes the current performance of M (e.g. loss or reward)
  • Algorithm A: the learning algorithm that adjusts M based on S and L
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What is the cognitive function of Autonomy within a cognitive system?

A

The ability to dynamically adapt to changes in the environment (e.g. continuous online learning from a live data stream)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What is the cognitive function of Perception within a cognitive system?

A

The ability to learn how to detect and categorize perceptual stimuli (e.g. unsupervised learning of visual features)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Why are conventional programming methods not possible in complex dynamic environments?

A
  • The environment is changing continuously
  • Dynamics and objects are too complex to be modelled explicitly
  • The system itself is subject to change
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What is a feature?

A

A feature is an individual measurable property or characteristic of a phenomenon being observed.

They abstract from raw data and represent semantic information in the data.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

What is feature engineering?

A

Feature engineering is the process of using domain knowledge to extract features from raw data.

Selected features are grouped into a feature vector.

Designing features is one of the most challenging tasks in machine learning. Manual feature engineering can become extremely complex for real-world datasets.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What is the definition of the Machine Learning Task?

A

Train a model M in a hypothesis space H using a learning algorithm A so that M minimizes loss L for dataset S

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Why can many real-world tasks not be modelled as pairs of input and desired outputs?

A

Many real-world tasks have complex dynamics and unknown goal representations.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What is the goal of Unsupervised Learning?

A

To discover structural features in the data set, given solely unlabelled data. Often applied as a pre-processing tool for initial data analysis

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

What is semi-supervised learning?

A

Semi-supervised learning combines a small amount of labelled data with a large amount of unlabelled data during training. A priori assumptions on input data is required

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

What is a Model?

A

The model M of a machine learning system encapsulates the outcome of the learning process. The learning algorithm A’s goal is to find an optimal model M* = argmin_(M ∈ H) of L(M).

It is often not possible to find a global minimum; therefore local optima have to suffice.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

What is the hypothesis space H?

A

All possible models.

E.g. decision trees, polynomials, neural networks, …

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

What are ensemble methods?

A

Ensemble methods combine multiple learning algorithms to obtain better predictive performance.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

What is Boosting?

A

Boosting is an ensemble method in supervised learning. By combining several weak learners, it is possible to build a strong learner.

Boosting algorithms compute a strong learner incrementally by constructing an ensemble of hypothesis and increasing the weights of incorrectly learned samples. The training of new hypotheses focuses on samples with large weights.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

What is a weak learner?

A

A classifier that is just slightly better than randomly guessing.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

How can the performance of a hypothesis be judged?

A

By how well it predicts seen data from the training set (“fit”) and unseen data (generalization).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

What is Occam’s Razor?

A

Is is desirable to choose a model that is a simple as possible (fewer parameters). “Of two competing theories, the simpler explanation of an entity is to be preferred”

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

What are generative and discriminative models?

A
  • Discriminative models learn the boundary between classes so they can discriminate between different kinds of data instances.
  • Based on the posterior probabilities P(y|x)
  • Generative models learn the distribution of data so they can generate new data instances.
  • Based on the prior probabilities P(x|y); predictions can be computed by applying Bayes’
  • compact representations of the training data set that have considerably less parameters than the dataset S
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

How can overfitting be detected?

A

By training on a training set while continuously assessing the performance and adjusting parameters based on a validation set. Finally, a test set is used to assess the performance of the final model.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q

What is cross-validation?

A

The dataset is partitioned into k subsets and learned on in k iterations, with a different subset chosen each time. The overall performance corresponds to the averaged performance of the k iterations.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
23
Q

How can overfitting be avoided?

A
  • Regularization: include a regularization term in the objective function L that punishes complexity
  • More training data: increase complexity of the dataset
  • Dataset augmentation: transform training samples (add noise, shifts, rotations, etc.)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
24
Q

How is an artificial neuron defined?

A

By its synaptic weights and an activation function.

25
Q

What is the output n(x) of a neuron n?

A

n(x) = A(w₁x₁ + … + wₙxₙ+ b)

26
Q

What is a perceptron?

A

A perceptron is a linear classifier that is based on a single neuron with a digital threshold function. The perceptron criterion punishes only incorrectly classified samples.

Invented by Frank Rosenblatt in 1958.

27
Q

What does the perceptron convergence theorem state?

A

If a solution exists (i.e. the data set is linearly separable), then the perceptron learning algorithm finds a solution within a finite number of steps.

28
Q

What is the desired output of a regression task?

A

A regressor h that predicts yₙ for a given xₙ with minimum loss L.

29
Q

What is the difference between interpolation and regression?

A

The interpolation function f must be consistent with the samples S: f(xᵢ)=yᵢ

In contrast, in regression the hypothesis h should minimize L and generalize well to new samples.

30
Q

What does Maximum Likelihood mean?

A

For a given dataset, we want to optimize the weights so that they maximize the likelihood of the observations (target values yₙ) for the input xₙ.

31
Q

What is the 1-of-K coding scheme?

A

𝐾 classes are represented using a 1-of-𝐾 coding scheme where

𝐶ₗ = (0,0, … , 1, … , 0) ∈ {0,1}ᴷ

32
Q

What are generalized linear models for classification?

A

y_cls(xₙ, w) = f(y_reg(xₙ, w))

A generalized linear model adds an nonlinear activation function f to a linear regression task.

Subspaces y_cls(xₙ, w) = c for a constant c are called decision surfaces.

33
Q

What is a linear discriminant function?

A

A linear discriminant function divides the feature space by a hyperplane decision surface.

34
Q

What is the Heaviside step function?

A

f(x)=0 if n<0, 1 if n≥0

The resulting decision surface is wᵀxₙ = 0

35
Q

What is the One-Versus-the-Rest Classifier?

A

We separate K classes with K-1 binary discriminant functions. Every discriminant function separates one class from all others.

Some regions may be classified ambiguously!

36
Q

What is the One-Versus-One Classifier?

A

We separate K classes pairwise with K(K-1)/2 binary discriminant functions. The class of the data sample is assigned by a majority vote.

Some regions may be classified ambiguously!

37
Q

What is the problem with one-versus-the-rest and one-versus-one classifiers? How can that be fixed?

A

The input space regions defined by the discriminant functions may overlap or do not cover the whole input space.

The solution is to determine a partition of the input space. Every class is assigned a linear function. A sample is assigned to a class if the function returns a larger value than all other classes.

38
Q

What is the problem with drawing class identifiers from ordered sets? How can that be fixed?

A

Ordered sets implicitly impose a non-existing order on the classes that might be captured by the learned model.

Use one-hot encoding (0,0,…,1,0,..)

39
Q

What are the issues with Least Squares Linear Classification?

A

Least squares classification is sensitive to outliers.

Also, if the input is linearly separable, the algorithm computes one of infinitely many solutions.

40
Q

What are Support Vector Machines?

A

SVMs minimize the generalization error by computing a hyperplane that maximizes the margin of the classifier.

Only support vectors define the decision boundary of the SVM and must be saved.

In general, the optimization function includes slack variables to allow for training samples that lie inside the margin (-> better generalization performance)

41
Q

What was ADALINE?

A

ADALINE was an early (1960) single-layer artificial neural network with multiple nodes, where each node accepts multiple inputs and generates one output.

42
Q

What was the AI winter?

A

The AI winter was period of reduced funding and interest in artificial intelligence research. It began in 1969 with the proof that the perceptron only works for simple data sets that are linearly separable.

The AI winter ended with the development of new neural network architectures and the backpropagation algorithm.

43
Q

What is the universal approximation theorem?

A

A neural network with only a single hidden layer can approximate any function
arbitrarily well.

44
Q

How does backpropagation work?

A

Backpropagation optimizes the loss by varying the weights based on the chain rule.

45
Q

What is SIFT?

A

The scale-invariant feature transform (SIFT) is a feature detection algorithm in computer vision to detect and describe local features in images.

It was published by David Lowe in 1999.

The SIFT features are designed to be invariant to scale, translation and partially invariant to illumination changes and projections. Specific neurons in the inferior temporal cortex share similarities with the SIFT features.

46
Q

What is a Deep Neural Network?

A

A deep neural network (DNN) is an artificial neural network (ANN) with multiple (hidden) layers between the input and output layers.

47
Q

What is the advantage of deep networks?

A

Deep neural networks can learn suitable feature space representations automatically.

Layers in deep neural networks correspond to features at different levels of abstraction.

48
Q

What do the “convolutions” in CNNs imply?

A

Convolutions are filters on discrete (=sampled) signals (images, audio, sensor data etc).

Discrete convolutions can be achieved with kernels.

The synaptic connectivity pattern of convolutional neural networks implements a neural version of the discrete convolution.

49
Q

What are convolutional kernels?

A

Matrices that define a signal filter. The entries of the kernel determine the influence of neighboring data points (e.g. pixels) of a sample.

The filtered data sample is computed by sliding the kernel along the input sample.

Kernels encode position-invariant features.

50
Q

What are Gabor filters?

A

Gabor filters can detect image content with a specific frequency and direction.

Deep neural networks typically learn features similar to Gabor filters in the first layer.

51
Q

What was LeNet-5 (1998)?

A

A CNN for digit recognition

32 x 32 pixel input
60,000 trainable parameters

52
Q

What was AlexNet (2012)?

A

First modern deep CNN.

  • accelerated training on GPUs
  • outperformed all competitors in 2012
53
Q

What is weight sharing?

A

Weight sharing is a way to reduce the number of parameters (e.g. weights of a filter) while allowing for more robust feature detection.

54
Q

What is max-pooling?

A

Max-pooling is a sample-based discretization process. The objective is to down-sample an input image, reducing its dimensionality and allowing for assumptions to be made about features contained in the sub-regions binned.

  • reduces number of weights
  • makes network more robust against translations in the input image
55
Q

How can CNNs benefit from the Fourier transform?

A

Convolutions are computationally expensive.

When taking the Fourier transform (converting inputs and kernels into Fourier space), convolutions become simple multiplications: ℱ(𝑓∗𝑠) = ℱ(𝑓) ∙ ℱ(𝑠)

The challenge lies in keeping the computational cost for the FT low.

56
Q

What are common simplifications in deep neural networks compared to biological neurons?

A
  • Single biological neurons can compute complex functions that would require several artificial neurons
  • The brain processes different streams of information in parallel and in different brain regions. Different data streams a fused into a coherent perception.
  • Information processing in the brain is more robust and versatile (data fusion, one-shot learning, memory etc.).
  • Deep neural networks suffer from catastrophic forgetting. Performance on a learned task decreases drastically as soon as a new task is trained.

example:
- adversarial patches (techniques to fool ML models)

57
Q

What is a capsule neural network (Hinton, 2017)?

A

In fully connected neural networks, all neurons process the complete input image, it is therefore not possible to separate the input into different aspects.

A Capsule Neural Network is a ANN that can be used to better model hierarchical relationships. The approach is an attempt to more closely mimic biological neural organization.

New type of neuron: the capsule. Encodes not only a single feature but also adds meta information on its instantiation.

The output of a capsule is a vector that indicates whether the visual feature represented by the capsule is present and how it differs from a prototype instance (pose, transformation, lighting etc.)

Lower-level capsules connect to higher-level capsules with a state that is in agreement with the current output.

➔ difficult to train!

58
Q

What is an advantage of SNNs?

A

Energy efficiency

59
Q

What are the three basic machine learning paradigms?

A
  • supervised learning
  • unsupervised learning
  • reinforcement learning