ML and AI Flashcards

(197 cards)

1
Q

What is the difference between supervised, unsupervised, and reinforcement learning?

A

Supervised learning is task-driven (e.g., classification, regression), unsupervised is data-driven (e.g., clustering), and reinforcement learning involves learning from feedback (e.g., trial-and-error).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Why is probabilistic machine learning important?

A

It helps quantify uncertainty, which is crucial for real-world decision-making (e.g., COVID-19’s R number). Probabilistic ML enables decisions under uncertainty.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What are aleatoric and epistemic uncertainties?

A

Aleatoric uncertainty comes from inherent randomness (e.g., coin flips), while epistemic uncertainty arises from lack of knowledge (e.g., unknown coin side).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What is probabilistic modelling?

A

It’s the use of statistical models to specify a probability distribution over data using parameters (e.g., θ). These models support prediction and uncertainty quantification.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What distribution is known as the bell curve and what are its parameters?

A

The Gaussian (normal) distribution. Parameters: mean (μ) and standard deviation (σ).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

How is the likelihood function defined for IID data?

A

L(θ) = Πᵢ₌₁ⁿ p(yᵢ | θ). It shows how likely data y is for given θ. Different models yield different likelihoods.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What is the goal of Maximum Likelihood Estimation (MLE)?

A

To find parameter values θ that maximise the likelihood function given the data.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Why is log-likelihood used in MLE?

A

It simplifies mathematics, especially for product-based likelihoods, by turning them into sums.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Give an example of using MLE with normally distributed data.

A

If p(yᵢ | θ) ~ N(μ, σ²) and σ is known, MLE is used to estimate μ by maximising likelihood based on data.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

How is MLE used for a Bernoulli distribution (e.g., coin flips)?

A

With heads as 1 and tails as 0, estimate the parameter θ (probability of heads) using the likelihood of observed outcomes.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Why is the choice of distribution critical in probabilistic modelling?

A

It influences parameter estimates and model behaviour. Different distributions represent different assumptions about the data.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

What is the Bayesian perspective in probabilistic inference?

A

It treats parameters as uncertain and models them with a prior distribution, combining it with the likelihood to get the posterior.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Define Prior, Likelihood, and Posterior in Bayesian inference.

A

Prior (p(θ)): belief before data. Likelihood (p(y | θ)): probability of data given parameters. Posterior (p(θ | y)): updated belief after observing data.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

What is Bayes’ rule?

A

p(θ | y) = [p(y | θ) * p(θ)] / p(y). Posterior is proportional to likelihood times prior. p(y) is the normalisation constant.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

What is Maximum A Posteriori Estimation (MAP)?

A

MAP estimates the most probable parameter values by maximising the posterior: θ_MAP = argmax_θ [log p(y | θ) + log p(θ)].

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

What is the difference between MLE and MAP?

A

MLE uses only the likelihood. MAP includes both the likelihood and the prior, favouring more plausible parameter values.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

Why can the product of likelihood and prior not be used directly as a probability?

A

It must be normalised to integrate to 1. Bayes’ rule achieves this by dividing by the marginal likelihood p(y).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

What distribution is typically used as a prior for Bernoulli likelihoods?

A

The Beta distribution, due to its conjugacy, simplifies posterior calculation.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

What are conjugate priors and why are they useful?

A

A prior is conjugate to a likelihood if the posterior is in the same distribution family. It simplifies computations and avoids numerical integration.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

What are informative and non-informative priors?

A

Informative priors encode strong beliefs (low variance); non-informative priors are flat or weakly informative, allowing data to dominate inference.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

How do weakly informative priors differ from non-informative priors?

A

Weakly informative priors slightly constrain plausible values based on domain knowledge, while non-informative priors assume minimal knowledge.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q

What are the advantages of using posterior distributions?

A

They allow prediction, quantification of uncertainty, and model checking using posterior predictive distributions.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
23
Q

What approximation methods exist for Bayesian inference when analytical solutions are hard?

A

Common methods include Laplace approximation, variational inference, and Monte Carlo (e.g., MCMC).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
24
Q

Summarise the Bayesian workflow in three key terms.

A

Prior (belief), Likelihood (data given belief), Posterior (updated belief after data). This trio defines Bayesian inference.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
25
What is the general form of a simple linear regression model?
y = wx + ε, where ε ~ N(0, σ²). Here, y is the response variable, x is the explanatory variable, w is the coefficient, and ε is the noise
26
How is multiple linear regression expressed?
y = w^T x + ε, where w and x are vectors, and ε ~ N(0, σ²). The model captures linear relationships in higher dimensions.
27
What is the likelihood function in multiple linear regression assuming normal noise?
p(y_i | x_i, w, σ²) = N(y_i | w^T x_i, σ²).
28
What are the components of Bayesian linear regression?
Prior: p(w) = N(w | w0, Σ0); Likelihood: p(y | X, w, σ²) = N(y | Xw, σ² I); Posterior: p(w | y, X, σ²) = N(w_n, Σ_n)
29
How is the posterior in Bayesian linear regression obtained?
By applying Bayes' rule: Posterior ∝ Likelihood × Prior. In log terms, it's log Posterior ∝ log Likelihood + log Prior.
30
Why might linear regression be inappropriate for certain data types?
If the output variable is count-based or categorical, the assumptions of linear regression (e.g., normality of errors) don't hold.
31
What distribution is more suitable for count data in regression?
The Poisson distribution, which has a single parameter λ (rate), where mean = variance = λ.
32
What is a link function in Generalised Linear Models (GLMs)?
A link function connects the mean of the distribution to a linear predictor. For Poisson, the canonical link is log(λ) = w x.
33
How is a Poisson GLM modelled mathematically?
λ_i = exp(w x_i), and thus log(λ_i) = w x_i. This ensures λ_i > 0.
34
What are the key takeaways about Generalised Linear Models?
GLMs extend linear models to handle non-normal data by applying a suitable link function to relate predictors to non-Gaussian responses.
35
What distinguishes generative and discriminative models in classification?
Generative models learn joint distribution p(x, y), while discriminative models learn conditional distribution p(y | x).
36
Give examples of discriminative and generative models.
Discriminative: Logistic regression, SVMs. Generative: Naive Bayes, Gaussian mixture models.
37
How does a discriminative model like logistic regression model class probabilities?
It models p(y_i | x_i, θ) directly and predicts class for new x′ using p(y′ | x′, θ).
38
In logistic regression, how is the likelihood function defined for binary data?
It's the likelihood of Bernoulli trials: each y_i is either 0 or 1, and θ_i = f(x_i, w) represents the probability.
39
Why can't linear functions be used directly for logistic regression probabilities?
Because probabilities must lie in [0, 1], and linear functions are unbounded.
40
What is the logistic function used in logistic regression?
η_i = w0 + w1 x_i; θ_i = 1 / (1 + exp(-η_i)). This maps η_i to [0, 1].
41
What is Bayesian logistic regression?
A model where the class probabilities follow a logistic function, and the weights w0, w1 have prior distributions. Posterior is approximated using MCMC or Laplace methods.
42
How does a Naive Bayes classifier work?
It estimates p(y | x) using Bayes' rule and assumes feature independence: p(x | y) = ∏ p(x_j | y).
43
What are common applications of Naive Bayes?
Text classification (spam detection), document categorisation, and sentiment analysis.
44
When is Naive Bayes not suitable?
When features are dependent (e.g., word pairs like "Boris Johnson" or correlated measurements like petal length/width).
45
Summarise the advantages and disadvantages of generative vs discriminative models.
Generative: better with missing data, small data, unlabeled data; Discriminative: better predictive performance when ample labeled data exists.
46
What is the main goal of Laplace approximation?
To find a Gaussian approximation to a continuous probability distribution that may be difficult to handle analytically, particularly when the posterior does not have a closed-form expression.
47
Why might Laplace approximation be needed in Bayesian inference?
Because the posterior distribution, especially in models like Bayesian logistic regression, often lacks a closed-form and must be approximated for inference.
48
What are the components of Bayesian linear regression?
The prior is p(w) = N(w | w₀, Σ₀), the likelihood is p(y | X, w, σ²) = N(y | Xw, σ²I), and the posterior is p(w | y, X, σ²) = N(wₙ, Σₙ).
49
How is the posterior computed in Bayesian linear regression?
Using Bayes’ rule: wₙ = Σₙ (1/σ²) Xᵗy and Σₙ = (Σ₀⁻¹ + (1/σ²) XᵗX)⁻¹.
50
What is the key challenge in Bayesian logistic regression?
The posterior cannot be expressed in closed form due to the logistic likelihood, so approximation techniques like Laplace, MCMC, or variational inference are needed.
51
What is the general form of the Laplace approximation?
Given p(θ), the approximation is q(θ) ≈ N(θ | θ₀, A⁻¹), where θ₀ is the mode of p(θ) and A is the curvature (second derivative) at the mode.
52
What does the normalisation constant Z represent in Laplace approximation?
Z = ∫ p(θ) dθ. It's generally intractable, and Laplace uses a Gaussian approximation to estimate it.
53
How is a 1D Laplace approximation derived?
By applying a second-order Taylor expansion to log p(θ) around its mode θ₀, and approximating p(θ) ≈ p(θ₀) exp(-A/2 (θ - θ₀)²).
54
What is the role of the second derivative in 1D Laplace approximation?
The negative second derivative of log p(θ) at θ₀ defines the precision (inverse variance) of the Gaussian approximation.
55
How is Laplace approximation extended to multiple dimensions?
By estimating the mode θ₀ and the Hessian matrix A of second derivatives of the log posterior at the mode
56
What are the limitations of Laplace approximation?
It struggles with multimodal distributions, small datasets, non-Gaussian shapes, and can only handle real-valued variables unless modified (e.g., via log-transform).
57
What is model evidence in Bayesian model comparison?
p(D | Mᵢ) = ∫ p(D | θᵢ, Mᵢ) p(θᵢ | Mᵢ) dθᵢ. It quantifies how well a model explains the data, integrating over all parameter values.
58
How is model evidence approximated using Laplace approximation?
log p(D) ≈ log p(D | θ_MAP) + log p(θ_MAP) + (M/2) log(2π) - (1/2) log |A|, where A is the Hessian and M is the number of parameters.
59
What is the Occam factor in Bayesian model selection?
It penalises complex models by adjusting the model evidence based on the curvature (log determinant of the Hessian) and number of parameters.
60
What is the Bayesian Information Criterion (BIC)?
BIC ≈ log p(D | θ_MAP) - (M log N)/2, where N is the number of data points and M is the number of parameters. It penalises model complexity.
61
When is BIC an appropriate approximation for model comparison?
When the prior is relatively flat and the data set is large, BIC provides a simple penalty-based criterion for comparing models.
62
What does the summary of Week 3 highlight about Laplace approximation?
That it provides a Gaussian approximation of non-Gaussian posteriors, useful for large data sets but limited for multimodal distributions or small sample sizes.
63
What is information in the context of a discrete random variable?
Information measures the degree of surprise when observing a value. More probable events carry less information, and improbable events carry more
64
How is the information measure h(x) for a discrete event defined?
h(x) = -log2(p(x)). This ensures a positive information measure and supports additivity for independent events.
65
What is the definition of entropy for a discrete distribution?
Entropy is the expected information: H[x] = -∑ p(x) log2(p(x)).
66
How does entropy differ between uniform and non-uniform distributions?
A uniform distribution has higher entropy because each outcome is equally likely, leading to maximum uncertainty.
67
How does entropy relate to code length?
Entropy is the theoretical lower bound for the average number of bits needed to encode messages from a source. Shorter codes can be used for more probable symbols.
68
What does Shannon’s noiseless coding theorem state?
Entropy is the minimum number of bits on average required to represent a random variable without loss.
69
Where did the concept of entropy originate?
It originated in thermodynamics (Boltzmann), representing disorder, and was adapted by Shannon to quantify information content.
70
How is maximum entropy achieved in a discrete distribution?
When all outcomes are equally probable, i.e., p(x_i) = 1/M, then entropy is maximized at H = ln(M).
71
What is differential entropy?
It is the entropy of a continuous variable: H = -∫ p(x) ln(p(x)) dx. It differs from discrete entropy and can be negative.
72
For which distribution is differential entropy maximized under mean and variance constraints?
The Gaussian distribution.
73
What is a functional in the context of entropy?
A functional maps a function to a real number, e.g., entropy as a function of the probability distribution: H[p(x)].
74
How is the maximum differential entropy of a distribution found?
By using calculus of variations to maximize the entropy functional under constraints such as normalization, mean, and variance.
75
What is the result of maximizing differential entropy with known mean and variance?
The result is a Gaussian distribution. This is proven via the calculus of variations with constraints.
76
What is the value of differential entropy for a Gaussian?
H = (1/2) * log(2 * π * e * σ²). Entropy increases with variance.
77
What is conditional entropy?
It measures the remaining uncertainty of a variable y given that x is known: H[y|x] = -∑ p(x, y) log(p(y|x)).
78
How does joint entropy relate to conditional and marginal entropies?
H[x, y] = H[x] + H[y|x], which expresses the total information needed to describe both variables.
79
What is Kullback-Leibler (KL) divergence?
It is a measure of dissimilarity between two distributions: KL(p||q) = ∑ p(x) log(p(x)/q(x)).
80
Why is KL divergence important in Bayesian inference?
It quantifies the information loss when using an approximate distribution q(x) instead of the true distribution p(x).
81
What are key properties of KL divergence?
It is non-negative and zero only when p(x) = q(x). It is not symmetric: KL(p||q) ≠ KL(q||p).
82
What mathematical principle is used to prove KL divergence is non-negative?
Jensen’s inequality, derived from the convexity of the function -ln(x).
83
What is mutual information?
It quantifies the reduction in uncertainty about one variable given knowledge of another: I(x; y) = H(x) - H(x|y).
84
How is mutual information related to KL divergence?
It is the KL divergence between the joint distribution p(x, y) and the product of marginals p(x)p(y).
85
What is the Bayesian interpretation of mutual information?
It represents the reduction in uncertainty (entropy) of the prior p(x) after observing y and updating to the posterior p(x|y).
86
Can differential entropy be negative? Why?
Yes, especially when variance σ² < 1/(2πe). This contrasts with discrete entropy, which is always non-negative.
87
Why is approximation necessary in Bayesian inference?
Posterior distributions are analytically tractable only for simple conjugate models. For complex models, integrals involved in inference are often intractable, necessitating approximations
88
Why is approximation necessary in Bayesian inference?
Posterior distributions are analytically tractable only for simple conjugate models. For complex models, integrals involved in inference are often intractable, necessitating approximations.
89
What are the two main categories of approximation techniques in Bayesian inference?
Deterministic approximations (e.g., Laplace, Variational Inference) and stochastic approximations (e.g., Markov Chain Monte Carlo).
90
What is the Laplace approximation in Bayesian inference?
It approximates the posterior distribution with a Gaussian centered at the mode θ₀ of the posterior. The approximation is:q(θ) ≈ N(θ | θ₀, A⁻¹), where A = -log p''(θ₀).
91
What are the limitations of the Laplace approximation?
It's not suitable for multimodal distributions or small datasets. It assumes Gaussian shape and real-valued variables.
92
What is the core idea behind variational inference?
Approximate the true posterior p(θ|x) with a tractable distribution q(θ), and optimize the parameters of q(θ) to make it close to p(θ|x).
93
Why can't we directly minimize KL(q||p)?
Because p(θ|x) is intractable. Instead, we optimize the Evidence Lower Bound (ELBO), which indirectly minimizes KL divergence.
94
What is Jensen's inequality and how is it relevant to ELBO?
Jensen's inequality: f(E[x]) ≥ E[f(x)] for convex f. It's used to derive ELBO by applying to the concave log function.
95
What is the ELBO in variational inference?
ELBO = E_q[log p(x, θ)] - E_q[log q(θ)]. Maximizing ELBO is equivalent to minimizing KL(q||p).
96
How is KL(q||p) related to the evidence and ELBO?
KL(q||p) = log p(x) - ELBO. Since log p(x) is fixed, maximizing ELBO minimizes KL divergence.
97
What is the difference between forward and reverse KL divergence?
Forward KL (KL(p||q)) avoids regions where q is small (zero avoiding). Reverse KL (KL(q||p)) avoids regions where p is small (zero forcing).
98
What is meant by 'mode seeking' and 'moment matching' in the context of KL divergence?
Reverse KL leads to 'mode seeking' (focuses on high-density regions). Forward KL leads to 'moment matching' (covers all regions where p is nonzero).
99
What is the goal of variational inference in practice?
To maximize ELBO to obtain a good approximation q(θ) to the true posterior p(θ|x).
100
What are the two main strategies in variational inference?
Mean field approximation (factorized distributions) and parametric approximation (using parameterized distributions q(θ|λ)).
101
What is the mean field approximation?
It assumes independence among latent variables: q(θ) = ∏ q_j(θ_j), simplifying optimization.
102
How is optimization done in mean field variational inference?
Using block coordinate ascent. At each step, optimize q_j(θ_j) keeping others fixed.
103
What is the update rule for mean field variational inference?
q_j(θ_j) ∝ exp(E_{-j}[log p(x, θ)]), where E_{-j} is expectation over all variables except θ_j.
104
What are the convergence criteria for mean field variational inference?
The algorithm repeats until the ELBO converges, assuming updates can be computed analytically.
105
What is the condition for mean field updates to be analytically tractable?
Conditional conjugacy of the prior and likelihood with respect to each variable θ_j.
106
What is a parametric approximation in variational inference?
q(θ) = q(θ | λ), where λ are parameters optimized to match q to the true posterior. It enables nonlinear optimization techniques.
107
What is a potential downside of parametric approximation?
If q is too simple, it may fail to approximate the true posterior well. If too complex, it may be difficult to optimize effectively.
108
What does variational inference transform Bayesian inference into?
An optimization problem where the objective is to find parameters (or functions) that best approximate the posterior distribution.
109
What is the goal of variational inference and its limitation?
It seeks an approximate distribution q(θ) that maximizes the ELBO or minimizes KL divergence to approximate the posterior p(θ). Its limitation is that q(θ) might be a poor approximation.
110
What is Monte Carlo integration in Bayesian inference?
It approximates integrals using sample averages drawn from a distribution. It's key for evaluating expectations when analytical solutions are intractable
111
What is importance sampling used for?
To approximate expectations when direct sampling from p(x) is hard. Instead, we sample from a proposal distribution q(x) and reweight the samples using p(x)/q(x).
112
What is a key assumption of importance sampling?
The proposal distribution q(x) must be non-zero wherever the target distribution p(x) is non-zero
113
What affects the success of importance sampling?
The closer q(x) is to p(x), the more efficient the sampling. Mismatch causes high variance and potentially large weights for few samples.
114
What is rejection sampling?
A method for sampling from a target distribution p(x) using a proposal distribution q(x) and a constant k such that kq(x) ≥ unnormalized p(x). Samples are accepted probabilistically
115
What is the acceptance condition in rejection sampling?
A sample x₀ is accepted if u₀ ≤ p̃(x₀)/(kq(x₀)), where u₀ is drawn from Uniform(0, kq(x₀)).
116
Compare importance sampling and rejection sampling.
Importance sampling uses all samples with weights, while rejection sampling gives exact samples but may reject many. Both suffer in high dimensions.
117
What is Markov Chain Monte Carlo (MCMC)?
A framework for sampling from high-dimensional distributions using a Markov chain whose stationary distribution is the target distribution p(x).
118
What is a first-order Markov chain?
A process where the next state depends only on the current state, not on the past history.
119
What is an ergodic Markov chain?
A chain where the distribution converges to a unique invariant distribution, regardless of the initial state.
120
What does it mean for a distribution to be invariant under a Markov chain?
he distribution remains unchanged under the transition dynamics of the chain: π(x) = ∑ π(x') T(x | x').
121
What is the property of detailed balance?
A condition ensuring that π(x) T(x' | x) = π(x') T(x | x') holds, helping guarantee the invariant distribution is maintained.
122
What is the Metropolis algorithm?
A method where a new state x' is proposed from a symmetric distribution and accepted with probability r = min(1, p(x')/p(x)). Otherwise, the chain remains in x.
123
What happens to rejected proposals in the Metropolis algorithm?
Unlike rejection sampling, rejected proposals don't get discarded; instead, the current state is repeated in the chain.
124
When is the Metropolis-Hastings algorithm used?
When the proposal distribution q(x' | x) is asymmetric. Acceptance probability becomes r = min(1, [p(x') q(x | x')] / [p(x) q(x' | x)]).
125
Why is MCMC better suited for high-dimensional inference?
It decomposes high-dimensional sampling into simpler, often univariate steps, which is computationally more tractable.
126
What is Gibbs sampling?
A special case of MCMC where each variable is sampled in turn from its full conditional distribution, assuming the rest are fixed.
127
Why does Gibbs sampling always accept proposed states?
Because the proposal distribution is the full conditional, ensuring that each sample follows the target distribution. Hence, acceptance probability is 1.
128
What is the benefit of Gibbs sampling in multivariate distributions?
It simplifies sampling by reducing the task to a series of univariate conditional updates, making implementation easier for certain models.
129
What are the main takeaways from Week 6?
Bayesian inference often relies on sampling due to intractable integrals. Importance and rejection sampling are basic tools; MCMC (including Metropolis-Hastings and Gibbs sampling) scales better for complex, high-dimensional problems.
130
How does conventional machine learning differ from deep learning?
Conventional ML uses hand-engineered features, while deep learning learns representations directly from raw data in a data-driven manner.
131
What is LeNet and what was it trained on?
LeNet is an early CNN with ~1M parameters, trained on the MNIST dataset (70,000 images) using a CPU.
132
What is ImageNet and what is ILSVRC?
ImageNet is a dataset with 14M+ images across 20K+ categories. ILSVRC (ImageNet Large Scale Visual Recognition Challenge) is an object classification competition with 1,000 classes and 1.2M training images.
133
What major breakthrough did AlexNet achieve?
AlexNet, with 60M parameters, achieved state-of-the-art performance on ImageNet, trained on GPU, marking the deep learning boom.
134
What inspired the design of neural networks?
Biological neurons inspired artificial neurons, where perceptrons mimic synaptic behavior using weights and activation functions.
135
What is a perceptron and its limitation?
A perceptron is a linear classifier defined by weights w and bias b. Its limitation is inability to model non-linear decision boundaries.
136
Why is a single perceptron insufficient for real-world data?
Because it only models linear functions, which cannot approximate complex non-linear relationships in real-world data.
137
How does a neural network perform multi-class classification?
Using multiple perceptrons with a weight matrix W (e.g., 10x784 for MNIST), and computing xW + b to get a 10-dimensional output vector.
138
What is the role of bias in neural networks and how is it handled?
Bias allows shifting of activation thresholds. It's often included as an extra weight on a fixed input value of 1 for convenience.
139
Why do we need non-linearity in neural networks?
Without non-linearity, a network becomes a composition of linear functions, which collapses to a single linear function.
140
What is the most common non-linear activation function used?
ReLU (Rectified Linear Unit), defined as f(x) = max(0, x), introduces non-linearity and mitigates vanishing gradients.
141
What is a Multi-Layer Perceptron (MLP)?
An MLP is a fully connected feed-forward neural network with input, hidden, and output layers using non-linear activation functions.
142
What is the function of hidden layers in an MLP?
Hidden layers learn intermediate representations. Each layer transforms its input via weighted sums and non-linear activations.
143
What are hyperparameters in an MLP?
They include number of hidden layers, number of hidden units, and architecture decisions like activation functions and learning rate.
144
How is the architecture of an MLP typically chosen?
Using cross-validation or hyperparameter search. More layers/units increase representational power but risk overfitting.
145
What are the main components of training data for a neural network?
Training set (to learn weights), validation set (to tune hyperparameters), and test set (to measure generalization).
146
What is a loss function in neural networks?
A loss function measures the difference between predicted outputs and true labels. Common examples include MSE and cross-entropy.
147
What is the Delta Rule used for?
It updates weights in simple perceptrons based on the error between predicted and actual output, using gradient descent.
148
What issues can arise during learning in neural networks?
Convergence issues, getting stuck in local minima, and sensitivity to learning rate choice can hinder training.
149
How does learning rate affect training in neural networks?
A high learning rate may overshoot minima; too low may lead to slow convergence or getting stuck.
150
What is stochastic gradient descent (SGD)?
An optimization algorithm that updates weights using gradients from random subsets (mini-batches) of training data to reduce computational cost and improve convergence.
151
What are examples of sequential data in real-world applications?
Examples include speech recognition (acoustic features over time), weather prediction (daily rainfall), DNA base sequences, and text processing (sequences of words).
152
What is the difference between stationary and non-stationary sequential data?
Stationary data has a fixed generative distribution over time, while non-stationary data has evolving distributions. Stationary models assume consistent dependence across time.
153
Why is treating sequential data as i.i.d. not ideal?
It fails to model dependencies between time steps and captures only marginal frequencies, ignoring sequence structure.
154
What is a first-order Markov chain?
A sequence where each state only depends on the immediately preceding state. This simplifies the joint distribution using the Markov assumption.
155
What is the Markov assumption in probabilistic modeling?
That the probability of a current state depends only on the previous state: p(o_i | o_1,...,o_{i-1}) = p(o_i | o_{i-1}).
156
What is a homogeneous Markov chain?
A Markov chain where transition probabilities are time-invariant, i.e., p(o_i | o_{i−1}) remains constant over time.
157
What is a Hidden Markov Model (HMM)?
A statistical model where observable events are generated by hidden states that follow a Markov process. Each observation depends only on the current hidden state.
158
What are the key components of a Hidden Markov Model (HMM)?
States (S), Observations (O), Transition probabilities (A), Emission probabilities (B), and Initial state probabilities (π). An HMM is denoted as λ = (A, B, π).
159
How is the transition probability matrix (A) defined in HMMs?
a_ij represents the probability of transitioning from state i to j. The sum of each row in A must equal 1.
160
What is the emission probability matrix (B) in HMMs?
b_ij denotes the probability of observing j given the model is in hidden state i.
161
What is the initial probability vector (π) in HMMs?
π_i gives the probability of starting in state i. The sum of all π_i values must be 1.
162
What are the three core problems in HMMs?
1) Likelihood: computing p(O | λ), 2) Decoding: finding the most likely hidden state sequence, and 3) Learning: estimating model parameters A and B from data
163
What is the naive method for computing the likelihood p(O | λ)?
By summing over the joint probability of all possible hidden state sequences. However, this approach is computationally infeasible for large T.
164
How many possible state sequences exist in an HMM with N states and T time steps?
N^T sequences, which grows exponentially and makes brute-force likelihood calculation intractable.
165
What is the forward algorithm in HMMs?
A dynamic programming technique that efficiently computes the likelihood of an observation sequence using intermediate forward probabilities α.
166
What is the recursive formula used in the forward algorithm?
α_t(j) = ∑{i=1}^N [α{t-1}(i) * a_ij * b_j(o_t)], where α_t(j) is the probability of being in state j at time t after observing the first t observations.
167
What is the time complexity of the forward algorithm?
O(N²T), where N is the number of states and T is the number of observations. This is significantly more efficient than the brute-force method.
168
What is the decoding problem in HMMs?
It involves finding the most probable hidden state sequence that could have generated a given observation sequence.
169
What is the Viterbi algorithm?
A dynamic programming method that efficiently computes the most likely state sequence in an HMM given an observation sequence.
170
What is the recursive formula used in the Viterbi algorithm?
v_t(j) = max_{i=1}^N [v_{t−1}(i) * a_ij * b_j(o_t)]. It stores backpointers to reconstruct the optimal state path.
171
How does the Viterbi algorithm differ from the forward algorithm?
Forward computes total observation likelihood, summing over all paths. Viterbi computes the most probable path, using max operations.
172
What is the learning problem in HMMs?
Given observations, estimate the transition (A) and emission (B) matrices. This involves maximizing the likelihood of data under the model.
173
What algorithm is used to learn HMM parameters?
The Baum-Welch algorithm (a special case of Expectation-Maximization) iteratively updates A and B to increase the likelihood of the observed sequence.
174
What is the role of the forward-backward procedure in Baum-Welch?
It computes expected counts of transitions and emissions using both forward and backward probabilities, which are used to update model parameters.
175
What are key applications of HMMs?
Speech recognition, part-of-speech tagging, biological sequence analysis, handwriting recognition, and time series forecasting.
176
What distinguishes reinforcement learning (RL) from supervised and unsupervised learning?
RL involves learning from a reward signal without a supervisor, has delayed feedback, sequential data where time matters, and agent actions influence future data
177
What are some real-world applications of reinforcement learning?
Examples include game playing (chess, Go), robotic control, financial portfolio management, advertisement selection, and autonomous navigation.
178
What is the credit assignment problem in RL?
It refers to the difficulty of determining which actions led to a reward, especially when feedback is delayed across multiple time steps.
179
What is the reward hypothesis in reinforcement learning?
It assumes all goals can be described by the maximization of expected cumulative reward.
180
What components define an RL setting?
Agent, environment, actions (a), states (s), rewards (r), policy (π), and value (v).
181
What is a policy (π) in reinforcement learning?
A policy is a strategy that maps states to actions: π(s) = a.
182
What is the value function in reinforcement learning?
It estimates the expected cumulative (discounted) reward from a given state under a policy.
183
What is the exploration vs exploitation trade-off?
Exploitation uses the best known action to gain rewards, while exploration tries new actions to gather information and potentially find better strategies.
184
What is a Markov Decision Process (MDP)?
A formalism to model RL problems with states S, actions A, reward function r(s, a), and state transition function δ(s, a).
185
What does the discount factor γ represent?
It determines how much future rewards are worth relative to immediate rewards. γ close to 1 values long-term rewards.
186
What is the Bellman equation for value functions?
V(s, a) = max_a [ r(s, a) + γV(s', a') ]. It recursively defines the value of actions.
187
What is the Q-function in Q-learning?
Q(s, a) = r(s, a) + γ max_a' Q(s', a') — it estimates the expected return for taking action a in state s and following the optimal policy thereafter.
188
What is temporal difference (TD) learning?
TD learning updates estimates of future rewards using observed rewards and estimates of subsequent values.
189
How is the Q-value updated in Q-learning?
Qₜ(s, a) = Qₜ₋₁(s, a) + α [ r(s, a) + γ max_a' Q(s', a') - Qₜ₋₁(s, a) ], where α is the learning rate.
190
What is the optimal policy in terms of the Q-function?
π*(s) = argmax_a Q(s, a). The action with the highest Q-value is chosen.
191
What is a multi-armed bandit problem?
A scenario where an agent must choose between multiple options (arms) to maximize rewards, balancing exploration and exploitation.
192
What does the term 'regret' mean in multi-armed bandits?
Regret is the difference between the reward received and the best possible reward. Total regret is cumulative opportunity loss over time.
193
What is the ε-greedy algorithm?
An algorithm that with probability 1−ε selects the best known action and with probability ε selects a random action to ensure exploration.
194
Why might the greedy algorithm perform poorly in bandit problems?
It can get stuck exploiting a suboptimal action indefinitely if initial estimates are misleading due to lack of exploration.
195
What are other types of multi-armed bandits beyond the basic model?
Stochastic (stationary), Bayesian, adversarial, and contextual bandits, each with different assumptions about reward distributions and environments.
196
What are some applications of multi-armed bandit algorithms?
Recommender systems, A/B testing, clinical trials, online advertising, robotics, and network communication systems.
197