ML and AI Flashcards

Question

What is the general form of a simple linear regression model?

Answer 1

y = wx + ε, where ε ~ N(0, σ²). Here, y is the response variable, x is the explanatory variable, w is the coefficient, and ε is the noise

Answer 2

y = w^T x + ε, where w and x are vectors, and ε ~ N(0, σ²). The model captures linear relationships in higher dimensions.

Answer 3

p(y_i | x_i, w, σ²) = N(y_i | w^T x_i, σ²).

Answer 4

Prior: p(w) = N(w | w0, Σ0); Likelihood: p(y | X, w, σ²) = N(y | Xw, σ² I); Posterior: p(w | y, X, σ²) = N(w_n, Σ_n)

Answer 5

By applying Bayes' rule: Posterior ∝ Likelihood × Prior. In log terms, it's log Posterior ∝ log Likelihood + log Prior.

Answer 6

If the output variable is count-based or categorical, the assumptions of linear regression (e.g., normality of errors) don't hold.

Answer 7

The Poisson distribution, which has a single parameter λ (rate), where mean = variance = λ.

Answer 8

A link function connects the mean of the distribution to a linear predictor. For Poisson, the canonical link is log(λ) = w x.

Answer 9

λ_i = exp(w x_i), and thus log(λ_i) = w x_i. This ensures λ_i > 0.

Answer 10

GLMs extend linear models to handle non-normal data by applying a suitable link function to relate predictors to non-Gaussian responses.

Answer 11

Generative models learn joint distribution p(x, y), while discriminative models learn conditional distribution p(y | x).

Answer 12

Discriminative: Logistic regression, SVMs. Generative: Naive Bayes, Gaussian mixture models.

Answer 13

It models p(y_i | x_i, θ) directly and predicts class for new x′ using p(y′ | x′, θ).

Answer 14

It's the likelihood of Bernoulli trials: each y_i is either 0 or 1, and θ_i = f(x_i, w) represents the probability.

Answer 15

Because probabilities must lie in [0, 1], and linear functions are unbounded.

Answer 16

η_i = w0 + w1 x_i; θ_i = 1 / (1 + exp(-η_i)). This maps η_i to [0, 1].

Answer 17

A model where the class probabilities follow a logistic function, and the weights w0, w1 have prior distributions. Posterior is approximated using MCMC or Laplace methods.

Answer 18

It estimates p(y | x) using Bayes' rule and assumes feature independence: p(x | y) = ∏ p(x_j | y).

Answer 19

Text classification (spam detection), document categorisation, and sentiment analysis.

Answer 20

When features are dependent (e.g., word pairs like "Boris Johnson" or correlated measurements like petal length/width).

Answer 21

Generative: better with missing data, small data, unlabeled data; Discriminative: better predictive performance when ample labeled data exists.

Answer 22

To find a Gaussian approximation to a continuous probability distribution that may be difficult to handle analytically, particularly when the posterior does not have a closed-form expression.

Answer 23

Because the posterior distribution, especially in models like Bayesian logistic regression, often lacks a closed-form and must be approximated for inference.

Answer 24

The prior is p(w) = N(w | w₀, Σ₀), the likelihood is p(y | X, w, σ²) = N(y | Xw, σ²I), and the posterior is p(w | y, X, σ²) = N(wₙ, Σₙ).

Answer 25

Using Bayes’ rule: wₙ = Σₙ (1/σ²) Xᵗy and Σₙ = (Σ₀⁻¹ + (1/σ²) XᵗX)⁻¹.

Answer 26

The posterior cannot be expressed in closed form due to the logistic likelihood, so approximation techniques like Laplace, MCMC, or variational inference are needed.

Answer 27

Given p(θ), the approximation is q(θ) ≈ N(θ | θ₀, A⁻¹), where θ₀ is the mode of p(θ) and A is the curvature (second derivative) at the mode.

Answer 28

Z = ∫ p(θ) dθ. It's generally intractable, and Laplace uses a Gaussian approximation to estimate it.

Answer 29

By applying a second-order Taylor expansion to log p(θ) around its mode θ₀, and approximating p(θ) ≈ p(θ₀) exp(-A/2 (θ - θ₀)²).

Answer 30

The negative second derivative of log p(θ) at θ₀ defines the precision (inverse variance) of the Gaussian approximation.

Answer 31

By estimating the mode θ₀ and the Hessian matrix A of second derivatives of the log posterior at the mode

Answer 32

It struggles with multimodal distributions, small datasets, non-Gaussian shapes, and can only handle real-valued variables unless modified (e.g., via log-transform).

Answer 33

p(D | Mᵢ) = ∫ p(D | θᵢ, Mᵢ) p(θᵢ | Mᵢ) dθᵢ. It quantifies how well a model explains the data, integrating over all parameter values.

Answer 34

log p(D) ≈ log p(D | θ_MAP) + log p(θ_MAP) + (M/2) log(2π) - (1/2) log |A|, where A is the Hessian and M is the number of parameters.

Answer 35

It penalises complex models by adjusting the model evidence based on the curvature (log determinant of the Hessian) and number of parameters.

Answer 36

BIC ≈ log p(D | θ_MAP) - (M log N)/2, where N is the number of data points and M is the number of parameters. It penalises model complexity.

Answer 37

When the prior is relatively flat and the data set is large, BIC provides a simple penalty-based criterion for comparing models.

Answer 38

That it provides a Gaussian approximation of non-Gaussian posteriors, useful for large data sets but limited for multimodal distributions or small sample sizes.

Answer 39

Information measures the degree of surprise when observing a value. More probable events carry less information, and improbable events carry more

Answer 40

h(x) = -log2(p(x)). This ensures a positive information measure and supports additivity for independent events.

Answer 41

Entropy is the expected information: H[x] = -∑ p(x) log2(p(x)).

Answer 42

A uniform distribution has higher entropy because each outcome is equally likely, leading to maximum uncertainty.

Answer 43

Entropy is the theoretical lower bound for the average number of bits needed to encode messages from a source. Shorter codes can be used for more probable symbols.

Answer 44

Entropy is the minimum number of bits on average required to represent a random variable without loss.

Answer 45

It originated in thermodynamics (Boltzmann), representing disorder, and was adapted by Shannon to quantify information content.

Answer 46

When all outcomes are equally probable, i.e., p(x_i) = 1/M, then entropy is maximized at H = ln(M).

Answer 47

It is the entropy of a continuous variable: H = -∫ p(x) ln(p(x)) dx. It differs from discrete entropy and can be negative.

Answer 48

The Gaussian distribution.

Answer 49

A functional maps a function to a real number, e.g., entropy as a function of the probability distribution: H[p(x)].

Answer 50

By using calculus of variations to maximize the entropy functional under constraints such as normalization, mean, and variance.

Answer 51

The result is a Gaussian distribution. This is proven via the calculus of variations with constraints.

Answer 52

H = (1/2) * log(2 * π * e * σ²). Entropy increases with variance.

Answer 53

It measures the remaining uncertainty of a variable y given that x is known: H[y|x] = -∑ p(x, y) log(p(y|x)).

Answer 54

H[x, y] = H[x] + H[y|x], which expresses the total information needed to describe both variables.

Answer 55

It is a measure of dissimilarity between two distributions: KL(p||q) = ∑ p(x) log(p(x)/q(x)).

Answer 56

It quantifies the information loss when using an approximate distribution q(x) instead of the true distribution p(x).

Answer 57

It is non-negative and zero only when p(x) = q(x). It is not symmetric: KL(p||q) ≠ KL(q||p).

Answer 58

Jensen’s inequality, derived from the convexity of the function -ln(x).

Answer 59

It quantifies the reduction in uncertainty about one variable given knowledge of another: I(x; y) = H(x) - H(x|y).

Answer 60

It is the KL divergence between the joint distribution p(x, y) and the product of marginals p(x)p(y).

Answer 61

It represents the reduction in uncertainty (entropy) of the prior p(x) after observing y and updating to the posterior p(x|y).

Answer 62

Yes, especially when variance σ² < 1/(2πe). This contrasts with discrete entropy, which is always non-negative.

Answer 63

Posterior distributions are analytically tractable only for simple conjugate models. For complex models, integrals involved in inference are often intractable, necessitating approximations

Answer 64

Posterior distributions are analytically tractable only for simple conjugate models. For complex models, integrals involved in inference are often intractable, necessitating approximations.

Answer 65

Deterministic approximations (e.g., Laplace, Variational Inference) and stochastic approximations (e.g., Markov Chain Monte Carlo).

Answer 66

It approximates the posterior distribution with a Gaussian centered at the mode θ₀ of the posterior. The approximation is:q(θ) ≈ N(θ | θ₀, A⁻¹), where A = -log p''(θ₀).

Answer 67

It's not suitable for multimodal distributions or small datasets. It assumes Gaussian shape and real-valued variables.

Answer 68

Approximate the true posterior p(θ|x) with a tractable distribution q(θ), and optimize the parameters of q(θ) to make it close to p(θ|x).

Answer 69

Because p(θ|x) is intractable. Instead, we optimize the Evidence Lower Bound (ELBO), which indirectly minimizes KL divergence.

Answer 70

Jensen's inequality: f(E[x]) ≥ E[f(x)] for convex f. It's used to derive ELBO by applying to the concave log function.

Answer 71

ELBO = E_q[log p(x, θ)] - E_q[log q(θ)]. Maximizing ELBO is equivalent to minimizing KL(q||p).

Answer 72

KL(q||p) = log p(x) - ELBO. Since log p(x) is fixed, maximizing ELBO minimizes KL divergence.

Answer 73

Forward KL (KL(p||q)) avoids regions where q is small (zero avoiding). Reverse KL (KL(q||p)) avoids regions where p is small (zero forcing).

Answer 74

Reverse KL leads to 'mode seeking' (focuses on high-density regions). Forward KL leads to 'moment matching' (covers all regions where p is nonzero).

Answer 75

To maximize ELBO to obtain a good approximation q(θ) to the true posterior p(θ|x).

Answer 76

Mean field approximation (factorized distributions) and parametric approximation (using parameterized distributions q(θ|λ)).

Answer 77

It assumes independence among latent variables: q(θ) = ∏ q_j(θ_j), simplifying optimization.

Answer 78

Using block coordinate ascent. At each step, optimize q_j(θ_j) keeping others fixed.

Answer 79

q_j(θ_j) ∝ exp(E_{-j}[log p(x, θ)]), where E_{-j} is expectation over all variables except θ_j.

Answer 80

The algorithm repeats until the ELBO converges, assuming updates can be computed analytically.

Answer 81

Conditional conjugacy of the prior and likelihood with respect to each variable θ_j.

Answer 82

q(θ) = q(θ | λ), where λ are parameters optimized to match q to the true posterior. It enables nonlinear optimization techniques.

Answer 83

If q is too simple, it may fail to approximate the true posterior well. If too complex, it may be difficult to optimize effectively.

Answer 84

An optimization problem where the objective is to find parameters (or functions) that best approximate the posterior distribution.

Answer 85

It seeks an approximate distribution q(θ) that maximizes the ELBO or minimizes KL divergence to approximate the posterior p(θ). Its limitation is that q(θ) might be a poor approximation.

Answer 86

It approximates integrals using sample averages drawn from a distribution. It's key for evaluating expectations when analytical solutions are intractable

Answer 87

To approximate expectations when direct sampling from p(x) is hard. Instead, we sample from a proposal distribution q(x) and reweight the samples using p(x)/q(x).

Answer 88

The proposal distribution q(x) must be non-zero wherever the target distribution p(x) is non-zero

Answer 89

The closer q(x) is to p(x), the more efficient the sampling. Mismatch causes high variance and potentially large weights for few samples.

Answer 90

A method for sampling from a target distribution p(x) using a proposal distribution q(x) and a constant k such that kq(x) ≥ unnormalized p(x). Samples are accepted probabilistically

Answer 91

A sample x₀ is accepted if u₀ ≤ p̃(x₀)/(kq(x₀)), where u₀ is drawn from Uniform(0, kq(x₀)).

Answer 92

Importance sampling uses all samples with weights, while rejection sampling gives exact samples but may reject many. Both suffer in high dimensions.

Answer 93

A framework for sampling from high-dimensional distributions using a Markov chain whose stationary distribution is the target distribution p(x).

Answer 94

A process where the next state depends only on the current state, not on the past history.

Answer 95

A chain where the distribution converges to a unique invariant distribution, regardless of the initial state.

Answer 96

he distribution remains unchanged under the transition dynamics of the chain: π(x) = ∑ π(x') T(x | x').

Answer 97

A condition ensuring that π(x) T(x' | x) = π(x') T(x | x') holds, helping guarantee the invariant distribution is maintained.

Answer 98

A method where a new state x' is proposed from a symmetric distribution and accepted with probability r = min(1, p(x')/p(x)). Otherwise, the chain remains in x.

Answer 99

Unlike rejection sampling, rejected proposals don't get discarded; instead, the current state is repeated in the chain.

Answer 100

When the proposal distribution q(x' | x) is asymmetric. Acceptance probability becomes r = min(1, [p(x') q(x | x')] / [p(x) q(x' | x)]).

Answer 101

It decomposes high-dimensional sampling into simpler, often univariate steps, which is computationally more tractable.

Answer 102

A special case of MCMC where each variable is sampled in turn from its full conditional distribution, assuming the rest are fixed.

Answer 103

Because the proposal distribution is the full conditional, ensuring that each sample follows the target distribution. Hence, acceptance probability is 1.

Answer 104

It simplifies sampling by reducing the task to a series of univariate conditional updates, making implementation easier for certain models.

Answer 105

Bayesian inference often relies on sampling due to intractable integrals. Importance and rejection sampling are basic tools; MCMC (including Metropolis-Hastings and Gibbs sampling) scales better for complex, high-dimensional problems.

Answer 106

Conventional ML uses hand-engineered features, while deep learning learns representations directly from raw data in a data-driven manner.

Answer 107

LeNet is an early CNN with ~1M parameters, trained on the MNIST dataset (70,000 images) using a CPU.

Answer 108

ImageNet is a dataset with 14M+ images across 20K+ categories. ILSVRC (ImageNet Large Scale Visual Recognition Challenge) is an object classification competition with 1,000 classes and 1.2M training images.

Answer 109

AlexNet, with 60M parameters, achieved state-of-the-art performance on ImageNet, trained on GPU, marking the deep learning boom.

Answer 110

Biological neurons inspired artificial neurons, where perceptrons mimic synaptic behavior using weights and activation functions.

Answer 111

A perceptron is a linear classifier defined by weights w and bias b. Its limitation is inability to model non-linear decision boundaries.

Answer 112

Because it only models linear functions, which cannot approximate complex non-linear relationships in real-world data.

Answer 113

Using multiple perceptrons with a weight matrix W (e.g., 10x784 for MNIST), and computing xW + b to get a 10-dimensional output vector.

Answer 114

Bias allows shifting of activation thresholds. It's often included as an extra weight on a fixed input value of 1 for convenience.

Answer 115

Without non-linearity, a network becomes a composition of linear functions, which collapses to a single linear function.

Answer 116

ReLU (Rectified Linear Unit), defined as f(x) = max(0, x), introduces non-linearity and mitigates vanishing gradients.

Answer 117

An MLP is a fully connected feed-forward neural network with input, hidden, and output layers using non-linear activation functions.

Answer 118

Hidden layers learn intermediate representations. Each layer transforms its input via weighted sums and non-linear activations.

Answer 119

They include number of hidden layers, number of hidden units, and architecture decisions like activation functions and learning rate.

Answer 120

Using cross-validation or hyperparameter search. More layers/units increase representational power but risk overfitting.

Answer 121

Training set (to learn weights), validation set (to tune hyperparameters), and test set (to measure generalization).

Answer 122

A loss function measures the difference between predicted outputs and true labels. Common examples include MSE and cross-entropy.

Answer 123

It updates weights in simple perceptrons based on the error between predicted and actual output, using gradient descent.

Answer 124

Convergence issues, getting stuck in local minima, and sensitivity to learning rate choice can hinder training.

Answer 125

A high learning rate may overshoot minima; too low may lead to slow convergence or getting stuck.

Answer 126

An optimization algorithm that updates weights using gradients from random subsets (mini-batches) of training data to reduce computational cost and improve convergence.

Answer 127

Examples include speech recognition (acoustic features over time), weather prediction (daily rainfall), DNA base sequences, and text processing (sequences of words).

Answer 128

Stationary data has a fixed generative distribution over time, while non-stationary data has evolving distributions. Stationary models assume consistent dependence across time.

Answer 129

It fails to model dependencies between time steps and captures only marginal frequencies, ignoring sequence structure.

Answer 130

A sequence where each state only depends on the immediately preceding state. This simplifies the joint distribution using the Markov assumption.

Answer 131

That the probability of a current state depends only on the previous state: p(o_i | o_1,...,o_{i-1}) = p(o_i | o_{i-1}).

Answer 132

A Markov chain where transition probabilities are time-invariant, i.e., p(o_i | o_{i−1}) remains constant over time.

Answer 133

A statistical model where observable events are generated by hidden states that follow a Markov process. Each observation depends only on the current hidden state.

Answer 134

States (S), Observations (O), Transition probabilities (A), Emission probabilities (B), and Initial state probabilities (π). An HMM is denoted as λ = (A, B, π).

Answer 135

a_ij represents the probability of transitioning from state i to j. The sum of each row in A must equal 1.

Answer 136

b_ij denotes the probability of observing j given the model is in hidden state i.

Answer 137

π_i gives the probability of starting in state i. The sum of all π_i values must be 1.

Answer 138

1) Likelihood: computing p(O | λ), 2) Decoding: finding the most likely hidden state sequence, and 3) Learning: estimating model parameters A and B from data

Answer 139

By summing over the joint probability of all possible hidden state sequences. However, this approach is computationally infeasible for large T.

Answer 140

N^T sequences, which grows exponentially and makes brute-force likelihood calculation intractable.

Answer 141

A dynamic programming technique that efficiently computes the likelihood of an observation sequence using intermediate forward probabilities α.

Answer 142

α_t(j) = ∑{i=1}^N [α{t-1}(i) * a_ij * b_j(o_t)], where α_t(j) is the probability of being in state j at time t after observing the first t observations.

Answer 143

O(N²T), where N is the number of states and T is the number of observations. This is significantly more efficient than the brute-force method.

Answer 144

It involves finding the most probable hidden state sequence that could have generated a given observation sequence.

Answer 145

A dynamic programming method that efficiently computes the most likely state sequence in an HMM given an observation sequence.

Answer 146

v_t(j) = max_{i=1}^N [v_{t−1}(i) * a_ij * b_j(o_t)]. It stores backpointers to reconstruct the optimal state path.

Answer 147

Forward computes total observation likelihood, summing over all paths. Viterbi computes the most probable path, using max operations.

Answer 148

Given observations, estimate the transition (A) and emission (B) matrices. This involves maximizing the likelihood of data under the model.

Answer 149

The Baum-Welch algorithm (a special case of Expectation-Maximization) iteratively updates A and B to increase the likelihood of the observed sequence.

Answer 150

It computes expected counts of transitions and emissions using both forward and backward probabilities, which are used to update model parameters.

Answer 151

Speech recognition, part-of-speech tagging, biological sequence analysis, handwriting recognition, and time series forecasting.

Answer 152

RL involves learning from a reward signal without a supervisor, has delayed feedback, sequential data where time matters, and agent actions influence future data

Answer 153

Examples include game playing (chess, Go), robotic control, financial portfolio management, advertisement selection, and autonomous navigation.

Answer 154

It refers to the difficulty of determining which actions led to a reward, especially when feedback is delayed across multiple time steps.

Answer 155

It assumes all goals can be described by the maximization of expected cumulative reward.

Answer 156

Agent, environment, actions (a), states (s), rewards (r), policy (π), and value (v).

Answer 157

A policy is a strategy that maps states to actions: π(s) = a.

Answer 158

It estimates the expected cumulative (discounted) reward from a given state under a policy.

Answer 159

Exploitation uses the best known action to gain rewards, while exploration tries new actions to gather information and potentially find better strategies.

Answer 160

A formalism to model RL problems with states S, actions A, reward function r(s, a), and state transition function δ(s, a).

Answer 161

It determines how much future rewards are worth relative to immediate rewards. γ close to 1 values long-term rewards.

Answer 162

V(s, a) = max_a [ r(s, a) + γV(s', a') ]. It recursively defines the value of actions.

Answer 163

Q(s, a) = r(s, a) + γ max_a' Q(s', a') — it estimates the expected return for taking action a in state s and following the optimal policy thereafter.

Answer 164

TD learning updates estimates of future rewards using observed rewards and estimates of subsequent values.

Answer 165

Qₜ(s, a) = Qₜ₋₁(s, a) + α [ r(s, a) + γ max_a' Q(s', a') - Qₜ₋₁(s, a) ], where α is the learning rate.

Answer 166

π*(s) = argmax_a Q(s, a). The action with the highest Q-value is chosen.

Answer 167

A scenario where an agent must choose between multiple options (arms) to maximize rewards, balancing exploration and exploitation.

Answer 168

Regret is the difference between the reward received and the best possible reward. Total regret is cumulative opportunity loss over time.

Answer 169

An algorithm that with probability 1−ε selects the best known action and with probability ε selects a random action to ensure exploration.

Answer 170

It can get stuck exploiting a suboptimal action indefinitely if initial estimates are misleading due to lack of exploration.

Answer 171

Stochastic (stationary), Bayesian, adversarial, and contextual bandits, each with different assumptions about reward distributions and environments.

Answer 172

Recommender systems, A/B testing, clinical trials, online advertising, robotics, and network communication systems.

ML and AI Flashcards

(197 cards)