Logistic Regression Flashcards
Which classification algorithms were mentioned in the syllabus?
Bayes classifier, Logistic Regression, K-Nearest Neighbors, and Support Vector Machines.
How many of these classification algorithms are probabilistic?
Two are probabilistic: the Bayes classifier and logistic regression.
How many of these classification algorithms are non-probabilistic?
Two are non-probabilistic: K-nearest neighbors and Support Vector Machines.
What is the key difference between the Bayes classifier and logistic regression in modeling?
The Bayes classifier models each class separately and uses Bayes rule, whereas logistic regression directly models the probability P(tnew = k|xnew).
Why can’t we simply use a linear function like w^T x as a probability?
Because w^T x is unbounded and can produce values outside the range [0, 1].
What is the ‘squashing’ function used in logistic regression for binary classification?
It is the sigmoid function h(w^T x) = 1 / (1 + exp(-w^T x)).
What is the probability output for the negative class (t=0) in logistic regression?
It is 1 - h(w^T x) = exp(-w^T x) / (1 + exp(-w^T x)).
Why do we use the likelihood p(t|X, w) in logistic regression?
To measure how well the parameters w predict the observed binary labels t given the training data X.
What is the form of the likelihood p(t|X, w) for logistic regression?
The product over all observations: ∏( h(w^T x_n) ) for t_n=1 and ∏(1 - h(w^T x_n)) for t_n=0.
What is the Cross Entropy (negative log-likelihood) in logistic regression?
J(w) = -Σ[t_n log(h(w^T x_n)) + (1 - t_n) log(1 - h(w^T x_n))].
How do we find the parameters w that minimize the Cross Entropy?
By setting the gradient of J(w) to zero or using an iterative method like Gradient Descent.
Why is the Cross Entropy in logistic regression convex in w?
Because the sigmoid function is log-concave in certain ranges, and combining it into the Cross Entropy yields a convex function with respect to w.
What is the main idea behind multiclass classification in logistic regression?
Use the softmax function to model P(tn=k|xn) across K classes.
How are labels represented in multiclass logistic regression?
Using a one-hot encoding vector, where each class corresponds to a 1 in one dimension and 0 in others.
What is the softmax function for class k in multiclass logistic regression?
P(tn=k|xn) = exp(-w^(k) x_n) / Σ( exp(-w^(ℓ) x_n) ) for ℓ = 1..K.
What is the Cross Entropy loss for multiclass logistic regression?
J = -Σ_ n Σ_ k [ t_n,k log( exp(-w^(k) x_n) / Σ_ℓ exp(-w^(ℓ) x_n ) ) ].
How do we compute the gradient of the multiclass Cross Entropy loss w.r.t w^(k)?
∂J/∂w^(k)_j = -Σ_n [ t_n,k - ( exp(-w^(k)x_n)/Σ_ℓ exp(-w^(ℓ)x_n) ) ] x_n,j.
What is Bayesian logistic regression trying to achieve?
It places a prior on w, defines a likelihood, and seeks the posterior p(w|X,t) to make predictive distributions.
Why is there no closed-form solution for the posterior in Bayesian logistic regression?
Because the likelihood (sigmoid-based) is not conjugate to the Gaussian prior, making the integral intractable.
What is the MAP (Maximum A Posteriori) estimate in Bayesian logistic regression?
It is the w that maximizes p(w|X,t), which is equivalent to maximizing the product of the likelihood and the prior.
Why do we often use numerical optimization for MAP in logistic regression?
Because we cannot solve ∂J/∂w = 0 analytically for logistic regression with a prior, so we rely on iterative methods like Gradient Ascent or Newton-Raphson.
What is the geometric interpretation of the decision boundary in logistic regression?
It is the set of x where w^T x = 0, which corresponds to P(t=1|x)=0.5.
What does the Laplace approximation do in Bayesian logistic regression?
It approximates the posterior p(w|X,t) with a Gaussian N(µ,Σ) centered at the mode of the posterior.
How do we choose µ and Σ in the Laplace approximation?
µ is the MAP estimate, and Σ^(-1) is the negative Hessian of the log posterior evaluated at the MAP.