Week 2: Decision & Information Theory Basics Flashcards

Question 1

Q

What is meant by: a “ Probabilistic Perspective “

Answer

A

Core Idea:
Modeling uncertainty explicitly using probability distributions rather than making deterministic predictions.
Key Aspects:

Uncertainty quantification: Express confidence levels in predictions (e.g., “90% confident this is spam”)
Parameters as distributions: Model parameters have probability distributions, not fixed values
Bayesian inference: Update beliefs using Bayes’ theorem: P(θ|data) ∝ P(data|θ) × P(θ)
Everything probabilistic: Inputs, outputs, and parameters all modeled with distributions

Question 2

Q

Explain and write out Bayes Theorm:

Answer

A

Derive probability of an event,
based on prior knowledge of conditions that might be related to the event

p(A|B) = [p(A) × p(B|A)] / p(B)

Where:
p(A|B) = posterior probability (probability of A given B)
p(A) = prior probability of A
p(B|A) = likelihood (probability of B given A)
p(B) = marginal probability of B (evidence)

Question 3

Q

What is MLE?

Answer

A

Find the θ that makes our observed training data most likely.

θ_mle = argmax p(D|θ) = argmax ∏(n=1 to N) p(y_n|x_n, θ)

We do this by minimizing the log likelihood function.

MLE ignores the prior p(θ) and just focuses on fitting the data as well as possible.

Question 4

Q

What is MAP (Maximum A Posteriori) estimation?

Answer

A

MAP = MLE + Prior regularization
θ_MAP = argmax p(θ|D) = argmax [log p(D|θ) + log p(θ)]

Finds the single most probable parameter value given data AND prior beliefs
Unlike MLE (which ignores priors), MAP includes prior p(θ) as regularization
Helps prevent overfitting by incorporating prior knowledge
When prior is uniform, MAP = MLE

Question 5

Q

How can you describe entropy?

Answer

A

The level/degree of surprise. If there is low surprise, we have low entropy. Entropy measures the expected average level of information or surprise in a random variable.

Question 6

Q

What is surprise (self-information) and how do you calculate it?

Answer

A

Surprise measures how unexpected an event is.
sur(x) = -log(p(x))

If p(x) = 1 → sur(x) = 0 (no surprise)
If p(x) = 0.5 → sur(x) = 1 bit

Question 7

Q

What is cross entropy H(p,q) and what does it measure?

Answer

A

Cross entropy measures how well predicted distribution q approximates true distribution p.

Cross entropy ≥ entropy. It’s minimized when q = p (perfect predictions)

Question 8

Q

How is cross entropy used in classification?

Answer

A

Cross entropy is the standard loss function for classification:

True labels: p (one-hot encoded)
Model predictions: q (softmax probabilities)

Goal: Minimize H(p,q) to make predictions match true labels

Connection to MLE: Minimizing cross entropy = Maximizing likelihood w.r.t. θ

Question 9

Q

What is joint entropy H(X,Y)?

Answer

A

Joint entropy measures the total uncertainty in two random variables X and Y together.

Question 10

Q

What’s the difference between entropy H(p) and cross entropy H(p,q)?

Answer

A

Entropy H(p): Uncertainty in true distribution p
Cross entropy H(p,q): Cost of using wrong distribution q to encode data from p

Question 11

Q

What is KL Divergence D_KL(p||q) and what does it measure?

Answer

A

KL Divergence measures how different distribution q is from distribution p. It does so with entropy of p, and cross entropy of p and q

Question 12

Q

What is the principle of Bayesian thinking?

Answer

A

Use probability distributions to represent uncertainty and update beliefs with evidence.

Question 13

Q

What is Bayesian decision making?

Answer

A

Use probabilities + loss functions to make optimal decisions under uncertainty.
Process:

Get posterior probabilities p(state|evidence)
Define loss function L(action, true_state)
Choose action that minimizes expected loss: argmin ∑ p(state|evidence) × L(action, state)

Example: Medical diagnosis - consider both probability of disease AND cost of different treatment decisions

Question 14

Q

What do Entropy and Mutual Information measure?

Answer

A

Entropy H(X):
Uncertainty/information content in random variable X

H(X) = -∑ p(x) log p(x)
Higher entropy = more uncertainty

Mutual Information:
I(X;Y): How much information X and Y share

I(X;Y) = H(X) - H(X|Y) = H(Y) - H(Y|X)
Measures dependence between variables
I(X;Y) = 0 when X,Y independent

Question 15

Q

Why it is intractable to compute a
posterior predictive distribution?

Answer

A

p(D) = ∫ p(D|θ) p(θ) dθ

High-dimensional integration: Neural networks have millions of parameters θ. We need to integrate over this entire high-dimensional parameter space, which is computationally impossible.
Complex posterior shape: P(θ|D) is usually non-Gaussian, multimodal, and has no analytical form, making the integral impossible to solve analytically.
Continuous parameter space: Unlike discrete cases where we could sum over a few values, θ exists in continuous space requiring true integration.
Example: Even if we knew P(θ|D) perfectly, averaging predictions from millions of possible parameter combinations is computationally infeasible.
Solution: We use approximation methods like variational inference, MCMC, or point estimates (MAP/MLE) instead of computing the full integral.

Question 16

Q

Answer

Study These Flashcards

A

Week 2: Decision & Information Theory Basics Flashcards

(16 cards)