Week 2: Decision & Information Theory Basics Flashcards

(16 cards)

1
Q

What is meant by: a “ Probabilistic Perspective “

A

Core Idea:
Modeling uncertainty explicitly using probability distributions rather than making deterministic predictions.
Key Aspects:

Uncertainty quantification: Express confidence levels in predictions (e.g., “90% confident this is spam”)
Parameters as distributions: Model parameters have probability distributions, not fixed values
Bayesian inference: Update beliefs using Bayes’ theorem: P(θ|data) ∝ P(data|θ) × P(θ)
Everything probabilistic: Inputs, outputs, and parameters all modeled with distributions

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Explain and write out Bayes Theorm:

A

Derive probability of an event,
based on prior knowledge of conditions that might be related to the event

p(A|B) = [p(A) × p(B|A)] / p(B)

Where:
p(A|B) = posterior probability (probability of A given B)
p(A) = prior probability of A
p(B|A) = likelihood (probability of B given A)
p(B) = marginal probability of B (evidence)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What is MLE?

A

Find the θ that makes our observed training data most likely.

θ_mle = argmax p(D|θ) = argmax ∏(n=1 to N) p(y_n|x_n, θ)

We do this by minimizing the log likelihood function.

MLE ignores the prior p(θ) and just focuses on fitting the data as well as possible.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What is MAP (Maximum A Posteriori) estimation?

A

MAP = MLE + Prior regularization
θ_MAP = argmax p(θ|D) = argmax [log p(D|θ) + log p(θ)]

Finds the single most probable parameter value given data AND prior beliefs
Unlike MLE (which ignores priors), MAP includes prior p(θ) as regularization
Helps prevent overfitting by incorporating prior knowledge
When prior is uniform, MAP = MLE

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

How can you describe entropy?

A

The level/degree of surprise. If there is low surprise, we have low entropy. Entropy measures the expected average level of information or surprise in a random variable.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What is surprise (self-information) and how do you calculate it?

A

Surprise measures how unexpected an event is.
sur(x) = -log(p(x))

If p(x) = 1 → sur(x) = 0 (no surprise)
If p(x) = 0.5 → sur(x) = 1 bit

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What is cross entropy H(p,q) and what does it measure?

A

Cross entropy measures how well predicted distribution q approximates true distribution p.

Cross entropy ≥ entropy. It’s minimized when q = p (perfect predictions)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

How is cross entropy used in classification?

A

Cross entropy is the standard loss function for classification:

True labels: p (one-hot encoded)
Model predictions: q (softmax probabilities)

Goal: Minimize H(p,q) to make predictions match true labels

Connection to MLE: Minimizing cross entropy = Maximizing likelihood w.r.t. θ

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What is joint entropy H(X,Y)?

A

Joint entropy measures the total uncertainty in two random variables X and Y together.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

What’s the difference between entropy H(p) and cross entropy H(p,q)?

A

Entropy H(p): Uncertainty in true distribution p
Cross entropy H(p,q): Cost of using wrong distribution q to encode data from p

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What is KL Divergence D_KL(p||q) and what does it measure?

A

KL Divergence measures how different distribution q is from distribution p. It does so with entropy of p, and cross entropy of p and q

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

What is the principle of Bayesian thinking?

A

Use probability distributions to represent uncertainty and update beliefs with evidence.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

What is Bayesian decision making?

A

Use probabilities + loss functions to make optimal decisions under uncertainty.
Process:

Get posterior probabilities p(state|evidence)
Define loss function L(action, true_state)
Choose action that minimizes expected loss: argmin ∑ p(state|evidence) × L(action, state)

Example: Medical diagnosis - consider both probability of disease AND cost of different treatment decisions

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

What do Entropy and Mutual Information measure?

A

Entropy H(X):
Uncertainty/information content in random variable X

H(X) = -∑ p(x) log p(x)
Higher entropy = more uncertainty

Mutual Information:
I(X;Y): How much information X and Y share

I(X;Y) = H(X) - H(X|Y) = H(Y) - H(Y|X)
Measures dependence between variables
I(X;Y) = 0 when X,Y independent

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Why it is intractable to compute a
posterior predictive distribution?

A

p(D) = ∫ p(D|θ) p(θ) dθ

High-dimensional integration: Neural networks have millions of parameters θ. We need to integrate over this entire high-dimensional parameter space, which is computationally impossible.
Complex posterior shape: P(θ|D) is usually non-Gaussian, multimodal, and has no analytical form, making the integral impossible to solve analytically.
Continuous parameter space: Unlike discrete cases where we could sum over a few values, θ exists in continuous space requiring true integration.
Example: Even if we knew P(θ|D) perfectly, averaging predictions from millions of possible parameter combinations is computationally infeasible.
Solution: We use approximation methods like variational inference, MCMC, or point estimates (MAP/MLE) instead of computing the full integral.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly