Topic 2: Decision & Information Theory Basics Flashcards

(19 cards)

1
Q

What is the uncertainty with the probabilistic perspective?

A

To classify, one can try to find exact relevant features to define a decision boundary, but this is uncertain, as there will be noise, transformations and so on.

In the real world, you deal with data that is uncertain (many outliers, not everything is identical). The data can be fuzzy, ambiguous, noisy, transformed, occluded and so on.

You can make the data into models, and try to capture the conditional probability distribution:
The conditional probability distribution can be converted into softmax functions, over the logits, a = f(x; θ)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What is the core idea of Bayesian thinking in machine learning?

A

Bayes’ rule/theorem: Understanding the relationship between two events.

Bayes’ theorem: derives the probability of an event, based on prior knowledge of conditions that might be related to the event.

Let A and B be two events.
𝑝(𝐴 and 𝐵) = 𝑝(𝐴, 𝐵) = 𝑝(𝐴) * 𝑝(𝐵)

if A & B are independent

𝑝(𝐴, 𝐵) = 𝑝(𝐴|𝐵) * 𝑝(𝐵) = 𝑝(𝐵|𝐴) * 𝑝(𝐵)
𝑝(θ|D) = (𝑝(θ) * 𝑝(D|θ)) / 𝑝(D)
posterior = (likelihood × prior)/evidence
It calculates the probability of event A given event B using prior, likelihood, and evidence.

vend tilbage for at stille spørgsmål

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What is Bayesian Decision Theory?

A

Bayesian inference: It’s an optimal (yet intractable) way to update the beliefs about hidden quantities p(H|x).
The next step is the need to choose a decision from beliefsBayesian Decision Theory.

IN OTHER WORDS: We derive decisions for tomorrow’s actions, based on today’s Bayesian probabilities.

The ideal case is that the probability structure is perfectly known.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What is a reject option in classification (Bayesian Decision Theory)?

A

When the model is uncertain (probability below a threshold), it can reject making a decision (“I don’t know”).
Now we can do a classification with the option to “reject”.

It’s usually modelled as an action, a, from set, A={treat, no treat} U with {0}.

The {0} set represents the “reject”/”idk” action that is drawn when the most probable class $p$ is: < lambda* = 1 - lambda_r / lambda_e

An example is the gameshow Jeopardy.
Action: Correct answer results in 0 (you get nothign)
Action: Wrong answer results in lambda_e (you lose 100)
Action: No answer result sin lambda_r (you lose 50)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Why can’t we always use exact Bayesian inference?

A

Because posterior predictive distributions require marginalisation over all parameters, which is often intractable (it’s an NP-hard problem).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What approximations are used instead of exact Bayesian inference?

A

Instead we use approximation → MLE, MAP, KL-Divergence

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What is Maximum Likelihood Estimation (MLE)?

A

How do we estimate parameters? We pick the parameters that assign the highest probability to training data. So, the $\theta$ that maximises the likelihood.
MLE: https://docs.google.com/document/d/1mxsSkeFXkP5p7zsEv1d96AZE-RTbwC5IjZ6v2gIzlZg/edit?tab=t.0

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

What is Maximum A Posteriori (MAP) estimation?

A

MLE is great, but it’s prone to overfit to the evidence (so it overfits to the training data).

MAP estimation is MLE but with regularisation on the prior, which reduces reliance on noisy training data.
For MAP, we use log, and we use other things, such as the posterior probability and the prior probability instead of only using the likelihood (to give it that spark).

MAP:
https://docs.google.com/document/d/1H71JvHrQg_u9INR-oWWKKuot2Q3QQ5lbdJBJ2sJ2YQE/edit?tab=t.0

There are different approximations for MAP, it depends on the problem at hand.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What is entropy in information theory?

A

Entropy → The level/degree of surprise

Consider the case of weather in Denmark. If the prediction says it will be raining, we won’t be surprised when it in reality does rain, because these types of predictions are usually true (for denmark (low surprise, low entropy).

When the surprise is low, we have low entropy

When we have high surprise, we will have high entropy.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

What is cross-entropy?

A

Say we have two distributions, p and q, and we want to find the cross entropy between the two distributions.

p is our true class distribution whilst $q$ is the predicted class distribution.
Minimising the cross-entropy is equivalent to maximising the likelihood with regards to θ.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What is conditional entropy H(X|Y)?

A

Here, our entropy is conditional in the sense that, Y given X means: “Surprise in Y after seeing X”.

This allows us to determine the information gain, thus the reduction of uncertainty about Y given knowledge of X.

When is H(Y|X) = 0? Our model is capturing the data perfectly. It’s producing all of our data AND more. So this is when Y is the function of X.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

What is KL-divergence (Kullback-Leibler)?

A

We use KL-divergence to compare our model’s prediction q with the true data distribution p_D, and train the model to minimize that difference.

It’s the distance metric of how similar OR divergent two distributions, p and q are:

It measures the predictive power (a sample brings on average) when distringuishing p(x) from q(x) and sampling from p(x)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

What’s the difference between forward and reverse KL-divergence?

A

p is the true distribution
q is the unimodal approximation

Forward KL:
- compares the true distribution, p, to the model, q. it is weighted by p
- if p is high (p > 0 instead of p = 0), we penalise the KL divergence, meaning that we don’t get to minimise the kl divergence
- it is used less in inference because it requires knowing the true distribution

Reverse KL:
- compares the model, q, to the true distribution, p. it is weighted by q
- if q is high (q > 0, instead of q = 0), we penalise the kl divergence, meaning that we don’t get to minimise the kl divergence
- more common in inference

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

What is mutual information?

A

The amount of information one variable contains about another; it quantifies the reduction in uncertainty.

Mutual information = reduction in “surprise” in Y after seeing X = information gain

Mutual information: The information we can derive with them together (what they have in common).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

How is mutual information used in decision trees?

A

Used to select the best features for splitting by maximizing the information gain I(A;Y).

For each input attribute A, and target Y, we compute I(A;Y). we use the one with the largest value to split the tree, and then we recurse

We make decisions on where the flower might belong.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Why is modeling uncertainty important in machine learning?

A

Because real-world data is often noisy, ambiguous, transformed, or incomplete. Modeling uncertainty allows for more robust predictions and better handling of outliers, occlusions, and variability in data.

17
Q

What is feature selection?

A

Feature selection: choose an optimal subset of features according to the task at hand:
- e.g. feature ranking algorithms
- minimum subset of algorithms

napoleon and celine dion have a lot of similarities (according to the mentioned features that they do share), but there probably some other differences that might exist as well.

there are some uncertain observations, as all observations are not equally reliable, so we have to consider uncertainty.

We can use cross entropy as a loss function to compare the model really well.

18
Q

Come with an example of MAP

A

A classic example is coin tosses modeled with a Bernoulli distribution.

Let’s say we observe outcomes of a coin toss where
y = 1, means heads,
y = 0, means tails.
We let D = {y_1, …, y_N} be the data. D = {0, 1, 0, 0, 1, 0, 0, 1}, so N = 8 tosses.
We use a prior p(θ)=Beta(θ|a,b)
Then the MAP estimate of the coin’s bias. We have N_1 = 3 heads and N_0 = 5 tails:
θ_MAP = (3 + 2 -1) / (3 + 5 + 2 + 2 - 2) = 10/4 = 0.4

The MAP estimate is 0.4 — it’s between 0.375 (from data) and 0.5 (from prior).
This means the MAP doesn’t fully trust the small dataset. It pulls the result closer to the prior.
This is useful when:
We don’t have much data.
We want to avoid extreme estimates (like 0 or 1) too early.

19
Q

Why is it intractable to compute a posterior predictive distribution?

A

To compute the entropy of the posterior predictive, it requires access to the full posterior predictive distribution. This is intractable because it requires us to integrate over all the possible parameter values which are weighted by their posterior probability. This is intractable, and an NP-hard problem. We can instead approximate it, by e.g. using Monte-Carlo sampling (repeated random sampling).