Topic 2: Decision & Information Theory Basics Flashcards
(19 cards)
What is the uncertainty with the probabilistic perspective?
To classify, one can try to find exact relevant features to define a decision boundary, but this is uncertain, as there will be noise, transformations and so on.
In the real world, you deal with data that is uncertain (many outliers, not everything is identical). The data can be fuzzy, ambiguous, noisy, transformed, occluded and so on.
You can make the data into models, and try to capture the conditional probability distribution:
The conditional probability distribution can be converted into softmax functions, over the logits, a = f(x; θ)
What is the core idea of Bayesian thinking in machine learning?
Bayes’ rule/theorem: Understanding the relationship between two events.
Bayes’ theorem: derives the probability of an event, based on prior knowledge of conditions that might be related to the event.
Let A and B be two events.
𝑝(𝐴 and 𝐵) = 𝑝(𝐴, 𝐵) = 𝑝(𝐴) * 𝑝(𝐵)
if A & B are independent
𝑝(𝐴, 𝐵) = 𝑝(𝐴|𝐵) * 𝑝(𝐵) = 𝑝(𝐵|𝐴) * 𝑝(𝐵)
𝑝(θ|D) = (𝑝(θ) * 𝑝(D|θ)) / 𝑝(D)
posterior = (likelihood × prior)/evidence
It calculates the probability of event A given event B using prior, likelihood, and evidence.
vend tilbage for at stille spørgsmål
What is Bayesian Decision Theory?
Bayesian inference: It’s an optimal (yet intractable) way to update the beliefs about hidden quantities p(H|x).
The next step is the need to choose a decision from beliefs → Bayesian Decision Theory.
IN OTHER WORDS: We derive decisions for tomorrow’s actions, based on today’s Bayesian probabilities.
The ideal case is that the probability structure is perfectly known.
What is a reject option in classification (Bayesian Decision Theory)?
When the model is uncertain (probability below a threshold), it can reject making a decision (“I don’t know”).
Now we can do a classification with the option to “reject”.
It’s usually modelled as an action, a, from set, A={treat, no treat} U with {0}.
The {0} set represents the “reject”/”idk” action that is drawn when the most probable class $p$ is: < lambda* = 1 - lambda_r / lambda_e
An example is the gameshow Jeopardy.
Action: Correct answer results in 0 (you get nothign)
Action: Wrong answer results in lambda_e (you lose 100)
Action: No answer result sin lambda_r (you lose 50)
Why can’t we always use exact Bayesian inference?
Because posterior predictive distributions require marginalisation over all parameters, which is often intractable (it’s an NP-hard problem).
What approximations are used instead of exact Bayesian inference?
Instead we use approximation → MLE, MAP, KL-Divergence
What is Maximum Likelihood Estimation (MLE)?
How do we estimate parameters? We pick the parameters that assign the highest probability to training data. So, the $\theta$ that maximises the likelihood.
MLE: https://docs.google.com/document/d/1mxsSkeFXkP5p7zsEv1d96AZE-RTbwC5IjZ6v2gIzlZg/edit?tab=t.0
What is Maximum A Posteriori (MAP) estimation?
MLE is great, but it’s prone to overfit to the evidence (so it overfits to the training data).
MAP estimation is MLE but with regularisation on the prior, which reduces reliance on noisy training data.
For MAP, we use log, and we use other things, such as the posterior probability and the prior probability instead of only using the likelihood (to give it that spark).
MAP:
https://docs.google.com/document/d/1H71JvHrQg_u9INR-oWWKKuot2Q3QQ5lbdJBJ2sJ2YQE/edit?tab=t.0
There are different approximations for MAP, it depends on the problem at hand.
What is entropy in information theory?
Entropy → The level/degree of surprise
Consider the case of weather in Denmark. If the prediction says it will be raining, we won’t be surprised when it in reality does rain, because these types of predictions are usually true (for denmark (low surprise, low entropy).
When the surprise is low, we have low entropy
When we have high surprise, we will have high entropy.
What is cross-entropy?
Say we have two distributions, p and q, and we want to find the cross entropy between the two distributions.
p is our true class distribution whilst $q$ is the predicted class distribution.
Minimising the cross-entropy is equivalent to maximising the likelihood with regards to θ.
What is conditional entropy H(X|Y)?
Here, our entropy is conditional in the sense that, Y given X means: “Surprise in Y after seeing X”.
This allows us to determine the information gain, thus the reduction of uncertainty about Y given knowledge of X.
When is H(Y|X) = 0? Our model is capturing the data perfectly. It’s producing all of our data AND more. So this is when Y is the function of X.
What is KL-divergence (Kullback-Leibler)?
We use KL-divergence to compare our model’s prediction q with the true data distribution p_D, and train the model to minimize that difference.
It’s the distance metric of how similar OR divergent two distributions, p and q are:
It measures the predictive power (a sample brings on average) when distringuishing p(x) from q(x) and sampling from p(x)
What’s the difference between forward and reverse KL-divergence?
p is the true distribution
q is the unimodal approximation
Forward KL:
- compares the true distribution, p, to the model, q. it is weighted by p
- if p is high (p > 0 instead of p = 0), we penalise the KL divergence, meaning that we don’t get to minimise the kl divergence
- it is used less in inference because it requires knowing the true distribution
Reverse KL:
- compares the model, q, to the true distribution, p. it is weighted by q
- if q is high (q > 0, instead of q = 0), we penalise the kl divergence, meaning that we don’t get to minimise the kl divergence
- more common in inference
What is mutual information?
The amount of information one variable contains about another; it quantifies the reduction in uncertainty.
Mutual information = reduction in “surprise” in Y after seeing X = information gain
Mutual information: The information we can derive with them together (what they have in common).
How is mutual information used in decision trees?
Used to select the best features for splitting by maximizing the information gain I(A;Y).
For each input attribute A, and target Y, we compute I(A;Y). we use the one with the largest value to split the tree, and then we recurse
We make decisions on where the flower might belong.
Why is modeling uncertainty important in machine learning?
Because real-world data is often noisy, ambiguous, transformed, or incomplete. Modeling uncertainty allows for more robust predictions and better handling of outliers, occlusions, and variability in data.
What is feature selection?
Feature selection: choose an optimal subset of features according to the task at hand:
- e.g. feature ranking algorithms
- minimum subset of algorithms
napoleon and celine dion have a lot of similarities (according to the mentioned features that they do share), but there probably some other differences that might exist as well.
there are some uncertain observations, as all observations are not equally reliable, so we have to consider uncertainty.
We can use cross entropy as a loss function to compare the model really well.
Come with an example of MAP
A classic example is coin tosses modeled with a Bernoulli distribution.
Let’s say we observe outcomes of a coin toss where
y = 1, means heads,
y = 0, means tails.
We let D = {y_1, …, y_N} be the data. D = {0, 1, 0, 0, 1, 0, 0, 1}, so N = 8 tosses.
We use a prior p(θ)=Beta(θ|a,b)
Then the MAP estimate of the coin’s bias. We have N_1 = 3 heads and N_0 = 5 tails:
θ_MAP = (3 + 2 -1) / (3 + 5 + 2 + 2 - 2) = 10/4 = 0.4
The MAP estimate is 0.4 — it’s between 0.375 (from data) and 0.5 (from prior).
This means the MAP doesn’t fully trust the small dataset. It pulls the result closer to the prior.
This is useful when:
We don’t have much data.
We want to avoid extreme estimates (like 0 or 1) too early.
Why is it intractable to compute a posterior predictive distribution?
To compute the entropy of the posterior predictive, it requires access to the full posterior predictive distribution. This is intractable because it requires us to integrate over all the possible parameter values which are weighted by their posterior probability. This is intractable, and an NP-hard problem. We can instead approximate it, by e.g. using Monte-Carlo sampling (repeated random sampling).