Machine Learning and AI Flashcards
(114 cards)
What is probabilistic machine learning?
A probabilistic machine learning model will assign probabilities to it’s outcomes based on how likely the model considers the prediction to be correct
What are the two types of uncertainty and their definitions
Aleatoric uncertainty:
Uncertainty due to randomness
Epistemic Uncertainty:
Due to lack of knowledge
How can we define a statistic model?(Notation)
p(Y|θ)
Where Y is the dataset a
Where θ is the model parameters
What is another name for Gaussian distribution?
Unvariate normal distribution
What does it mean when samples are identically distributed?
They all come from the same probability distribution
Why might we use Capital Pi instead of Capital Sigma for a summation?
Capital Pi indicated multiplication of the elements while capital sigma indicates summation
Try Maximum likelihood for different distributions Slides lecture 1 page 29
I did it and i’m a good boy!!
What are the three types of prior?
Non- informative
Weakly informative
Informative
When should we use a Non-Informative Prior
No prior knowledge; you want data to dominate
When there is a normal distribution with a constant mean and a very large variance
The p(θ) is approximately the same for all values of θ
When should we use a Weakly Informative Prior?
A weakly informative prior is used when you have some general knowledge or reasonable assumptions about a parameter, but you don’t want the prior to dominate the data — it’s a compromise between a non-informative prior and a strongly informative one.
A weakly informative prior provides mild constraints on parameter values based on domain knowledge.
It helps prevent implausible or extreme values while still allowing the data to heavily influence the posterior.
Unlike non-informative priors (which treat all values equally likely), weakly informative priors rule out obviously wrong values.
When there is a Normal distribution with a constant mean and very large variance.
The p(θ) is the same for θ which are plausible
When should we use Informative Priors
You have strong, reliable prior knowledge and can enforce constraints of possible parameter values based on domain knowledge
When there is a normal distribution with a constant mean and very small variance
Why is the poisson distribution more suitable for count data
The Poisson distribution is more suitable for count data because it naturally handles discrete, non-negative values and aligns with the way events are expected to occur over fixed intervals, often capturing the mean-variance relationship seen in such data.
Define a Conjugate prior
In Bayesian statistics, a conjugate prior is a prior distribution that, when combined with a likelihood from a particular family of distributions, results in a posterior distribution of the same family.
So if the prior and posterior have the same functional form, the prior is said to be conjugate to the likelihood.
e.g.
If you pick a prior that, after seeing the data, gives you a posterior in the same kind of distribution, that prior is called conjugate.
How can we prove a prior is conjugate?
A conjugate prior is one where the posterior belongs to the same distribution family as the prior. To prove conjugacy, show that the product of the prior and likelihood results in a distribution of the same form as the prior (ignoring constants)
How do we calculate the Likelihood for Bayesian linear regression
Likelihood: 𝑝 (𝒚|𝑋, 𝜎2, 𝒘) = 𝒩(𝒚|𝑋𝒘,𝜎^2𝐼)
How can we calculate a posterior in the context of Bayesian Linear regression
Posterior: 𝑝(𝒘|𝒚, 𝑋, 𝜎^2)=𝒩(𝒘|𝒘𝑛, 𝜮𝑛)
What does Bayes’ rule state in the context of updating beliefs?
Bayes’ rule states that the posterior (our updated belief after seeing data) is proportional to the likelihood (how well the model explains the data) multiplied by the prior (our initial belief about the parameters).
How is the posterior mathematically expressed using Bayes’ rule?
It is expressed as:
P(θ∣D) ∝ P(D∣θ) x P(θ)
where
θ are the parameters
D is the data
P(θ|D) is the posterior
P(D|θ) is the likelihood
P(θ) is the Prior
Why do we take the logarithm of the posterior, likelihood, and prior?
Taking the logarithm simplifies the calculations because it converts products into sums, making the math easier to handle, especially for optimization.
What does the equation look like after taking the log of Bayes’ rule? Write this down, the answer is on the back
logP(θ∣D)∝logP(D∣θ)+logP(θ)
What is the Maximum A Posteriori (MAP)?
MAP is a method for estimating the most plausible value of a parameter after seeing data, while also taking into account your prior belief about that parameter
How is the Maximum A Posteriori(MAP) different to the Maximum Likelihood estimate?
Maximum Likelihood Estimation (MLE):
-Only uses the likelihood: how well the parameter fits the data
-Doesn’t use a prior.
Maximum A Posteriori (MAP):
-Uses likelihood AND prior.
-Balances what the data says with what you already believe.
Likelihood is like what the data is telling you.
Prior is like what you already believed.
MAP is like a compromise between the two — it pulls the estimate toward your prior unless the data strongly disagrees.
How does a Gaussian likelihood relate to Ordinary Least Squares (OLS)?
When the likelihood is Gaussian, maximizing the likelihood (or minimizing the negative log likelihood) is equivalent to minimizing the sum of squared errors, which is exactly what OLS does.
A Gaussian likelihood assumes that the errors (or noise) in a linear regression model are normally distributed around the predictions. When we take the log of the Gaussian likelihood, the result is an expression involving the sum of squared errors between the predicted and actual values.
Ordinary Least Squares (OLS) minimizes this sum of squared errors. So, maximizing the Gaussian likelihood is mathematically equivalent to minimizing the OLS loss function.
In other words, OLS is the Maximum Likelihood Estimator (MLE) when the data is assumed to have Gaussian noise.
e.g. OLS and Gaussian likelihood lead to the same solution because both aim to minimize the squared difference between predictions and true values.
What role does a Gaussian prior play in model building?
A Gaussian prior on the parameters implies that we expect the parameters to be centered around zero and not be too large. Its logarithm introduces a penalty for large parameter values, serving as a regularization term.