Variational Inference Flashcards
The goal of Inference is to learn about latent (unknown) variables trough the posterior. Why is a analytic solution usually not an option?
The marginal integral is usually intractable.
Name some options for posterior inference
MCMC sampling
Laplace approximation
Expectation propagation
Variational inference
What is the main advantage of variational inference?
It is the most scalable method currently known.
What is the main idea behind variational inference?
Approximate the true posterior by defining a family of approximate distrubution q_v and optimizing the variational parameters v.
What is the KL (Kullback Leibler) divergence?
KL(p(x)||q(x)) =
integ p(x) log(p(x)/q(x)) dx =
E[log(p(x)/q(x))]
What is differential entropy?
H[q(x)]= -Eq[log q(x)]
How is the KL divergence often used in variational inference?
Use a KL(q(z) ||p(z|x)) as objective function to minimize
What does Jensens inequality state, and for what function do we often use this inequality?
For concave functions f:
f(E[x]) >= E[f(x)]
This is often used for logarithmes,
log E[x] >= E[log(x)]
What can we do instead of minimizing KL(q(z) || p(z|x))?
We can maximise ELBO (Evidence Lower Bound): Eq[log p(x|z)] - KL(q(z) ||p(z)) = Eq[log p(x,z)] + H[log q(z|v)] = Eq[log p(x,z)] - Eq[log q(z|v)] = Eq[log p(x,z) / q(z|v)]
What is a mean field approximation?
In a mean field approximation q(z|x) is fully factorized, meaning q(z|x) = prod q(z_i). The resulting distribution is with global parameters beta, and local paramters z_i is given by:
q(b, z|v) = q(B | lambda) prod (z_n | phi_n) with
v = [lambda, phi_1, phi_2…., phi_n]
In the mean field approximation the q-factors don’t depend directly on the data. How is the family of q’s connected to the data?
Trought the maximization of ELBO.
What is the algorithm for mean field approximation?
- Initialize parameters randomly
- Update local variational paramters
- Update global variational paramters
- Repeat.
What are the limitations of mean field approximation?
Generally mean field tend to be too compact, need a better class for approximation
Classical mean field approximation has to evaluate all datapoints to update parameters, making it unscaleable to large datasets. How can we leaviate this problem?
Use stochastic variational inference, updating the parameters with a stochastic subset of the data.
How do we maximize the ELBO?
Set q*(z_i) = exp(E{-i}[p(x,z)])
What is a important criterion for using mean field approximation with stochastic (noisy) gradients?
The gradients should be unbiased
What is a problem with maximizing the ELBO?
ELBO is non-convex, we only get local minima.
What is the natural gradient?
Uses one datapoint to compute the gradient of distrubutions with respect to the parameters v, for example the gradient of KL divergence with respect to v.
What are some advantages of the natural gradinet?
- Invariant to parametrization, for example variance vs precission.
- Uses only one datapoint.
How do we update the global parameter using noisy gradients?
Use a running mean so:
lambda_t
What happens if we try to use the cavi to do Bayesian logistic regression?
We get a mean that can’t be calculated in closed form.
Why can’t we use monte carlo approximation to calculate the intractable mean when doing inference on logistic regression?
Can’t push gradients trough monte carlo sampling. (We can optimize a lower bound instead, but the lower bound is model spesific).
Why do we have to swap the order of the integraton (for ELBO expecations) and derivation in BBVI ( Black box VI)
The integrals are intractable for non-conjugate models, which makes gradient computation difficult (impossible?).
What is the idae behind score function gradients?
Switch the order of integration and differentation, then simplify the expectation computations