Variational Inference Flashcards by Patrick Henriksen

The goal of Inference is to learn about latent (unknown) variables trough the posterior. Why is a analytic solution usually not an option?

The marginal integral is usually intractable.

How well did you know this?

Not at all

Perfectly

Name some options for posterior inference

MCMC sampling
Laplace approximation
Expectation propagation
Variational inference

How well did you know this?

Not at all

Perfectly

What is the main advantage of variational inference?

It is the most scalable method currently known.

How well did you know this?

Not at all

Perfectly

What is the main idea behind variational inference?

Approximate the true posterior by defining a family of approximate distrubution q_v and optimizing the variational parameters v.

How well did you know this?

Not at all

Perfectly

What is the KL (Kullback Leibler) divergence?

KL(p(x)||q(x)) =
integ p(x) log(p(x)/q(x)) dx =
E[log(p(x)/q(x))]

How well did you know this?

Not at all

Perfectly

What is differential entropy?

H[q(x)]= -Eq[log q(x)]

How well did you know this?

Not at all

Perfectly

How is the KL divergence often used in variational inference?

Use a KL(q(z) ||p(z|x)) as objective function to minimize

How well did you know this?

Not at all

Perfectly

What does Jensens inequality state, and for what function do we often use this inequality?

For concave functions f:
f(E[x]) >= E[f(x)]

This is often used for logarithmes,
log E[x] >= E[log(x)]

How well did you know this?

Not at all

Perfectly

What can we do instead of minimizing KL(q(z) || p(z|x))?

We can maximise ELBO (Evidence Lower Bound):
Eq[log p(x|z)] - KL(q(z) ||p(z)) = 
Eq[log p(x,z)] + H[log q(z|v)] = 
Eq[log p(x,z)] - Eq[log q(z|v)] = 
Eq[log p(x,z) / q(z|v)]

How well did you know this?

Not at all

Perfectly

What is a mean field approximation?

In a mean field approximation q(z|x) is fully factorized, meaning q(z|x) = prod q(z_i). The resulting distribution is with global parameters beta, and local paramters z_i is given by:
q(b, z|v) = q(B | lambda) prod (z_n | phi_n) with
v = [lambda, phi_1, phi_2…., phi_n]

How well did you know this?

Not at all

Perfectly

In the mean field approximation the q-factors don’t depend directly on the data. How is the family of q’s connected to the data?

Trought the maximization of ELBO.

How well did you know this?

Not at all

Perfectly

What is the algorithm for mean field approximation?

Initialize parameters randomly
Update local variational paramters
Update global variational paramters
Repeat.

How well did you know this?

Not at all

Perfectly

What are the limitations of mean field approximation?

Generally mean field tend to be too compact, need a better class for approximation

How well did you know this?

Not at all

Perfectly

Classical mean field approximation has to evaluate all datapoints to update parameters, making it unscaleable to large datasets. How can we leaviate this problem?

Use stochastic variational inference, updating the parameters with a stochastic subset of the data.

How well did you know this?

Not at all

Perfectly

How do we maximize the ELBO?

Set q*(z_i) = exp(E{-i}[p(x,z)])

How well did you know this?

Not at all

Perfectly

What is a important criterion for using mean field approximation with stochastic (noisy) gradients?

Study These Flashcards

The gradients should be unbiased

What is a problem with maximizing the ELBO?

Study These Flashcards

ELBO is non-convex, we only get local minima.

What is the natural gradient?

Study These Flashcards

Uses one datapoint to compute the gradient of distrubutions with respect to the parameters v, for example the gradient of KL divergence with respect to v.

What are some advantages of the natural gradinet?

Study These Flashcards

Invariant to parametrization, for example variance vs precission.
Uses only one datapoint.

How do we update the global parameter using noisy gradients?

Study These Flashcards

Use a running mean so:

lambda_t

What happens if we try to use the cavi to do Bayesian logistic regression?

Study These Flashcards

We get a mean that can’t be calculated in closed form.

Why can’t we use monte carlo approximation to calculate the intractable mean when doing inference on logistic regression?

Study These Flashcards

Can’t push gradients trough monte carlo sampling. (We can optimize a lower bound instead, but the lower bound is model spesific).

Why do we have to swap the order of the integraton (for ELBO expecations) and derivation in BBVI ( Black box VI)

Study These Flashcards

The integrals are intractable for non-conjugate models, which makes gradient computation difficult (impossible?).

What is the idae behind score function gradients?

Study These Flashcards

Switch the order of integration and differentation, then simplify the expectation computations

How can we practically calculate the score function gradient?

MC estimate

What do we need to calculate the score funciton gradients?

1. Sampling from q 2. Evaluate score function gradient_v {log q(z|v)} 3. Evaluate log q(z|v) and p(x,z)

What is a problem with the score function gradients?

As we use MC sampling they are noisy. This can be leviated by for example control variates.

If f: X -> Y so f(x) = y, how are integ{xdx} and integ{ydy} related?

integ{ydy} = det(dF/dx) integ{xdx}. The "area" is multiplied by the determinand of the Jacobian.

What are the properties of pathwise gradient compared to score function gradient?

Lower variance, but more restricted model classes (differentiable models and z = t(e,v) )

What is the score function ELBO gradient and the pathwise ELBO gradient

Score: Eq[g(z, v) * dv log q(z|v)] Pathwise: Ep[dz g(z,v) * dv t(e,v)]

What is the idea behind behind amortized inference?

Learn a mapping, f(x_n, theta) = phi_n, from datapoints to local parameters. This means: 1. We do not need to learn any local parameters 2. No more independent update of local and 3. New theta can be found using SGD

What is the idea of atuoregressive distributions?

we make z_i dependend on all former z_j with j < i.

What is the idea of normalizing flows?

We aply k invertible transformations to q(z|v)

Variational Inference Flashcards

(33 cards)