Variational Inference Flashcards

1
Q

The goal of Inference is to learn about latent (unknown) variables trough the posterior. Why is a analytic solution usually not an option?

A

The marginal integral is usually intractable.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Name some options for posterior inference

A

MCMC sampling
Laplace approximation
Expectation propagation
Variational inference

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What is the main advantage of variational inference?

A

It is the most scalable method currently known.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What is the main idea behind variational inference?

A

Approximate the true posterior by defining a family of approximate distrubution q_v and optimizing the variational parameters v.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What is the KL (Kullback Leibler) divergence?

A

KL(p(x)||q(x)) =
integ p(x) log(p(x)/q(x)) dx =
E[log(p(x)/q(x))]

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What is differential entropy?

A

H[q(x)]= -Eq[log q(x)]

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

How is the KL divergence often used in variational inference?

A

Use a KL(q(z) ||p(z|x)) as objective function to minimize

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

What does Jensens inequality state, and for what function do we often use this inequality?

A

For concave functions f:
f(E[x]) >= E[f(x)]

This is often used for logarithmes,
log E[x] >= E[log(x)]

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What can we do instead of minimizing KL(q(z) || p(z|x))?

A
We can maximise ELBO (Evidence Lower Bound):
Eq[log p(x|z)] - KL(q(z) ||p(z)) = 
Eq[log p(x,z)] + H[log q(z|v)] = 
Eq[log p(x,z)] - Eq[log q(z|v)] = 
Eq[log p(x,z) / q(z|v)]
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

What is a mean field approximation?

A

In a mean field approximation q(z|x) is fully factorized, meaning q(z|x) = prod q(z_i). The resulting distribution is with global parameters beta, and local paramters z_i is given by:
q(b, z|v) = q(B | lambda) prod (z_n | phi_n) with
v = [lambda, phi_1, phi_2…., phi_n]

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

In the mean field approximation the q-factors don’t depend directly on the data. How is the family of q’s connected to the data?

A

Trought the maximization of ELBO.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

What is the algorithm for mean field approximation?

A
  1. Initialize parameters randomly
  2. Update local variational paramters
  3. Update global variational paramters
  4. Repeat.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

What are the limitations of mean field approximation?

A

Generally mean field tend to be too compact, need a better class for approximation

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Classical mean field approximation has to evaluate all datapoints to update parameters, making it unscaleable to large datasets. How can we leaviate this problem?

A

Use stochastic variational inference, updating the parameters with a stochastic subset of the data.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

How do we maximize the ELBO?

A

Set q*(z_i) = exp(E{-i}[p(x,z)])

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

What is a important criterion for using mean field approximation with stochastic (noisy) gradients?

A

The gradients should be unbiased

17
Q

What is a problem with maximizing the ELBO?

A

ELBO is non-convex, we only get local minima.

18
Q

What is the natural gradient?

A

Uses one datapoint to compute the gradient of distrubutions with respect to the parameters v, for example the gradient of KL divergence with respect to v.

19
Q

What are some advantages of the natural gradinet?

A
  • Invariant to parametrization, for example variance vs precission.
  • Uses only one datapoint.
20
Q

How do we update the global parameter using noisy gradients?

A

Use a running mean so:

lambda_t

21
Q

What happens if we try to use the cavi to do Bayesian logistic regression?

A

We get a mean that can’t be calculated in closed form.

22
Q

Why can’t we use monte carlo approximation to calculate the intractable mean when doing inference on logistic regression?

A

Can’t push gradients trough monte carlo sampling. (We can optimize a lower bound instead, but the lower bound is model spesific).

23
Q

Why do we have to swap the order of the integraton (for ELBO expecations) and derivation in BBVI ( Black box VI)

A

The integrals are intractable for non-conjugate models, which makes gradient computation difficult (impossible?).

24
Q

What is the idae behind score function gradients?

A

Switch the order of integration and differentation, then simplify the expectation computations

25
How can we practically calculate the score function gradient?
MC estimate
26
What do we need to calculate the score funciton gradients?
1. Sampling from q 2. Evaluate score function gradient_v {log q(z|v)} 3. Evaluate log q(z|v) and p(x,z)
27
What is a problem with the score function gradients?
As we use MC sampling they are noisy. This can be leviated by for example control variates.
28
If f: X -> Y so f(x) = y, how are integ{xdx} and integ{ydy} related?
integ{ydy} = det(dF/dx) integ{xdx}. The "area" is multiplied by the determinand of the Jacobian.
29
What are the properties of pathwise gradient compared to score function gradient?
Lower variance, but more restricted model classes (differentiable models and z = t(e,v) )
30
What is the score function ELBO gradient and the pathwise ELBO gradient
Score: Eq[g(z, v) * dv log q(z|v)] Pathwise: Ep[dz g(z,v) * dv t(e,v)]
31
What is the idea behind behind amortized inference?
Learn a mapping, f(x_n, theta) = phi_n, from datapoints to local parameters. This means: 1. We do not need to learn any local parameters 2. No more independent update of local and 3. New theta can be found using SGD
32
What is the idea of atuoregressive distributions?
we make z_i dependend on all former z_j with j < i.
33
What is the idea of normalizing flows?
We aply k invertible transformations to q(z|v)