Variational Autoencoder (VAE)
Variational Autoencoder (VAE)
Let $p_X(\cdot; \theta)$ be the generative model over $\mathbb{R}^d$ of the form $p_{X}(x; \theta) = \int_z p_Z(z)p_{X\mid Z}(x \mid z; \theta) dz$, where $p_Z$ is the latent distribution over $\mathbb{R}^\ell$ (for $\ell < d$) and $p_{X\mid Z}$ is the conditional distribution, both of which are assumed to be tractable (and simple, for example, parameterized Gaussian or Bernoulli). The objective of generative modelling is \(\begin{equation} \max_\theta {\rm LL}(\theta) = \max_\theta \mathbb{E}_{x \sim p_X} [\ln p_X(x; \theta)]. \end{equation}\)
For a data point $x \in \mathbb{R}^d$, and for any variational distribution over the latents $q_{Z \mid X}(\cdot \mid x, \varphi)$, we have that \(\begin{equation} \ln p_X(x; \theta) = {\rm ELBO}(\theta, \varphi; x) + {\rm KL}(q_{Z\mid X}(\cdot \mid x; \varphi) \Vert p_{Z \mid X}(\cdot \mid x; \theta)) \ge {\rm ELBO}(\theta, \varphi; x), \end{equation}\) due to the non-negativity of KL-divergence, for any $\varphi$. Therefore, we can instead define the following objective as \(\begin{equation} \max_\theta {\rm LL}(\theta) \ge \max_{\theta, \varphi} {\rm ELBO}(\theta, \varphi) = \mathbb{E}_{X \sim p_X}[{\rm ELBO}(\theta, \varphi; x)]. \end{equation}\)
The term is called the evidence lower bound (ELBO) and is given by \(\begin{align} {\rm ELBO}(\theta, \varphi; x) &\triangleq \int_z q_{Z\mid X}(z\mid x; \varphi) \ln \left( \frac{p_{X\mid Z}(x\mid z; \theta) p_{Z}(z) }{q_{Z\mid X}(z\mid x; \varphi)}\right) dz \\ &= \mathbb{E}_{z \sim q_{Z\mid X}(\cdot \mid x; \varphi)}[\ln p_{X\mid Z}(x \mid z; \theta)] - {\rm KL}(q_{Z\mid X}(\cdot \mid x; \varphi) \Vert p_Z(\cdot)). \end{align}\)
Let the variational distribution be Gaussian distribution with independent components, i.e., $q_{Z \mid X}(z \mid x; \varphi) = \prod_{j=1}^\ell \mathcal{N}(z_j \mid \mu_j(x; \varphi), \sigma_j^2(x; \varphi))$. Then, the KL term can be computed analytically as
\[\begin{align} {\rm KL}(q_{Z\mid X}(\cdot \mid x; \varphi) \Vert p_Z(\cdot)) = -\frac{1}{2} \sum_{j=1}^\ell (1+\ln (\sigma_j^2(x; \varphi)) + \mu_j^2(x;\varphi) - \sigma_j^2(x; \varphi)). \end{align}\]and the first term can be written, using a reparametrization trick, as \(\begin{align} \mathbb{E}_{z \sim q_{Z\mid X}(\cdot \mid x; \varphi)}[\ln p_{X\mid Z}(x \mid z; \theta)] = \mathbb{E}_{\varepsilon \sim \mathcal{N}(\cdot \mid 0, I_\ell)}[\ln p_{X\mid Z}(x \mid \mu(x; \varphi) +\varepsilon \odot \sigma(x; \varphi); \theta)]. \end{align}\) Therefore, the overall objective can be written as \(\begin{align} \max_{\theta, \varphi} {\rm ELBO}(\theta, \varphi) = \max_{\theta, \varphi} \mathbb{E}_{x \sim p_X, \varepsilon \sim \mathcal{N}(\cdot\mid 0, I_\ell)} \left[ \ln p_{X\mid Z}(x \mid \mu(x; \varphi) +\varepsilon \odot \sigma(x; \varphi); \theta) + \frac{1}{2} \sum_{j=1}^\ell (1+\ln (\sigma_j^2(x; \varphi)) + \mu_j^2(x;\varphi) - \sigma_j^2(x; \varphi)) \right]. \end{align}\)
KL of Gaussian distributions
We can find the KL divergence as
\[\begin{align} {\rm KL}(\mathcal{N}(\cdot \mid \mu_1, \sigma_1^2) \Vert \mathcal{N}(\cdot \mid \mu_2, \sigma_2^2)) &= \mathbb{E}_{z \sim \mathcal{N}(\cdot \mid \mu_1, \sigma_1^2)}\left[\frac{(z-\mu_2)^2}{2\sigma_2^2}-\frac{(z-\mu_1)^2}{2\sigma_1^2} \right] \\ &= \frac{(\mu_1-\mu_2)^2}{2\sigma_2^2} \end{align}\]Connection to Expectation-Maximization (EM)
Consider an EM algorithm to find the parameters $\theta$ to maximize the data log-likelihood \(\max_\theta {\rm LL}(\theta) = \mathbb{E}_{x \sim p_X}[\ln p_X(x; \theta)].\).
Let $q_{Z \mid X}(\cdot \mid x; \varphi)$ be a (parameterized) distribution over the latent parameters. Then, the E-step is given by \(\mathbb{E}_{x \sim p_X}[\ln p_X(x; \theta)] = \mathbb{E}_{x \sim p_X, z \sim q_{Z \mid X}(\cdot \mid x; \varphi)}[\ln p_X(x; \theta)]\)