VAE Study Notes

Edit Me

The Variational Autoencoder (VAE) is a deep generative model proposed by Kingma and Welling in 2013. Before this, traditional autoencoders (AE) were mainly used to learn low-dimensional representations of data. Their encoder-decoder structure could reconstruct inputs, but because the encoding produces a fixed deterministic vector, it cannot be used directly to generate entirely new samples. In other words, ordinary autoencoders lack constraints on the distribution of the latent space, so there is no guarantee that a point sampled at random from the latent space will decode into meaningful data — limiting autoencoders as generative models.

An ordinary AE only focuses on how accurately it compresses and restores the input, without forcing the latent representations of different samples to form any particular distribution in space. For example, the model might scatter $z$ values for similar images far apart, while $z$ values of different categories might be interleaved — neighboring points in the latent space may not be semantically similar. There is also a lack of continuity, meaning the latent space may have "holes."

VAE's motivation is precisely to address these issues: combining autoencoders with probabilistic graphical models and introducing variational inference to learn the underlying distribution of the data, thereby enabling generative modeling. By imposing a distributional constraint on the encoder output (e.g., assuming the latent variable follows a standard normal distribution), VAE transforms an autoencoder into a true generative model, allowing us to generate new data by sampling latent variables. This key improvement enables VAE to learn the latent space distribution while avoiding overfitting — accurately reconstructing the original input and generating new samples similar to the training data.

Basic Principles

A traditional autoencoder mainly consists of two parts: an encoder and a decoder.

In the model above, after repeated training, input data $X$ is ultimately encoded into a representation vector $X'$ , where each dimension of $X'$ represents a feature learned about the data, and the value at each dimension represents how $X$ appears on that feature. The decoder network then receives these values of $X'$ and attempts to reconstruct the original input.

Example:

Suppose any portrait image can be uniquely determined by a few feature values such as expression, skin tone, gender, and hair style. After feeding a portrait into an autoencoder, we obtain a vector $X'$ containing the image's values on features like expression and skin tone. The decoder then reconstructs the original portrait from these feature values.

In the example above, we use a single value to describe the input image's value on each latent feature. In practice, however, we may prefer to represent each latent feature as a range of possible values. For example, if the input is the Mona Lisa, setting the "smile" feature to a specific single value (asserting definitively that she is or isn't smiling) is clearly less appropriate than setting it to a range of values (e.g., a number within range $x$ to $y$ , where some values represent smiling and others represent not smiling). A VAE is a model that replaces single values with "probability distributions over values" to describe observations of features — as shown on the right side of the figure below. After encoding by a VAE, the smile feature of each image is no longer a single value as in the autoencoder, but a probability distribution.

Through this approach, we now represent each latent feature for a given input as a probability distribution. When decoding from a latent state, we randomly sample from each latent distribution to generate a vector as the input to the decoder model.

Through this encode-decode process, we effectively implement a continuous, smooth latent space representation. For all samples drawn from the latent distribution, we expect the decoder model to accurately reconstruct the input. Therefore, values that are close to each other in the latent space should also reconstruct into similar original inputs.

The above is the principle behind the VAE's construction. Let's now look at its concrete structure.

As shown above, VAE uses two neural networks to build two probabilistic density distribution models:

One for variational inference of the original input data, generating a variational probability distribution of the latent variables — called the inference network
Another for restoring the approximate probability distribution of the original data from the generated latent variable distribution — called the generative network

Model Architecture

VAE's model architecture includes three core parts — Encoder, Decoder, and Latent Variables — each playing a distinct role in data reconstruction and generation. The VAE encoder maps input data $x$ to the parameters of a probability distribution in the latent space (rather than a single deterministic encoding), samples latent variable $z$ from that distribution, and then the decoder generates output $x'$ based on $z$ .

Encoder

In VAE, the encoder no longer outputs a fixed hidden representation; instead, it outputs the parameters of a latent distribution (typically the mean and log-variance of a Gaussian). The encoder network is denoted $q_\phi(z|x)$ , controlled by parameters $\phi$ , and approximates the posterior distribution. It receives observed data $x$ , extracts its main features, and maps them to the latent space, producing the probability distribution of the latent variable (e.g., $\mu(x)$ and $\sigma(x)$ ). The encoder plays the role of an "inference network" — given data, it predicts the distribution of the latent variable. This design gives the latent variable uncertainty, improving the model's generalization ability.

Latent Variable

$z$ : The latent variable lies in the latent space and represents the underlying factors of data generation. Unlike the deterministic encoding of a traditional autoencoder, VAE treats the latent variable as a random variable. We typically assume the latent variable follows a simple prior distribution $p(z)$ (e.g., a standard normal distribution $\mathcal{N}(0,1)$ of dimension $d$ ), so the encoder learns distribution parameters that make the posterior close to the prior. Introducing a stochastic latent variable makes the model part of a probabilistic generative model: different values of $z$ correspond to different generated results in the data space. The probabilistic representation of the latent variable ensures the continuity and structure of the latent space, supporting the generation of new data by sampling from the latent space.

Decoder

The decoder receives the latent variable $z$ (a sampled value) as input and generates output $x'$ in the same distribution as the original data space. The decoder defines the likelihood distribution $p_\theta(x|z)$ , controlled by parameters $\theta$ , and is typically implemented as a neural network. The decoder acts like a "generative network" — it learns to map from the latent space back to the data space, attempting to reconstruct the original input or generate samples consistent with the training data distribution. Intuitively, the decoder "imagines" what the corresponding input might look like given $z$ , implementing the data generation process.

During training, the encoder and decoder are connected through the shared latent variable $z$ to form an autoencoder structure, except that the intermediate representation becomes a probability distribution. The encoder ensures $z$ contains enough information about $x$ to reconstruct the input; the decoder attempts to accurately recover $x$ from $z$ . Since we force the latent variable distribution to be close to a preset prior (e.g., standard normal), any $z$ sampled from the prior can potentially generate a reasonable sample through the decoder. This is the key distinction between VAE and ordinary autoencoders: the encoder outputs a continuous probability distribution rather than a fixed encoding. Thus, when generating new data, we can directly sample $z$ from the prior distribution and feed it to the decoder, which will produce generated samples. VAE thus achieves explicit modeling of the data distribution, becoming a true generative model.

Probabilistic Graphical Modeling and Generative Process

VAE can be viewed as a probabilistic graphical model with latent variables. From a probabilistic graphical perspective, VAE defines the relationship between observed data $x$ and latent variable $z$ : the latent variable $z$ generates observed data $x$ through the generative distribution $p_\theta(x|z)$ . That is, we assume there exists a hidden random variable $z$ that generates data $x$ , through the process: first sample a latent vector $z$ from the prior $p(z)$ , then generate the observed sample $x$ according to $p_\theta(x|z)$ . In graphical model notation, this is a simple directed graph: $z \rightarrow x$ . This generative process can be formalized as:

Prior distribution: $z \sim p(z)$ , typically $p(z) = \mathcal{N}(0, I)$ , indicating our prior assumption that latent variable dimensions are independent standard Gaussian. Choosing a normal distribution as the prior is partly because Gaussian distributions are easy to handle (e.g., KL divergence has an analytic solution), and partly because sufficiently complex function mappings can transform random variables from simple distributions into arbitrary complex distributions.
Conditional generative distribution: $x \sim p_\theta(x|z)$ , which describes the probability distribution of generating observed data $x$ given latent variable $z$ . We typically choose an appropriate distribution form based on data type and parameterize it with the decoder network.

In summary, VAE defines a parameterized joint distribution $p_\theta(x, z) = p(z) p_\theta(x|z)$ . The generative process samples from this joint distribution in the above order to obtain $(z, x)$ pairs, yielding new data points $x$ . VAE's objective is to maximize the marginal likelihood $p_\theta(x)$ over parameters $\theta$ . The marginal probability of observed data is:

p_\theta(x) = \int p_\theta(x,z)\ dz = \int p_\theta(x|z)p(z)\ dz

This integral is typically intractable in high dimensions. Therefore, approximate inference is needed to optimize model parameters — leading to the core idea of VAE: variational inference and the Evidence Lower Bound (ELBO).

Variational Inference and the ELBO

Since the marginal likelihood $p_\theta(x)$ is difficult to compute and maximize directly, we turn to variational inference to approximate the posterior and construct a lower bound on the marginal likelihood for optimization. Specifically, for a given data point $x$ , we want to estimate the posterior $p_\theta(z|x)$ . But directly computing $p_\theta(z|x) = \frac{p_\theta(x,z)}{p_\theta(x)}$ also involves the intractable $p_\theta(x)$ .

VAE's elegance lies in introducing an approximate posterior $q_\phi(z|x)$ defined by the encoder, and indirectly maximizing the log-likelihood by minimizing the difference between $q_\phi(z|x)$ and the true posterior $p_\theta(z|x)$ .

We measure the difference between two distributions using the Kullback-Leibler divergence (KL divergence): $D_\text{KL}(q\|p) = \mathbb{E}_q[\log \frac{q(z)}{p(z)}]$ . KL divergence is non-negative and equals 0 only when $q = p$ . Applying KL divergence between the true and approximate posteriors, and through derivation (applying Bayes' theorem and rearranging), we obtain:

\log p_\theta(x) \ge \mathbb{E}_{q_\phi(z|x)}[\log p_\theta(x|z)] - D_\text{KL}(q_\phi(z|x) \| p(z))

The left side is the log-likelihood we care about, and the right side is its lower bound. This inequality states that for any approximate posterior distribution $q_\phi(z|x)$ , the log marginal likelihood is greater than or equal to the right-hand expression. We call the right-hand quantity the Evidence Lower Bound (ELBO):

\mathcal{L}(\theta, \phi; x) = \mathbb{E}_{q_\phi(z|x)}[\log p_\theta(x|z)] - D_\text{KL}(q_\phi(z|x) \| p(z))

It is a lower bound on $\log p_\theta(x)$ for any $\theta, \phi$ : $\log p_\theta(x) \ge \mathcal{L}(\theta, \phi; x)$ . The two are equal if and only if the approximate posterior perfectly equals the true posterior. Thus, we transform the original problem of "maximizing $\log p_\theta(x)$ " into "maximizing the ELBO" — by maximizing the ELBO, we simultaneously improve the data log-likelihood and approximate the true posterior.

KL Divergence and Reconstruction Error: VAE Loss Function

From the derivation above, the ELBO for a single sample $x$ contains two parts:

Reconstruction term: $\mathbb{E}_{q_\phi(z|x)}[\log p_\theta(x|z)]$ — the expected log-likelihood of $x|z$ , reflecting the decoder's ability to reconstruct the original sample $x$ from latent variable $z$ . Maximizing the ELBO requires this term to be as large as possible — i.e., the decoder gives as high a probability as possible to $x$ given $z$ — equivalent to reconstructing the input as accurately as possible.
Regularization term (KL term): $-D_\text{KL}(q_\phi(z|x) \| p(z))$ — the negative KL divergence, or equivalently a penalty term $D_\text{KL}(q_\phi(z|x) \| p(z))$ in the loss. Maximizing the ELBO is equivalent to minimizing the KL divergence between the approximate posterior produced by the encoder and the prior $p(z)$ . Intuitively, the KL term constrains the shape of the latent space, ensuring latent representations of different inputs are not too far apart and cover the prior's space — guaranteeing meaningful interpolation and random sampling in the latent space.

The VAE loss function is typically the negative ELBO (since we minimize loss in optimization):

L(\theta, \phi; x) = -\mathcal{L}(\theta, \phi; x) = -\mathbb{E}_{q_\phi(z|x)}[\log p_\theta(x|z)] + D_\text{KL}(q_\phi(z|x) \| p(z))

This loss function consists exactly of "reconstruction error + KL regularization." Through minimizing $L$ , we equivalently maximize the ELBO, teaching the model to simultaneously reconstruct inputs and regularize the latent space.

When choosing Gaussian prior and Gaussian posterior ( $q_\phi(z|x) = \mathcal{N}(z;\mu(x), \text{diag}(\sigma^2(x)))$ and $p(z) = \mathcal{N}(0,I)$ ), the KL divergence has an analytic solution. For a single dimension:

D_\text{KL}(\mathcal{N}(\mu, \sigma^2) \| \mathcal{N}(0, 1)) = \frac{1}{2}(\mu^2 + \sigma^2 - 1 - \log\sigma^2)

Sum over all dimensions for the multi-dimensional case. In practice, the KL divergence loss is computed directly from this closed form, while the reconstruction term uses the appropriate loss for the output distribution (e.g., cross-entropy for Bernoulli assumptions, MSE or negative log-likelihood for Gaussian assumptions).

The Reparameterization Trick

Principle

The encoder outputs the parameters of the latent variable distribution (e.g., $\mu(x), \sigma(x)$ ), and we need to sample $z$ from this distribution to pass to the decoder. If we sample directly from $q_\phi(z|x)$ , the value of $z$ is random and gradients cannot be backpropagated to $\phi$ . The reparameterization trick solves this:

The core idea is to decouple the sampling process of the random variable from the network parameters, "extracting" the randomness from the network structure so gradients can flow back. Specifically, for Gaussian distributions, we introduce an auxiliary standard normal noise $\epsilon$ and express the sampling of $z$ as:

z = \mu(x) + \sigma(x) \odot \epsilon, \quad \epsilon \sim \mathcal{N}(0, I)

Here $\mu(x)$ and $\sigma(x)$ are the mean and standard deviation vectors output by the encoder network, $\epsilon$ is a random noise vector independent of $\phi$ (each dimension independently standard normal), and $\odot$ denotes element-wise multiplication. Through this reparameterization, $z$ is expressed as a deterministic function $z = f_\phi(x, \epsilon)$ : for a given input $x$ , different $\epsilon$ samples yield different $z$ values, but for any specific $\epsilon$ , $z$ is an explicit function of $\mu$ and $\sigma$ . Therefore, the expectation $\mathbb{E}_{q_\phi(z|x)}$ can be converted to an expectation over $\epsilon$ , allowing exchange of the integral and derivative:

\mathbb{E}_{q_\phi(z|x)}[h(z)] = \mathbb{E}_{\epsilon \sim \mathcal{N}(0,I)}[h(\mu(x) + \sigma(x) \odot \epsilon)]

Gradients can now be computed with respect to $\phi$ through the deterministic transformation. Intuitively, the reparameterization trick makes stochastic nodes differentiable: sampling $z$ is treated as sampling independent noise $\epsilon$ then transforming it through a differentiable mapping. This is the most critical part of enabling VAE to be trained end-to-end. Without reparameterization, there would be no practical VAE training scheme.

Full Example

Setup (from encoder output)

Mean vector $\mu = [1.2, -0.3]$
Log-variance $\log\sigma^2 = [-0.4, 0.8]$

Thus:

\sigma = \exp\!\left(\tfrac{1}{2}\log\sigma^2\right) = \exp([-0.2, 0.4]) \approx [0.8187, 1.4918]

Reparameterized sampling

Sample standard normal noise $\epsilon \sim \mathcal{N}(0, I)$ , take $\epsilon = [0.7, -1.1]$ :

z = \mu + \sigma \odot \epsilon = [1.2, -0.3] + [0.8187 \times 0.7,\ 1.4918 \times (-1.1)] \approx [1.7731, -1.9410]

This converts the originally "non-differentiable" sampling into a deterministic transformation $z(\mu, \sigma, \epsilon)$ that is differentiable with respect to $\mu$ and $\sigma$ .

Gradients during backpropagation

Suppose the gradient of some loss $L$ with respect to $z$ is $\frac{\partial L}{\partial z} = g = [0.2, -0.5]$ :

\frac{\partial z}{\partial \mu} = 1 \Rightarrow \frac{\partial L}{\partial \mu} = g = [0.2, -0.5]

Since $z = \mu + \sigma \odot \epsilon$ and $\sigma = \exp(\tfrac{1}{2}\log\sigma^2)$ :

\frac{\partial L}{\partial \log\sigma^2} = g \odot \left(\tfrac{1}{2}\sigma \odot \epsilon\right) \approx [0.0573, 0.4102]

Gradients flow directly to $\mu$ and $\log\sigma^2$ — this is the key to the reparameterization trick.

KL divergence numerical value

For diagonal Gaussian $q(z|x) = \mathcal{N}(\mu, \text{diag}(\sigma^2))$ vs. standard normal $p(z) = \mathcal{N}(0, I)$ :

\text{KL}(q\|p) = -\tfrac{1}{2}\sum_i\left(1 + \log\sigma_i^2 - \mu_i^2 - \exp(\log\sigma_i^2)\right) \approx 1.013

Model Training

Training Process

Initialize parameters: Randomly initialize encoder and decoder parameters $\phi, \theta$ .
Forward pass and sampling: For each training sample $x^{(i)}$ , the encoder outputs $\mu^{(i)}$ and $\sigma^{(i)}$ . Sample $\epsilon^{(i)} \sim \mathcal{N}(0, I)$ and compute $z^{(i)} = \mu^{(i)} + \sigma^{(i)} \odot \epsilon^{(i)}$ . Feed $z^{(i)}$ to the decoder to obtain the reconstructed output distribution.
Compute loss: Calculate reconstruction error and KL divergence; sum them to get the sample loss.
Backpropagation: Thanks to reparameterization, gradients flow to $\phi$ via $\mu^{(i)}$ and $\sigma^{(i)}$ .
Parameter update: Use an optimizer (e.g., SGD or Adam) to update parameters.
Iterate: Repeat over multiple epochs until convergence.

Sampling Strategy

During training, the expectation in the reconstruction term is approximated via sampling. In practice, typically one sample per data point suffices — this simplifies computation and reduces gradient variance, since noise varies across samples within a mini-batch.

Generation Phase

After training, we obtain parameters $\phi^*, \theta^*$ . To generate new data: sample $z$ from the prior $p(z) = \mathcal{N}(0, I)$ , feed it to the trained decoder $p_{\theta^*}(x|z)$ , and obtain a generated sample. Because the latent space has learned to correspond to the data distribution, samples from the prior, after passing through the decoder, produce outputs similar to the training distribution.

Applications

VAE, as a powerful generative and feature learning model, has broad applications:

Image generation: VAE is commonly used to generate images such as handwritten digits and faces. Although pure VAE-generated images may be less sharp than GAN outputs, they have good diversity and avoid mode collapse. VAE can also achieve continuous image transformation and interpolation — e.g., linearly interpolating $z$ values of two face images in the latent space to generate intermediate transitional faces.
Representation learning: Because the encoder compresses data into latent variable $z$ with a standard normal prior structure, VAE is often used for unsupervised feature extraction and dimensionality reduction. The latent space learned by VAE tends to be more structured and continuous than that of ordinary autoencoders, useful for visualizing high-dimensional data or as input features for downstream tasks.
Semi-supervised learning: VAE can naturally combine labeled and unlabeled data for semi-supervised learning. A typical approach is Conditional VAE (CVAE) or adding a class variable to the latent variable — using unlabeled data to learn data distribution structure and limited labeled data to guide the latent space distribution to correlate with categories.
Anomaly detection: Using VAE to learn data's probability distribution enables detection of anomalous samples. Train VAE on normal data; for a new data point, if VAE cannot reconstruct it well (reconstruction error far exceeds the normal range) or its posterior distribution diverges significantly from the prior, the data likely deviates from the training distribution and is flagged as anomalous. This approach is used in industrial sensor data anomaly detection, network intrusion detection, financial fraud detection, and more.

VAE is also used in data augmentation (generating synthetic samples in data-scarce scenarios), image denoising, and reinforcement learning (as a world model to assist agent decision-making).

Limitations and Improvements

Posterior Collapse

Posterior collapse (also called KL vanishing or latent variable collapse) is an notorious problem in VAE training, especially in high-dimensional sequence modeling such as NLP. It refers to the encoder's approximate posterior degenerating almost to the prior, ignoring input information — causing the latent variable $z$ to be nearly useless for reconstruction. This manifests as the KL divergence approaching 0 during training, with the model ignoring latent variables and the decoder relying entirely on its own high capacity to reconstruct the input.

The intuitive cause is that the decoder is too powerful or the training strategy is inappropriate. Once in this state, the model's latent variables carry almost no information, completely losing the purpose of representation learning.

Improvement strategies:

KL annealing: Start KL loss weight at 0 and gradually increase it, ensuring the model first learns reconstruction then gradually applies the prior constraint.
Free Bits: Set a minimum contribution threshold per KL term dimension, guaranteeing the encoder transmits at least some information in each latent variable dimension.
Adjust network capacity: Limit decoder capacity (reduce layers, add dropout) to force it to rely on $z$ information.
Improved objective functions: e.g., Info-VAE adding additional information constraints to the ELBO, VAE with skip connections, Cyclic-VAE, etc.

Development of VAE

β-VAE

β-VAE was proposed by Higgins et al. in 2017 for learning more disentangled latent representations. The core idea is simple: introduce a weight factor $\beta$ to adjust the KL divergence term's weight:

\mathcal{L}_{\beta\text{-VAE}} = \mathbb{E}_{q_\phi(z|x)}[\log p_\theta(x|z)] - \beta D_\text{KL}(q_\phi(z|x) \| p(z))

When $\beta = 1$ , this recovers the standard VAE. Taking $\beta > 1$ means we place greater emphasis on making the encoder output comply with the prior. By forcing the model to reconstruct data under a stricter bottleneck, β-VAE encourages each dimension of $z$ to learn an independent latent factor. Experiments show that with appropriately independent generative factors, increasing $\beta$ can cause different coordinates of $z$ to correspond to different semantic factors — achieving unsupervised factor disentanglement. The trade-off is reduced reconstruction quality: an overly large $\beta$ may cause the model to sacrifice reconstruction detail to satisfy the prior constraint. β-VAE inspired a series of subsequent research on quantifying and improving disentangled representations.

VAE-GAN

VAE-GAN, proposed by Larsen et al. in 2015, fuses VAE and GAN to leverage the strengths of both. In addition to the standard VAE architecture, a GAN discriminator is introduced to perform adversarial training on generated reconstructed samples.

Specifically, VAE-GAN retains the VAE encoder and decoder. A discriminator network $D$ receives real data $x$ , reconstructed data $x'$ , or purely generated data $\tilde{x}$ (sampled from the prior), and is trained to distinguish real from generated. The decoder is trained to fool the discriminator.

Benefits:

Sharper generated samples: The discriminator pushes the decoder to generate sharper, more detailed images — significantly improving the blurriness issue of vanilla VAE outputs.
More stable adversarial training: Because the discriminator sees both reconstructed and purely sampled outputs, and the encoder provides a structured starting point for the decoder, training is more stable than vanilla GAN.
Preserves inference capability: Unlike pure GAN, VAE-GAN retains the encoder, so the model can still infer latent variables from given data, maintaining representation learning capability.

VAE-GAN's main drawback is greater complexity and more hyperparameters. But when successfully trained, it combines both worlds' advantages: generation quality close to GAN with VAE's inference capability and training stability.

Summary

The Variational Autoencoder (VAE), since its proposal, has developed rich theoretical and practical results. Fundamentally, VAE combines probabilistic graphical models with deep learning, introduces variational inference to approximate the posterior, and cleverly implements differentiable sampling through reparameterization — enabling gradient descent to train deep models with stochastic units. Its loss consists of reconstruction error and KL divergence, learning the data distribution while regularizing the latent space structure, making generation possible. VAE has been widely applied to generation of images, text, and more, as well as feature learning and semi-supervised tasks. Many improvements have been proposed to enhance VAE: addressing posterior collapse to fully utilize latent variables, obtaining more disentangled representations via β-VAE, improving generation quality by combining with GAN, introducing discrete latents with VQ-VAE, and more. Although compared to newer generative models, VAE has limitations such as blurry generated samples, its efficient inference mechanism, stable training process, and clear probabilistic interpretation ensure it continues to play an important role in deep generative modeling.

贡献者

这篇文章有帮助吗？