VAE Study Notes
VAE (Variational Autoencoder)
Overview
The Variational Autoencoder (VAE) is a deep generative model proposed by Kingma and Welling in 2013. Before this, traditional autoencoders (AE) were mainly used to learn low-dimensional representations of data. Their encoder-decoder structure could reconstruct inputs, but because the encoding produces a fixed deterministic vector, it cannot be used directly to generate entirely new samples. In other words, ordinary autoencoders lack constraints on the distribution of the latent space, so there is no guarantee that a point sampled at random from the latent space will decode into meaningful data — limiting autoencoders as generative models.
An ordinary AE only focuses on how accurately it compresses and restores the input, without forcing the latent representations of different samples to form any particular distribution in space. For example, the model might scatter values for similar images far apart, while values of different categories might be interleaved — neighboring points in the latent space may not be semantically similar. There is also a lack of continuity, meaning the latent space may have "holes."
VAE's motivation is precisely to address these issues: combining autoencoders with probabilistic graphical models and introducing variational inference to learn the underlying distribution of the data, thereby enabling generative modeling. By imposing a distributional constraint on the encoder output (e.g., assuming the latent variable follows a standard normal distribution), VAE transforms an autoencoder into a true generative model, allowing us to generate new data by sampling latent variables. This key improvement enables VAE to learn the latent space distribution while avoiding overfitting — accurately reconstructing the original input and generating new samples similar to the training data.
Basic Principles
A traditional autoencoder mainly consists of two parts: an encoder and a decoder.
In the model above, after repeated training, input data is ultimately encoded into a representation vector , where each dimension of represents a feature learned about the data, and the value at each dimension represents how appears on that feature. The decoder network then receives these values of and attempts to reconstruct the original input.
Example:
Suppose any portrait image can be uniquely determined by a few feature values such as expression, skin tone, gender, and hair style. After feeding a portrait into an autoencoder, we obtain a vector containing the image's values on features like expression and skin tone. The decoder then reconstructs the original portrait from these feature values.
In the example above, we use a single value to describe the input image's value on each latent feature. In practice, however, we may prefer to represent each latent feature as a range of possible values. For example, if the input is the Mona Lisa, setting the "smile" feature to a specific single value (asserting definitively that she is or isn't smiling) is clearly less appropriate than setting it to a range of values (e.g., a number within range to , where some values represent smiling and others represent not smiling). A VAE is a model that replaces single values with "probability distributions over values" to describe observations of features — as shown on the right side of the figure below. After encoding by a VAE, the smile feature of each image is no longer a single value as in the autoencoder, but a probability distribution.
Through this approach, we now represent each latent feature for a given input as a probability distribution. When decoding from a latent state, we randomly sample from each latent distribution to generate a vector as the input to the decoder model.
Through this encode-decode process, we effectively implement a continuous, smooth latent space representation. For all samples drawn from the latent distribution, we expect the decoder model to accurately reconstruct the input. Therefore, values that are close to each other in the latent space should also reconstruct into similar original inputs.
The above is the principle behind the VAE's construction. Let's now look at its concrete structure.
As shown above, VAE uses two neural networks to build two probabilistic density distribution models:
- One for variational inference of the original input data, generating a variational probability distribution of the latent variables — called the inference network
- Another for restoring the approximate probability distribution of the original data from the generated latent variable distribution — called the generative network
Model Architecture
VAE's model architecture includes three core parts — Encoder, Decoder, and Latent Variables — each playing a distinct role in data reconstruction and generation. The VAE encoder maps input data to the parameters of a probability distribution in the latent space (rather than a single deterministic encoding), samples latent variable from that distribution, and then the decoder generates output based on .
Encoder
In VAE, the encoder no longer outputs a fixed hidden representation; instead, it outputs the parameters of a latent distribution (typically the mean and log-variance of a Gaussian). The encoder network is denoted , controlled by parameters , and approximates the posterior distribution. It receives observed data , extracts its main features, and maps them to the latent space, producing the probability distribution of the latent variable (e.g., and ). The encoder plays the role of an "inference network" — given data, it predicts the distribution of the latent variable. This design gives the latent variable uncertainty, improving the model's generalization ability.
Latent Variable
: The latent variable lies in the latent space and represents the underlying factors of data generation. Unlike the deterministic encoding of a traditional autoencoder, VAE treats the latent variable as a random variable. We typically assume the latent variable follows a simple prior distribution (e.g., a standard normal distribution of dimension ), so the encoder learns distribution parameters that make the posterior close to the prior. Introducing a stochastic latent variable makes the model part of a probabilistic generative model: different values of correspond to different generated results in the data space. The probabilistic representation of the latent variable ensures the continuity and structure of the latent space, supporting the generation of new data by sampling from the latent space.
Decoder
The decoder receives the latent variable (a sampled value) as input and generates output in the same distribution as the original data space. The decoder defines the likelihood distribution , controlled by parameters , and is typically implemented as a neural network. The decoder acts like a "generative network" — it learns to map from the latent space back to the data space, attempting to reconstruct the original input or generate samples consistent with the training data distribution. Intuitively, the decoder "imagines" what the corresponding input might look like given , implementing the data generation process.
During training, the encoder and decoder are connected through the shared latent variable to form an autoencoder structure, except that the intermediate representation becomes a probability distribution. The encoder ensures contains enough information about to reconstruct the input; the decoder attempts to accurately recover from . Since we force the latent variable distribution to be close to a preset prior (e.g., standard normal), any sampled from the prior can potentially generate a reasonable sample through the decoder. This is the key distinction between VAE and ordinary autoencoders: the encoder outputs a continuous probability distribution rather than a fixed encoding. Thus, when generating new data, we can directly sample from the prior distribution and feed it to the decoder, which will produce generated samples. VAE thus achieves explicit modeling of the data distribution, becoming a true generative model.
Probabilistic Graphical Modeling and Generative Process
VAE can be viewed as a probabilistic graphical model with latent variables. From a probabilistic graphical perspective, VAE defines the relationship between observed data and latent variable : the latent variable generates observed data through the generative distribution . That is, we assume there exists a hidden random variable that generates data , through the process: first sample a latent vector from the prior , then generate the observed sample according to . In graphical model notation, this is a simple directed graph: . This generative process can be formalized as:
- Prior distribution: , typically , indicating our prior assumption that latent variable dimensions are independent standard Gaussian. Choosing a normal distribution as the prior is partly because Gaussian distributions are easy to handle (e.g., KL divergence has an analytic solution), and partly because sufficiently complex function mappings can transform random variables from simple distributions into arbitrary complex distributions.
- Conditional generative distribution: , which describes the probability distribution of generating observed data given latent variable . We typically choose an appropriate distribution form based on data type and parameterize it with the decoder network.
In summary, VAE defines a parameterized joint distribution . The generative process samples from this joint distribution in the above order to obtain pairs, yielding new data points . VAE's objective is to maximize the marginal likelihood over parameters . The marginal probability of observed data is:
This integral is typically intractable in high dimensions. Therefore, approximate inference is needed to optimize model parameters — leading to the core idea of VAE: variational inference and the Evidence Lower Bound (ELBO).
Variational Inference and the ELBO
Since the marginal likelihood is difficult to compute and maximize directly, we turn to variational inference to approximate the posterior and construct a lower bound on the marginal likelihood for optimization. Specifically, for a given data point , we want to estimate the posterior . But directly computing also involves the intractable .
VAE's elegance lies in introducing an approximate posterior defined by the encoder, and indirectly maximizing the log-likelihood by minimizing the difference between and the true posterior .
We measure the difference between two distributions using the Kullback-Leibler divergence (KL divergence): . KL divergence is non-negative and equals 0 only when . Applying KL divergence between the true and approximate posteriors, and through derivation (applying Bayes' theorem and rearranging), we obtain:
The left side is the log-likelihood we care about, and the right side is its lower bound. This inequality states that for any approximate posterior distribution , the log marginal likelihood is greater than or equal to the right-hand expression. We call the right-hand quantity the Evidence Lower Bound (ELBO):
It is a lower bound on for any : . The two are equal if and only if the approximate posterior perfectly equals the true posterior. Thus, we transform the original problem of "maximizing " into "maximizing the ELBO" — by maximizing the ELBO, we simultaneously improve the data log-likelihood and approximate the true posterior.
KL Divergence and Reconstruction Error: VAE Loss Function
From the derivation above, the ELBO for a single sample contains two parts:
- Reconstruction term: — the expected log-likelihood of , reflecting the decoder's ability to reconstruct the original sample from latent variable . Maximizing the ELBO requires this term to be as large as possible — i.e., the decoder gives as high a probability as possible to given — equivalent to reconstructing the input as accurately as possible.
- Regularization term (KL term): — the negative KL divergence, or equivalently a penalty term in the loss. Maximizing the ELBO is equivalent to minimizing the KL divergence between the approximate posterior produced by the encoder and the prior . Intuitively, the KL term constrains the shape of the latent space, ensuring latent representations of different inputs are not too far apart and cover the prior's space — guaranteeing meaningful interpolation and random sampling in the latent space.
The VAE loss function is typically the negative ELBO (since we minimize loss in optimization):
This loss function consists exactly of "reconstruction error + KL regularization." Through minimizing , we equivalently maximize the ELBO, teaching the model to simultaneously reconstruct inputs and regularize the latent space.
When choosing Gaussian prior and Gaussian posterior ( and ), the KL divergence has an analytic solution. For a single dimension:
Sum over all dimensions for the multi-dimensional case. In practice, the KL divergence loss is computed directly from this closed form, while the reconstruction term uses the appropriate loss for the output distribution (e.g., cross-entropy for Bernoulli assumptions, MSE or negative log-likelihood for Gaussian assumptions).
The Reparameterization Trick
Principle
The encoder outputs the parameters of the latent variable distribution (e.g., ), and we need to sample from this distribution to pass to the decoder. If we sample directly from , the value of is random and gradients cannot be backpropagated to . The reparameterization trick solves this:
The core idea is to decouple the sampling process of the random variable from the network parameters, "extracting" the randomness from the network structure so gradients can flow back. Specifically, for Gaussian distributions, we introduce an auxiliary standard normal noise and express the sampling of as:
Here and are the mean and standard deviation vectors output by the encoder network, is a random noise vector independent of (each dimension independently standard normal), and denotes element-wise multiplication. Through this reparameterization, is expressed as a deterministic function : for a given input , different samples yield different values, but for any specific , is an explicit function of and . Therefore, the expectation can be converted to an expectation over , allowing exchange of the integral and derivative:
Gradients can now be computed with respect to through the deterministic transformation. Intuitively, the reparameterization trick makes stochastic nodes differentiable: sampling is treated as sampling independent noise then transforming it through a differentiable mapping. This is the most critical part of enabling VAE to be trained end-to-end. Without reparameterization, there would be no practical VAE training scheme.
Full Example
Setup (from encoder output)
- Mean vector
- Log-variance
Thus:
Reparameterized sampling
Sample standard normal noise , take :
This converts the originally "non-differentiable" sampling into a deterministic transformation that is differentiable with respect to and .
Gradients during backpropagation
Suppose the gradient of some loss with respect to is :
Since and :
Gradients flow directly to and — this is the key to the reparameterization trick.
KL divergence numerical value
For diagonal Gaussian vs. standard normal :
Model Training
Training Process
- Initialize parameters: Randomly initialize encoder and decoder parameters .
- Forward pass and sampling: For each training sample , the encoder outputs and . Sample and compute . Feed to the decoder to obtain the reconstructed output distribution.
- Compute loss: Calculate reconstruction error and KL divergence; sum them to get the sample loss.
- Backpropagation: Thanks to reparameterization, gradients flow to via and .
- Parameter update: Use an optimizer (e.g., SGD or Adam) to update parameters.
- Iterate: Repeat over multiple epochs until convergence.
Sampling Strategy
During training, the expectation in the reconstruction term is approximated via sampling. In practice, typically one sample per data point suffices — this simplifies computation and reduces gradient variance, since noise varies across samples within a mini-batch.
Generation Phase
After training, we obtain parameters . To generate new data: sample from the prior , feed it to the trained decoder , and obtain a generated sample. Because the latent space has learned to correspond to the data distribution, samples from the prior, after passing through the decoder, produce outputs similar to the training distribution.
Applications
VAE, as a powerful generative and feature learning model, has broad applications:
- Image generation: VAE is commonly used to generate images such as handwritten digits and faces. Although pure VAE-generated images may be less sharp than GAN outputs, they have good diversity and avoid mode collapse. VAE can also achieve continuous image transformation and interpolation — e.g., linearly interpolating values of two face images in the latent space to generate intermediate transitional faces.
- Representation learning: Because the encoder compresses data into latent variable with a standard normal prior structure, VAE is often used for unsupervised feature extraction and dimensionality reduction. The latent space learned by VAE tends to be more structured and continuous than that of ordinary autoencoders, useful for visualizing high-dimensional data or as input features for downstream tasks.
- Semi-supervised learning: VAE can naturally combine labeled and unlabeled data for semi-supervised learning. A typical approach is Conditional VAE (CVAE) or adding a class variable to the latent variable — using unlabeled data to learn data distribution structure and limited labeled data to guide the latent space distribution to correlate with categories.
- Anomaly detection: Using VAE to learn data's probability distribution enables detection of anomalous samples. Train VAE on normal data; for a new data point, if VAE cannot reconstruct it well (reconstruction error far exceeds the normal range) or its posterior distribution diverges significantly from the prior, the data likely deviates from the training distribution and is flagged as anomalous. This approach is used in industrial sensor data anomaly detection, network intrusion detection, financial fraud detection, and more.
VAE is also used in data augmentation (generating synthetic samples in data-scarce scenarios), image denoising, and reinforcement learning (as a world model to assist agent decision-making).
Limitations and Improvements
Posterior Collapse
Posterior collapse (also called KL vanishing or latent variable collapse) is an notorious problem in VAE training, especially in high-dimensional sequence modeling such as NLP. It refers to the encoder's approximate posterior degenerating almost to the prior, ignoring input information — causing the latent variable to be nearly useless for reconstruction. This manifests as the KL divergence approaching 0 during training, with the model ignoring latent variables and the decoder relying entirely on its own high capacity to reconstruct the input.
The intuitive cause is that the decoder is too powerful or the training strategy is inappropriate. Once in this state, the model's latent variables carry almost no information, completely losing the purpose of representation learning.
Improvement strategies:
- KL annealing: Start KL loss weight at 0 and gradually increase it, ensuring the model first learns reconstruction then gradually applies the prior constraint.
- Free Bits: Set a minimum contribution threshold per KL term dimension, guaranteeing the encoder transmits at least some information in each latent variable dimension.
- Adjust network capacity: Limit decoder capacity (reduce layers, add dropout) to force it to rely on information.
- Improved objective functions: e.g., Info-VAE adding additional information constraints to the ELBO, VAE with skip connections, Cyclic-VAE, etc.
Development of VAE
β-VAE
β-VAE was proposed by Higgins et al. in 2017 for learning more disentangled latent representations. The core idea is simple: introduce a weight factor to adjust the KL divergence term's weight:
When , this recovers the standard VAE. Taking means we place greater emphasis on making the encoder output comply with the prior. By forcing the model to reconstruct data under a stricter bottleneck, β-VAE encourages each dimension of to learn an independent latent factor. Experiments show that with appropriately independent generative factors, increasing can cause different coordinates of to correspond to different semantic factors — achieving unsupervised factor disentanglement. The trade-off is reduced reconstruction quality: an overly large may cause the model to sacrifice reconstruction detail to satisfy the prior constraint. β-VAE inspired a series of subsequent research on quantifying and improving disentangled representations.
VAE-GAN
VAE-GAN, proposed by Larsen et al. in 2015, fuses VAE and GAN to leverage the strengths of both. In addition to the standard VAE architecture, a GAN discriminator is introduced to perform adversarial training on generated reconstructed samples.
Specifically, VAE-GAN retains the VAE encoder and decoder. A discriminator network receives real data , reconstructed data , or purely generated data (sampled from the prior), and is trained to distinguish real from generated. The decoder is trained to fool the discriminator.
Benefits:
- Sharper generated samples: The discriminator pushes the decoder to generate sharper, more detailed images — significantly improving the blurriness issue of vanilla VAE outputs.
- More stable adversarial training: Because the discriminator sees both reconstructed and purely sampled outputs, and the encoder provides a structured starting point for the decoder, training is more stable than vanilla GAN.
- Preserves inference capability: Unlike pure GAN, VAE-GAN retains the encoder, so the model can still infer latent variables from given data, maintaining representation learning capability.
VAE-GAN's main drawback is greater complexity and more hyperparameters. But when successfully trained, it combines both worlds' advantages: generation quality close to GAN with VAE's inference capability and training stability.
Summary
The Variational Autoencoder (VAE), since its proposal, has developed rich theoretical and practical results. Fundamentally, VAE combines probabilistic graphical models with deep learning, introduces variational inference to approximate the posterior, and cleverly implements differentiable sampling through reparameterization — enabling gradient descent to train deep models with stochastic units. Its loss consists of reconstruction error and KL divergence, learning the data distribution while regularizing the latent space structure, making generation possible. VAE has been widely applied to generation of images, text, and more, as well as feature learning and semi-supervised tasks. Many improvements have been proposed to enhance VAE: addressing posterior collapse to fully utilize latent variables, obtaining more disentangled representations via β-VAE, improving generation quality by combining with GAN, introducing discrete latents with VQ-VAE, and more. Although compared to newer generative models, VAE has limitations such as blurry generated samples, its efficient inference mechanism, stable training process, and clear probabilistic interpretation ensure it continues to play an important role in deep generative modeling.
贡献者
最近更新
Involution Hell© 2026 byCommunityunderCC BY-NC-SA 4.0