This is Part 1 of the Flow Matching Series
Introduction
How do modern generative models generate photorealistic images, videos, or protein structures? At their core, they solve a deceptively simple problem: transform random noise into samples from a target data distribution. However, of the many blogposts and learning resources, few seems to provide depth while connecting the dots between normalizing flow, continuous normalizing flow, and flow matching. The goal of this series is to build up from foundation the theory and technical implementation of these methods. I also hope to educate myself on these topics as we go along.
In this first post, I want to establish the mathematical foundations rigorously. We’ll start with a concrete example (planar flows), see why they need to be stacked, discover this leads naturally to a differential equation, and understand why the continuous version works better.
Generative Modeling as Sampling
Suppose we want to generate images of cats. Any image can be represented as a vector:
where is the dimensionality (e.g., for a RGB image, ). Not all vectors in correspond to "good" images of cats. There exists an unknown distribution that assigns high probability to realistic cat images and low probability to noise.
Generative modeling is learning to sample from this target distribution:
Here we face two fundamental challenges:
- We don't have the formula for , only samples (e.g., sample images of cats)
- Even if we knew , sampling might be intractable
Transport Map Approach
Start with a simple distribution we can sample from:
Then learn a transformation such that:
We denote the distribution of as the pushforward: .
Figure 1: The transport map approach. We learn a transformation that maps samples from a simple base distribution (e.g., Gaussian) to the complex target distribution . The pushforward operation represents the distribution of when .
Generating samples from the target distribution then becomes:
- Sample
- Deterministically compute
- Output (sampling from learned distribution)
For this to work, we need to answer:
- What form should take? (Architecture)
- How do we train ? (Objective function)
- How do we evaluate ? (For computing likelihoods)
How Do We Train This Transformation?
To learn , we need an objective function. The most natural choice is maximum likelihood (Dinh et al., 2015; Rezende & Mohamed, 2015): make the model distribution match the data distribution by maximizing:
where is the density induced by the learned transformation , parameterized by .
But if where , how do we compute ?
This requires the change of variables formula.
Intuition: Transformations Change Densities
Consider a simple 1D example: and (stretching by factor 2).
- The transformation spreads samples apart
- If samples spread apart, probability density must decrease to keep total probability = 1
- Specifically:
The factor corrects for the volume change. In general dimensions, the Jacobian is necessary for correcting for this change of volume.
Figure 2: Intuition for change of variables in 1D. When stretched by a factor of 2 (), the density must decrease by a factor of 2 to preserve total probability mass. The interval becomes , so the height must be halved.
Change of Variables
Theorem (Change of Variables): If and where is a diffeomorphism (smooth bijection with differentiable inverse ), then:
where is the Jacobian matrix:
Equivalently, using instead of :
The determinant measures how locally scales volumes:
- If : expands space near , so densities must decrease
- If : contracts space near , so densities must increase
This ensures probability mass is conserved: .
For maximum likelihood, we work with log-densities. Taking the logarithm of Equation 6:
Given data , we can train with maximum likelihood via:
In practice, computing the objective in Equation 9 with neural network parameterized forces the following constraints:
- Invertibility: needs to map data back to latent space
- Efficient inverse: Computing must be tractable
- Efficient Jacobian: Computing must be tractable
Normalizing Flows: Making Jacobians Tractable
For general neural networks, the Jacobian is a matrix. Computing its determinant naively requires operations (LU decomposition). For images with , this becomes completely intractable.
Dinh et al. (2015) and Rezende & Mohamed (2015) introduced normalizing flows with special architectures where the Jacobian determinant is cheap to compute.
Strategy: Triangular Jacobians
If is triangular (upper or lower), then:
Computing Equation 10 requires only operations. This is a fundamental property of the determinant.
Affine Coupling Layers (RealNVP)
Idea: Make the Jacobian triangular so .
Dinh et al. (2017) introduced the affine coupling layer architecture: Split input where :
where are unconstrained neural networks, and is element-wise multiplication.
Why this works:
- Explicit Inverse:
- Triangular Jacobian:
- Cheap Log-Det using Equation 10:
Now we just sum the network outputs without matrix operations. However, this architecture means a single coupling layer leaves unchanged (half the dimensions are untransformed).
To transform all dimensions, we must stack multiple layers with alternating masks:
Layer 1: transform dimensions [2,4,6,...], freeze [1,3,5,...]
Layer 2: transform dimensions [1,3,5,...], freeze [2,4,6,...]
Layer 3: transform dimensions [2,4,6,...], freeze [1,3,5,...]
...
The dilemma:
- Few layers Limited expressiveness
- Many layers Expensive computation, harder to train
Can we do better?
Residual Flows: Toward Continuous Transformations
Motivation: Full-Rank Updates
Coupling layers have triangular Jacobians by construction. What if we want full-rank Jacobians (where every output can depend on every input)? Taking inspiration from ResNets (He et al., 2016), Chen et al. (2019) introduced residual flows based on residual connections:
where is an unconstrained neural network.
Advantages:
- Jacobian is , which is full rank if is full rank
- More expressive than coupling layers
But how do we ensure invertibility?
Banach Fixed Point Theorem (Behrmann et al., 2019):
If (Lipschitz constant less than 1), then defined in Equation 15 is a bijection.
Proof sketch:
- For invertibility, we need
- Assume :
- If and , we get , a contradiction
- Therefore (injectivity)
In practice we enforce the Lipschitz constraint by spectral normalization of weight matrices (Miyato et al., 2018) and bounded activation functions.
Planar Flows
Before introducing the general case, let's look at a specific type of residual flow: planar flows (Rezende & Mohamed, 2015). A planar flow has the form:
where the transformation is constant on hyperplanes and:
- are weight vectors
- is a bias
- is a smooth activation (e.g., )
Intuition: This transformation "pushes" points in the direction of , with the magnitude determined by how the point projects onto .
Figure 3: Geometry of a planar flow (Equation 16). Points are pushed in the direction of vector , with magnitude controlled by their projection onto the normal vector . The transformation is constant along hyperplanes orthogonal to .
Using the matrix determinant lemma, we obtain the cheap-to-compute log determinant of the Jacobian:
From Discrete Layers to Continuous Dynamics
Now consider a composition of residual flow layers:
but instead of having each layer make a “full-sized” update, let’s have each layer make a small update:
where is a neural network ( for planar flows). Each layer takes a step of size . As is increased, each individual step becomes smaller, more steps are taken, but the total "distance traveled" stays roughly constant.
Figure 4: From discrete layers to continuous dynamics. As we increase the number of layers and decrease the step size , the discrete sequence of transformations converges to a continuous trajectory governed by an ODE.
The recurrence relation: Starting from and applying layers sequentially:
Rearranging Equation 20c:
This is exactly the forward Euler discretization. If we associate time with step , then is the time step, and we have:
which approximates the differential equation:
More explicitly, we are discretizing the time interval into points:
At each time step, we compute:
As (and correspondingly ), the discrete sequence converges to a continuous trajectory for that satisfies Equation 23:
The final transformation is given by integrating the ODE:
Chen et al. (2018) introduced this framework as Continuous Normalizing Flows (CNF), with Grathwohl et al. (2019) demonstrating practical training methods.
Continuous Normalizing Flows
A CNF is defined by a vector field :
The solution at time , denoted , is the flow map:
Generation is then the process of sampling and solving the ODE in Equation 26 to get .
There are three related perspectives on CNFs:
- Flow map : Maps initial conditions to solutions
- Probability path : How the density evolves over time.
- Velocity field : The "instructions" for how particles move.
Figure 5: Three perspectives on continuous normalizing flows. The flow map gives particle trajectories, the probability path describes how densities evolve, and the velocity field specifies how particles move. These are related by the continuity equation and ODE solving.
To see these three perspectives in action, consider a 1D example transforming a Gaussian distribution into a bimodal distribution:

Figure 6: Three synchronized views of a 1D continuous normalizing flow (Equation 26). Top panel: The probability density evolves from a single Gaussian peak at to two separate modes at . Middle panel: The velocity field provides the "instructions", orange arrows (positive velocity) push particles right, blue arrows (negative velocity) push left. Notice how particles near split toward . Bottom panel: Particle samples show actual data points following these velocity instructions. The three perspectives , , and are three ways of viewing a single underlying continuous transformation.
The same concept can be visualized in two dimensions:

Figure 7: Synchronized views of a 2D continuous normalizing flow transforming a centered Gaussian into a bimodal distribution.
Continuous Change of Variables
Motivation: From Discrete to Continuous
In discrete normalizing flows, we tracked density changes using the change of variables formula (Equation 8):
For CNFs, we need the continuous analog. The key questions are:
- How does the density field evolve as particles flow through space?
- How do we compute given along a trajectory?
The answer involves two key results: the continuity equation and the instantaneous change of variables formula.
The Continuity Equation
Physical Intuition: Imagine dye particles flowing through water. If particles converge at a point, the concentration (density) increases there. If they spread out, the density decreases. The continuity equation formalizes this conservation of mass principle.
Theorem (Continuity Equation (Evans, 2010)): If particles move according to the ODE (Equation 26), then the density evolves according to:
This is called the continuity equation (transport equation):
- : How density changes over time at a fixed location
- : The divergence of the probability flux (net outflow of probability mass from )
- The equation shows that at any location, the rate of density change plus the net outflow must equal zero (mass is conserved)
- Equivalently: , so density increases where flux converges (negative divergence) and decreases where flux diverges (positive divergence)
Proof Sketch:
We represent the density using the flow map (Equation 27):
Taking the time derivative:
The second line uses the chain rule for derivatives of delta functions. The third line uses from the ODE definition (Equation 26).
Now substitute (change of variables with ):
Integration by parts (using for rapidly decaying functions):
which gives us Equation 28. □
Tracking Density Along a Trajectory
The continuity equation (Equation 28) describes how the entire density field evolves. But for a specific particle following the ODE trajectory , we want to know: how does change?
We need to apply the chain rule:
The first term is the change due to time, the second is the change due to the particle’s movement through space.
From the continuity equation (Equation 28) , we can derive:
The last equality uses the product rule: .
Now substituting Equation 31 into our chain rule expression (Equation 30):
The middle terms cancel, giving us the Instantaneous Change of Variables Formula:
where the divergence operator is defined as:
Physical Intuition:
The divergence measures the net outflow of the vector field at point and time :
- : net outflow (flow is diverging/expanding), so density must decrease
- : net inflow (flow is converging/contracting), so density must increase
- : no net flow, density stays constant
The negative sign in Equation 33 enforces conservation of probability mass:
The quantity represents probability flux: how much probability mass flows through point at time . When flux diverges (), probability flows out, so density decreases (). The negative sign in the continuity equation makes this relationship consistent.
This is the continuous analog of how Jacobian determinants work in discrete flows (Equation 8): expansion decreases density, contraction increases it.
Figure 8: The divergence controls density changes. When the flow has net outflow (), density decreases. When it has net inflow (), density increases. This is the continuous analog of the Jacobian determinant in discrete flows.
The Continuous Change of Variables Formula
Finally, integrating the instantaneous formula (Equation 32) from to along a trajectory:
This is the continuous analog of the discrete formula (Equation 8) :
- The initial log-density corresponds to
- The sum of log determinants becomes an integral of divergences
- The discrete layers dissolve into a continuous flow
Equation 34 is fundamental to CNFs (Chen et al., 2018):
- Training: We can compute by solving the ODE and integrating the divergence
- Generation: We sample and solve the ODE forward to get
Training CNFs
Before flow matching revolutionized CNF training, the standard approach was maximum likelihood estimation using the continuous change of variables formula. While powerful in theory, this approach faces significant computational challenges.
Maximum Likelihood Training
The training objective maximizes the log-likelihood of data samples under the model:
Using Equation 34, computing for a data point requires:
- Backward ODE solve: Starting from at time , solve the ODE backward to find the corresponding latent point at
- Divergence integration: Track along the trajectory
This leads to the augmented ODE system where we simultaneously track position and log-density:
Computational Bottlenecks
Two major computational challenges arise:
1. Divergence Computation: Computing naively requires backward passes through the network, costing operations. Grathwohl et al. (2019) introduced Hutchinson's trace estimator to reduce this to using unbiased Monte Carlo estimation:
where the expectation can be approximated with a single random sample.
2. Memory for Backpropagation: Standard backpropagation through ODE solvers requires storing all intermediate states, leading to memory where is the number of solver steps. Chen et al. (2018) introduced the adjoint method which reduces memory to by computing gradients through an auxiliary backward solve, trading off speed for memory efficiency.
Why This Is Slow
Despite these optimizations, training CNFs via maximum likelihood remains computationally expensive:
- Each training iteration requires solving the ODE backward (typically 50-100 function evaluations)
- The divergence must be computed at every ODE step
- Gradient computation requires additional backward passes
For high-dimensional problems (images, videos), training can take orders of magnitude longer than discrete normalizing flows or other generative models. This motivated the development of flow matching, which we’ll explore in the next post.
Summary
We’ve built up the complete theory of Continuous Normalizing Flows:
- CNFs transform distributions continuously via ODEs (Equation 26): .
- Three equivalent perspectives: flow map , probability path , and velocity field .
- Continuity equation (Equation 28): Describes how the density field evolves in space and time.
- Instantaneous change of variables (Equation 32): .
- Continuous change of variables (Equation 34): .
- Traditional maximum likelihood training requires expensive ODE solving and divergence computation.
In the next post, we’ll explore flow matching, a breakthrough approach that trains CNFs without solving ODEs during training, making them practical for large-scale applications.
References
[1] Laurent Dinh, David Krueger, and Yoshua Bengio. “NICE: Non-linear Independent Components Estimation.” ICLR Workshop, 2015. arXiv:1410.8516
[2] Danilo Rezende and Shakir Mohamed. “Variational Inference with Normalizing Flows.” ICML, 2015. arXiv:1505.05770
[3] Laurent Dinh, Jascha Sohl-Dickstein, and Samy Bengio. “Density estimation using Real NVP.” ICLR, 2017. arXiv:1605.08803
[4] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. “Deep Residual Learning for Image Recognition.” CVPR, 2016. arXiv:1512.03385
[5] Jens Behrmann, Will Grathwohl, Ricky T. Q. Chen, David Duvenaud, and Jörn-Henrik Jacobsen. “Invertible Residual Networks.” ICML, 2019. arXiv:1811.00995
[6] Ricky T. Q. Chen, Jens Behrmann, David Duvenaud, and Jörn-Henrik Jacobsen. “Residual Flows for Invertible Generative Modeling” NeurIPS, 2019. arXiv:1906.02735
[7] Takeru Miyato, Toshiki Kataoka, Masanori Koyama, and Yuichi Yoshida. “Spectral Normalization for Generative Adversarial Networks.” ICLR, 2018. arXiv:1802.05957
[8] Danilo Jimenez Rezende and Shakir Mohamed. “Variational Inference with Normalizing Flows.” ICML, 2015. arXiv:1505.05770
[9] Ricky T. Q. Chen, Yulia Rubanova, Jesse Bettencourt, and David Duvenaud. “Neural Ordinary Differential Equations.” NeurIPS, 2018. arXiv:1806.07366
[10] Will Grathwohl, Ricky T. Q. Chen, Jesse Bettencourt, Ilya Sutskever, and David Duvenaud. “FFJORD: Free-form Continuous Dynamics for Scalable Reversible Generative Models.” ICLR, 2019. arXiv:1810.01367
[11] Lawrence C. Evans. “Partial Differential Equations.” American Mathematical Society, 2010.
Additional Resources
For further reading on flow matching and related topics, I recommend:
Michael Albergo and Eric Vanden-Eijnden. “Building Normalizing Flows with Stochastic Interpolants.” ICLR, 2023. arXiv:2209.15571
Yaron Lipman, Ricky T. Q. Chen, Heli Ben-Hamu, Maximilian Nickel, and Matt Le. “Flow Matching for Generative Modeling.” ICLR, 2023. arXiv:2210.02747
Alexander Tong, Nikolay Malkin, Guillaume Huguet, Yanlei Zhang, Jarrid Rector-Brooks, Kilian Fatras, Guy Wolf, and Yoshua Bengio. “Improving and Generalizing Flow-Based Generative Models with Minibatch Optimal Transport.” TMLR, 2024. arXiv:2302.00482
Blog posts and resources that inspired this series:
- Conditional Flow Matching
- Flow Matching Guide
- Peter Holderrieth and Ezra Erives. “Introduction to Flow Matching and Diffusion Models.” 2025. https://diffusion.csail.mit.edu/