Understanding Normalizing and Continuous Flows

This is Part 1 of the Flow Matching Series

Introduction

How do modern generative models generate photorealistic images, videos, or protein structures? At their core, they solve a deceptively simple problem: transform random noise into samples from a target data distribution. However, of the many blogposts and learning resources, few seems to provide depth while connecting the dots between normalizing flow, continuous normalizing flow, and flow matching. The goal of this series is to build up from foundation the theory and technical implementation of these methods. I also hope to educate myself on these topics as we go along.

In this first post, I want to establish the mathematical foundations rigorously. We’ll start with a concrete example (planar flows), see why they need to be stacked, discover this leads naturally to a differential equation, and understand why the continuous version works better.

Generative Modeling as Sampling

Suppose we want to generate images of cats. Any image can be represented as a vector:

$$z \in \mathbb{R}^d \tag{1}$$

where $d$ is the dimensionality (e.g., for a $64 \times 64$ RGB image, $d = 64 \times 64 \times 3 = 12,288$). Not all vectors in $\mathbb{R}^d$ correspond to "good" images of cats. There exists an unknown distribution $p_{\text{data}}(z)$ that assigns high probability to realistic cat images and low probability to noise.

Generative modeling is learning to sample from this target distribution:

$$z \sim p_{\text{data}}(z) \tag{2}$$

Here we face two fundamental challenges:

We don't have the formula for $p_{\text{data}}$, only samples $\{z_1, \ldots, z_N\}$ (e.g., sample images of cats)
Even if we knew $p_{\text{data}}$, sampling might be intractable

Transport Map Approach

Start with a simple distribution we can sample from:

$$x \sim p_{\text{0}}(x) = \mathcal{N}(0, I_d) \tag{3}$$

Then learn a transformation $T: \mathbb{R}^d \to \mathbb{R}^d$ such that:

$$z = T(x) \implies z \sim p_{\text{data}}(z) \tag{4}$$

We denote the distribution of $T(x)$ as the pushforward: $T\# p_{\text{0}}$.

Figure 1: The transport map approach. We learn a transformation $T$ that maps samples from a simple base distribution $p_{\text{0}}$ (e.g., Gaussian) to the complex target distribution $p_{\text{data}}$. The pushforward operation $T\#p_{\text{0}}$ represents the distribution of $T(x)$ when $x \sim p_{\text{0}}$.

Generating samples from the target distribution then becomes:

Sample $x \sim \mathcal{N}(0, I)$
Deterministically compute $z = T(x)$
Output $z$ (sampling from learned distribution)

For this to work, we need to answer:

What form should $T$ take? (Architecture)
How do we train $T$? (Objective function)
How do we evaluate $p_{\text{data}}(z)$? (For computing likelihoods)

How Do We Train This Transformation?

To learn $T$, we need an objective function. The most natural choice is maximum likelihood (Dinh et al., 2015; Rezende & Mohamed, 2015): make the model distribution match the data distribution by maximizing:

$$\max_\theta \mathbb{E}_{z \sim p_{\text{data}}} [\log p_\theta(z)] \tag{5}$$

where $p_\theta(z)$ is the density induced by the learned transformation $T_\theta$, parameterized by $\theta$.

But if $z = T_\theta(x)$ where $x \sim \mathcal{N}(0, I)$, how do we compute $p_\theta(z)$?

This requires the change of variables formula.

Intuition: Transformations Change Densities

Consider a simple 1D example: $X \sim \mathcal{N}(0, 1)$ and $Z = 2X$ (stretching by factor 2).

The transformation spreads samples apart
If samples spread apart, probability density must decrease to keep total probability = 1
Specifically: $p_Z(z) = p_X(z/2) \cdot \frac{1}{2}$

The factor $1/2$ corrects for the volume change. In general dimensions, the Jacobian is necessary for correcting for this change of volume.

Figure 2: Intuition for change of variables in 1D. When stretched by a factor of 2 ($Z = 2X$), the density must decrease by a factor of 2 to preserve total probability mass. The interval $\Delta x$ becomes $2\Delta x$, so the height must be halved.

Change of Variables

Theorem (Change of Variables): If $x \sim p_X(x)$ and $z = T(x)$ where $T$ is a diffeomorphism (smooth bijection with differentiable inverse $T^{-1}$), then:

$$p_Z(z) = p_X(T^{-1}(z)) \left|\det J_{T^{-1}}(z)\right| \tag{6}$$

where $J_{T^{-1}}(z) \in \mathbb{R}^{d \times d}$ is the Jacobian matrix: $[J_{T^{-1}}]_{ij} = \frac{\partial [T^{-1}]_i}{\partial z_j}$

Equivalently, using $T$ instead of $T^{-1}$:

$$p_Z(T(x)) = p_X(x) \left|\det J_T(x)\right|^{-1} \tag{7}$$

The determinant $|\det J_T(x)|$ measures how $T$ locally scales volumes:

If $|\det J_T(x)| > 1$: $T$ expands space near $x$, so densities must decrease
If $|\det J_T(x)| < 1$: $T$ contracts space near $x$, so densities must increase

This ensures probability mass is conserved: $\int p_Z(z) dz = \int p_X(x) dx = 1$.

For maximum likelihood, we work with log-densities. Taking the logarithm of Equation 6:

$$\log p_Z(z) = \log p_X(T^{-1}(z)) + \log \left|\det J_{T^{-1}}(z)\right| \tag{8}$$

Given data $\{z_1, \ldots, z_N\} \sim p_{\text{data}}$, we can train with maximum likelihood via:

$$\max_\theta \sum_{i=1}^N \log p_\theta(z_i) = \max_\theta \sum_{i=1}^N \left[ \log p_{\text{0}}(T_\theta^{-1}(z_i)) + \log \left|\det J_{T_\theta^{-1}}(z_i)\right| \right] \tag{9}$$

In practice, computing the objective in Equation 9 with neural network parameterized $T_\theta$ forces the following constraints:

Invertibility: $T^{-1}$ needs to map data back to latent space
Efficient inverse: Computing $T^{-1}(z)$ must be tractable
Efficient Jacobian: Computing $\log |\det J_{T^{-1}}(z)|$ must be tractable

Normalizing Flows: Making Jacobians Tractable

For general neural networks, the Jacobian is a $d \times d$ matrix. Computing its determinant naively requires $O(d^3)$ operations (LU decomposition). For images with $d = 12,288$, this becomes completely intractable.

Dinh et al. (2015) and Rezende & Mohamed (2015) introduced normalizing flows with special architectures where the Jacobian determinant is cheap to compute.

Strategy: Triangular Jacobians

If $J_T$ is triangular (upper or lower), then:

$$\det J_T = \prod_{i=1}^d [J_T]_{ii} \tag{10}$$

Computing Equation 10 requires only $O(d)$ operations. This is a fundamental property of the determinant.

Affine Coupling Layers (RealNVP)

Idea: Make the Jacobian triangular so $\det J = \prod_i J_{ii}$.

Dinh et al. (2017) introduced the affine coupling layer architecture: Split input $x = [x_1, x_2]$ where $x_1, x_2 \in \mathbb{R}^{d/2}$:

$$\begin{align} z_1 &= x_1 \tag{11a}\\ z_2 &= x_2 \odot \exp(s(x_1)) + t(x_1) \tag{11b} \end{align}$$

where $s, t : \mathbb{R}^{d/2} \to \mathbb{R}^{d/2}$ are unconstrained neural networks, and $\odot$ is element-wise multiplication.

Why this works:

Explicit Inverse:

$$x_1 = z_1, \quad x_2 = (z_2 - t(z_1)) \odot \exp(-s(z_1)) \tag{12}$$

Triangular Jacobian:

$$J_T = \begin{bmatrix} I_{d/2} & 0 \\ \frac{\partial z_2}{\partial x_1} & \text{diag}(\exp(s(x_1))) \end{bmatrix} \tag{13}$$

Cheap Log-Det using Equation 10:

$$\log |\det J_T| = \sum_{i=1}^{d/2} s(x_1)_i \tag{14}$$

Now we just sum the network outputs without matrix operations. However, this architecture means a single coupling layer leaves $z_1 = x_1$ unchanged (half the dimensions are untransformed).

To transform all dimensions, we must stack multiple layers with alternating masks:

Layer 1: transform dimensions [2,4,6,...], freeze [1,3,5,...]
Layer 2: transform dimensions [1,3,5,...], freeze [2,4,6,...]
Layer 3: transform dimensions [2,4,6,...], freeze [1,3,5,...]
...

The dilemma:

Few layers $\rightarrow$ Limited expressiveness
Many layers $\rightarrow$ Expensive computation, harder to train

Can we do better?

Residual Flows: Toward Continuous Transformations

Motivation: Full-Rank Updates

Coupling layers have triangular Jacobians by construction. What if we want full-rank Jacobians (where every output can depend on every input)? Taking inspiration from ResNets (He et al., 2016), Chen et al. (2019) introduced residual flows based on residual connections:

$$T(x) = x + u(x) \tag{15}$$

where $u: \mathbb{R}^d \to \mathbb{R}^d$ is an unconstrained neural network.

Advantages:

Jacobian is $J_T(x) = I + J_u(x)$, which is full rank if $J_u$ is full rank
More expressive than coupling layers

But how do we ensure invertibility?

Banach Fixed Point Theorem (Behrmann et al., 2019):

If $|u|_{\text{Lip}} < 1$ (Lipschitz constant less than 1), then $T(x) = x + u(x)$ defined in Equation 15 is a bijection.

Proof sketch:

For invertibility, we need $T(x_1) = T(x_2) \implies x_1 = x_2$
Assume $T(x_1) = T(x_2)$: $|x_1 - x_2| = |T(x_1) - u(x_1) - T(x_2) + u(x_2)| = |u(x_1) - u(x_2)| \leq L|x_1 - x_2|$
If $L < 1$ and $x_1 \neq x_2$, we get $1 \leq L < 1$, a contradiction
Therefore $x_1 = x_2$ (injectivity)

In practice we enforce the Lipschitz constraint by spectral normalization of weight matrices (Miyato et al., 2018) and bounded activation functions.

Planar Flows

Before introducing the general case, let's look at a specific type of residual flow: planar flows (Rezende & Mohamed, 2015). A planar flow has the form:

$$T(x) = x + \sigma(w^T x + b) a \tag{16}$$

where the transformation is constant on hyperplanes $\{w^T x = c\}$ and:

$w, a \in \mathbb{R}^d$ are weight vectors
$b \in \mathbb{R}$ is a bias
$\sigma: \mathbb{R} \to \mathbb{R}$ is a smooth activation (e.g., $\tanh$)

Intuition: This transformation "pushes" points in the direction of $a$, with the magnitude determined by how the point projects onto $w$.

Figure 3: Geometry of a planar flow (Equation 16). Points are pushed in the direction of vector $a$, with magnitude controlled by their projection onto the normal vector $w$. The transformation is constant along hyperplanes orthogonal to $w$.

Using the matrix determinant lemma, we obtain the cheap-to-compute log determinant of the Jacobian:

$$\log |\det J_T(x)| = \log |1 + \sigma'(w^T x + b) w^T a| \tag{17}$$

From Discrete Layers to Continuous Dynamics

Now consider a composition of $K$ residual flow layers:

$$T = T_K \circ T_{K-1} \circ \cdots \circ T_1 \tag{18}$$

but instead of having each layer make a “full-sized” update, let’s have each layer make a small update:

$$T_k(x) = x + \frac{1}{K} u_k(x) \tag{19}$$

where $u_k : \mathbb{R}^d \to \mathbb{R}^d$ is a neural network ($u_k(x) = \sigma(w_k^T x + b_k) a_k$ for planar flows). Each layer takes a step of size $O(1/K)$. As $K$ is increased, each individual step becomes smaller, more steps are taken, but the total "distance traveled" stays roughly constant.

Figure 4: From discrete layers to continuous dynamics. As we increase the number of layers $K$ and decrease the step size $1/K$, the discrete sequence of transformations converges to a continuous trajectory governed by an ODE.

The recurrence relation: Starting from $x_0$ and applying layers sequentially:

$$\begin{align} x_1 &= x_0 + \frac{1}{K} u_1(x_0) \tag{20a}\\ x_2 &= x_1 + \frac{1}{K} u_2(x_1) \tag{20b}\\ &\vdots \nonumber\\ x_k &= x_{k-1} + \frac{1}{K} u_k(x_{k-1}) \tag{20c} \end{align}$$

Rearranging Equation 20c:

$$\frac{x_k - x_{k-1}}{1/K} = u_k(x_{k-1}) \tag{21}$$

This is exactly the forward Euler discretization. If we associate time $t_k = k/K$ with step $k$, then $\Delta t = 1/K$ is the time step, and we have:

$$\frac{x_k - x_{k-1}}{\Delta t} \approx \frac{dx}{dt}\bigg|_{t=t_{k-1}} \tag{22}$$

which approximates the differential equation:

$$\frac{dx(t)}{dt} = u(x(t), t) \tag{23}$$

More explicitly, we are discretizing the time interval $[0, 1]$ into $K$ points:

\[0 = t_0 < t_1 < \cdots < t_K = 1, \quad \text{where } t_k = \frac{k}{K}\]

At each time step, we compute:

\[x_k = x_{k-1} + \underbrace{(t_k - t_{k-1})}_{\Delta t = 1/K} \cdot u_k(x_{k-1})\]

As $K \to \infty$ (and correspondingly $\Delta t = 1/K \to 0$), the discrete sequence $x_0, x_1, \ldots, x_K$ converges to a continuous trajectory $x(t)$ for $t \in [0, 1]$ that satisfies Equation 23:

$$\frac{dx_t}{dt} = u(x_t, t), \quad x_0 \sim p_{\text{init}} \tag{24}$$

The final transformation is given by integrating the ODE:

$$z = x_1 = x_0 + \int_0^1 u(x_t, t) dt \tag{25}$$

Chen et al. (2018) introduced this framework as Continuous Normalizing Flows (CNF), with Grathwohl et al. (2019) demonstrating practical training methods.

Continuous Normalizing Flows

A CNF is defined by a vector field $u_\theta: \mathbb{R}^d \times [0,1] \to \mathbb{R}^d$:

$$\frac{dx_t}{dt} = u_\theta(x_t, t), \quad x_0 \sim p_{\text{init}} = \mathcal{N}(0, I) \tag{26}$$

The solution at time $t$, denoted $\psi_t(x_0)$, is the flow map:

$$\psi_t(x_0) = x_0 + \int_0^t u_\theta(\psi_s(x_0), s) ds \tag{27}$$

Generation is then the process of sampling $x_0 \sim \mathcal{N}(0, I)$ and solving the ODE in Equation 26 to get $z = x_{1} = \psi_1(x_0)$.

There are three related perspectives on CNFs:

Flow map $\psi_t$: Maps initial conditions to solutions
Probability path $p_t$: How the density evolves over time.
Velocity field $u_\theta$: The "instructions" for how particles move.

Figure 5: Three perspectives on continuous normalizing flows. The flow map $\psi_t$ gives particle trajectories, the probability path $p_t$ describes how densities evolve, and the velocity field $u(x,t)$ specifies how particles move. These are related by the continuity equation and ODE solving.

To see these three perspectives in action, consider a 1D example transforming a Gaussian distribution into a bimodal distribution:

Figure 6: Three synchronized views of a 1D continuous normalizing flow (Equation 26). Top panel: The probability density $p_t(x)$ evolves from a single Gaussian peak at $t=0$ to two separate modes at $t=1$. Middle panel: The velocity field $u_\theta(x,t)$ provides the "instructions", orange arrows (positive velocity) push particles right, blue arrows (negative velocity) push left. Notice how particles near $x=0$ split toward $x \approx \pm 1.5$. Bottom panel: Particle samples $x_t \sim p_t$ show actual data points following these velocity instructions. The three perspectives $p_t$, $u_\theta$, and $\psi_t$ are three ways of viewing a single underlying continuous transformation.

The same concept can be visualized in two dimensions:

Figure 7: Synchronized views of a 2D continuous normalizing flow transforming a centered Gaussian into a bimodal distribution.

Continuous Change of Variables

Motivation: From Discrete to Continuous

In discrete normalizing flows, we tracked density changes using the change of variables formula (Equation 8):

\[\log p_Z(z) = \log p_X(x) + \log |\det J_T(x)|\]

For CNFs, we need the continuous analog. The key questions are:

How does the density field $p_t(x)$ evolve as particles flow through space?
How do we compute $\log p_1(x_1)$ given $\log p_0(x_0)$ along a trajectory?

The answer involves two key results: the continuity equation and the instantaneous change of variables formula.

The Continuity Equation

Physical Intuition: Imagine dye particles flowing through water. If particles converge at a point, the concentration (density) increases there. If they spread out, the density decreases. The continuity equation formalizes this conservation of mass principle.

Theorem (Continuity Equation (Evans, 2010)): If particles move according to the ODE $\frac{dx_t}{dt} = u(x_t, t)$ (Equation 26), then the density $p_t(x)$ evolves according to:

$$\frac{\partial p_t}{\partial t}(x) + \text{div}(p_t u_\theta)(x, t) = 0 \tag{28}$$

This is called the continuity equation (transport equation):

$\frac{\partial p_t}{\partial t}(x)$: How density changes over time at a fixed location $x$
$\text{div}(p_t u_\theta)(x, t)$: The divergence of the probability flux (net outflow of probability mass from $x$)
The equation shows that at any location, the rate of density change plus the net outflow must equal zero (mass is conserved)
Equivalently: $\frac{\partial p_t}{\partial t} = -\text{div}(p_t u_\theta)$, so density increases where flux converges (negative divergence) and decreases where flux diverges (positive divergence)

Proof Sketch:

We represent the density using the flow map (Equation 27):

\[p_t(x) = \int \delta(x - \psi_t(x_0)) p_0(x_0) dx_0\]

Taking the time derivative:

$$\begin{align} \frac{\partial p_t}{\partial t}(x) &= \int \frac{\partial}{\partial t}\delta(x - \psi_t(x_0)) p_0(x_0) dx_0 \tag{29a}\\ &= -\int \nabla_x \delta(x - \psi_t(x_0)) \cdot \frac{\partial \psi_t}{\partial t}(x_0) p_0(x_0) dx_0 \tag{29b}\\ &= -\int \nabla_x \delta(x - \psi_t(x_0)) \cdot u_\theta(\psi_t(x_0), t) p_0(x_0) dx_0 \tag{29c} \end{align}$$

The second line uses the chain rule for derivatives of delta functions. The third line uses $\frac{\partial \psi_t}{\partial t}(x_0) = u_\theta(\psi_t(x_0), t)$ from the ODE definition (Equation 26).

Now substitute $y = \psi_t(x_0)$ (change of variables with $dy = |\det J_{\psi_t}(x_0)| dx_0$):

\[\frac{\partial p_t}{\partial t}(x) = -\int \nabla_x \delta(x - y) \cdot u_\theta(y, t) p_t(y) dy\]

Integration by parts (using $\int f \nabla g = -\int g \nabla f$ for rapidly decaying functions):

\[\frac{\partial p_t}{\partial t}(x) = -\nabla_x \cdot \int \delta(x - y) u_\theta(y, t) p_t(y) dy = -\text{div}(p_t u_\theta)(x, t)\]

which gives us Equation 28. □

Tracking Density Along a Trajectory

The continuity equation (Equation 28) describes how the entire density field evolves. But for a specific particle following the ODE trajectory $x_t$, we want to know: how does $\log p_t(x_t)$ change?

We need to apply the chain rule:

$$\frac{d}{dt} \log p_t(x_t) = \frac{\partial \log p_t}{\partial t}(x_t) + \nabla \log p_t(x_t) \cdot \frac{dx_t}{dt} \tag{30}$$

The first term is the change due to time, the second is the change due to the particle’s movement through space.

From the continuity equation (Equation 28) $\frac{\partial p_t}{\partial t} + \text{div}(p_t u_\theta) = 0$, we can derive:

$$\frac{\partial \log p_t}{\partial t} = \frac{1}{p_t}\frac{\partial p_t}{\partial t} = -\frac{1}{p_t}\text{div}(p_t u_\theta) = -\text{div}(u_\theta) - \frac{\nabla p_t}{p_t} \cdot u_\theta \tag{31}$$

The last equality uses the product rule: $\text{div}(p_t u_\theta) = \nabla p_t \cdot u_\theta + p_t \text{div}(u_\theta)$.

Now substituting Equation 31 into our chain rule expression (Equation 30):

$$\begin{align} \frac{d}{dt} \log p_t(x_t) &= -\text{div}(u_\theta)(x_t, t) - \nabla \log p_t(x_t) \cdot u_\theta(x_t, t) + \nabla \log p_t(x_t) \cdot \frac{dx_t}{dt}\\ &= -\text{div}(u_\theta)(x_t, t) - \nabla \log p_t(x_t) \cdot u_\theta(x_t, t) + \nabla \log p_t(x_t) \cdot u_\theta(x_t, t)\\ &= -\text{div}(u_\theta)(x_t, t) \end{align}$$

The middle terms cancel, giving us the Instantaneous Change of Variables Formula:

$$\frac{d}{dt} \log p_t(x_t) = -\text{tr}\left(\frac{\partial u_\theta}{\partial x}(x_t, t)\right) = -\text{div}(u_\theta)(x_t, t) \tag{32}$$

where the divergence operator is defined as:

$$\text{div}(u)(x, t) = \sum_{i=1}^d \frac{\partial u_i}{\partial x_i}(x, t) \tag{33}$$

Physical Intuition:

The divergence $\text{div}(u)(x, t)$ measures the net outflow of the vector field $u$ at point $x$ and time $t$:

$\text{div}(u)(x, t) > 0$: net outflow (flow is diverging/expanding), so density must decrease
$\text{div}(u)(x, t) < 0$: net inflow (flow is converging/contracting), so density must increase
$\text{div}(u)(x, t) = 0$: no net flow, density stays constant

The negative sign in Equation 33 enforces conservation of probability mass:

\[\text{rate of change of log-density} = -(\text{net outflow})\]

The quantity $p_t(x) u_\theta(x, t)$ represents probability flux: how much probability mass flows through point $x$ at time $t$. When flux diverges ($\text{div}(p_t u_\theta) > 0$), probability flows out, so density decreases ($\frac{\partial p_t}{\partial t} < 0$). The negative sign in the continuity equation makes this relationship consistent.

This is the continuous analog of how Jacobian determinants work in discrete flows (Equation 8): expansion decreases density, contraction increases it.

Figure 8: The divergence $\text{div}(u)$ controls density changes. When the flow has net outflow ($\text{div}(u) > 0$), density decreases. When it has net inflow ($\text{div}(u) < 0$), density increases. This is the continuous analog of the Jacobian determinant in discrete flows.

The Continuous Change of Variables Formula

Finally, integrating the instantaneous formula (Equation 32) from $t=0$ to $t=1$ along a trajectory:

$$\log p_1(x_1) = \log p_0(x_0) - \int_0^1 \text{div}(u_\theta)(x_t, t) dt \tag{34}$$

This is the continuous analog of the discrete formula (Equation 8) $\log p_Z(z) = \log p_X(x) + \log |\det J_T(x)|$:

The initial log-density $\log p_0(x_0)$ corresponds to $\log p_X(x)$
The sum of log determinants becomes an integral of divergences
The discrete layers dissolve into a continuous flow

Equation 34 is fundamental to CNFs (Chen et al., 2018):

Training: We can compute $\log p_1(x_1)$ by solving the ODE and integrating the divergence
Generation: We sample $x_0 \sim p_0$ and solve the ODE forward to get $x_1 \sim p_1$

Training CNFs

Before flow matching revolutionized CNF training, the standard approach was maximum likelihood estimation using the continuous change of variables formula. While powerful in theory, this approach faces significant computational challenges.

Maximum Likelihood Training

The training objective maximizes the log-likelihood of data samples under the model:

\[\max_\theta \mathbb{E}_{z \sim p_{\text{data}}}[\log p_1(z)]\]

Using Equation 34, computing $\log p_1(z)$ for a data point $z$ requires:

Backward ODE solve: Starting from $z$ at time $t=1$, solve the ODE backward to find the corresponding latent point $x_0$ at $t=0$
Divergence integration: Track $\int_0^1 \text{div}(u_\theta)(x_t, t) dt$ along the trajectory

This leads to the augmented ODE system where we simultaneously track position and log-density:

\[\frac{d}{dt}\begin{bmatrix} x_t \\ \log p_t(x_t) \end{bmatrix} = \begin{bmatrix} u_\theta(x_t, t) \\ -\text{div}(u_\theta)(x_t, t) \end{bmatrix}\]

Computational Bottlenecks

Two major computational challenges arise:

1. Divergence Computation: Computing $\text{div}(u_\theta) = \sum_{i=1}^d \frac{\partial u_i}{\partial x_i}$ naively requires $d$ backward passes through the network, costing $O(d^2)$ operations. Grathwohl et al. (2019) introduced Hutchinson's trace estimator to reduce this to $O(d)$ using unbiased Monte Carlo estimation:

\[\text{div}(u_\theta)(x, t) = \mathbb{E}_{\epsilon \sim \mathcal{N}(0, I)}[\epsilon^T J_{u_\theta}(x, t) \epsilon]\]

where the expectation can be approximated with a single random sample.

2. Memory for Backpropagation: Standard backpropagation through ODE solvers requires storing all intermediate states, leading to $O(K \cdot d)$ memory where $K$ is the number of solver steps. Chen et al. (2018) introduced the adjoint method which reduces memory to $O(d)$ by computing gradients through an auxiliary backward solve, trading off speed for memory efficiency.

Why This Is Slow

Despite these optimizations, training CNFs via maximum likelihood remains computationally expensive:

Each training iteration requires solving the ODE backward (typically 50-100 function evaluations)
The divergence must be computed at every ODE step
Gradient computation requires additional backward passes

For high-dimensional problems (images, videos), training can take orders of magnitude longer than discrete normalizing flows or other generative models. This motivated the development of flow matching, which we’ll explore in the next post.

Summary

We’ve built up the complete theory of Continuous Normalizing Flows:

CNFs transform distributions continuously via ODEs (Equation 26): $\frac{dx_t}{dt} = u_\theta(x_t, t)$.
Three equivalent perspectives: flow map $\psi_t$, probability path $p_t$, and velocity field $u_\theta$.
Continuity equation (Equation 28): Describes how the density field evolves in space and time.
Instantaneous change of variables (Equation 32): $\frac{d}{dt} \log p_t(x_t) = -\text{div}(u_\theta)(x_t, t)$.
Continuous change of variables (Equation 34): $\log p_1(x_1) = \log p_0(x_0) - \int_0^1 \text{div}(u_\theta)(x_t, t) dt$.
Traditional maximum likelihood training requires expensive ODE solving and divergence computation.

In the next post, we’ll explore flow matching, a breakthrough approach that trains CNFs without solving ODEs during training, making them practical for large-scale applications.

References

[1] Laurent Dinh, David Krueger, and Yoshua Bengio. “NICE: Non-linear Independent Components Estimation.” ICLR Workshop, 2015. arXiv:1410.8516

[2] Danilo Rezende and Shakir Mohamed. “Variational Inference with Normalizing Flows.” ICML, 2015. arXiv:1505.05770

[3] Laurent Dinh, Jascha Sohl-Dickstein, and Samy Bengio. “Density estimation using Real NVP.” ICLR, 2017. arXiv:1605.08803

[4] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. “Deep Residual Learning for Image Recognition.” CVPR, 2016. arXiv:1512.03385

[5] Jens Behrmann, Will Grathwohl, Ricky T. Q. Chen, David Duvenaud, and Jörn-Henrik Jacobsen. “Invertible Residual Networks.” ICML, 2019. arXiv:1811.00995

[6] Ricky T. Q. Chen, Jens Behrmann, David Duvenaud, and Jörn-Henrik Jacobsen. “Residual Flows for Invertible Generative Modeling” NeurIPS, 2019. arXiv:1906.02735

[7] Takeru Miyato, Toshiki Kataoka, Masanori Koyama, and Yuichi Yoshida. “Spectral Normalization for Generative Adversarial Networks.” ICLR, 2018. arXiv:1802.05957

[8] Danilo Jimenez Rezende and Shakir Mohamed. “Variational Inference with Normalizing Flows.” ICML, 2015. arXiv:1505.05770

[9] Ricky T. Q. Chen, Yulia Rubanova, Jesse Bettencourt, and David Duvenaud. “Neural Ordinary Differential Equations.” NeurIPS, 2018. arXiv:1806.07366

[10] Will Grathwohl, Ricky T. Q. Chen, Jesse Bettencourt, Ilya Sutskever, and David Duvenaud. “FFJORD: Free-form Continuous Dynamics for Scalable Reversible Generative Models.” ICLR, 2019. arXiv:1810.01367

[11] Lawrence C. Evans. “Partial Differential Equations.” American Mathematical Society, 2010.

Additional Resources

For further reading on flow matching and related topics, I recommend:

Michael Albergo and Eric Vanden-Eijnden. “Building Normalizing Flows with Stochastic Interpolants.” ICLR, 2023. arXiv:2209.15571
Yaron Lipman, Ricky T. Q. Chen, Heli Ben-Hamu, Maximilian Nickel, and Matt Le. “Flow Matching for Generative Modeling.” ICLR, 2023. arXiv:2210.02747
Alexander Tong, Nikolay Malkin, Guillaume Huguet, Yanlei Zhang, Jarrid Rector-Brooks, Kilian Fatras, Guy Wolf, and Yoshua Bengio. “Improving and Generalizing Flow-Based Generative Models with Minibatch Optimal Transport.” TMLR, 2024. arXiv:2302.00482
Blog posts and resources that inspired this series:
- Conditional Flow Matching
- Flow Matching Guide
- Peter Holderrieth and Ezra Erives. “Introduction to Flow Matching and Diffusion Models.” 2025. https://diffusion.csail.mit.edu/

Introduction#

Generative Modeling as Sampling#

Transport Map Approach#

How Do We Train This Transformation?#

Intuition: Transformations Change Densities#

Change of Variables#

Normalizing Flows: Making Jacobians Tractable#

Strategy: Triangular Jacobians#

Affine Coupling Layers (RealNVP)#

Residual Flows: Toward Continuous Transformations#

Motivation: Full-Rank Updates#

Planar Flows#

From Discrete Layers to Continuous Dynamics#

Continuous Normalizing Flows#

Continuous Change of Variables#

Motivation: From Discrete to Continuous#

The Continuity Equation#

Tracking Density Along a Trajectory#

The Continuous Change of Variables Formula#

Training CNFs#

Maximum Likelihood Training#

Computational Bottlenecks#

Why This Is Slow#

Summary#

References#

Additional Resources#