This is Part 1 of the Flow Matching Series

Introduction

How do modern generative models generate photorealistic images, videos, or protein structures? At their core, they solve a deceptively simple problem: transform random noise into samples from a target data distribution. However, of the many blogposts and learning resources, few seems to provide depth while connecting the dots between normalizing flow, continuous normalizing flow, and flow matching. The goal of this series is to build up from foundation the theory and technical implementation of these methods. I also hope to educate myself on these topics as we go along.

In this first post, I want to establish the mathematical foundations rigorously. We’ll start with a concrete example (planar flows), see why they need to be stacked, discover this leads naturally to a differential equation, and understand why the continuous version works better.

Generative Modeling as Sampling

Suppose we want to generate images of cats. Any image can be represented as a vector:

zRd(1)z \in \mathbb{R}^d \tag{1}

where dd is the dimensionality (e.g., for a 64×6464 \times 64 RGB image, d=64×64×3=12,288d = 64 \times 64 \times 3 = 12,288). Not all vectors in Rd\mathbb{R}^d correspond to "good" images of cats. There exists an unknown distribution pdata(z)p_{\text{data}}(z) that assigns high probability to realistic cat images and low probability to noise.

Generative modeling is learning to sample from this target distribution:

zpdata(z)(2)z \sim p_{\text{data}}(z) \tag{2}

Here we face two fundamental challenges:

  1. We don't have the formula for pdatap_{\text{data}}, only samples {z1,,zN}\{z_1, \ldots, z_N\} (e.g., sample images of cats)
  2. Even if we knew pdatap_{\text{data}}, sampling might be intractable

Transport Map Approach

Start with a simple distribution we can sample from:

xp0(x)=N(0,Id)(3)x \sim p_{\text{0}}(x) = \mathcal{N}(0, I_d) \tag{3}

Then learn a transformation T:RdRdT: \mathbb{R}^d \to \mathbb{R}^d such that:

z=T(x)    zpdata(z)(4)z = T(x) \implies z \sim p_{\text{data}}(z) \tag{4}

We denote the distribution of T(x)T(x) as the pushforward: T#p0T\# p_{\text{0}}.

Figure 1: The transport map approach. We learn a transformation TT that maps samples from a simple base distribution p0p_{\text{0}} (e.g., Gaussian) to the complex target distribution pdatap_{\text{data}}. The pushforward operation T#p0T\#p_{\text{0}} represents the distribution of T(x)T(x) when xp0x \sim p_{\text{0}}.

Generating samples from the target distribution then becomes:

  1. Sample xN(0,I)x \sim \mathcal{N}(0, I)
  2. Deterministically compute z=T(x)z = T(x)
  3. Output zz (sampling from learned distribution)

For this to work, we need to answer:

  1. What form should TT take? (Architecture)
  2. How do we train TT? (Objective function)
  3. How do we evaluate pdata(z)p_{\text{data}}(z)? (For computing likelihoods)

How Do We Train This Transformation?

To learn TT, we need an objective function. The most natural choice is maximum likelihood (Dinh et al., 2015; Rezende & Mohamed, 2015): make the model distribution match the data distribution by maximizing:

maxθEzpdata[logpθ(z)](5)\max_\theta \mathbb{E}_{z \sim p_{\text{data}}} [\log p_\theta(z)] \tag{5}

where pθ(z)p_\theta(z) is the density induced by the learned transformation TθT_\theta, parameterized by θ\theta.

But if z=Tθ(x)z = T_\theta(x) where xN(0,I)x \sim \mathcal{N}(0, I), how do we compute pθ(z)p_\theta(z)?

This requires the change of variables formula.

Intuition: Transformations Change Densities

Consider a simple 1D example: XN(0,1)X \sim \mathcal{N}(0, 1) and Z=2XZ = 2X (stretching by factor 2).

  • The transformation spreads samples apart
  • If samples spread apart, probability density must decrease to keep total probability = 1
  • Specifically: pZ(z)=pX(z/2)12p_Z(z) = p_X(z/2) \cdot \frac{1}{2}

The factor 1/21/2 corrects for the volume change. In general dimensions, the Jacobian is necessary for correcting for this change of volume.

Figure 2: Intuition for change of variables in 1D. When stretched by a factor of 2 (Z=2XZ = 2X), the density must decrease by a factor of 2 to preserve total probability mass. The interval Δx\Delta x becomes 2Δx2\Delta x, so the height must be halved.

Change of Variables

Theorem (Change of Variables): If xpX(x)x \sim p_X(x) and z=T(x)z = T(x) where TT is a diffeomorphism (smooth bijection with differentiable inverse T1T^{-1}), then:

pZ(z)=pX(T1(z))detJT1(z)(6)p_Z(z) = p_X(T^{-1}(z)) \left|\det J_{T^{-1}}(z)\right| \tag{6}

where JT1(z)Rd×dJ_{T^{-1}}(z) \in \mathbb{R}^{d \times d} is the Jacobian matrix: [JT1]ij=[T1]izj[J_{T^{-1}}]_{ij} = \frac{\partial [T^{-1}]_i}{\partial z_j}

Equivalently, using TT instead of T1T^{-1}:

pZ(T(x))=pX(x)detJT(x)1(7)p_Z(T(x)) = p_X(x) \left|\det J_T(x)\right|^{-1} \tag{7}

The determinant detJT(x)|\det J_T(x)| measures how TT locally scales volumes:

  • If detJT(x)>1|\det J_T(x)| > 1: TT expands space near xx, so densities must decrease
  • If detJT(x)<1|\det J_T(x)| < 1: TT contracts space near xx, so densities must increase

This ensures probability mass is conserved: pZ(z)dz=pX(x)dx=1\int p_Z(z) dz = \int p_X(x) dx = 1.

For maximum likelihood, we work with log-densities. Taking the logarithm of Equation 6:

logpZ(z)=logpX(T1(z))+logdetJT1(z)(8)\log p_Z(z) = \log p_X(T^{-1}(z)) + \log \left|\det J_{T^{-1}}(z)\right| \tag{8}

Given data {z1,,zN}pdata\{z_1, \ldots, z_N\} \sim p_{\text{data}}, we can train with maximum likelihood via:

maxθi=1Nlogpθ(zi)=maxθi=1N[logp0(Tθ1(zi))+logdetJTθ1(zi)](9)\max_\theta \sum_{i=1}^N \log p_\theta(z_i) = \max_\theta \sum_{i=1}^N \left[ \log p_{\text{0}}(T_\theta^{-1}(z_i)) + \log \left|\det J_{T_\theta^{-1}}(z_i)\right| \right] \tag{9}

In practice, computing the objective in Equation 9 with neural network parameterized TθT_\theta forces the following constraints:

  • Invertibility: T1T^{-1} needs to map data back to latent space
  • Efficient inverse: Computing T1(z)T^{-1}(z) must be tractable
  • Efficient Jacobian: Computing logdetJT1(z)\log |\det J_{T^{-1}}(z)| must be tractable

Normalizing Flows: Making Jacobians Tractable

For general neural networks, the Jacobian is a d×dd \times d matrix. Computing its determinant naively requires O(d3)O(d^3) operations (LU decomposition). For images with d=12,288d = 12,288, this becomes completely intractable.

Dinh et al. (2015) and Rezende & Mohamed (2015) introduced normalizing flows with special architectures where the Jacobian determinant is cheap to compute.

Strategy: Triangular Jacobians

If JTJ_T is triangular (upper or lower), then:

detJT=i=1d[JT]ii(10)\det J_T = \prod_{i=1}^d [J_T]_{ii} \tag{10}

Computing Equation 10 requires only O(d)O(d) operations. This is a fundamental property of the determinant.

Affine Coupling Layers (RealNVP)

Idea: Make the Jacobian triangular so detJ=iJii\det J = \prod_i J_{ii}.

Dinh et al. (2017) introduced the affine coupling layer architecture: Split input x=[x1,x2]x = [x_1, x_2] where x1,x2Rd/2x_1, x_2 \in \mathbb{R}^{d/2}:

z1=x1z2=x2exp(s(x1))+t(x1)\begin{align} z_1 &= x_1 \tag{11a}\\ z_2 &= x_2 \odot \exp(s(x_1)) + t(x_1) \tag{11b} \end{align}

where s,t:Rd/2Rd/2s, t : \mathbb{R}^{d/2} \to \mathbb{R}^{d/2} are unconstrained neural networks, and \odot is element-wise multiplication.

Why this works:

  1. Explicit Inverse:
x1=z1,x2=(z2t(z1))exp(s(z1))(12)x_1 = z_1, \quad x_2 = (z_2 - t(z_1)) \odot \exp(-s(z_1)) \tag{12}
  1. Triangular Jacobian:
JT=[Id/20z2x1diag(exp(s(x1)))](13)J_T = \begin{bmatrix} I_{d/2} & 0 \\ \frac{\partial z_2}{\partial x_1} & \text{diag}(\exp(s(x_1))) \end{bmatrix} \tag{13}
  1. Cheap Log-Det using Equation 10:
logdetJT=i=1d/2s(x1)i(14)\log |\det J_T| = \sum_{i=1}^{d/2} s(x_1)_i \tag{14}

Now we just sum the network outputs without matrix operations. However, this architecture means a single coupling layer leaves z1=x1z_1 = x_1 unchanged (half the dimensions are untransformed).

To transform all dimensions, we must stack multiple layers with alternating masks:

Layer 1: transform dimensions [2,4,6,...], freeze [1,3,5,...]
Layer 2: transform dimensions [1,3,5,...], freeze [2,4,6,...]
Layer 3: transform dimensions [2,4,6,...], freeze [1,3,5,...]
...

The dilemma:

  • Few layers \rightarrow Limited expressiveness
  • Many layers \rightarrow Expensive computation, harder to train

Can we do better?

Residual Flows: Toward Continuous Transformations

Motivation: Full-Rank Updates

Coupling layers have triangular Jacobians by construction. What if we want full-rank Jacobians (where every output can depend on every input)? Taking inspiration from ResNets (He et al., 2016), Chen et al. (2019) introduced residual flows based on residual connections:

T(x)=x+u(x)(15)T(x) = x + u(x) \tag{15}

where u:RdRdu: \mathbb{R}^d \to \mathbb{R}^d is an unconstrained neural network.

Advantages:

  • Jacobian is JT(x)=I+Ju(x)J_T(x) = I + J_u(x), which is full rank if JuJ_u is full rank
  • More expressive than coupling layers

But how do we ensure invertibility?

Banach Fixed Point Theorem (Behrmann et al., 2019):

If uLip<1|u|_{\text{Lip}} < 1 (Lipschitz constant less than 1), then T(x)=x+u(x)T(x) = x + u(x) defined in Equation 15 is a bijection.

Proof sketch:

  • For invertibility, we need T(x1)=T(x2)    x1=x2T(x_1) = T(x_2) \implies x_1 = x_2
  • Assume T(x1)=T(x2)T(x_1) = T(x_2): x1x2=T(x1)u(x1)T(x2)+u(x2)=u(x1)u(x2)Lx1x2|x_1 - x_2| = |T(x_1) - u(x_1) - T(x_2) + u(x_2)| = |u(x_1) - u(x_2)| \leq L|x_1 - x_2|
  • If L<1L < 1 and x1x2x_1 \neq x_2, we get 1L<11 \leq L < 1, a contradiction
  • Therefore x1=x2x_1 = x_2 (injectivity)

In practice we enforce the Lipschitz constraint by spectral normalization of weight matrices (Miyato et al., 2018) and bounded activation functions.

Planar Flows

Before introducing the general case, let's look at a specific type of residual flow: planar flows (Rezende & Mohamed, 2015). A planar flow has the form:

T(x)=x+σ(wTx+b)a(16)T(x) = x + \sigma(w^T x + b) a \tag{16}

where the transformation is constant on hyperplanes {wTx=c}\{w^T x = c\} and:

  • w,aRdw, a \in \mathbb{R}^d are weight vectors
  • bRb \in \mathbb{R} is a bias
  • σ:RR\sigma: \mathbb{R} \to \mathbb{R} is a smooth activation (e.g., tanh\tanh)

Intuition: This transformation "pushes" points in the direction of aa, with the magnitude determined by how the point projects onto ww.

Figure 3: Geometry of a planar flow (Equation 16). Points are pushed in the direction of vector aa, with magnitude controlled by their projection onto the normal vector ww. The transformation is constant along hyperplanes orthogonal to ww.

Using the matrix determinant lemma, we obtain the cheap-to-compute log determinant of the Jacobian:

logdetJT(x)=log1+σ(wTx+b)wTa(17)\log |\det J_T(x)| = \log |1 + \sigma'(w^T x + b) w^T a| \tag{17}

From Discrete Layers to Continuous Dynamics

Now consider a composition of KK residual flow layers:

T=TKTK1T1(18)T = T_K \circ T_{K-1} \circ \cdots \circ T_1 \tag{18}

but instead of having each layer make a “full-sized” update, let’s have each layer make a small update:

Tk(x)=x+1Kuk(x)(19)T_k(x) = x + \frac{1}{K} u_k(x) \tag{19}

where uk:RdRdu_k : \mathbb{R}^d \to \mathbb{R}^d is a neural network (uk(x)=σ(wkTx+bk)aku_k(x) = \sigma(w_k^T x + b_k) a_k for planar flows). Each layer takes a step of size O(1/K)O(1/K). As KK is increased, each individual step becomes smaller, more steps are taken, but the total "distance traveled" stays roughly constant.

Figure 4: From discrete layers to continuous dynamics. As we increase the number of layers KK and decrease the step size 1/K1/K, the discrete sequence of transformations converges to a continuous trajectory governed by an ODE.

The recurrence relation: Starting from x0x_0 and applying layers sequentially:

x1=x0+1Ku1(x0)x2=x1+1Ku2(x1)xk=xk1+1Kuk(xk1)\begin{align} x_1 &= x_0 + \frac{1}{K} u_1(x_0) \tag{20a}\\ x_2 &= x_1 + \frac{1}{K} u_2(x_1) \tag{20b}\\ &\vdots \nonumber\\ x_k &= x_{k-1} + \frac{1}{K} u_k(x_{k-1}) \tag{20c} \end{align}

Rearranging Equation 20c:

xkxk11/K=uk(xk1)(21)\frac{x_k - x_{k-1}}{1/K} = u_k(x_{k-1}) \tag{21}

This is exactly the forward Euler discretization. If we associate time tk=k/Kt_k = k/K with step kk, then Δt=1/K\Delta t = 1/K is the time step, and we have:

xkxk1Δtdxdtt=tk1(22)\frac{x_k - x_{k-1}}{\Delta t} \approx \frac{dx}{dt}\bigg|_{t=t_{k-1}} \tag{22}

which approximates the differential equation:

dx(t)dt=u(x(t),t)(23)\frac{dx(t)}{dt} = u(x(t), t) \tag{23}

More explicitly, we are discretizing the time interval [0,1][0, 1] into KK points:

0=t0<t1<<tK=1,where tk=kK0 = t_0 < t_1 < \cdots < t_K = 1, \quad \text{where } t_k = \frac{k}{K}

At each time step, we compute:

xk=xk1+(tktk1)Δt=1/Kuk(xk1)x_k = x_{k-1} + \underbrace{(t_k - t_{k-1})}_{\Delta t = 1/K} \cdot u_k(x_{k-1})

As KK \to \infty (and correspondingly Δt=1/K0\Delta t = 1/K \to 0), the discrete sequence x0,x1,,xKx_0, x_1, \ldots, x_K converges to a continuous trajectory x(t)x(t) for t[0,1]t \in [0, 1] that satisfies Equation 23:

dxtdt=u(xt,t),x0pinit(24)\frac{dx_t}{dt} = u(x_t, t), \quad x_0 \sim p_{\text{init}} \tag{24}

The final transformation is given by integrating the ODE:

z=x1=x0+01u(xt,t)dt(25)z = x_1 = x_0 + \int_0^1 u(x_t, t) dt \tag{25}

Chen et al. (2018) introduced this framework as Continuous Normalizing Flows (CNF), with Grathwohl et al. (2019) demonstrating practical training methods.

Continuous Normalizing Flows

A CNF is defined by a vector field uθ:Rd×[0,1]Rdu_\theta: \mathbb{R}^d \times [0,1] \to \mathbb{R}^d:

dxtdt=uθ(xt,t),x0pinit=N(0,I)(26)\frac{dx_t}{dt} = u_\theta(x_t, t), \quad x_0 \sim p_{\text{init}} = \mathcal{N}(0, I) \tag{26}

The solution at time tt, denoted ψt(x0)\psi_t(x_0), is the flow map:

ψt(x0)=x0+0tuθ(ψs(x0),s)ds(27)\psi_t(x_0) = x_0 + \int_0^t u_\theta(\psi_s(x_0), s) ds \tag{27}

Generation is then the process of sampling x0N(0,I)x_0 \sim \mathcal{N}(0, I) and solving the ODE in Equation 26 to get z=x1=ψ1(x0)z = x_{1} = \psi_1(x_0).

There are three related perspectives on CNFs:

  1. Flow map ψt\psi_t: Maps initial conditions to solutions
  2. Probability path ptp_t: How the density evolves over time.
  3. Velocity field uθu_\theta: The "instructions" for how particles move.

Figure 5: Three perspectives on continuous normalizing flows. The flow map ψt\psi_t gives particle trajectories, the probability path ptp_t describes how densities evolve, and the velocity field u(x,t)u(x,t) specifies how particles move. These are related by the continuity equation and ODE solving.

To see these three perspectives in action, consider a 1D example transforming a Gaussian distribution into a bimodal distribution:

Figure 6: Three synchronized views of a 1D continuous normalizing flow (Equation 26). Top panel: The probability density pt(x)p_t(x) evolves from a single Gaussian peak at t=0t=0 to two separate modes at t=1t=1. Middle panel: The velocity field uθ(x,t)u_\theta(x,t) provides the "instructions", orange arrows (positive velocity) push particles right, blue arrows (negative velocity) push left. Notice how particles near x=0x=0 split toward x±1.5x \approx \pm 1.5. Bottom panel: Particle samples xtptx_t \sim p_t show actual data points following these velocity instructions. The three perspectives ptp_t, uθu_\theta, and ψt\psi_t are three ways of viewing a single underlying continuous transformation.

The same concept can be visualized in two dimensions:

Figure 7: Synchronized views of a 2D continuous normalizing flow transforming a centered Gaussian into a bimodal distribution.

Continuous Change of Variables

Motivation: From Discrete to Continuous

In discrete normalizing flows, we tracked density changes using the change of variables formula (Equation 8):

logpZ(z)=logpX(x)+logdetJT(x)\log p_Z(z) = \log p_X(x) + \log |\det J_T(x)|

For CNFs, we need the continuous analog. The key questions are:

  1. How does the density field pt(x)p_t(x) evolve as particles flow through space?
  2. How do we compute logp1(x1)\log p_1(x_1) given logp0(x0)\log p_0(x_0) along a trajectory?

The answer involves two key results: the continuity equation and the instantaneous change of variables formula.

The Continuity Equation

Physical Intuition: Imagine dye particles flowing through water. If particles converge at a point, the concentration (density) increases there. If they spread out, the density decreases. The continuity equation formalizes this conservation of mass principle.

Theorem (Continuity Equation (Evans, 2010)): If particles move according to the ODE dxtdt=u(xt,t)\frac{dx_t}{dt} = u(x_t, t) (Equation 26), then the density pt(x)p_t(x) evolves according to:

ptt(x)+div(ptuθ)(x,t)=0(28)\frac{\partial p_t}{\partial t}(x) + \text{div}(p_t u_\theta)(x, t) = 0 \tag{28}

This is called the continuity equation (transport equation):

  • ptt(x)\frac{\partial p_t}{\partial t}(x): How density changes over time at a fixed location xx
  • div(ptuθ)(x,t)\text{div}(p_t u_\theta)(x, t): The divergence of the probability flux (net outflow of probability mass from xx)
  • The equation shows that at any location, the rate of density change plus the net outflow must equal zero (mass is conserved)
  • Equivalently: ptt=div(ptuθ)\frac{\partial p_t}{\partial t} = -\text{div}(p_t u_\theta), so density increases where flux converges (negative divergence) and decreases where flux diverges (positive divergence)

Proof Sketch:

We represent the density using the flow map (Equation 27):

pt(x)=δ(xψt(x0))p0(x0)dx0p_t(x) = \int \delta(x - \psi_t(x_0)) p_0(x_0) dx_0

Taking the time derivative:

ptt(x)=tδ(xψt(x0))p0(x0)dx0=xδ(xψt(x0))ψtt(x0)p0(x0)dx0=xδ(xψt(x0))uθ(ψt(x0),t)p0(x0)dx0\begin{align} \frac{\partial p_t}{\partial t}(x) &= \int \frac{\partial}{\partial t}\delta(x - \psi_t(x_0)) p_0(x_0) dx_0 \tag{29a}\\ &= -\int \nabla_x \delta(x - \psi_t(x_0)) \cdot \frac{\partial \psi_t}{\partial t}(x_0) p_0(x_0) dx_0 \tag{29b}\\ &= -\int \nabla_x \delta(x - \psi_t(x_0)) \cdot u_\theta(\psi_t(x_0), t) p_0(x_0) dx_0 \tag{29c} \end{align}

The second line uses the chain rule for derivatives of delta functions. The third line uses ψtt(x0)=uθ(ψt(x0),t)\frac{\partial \psi_t}{\partial t}(x_0) = u_\theta(\psi_t(x_0), t) from the ODE definition (Equation 26).

Now substitute y=ψt(x0)y = \psi_t(x_0) (change of variables with dy=detJψt(x0)dx0dy = |\det J_{\psi_t}(x_0)| dx_0):

ptt(x)=xδ(xy)uθ(y,t)pt(y)dy\frac{\partial p_t}{\partial t}(x) = -\int \nabla_x \delta(x - y) \cdot u_\theta(y, t) p_t(y) dy

Integration by parts (using fg=gf\int f \nabla g = -\int g \nabla f for rapidly decaying functions):

ptt(x)=xδ(xy)uθ(y,t)pt(y)dy=div(ptuθ)(x,t)\frac{\partial p_t}{\partial t}(x) = -\nabla_x \cdot \int \delta(x - y) u_\theta(y, t) p_t(y) dy = -\text{div}(p_t u_\theta)(x, t)

which gives us Equation 28. □

Tracking Density Along a Trajectory

The continuity equation (Equation 28) describes how the entire density field evolves. But for a specific particle following the ODE trajectory xtx_t, we want to know: how does logpt(xt)\log p_t(x_t) change?

We need to apply the chain rule:

ddtlogpt(xt)=logptt(xt)+logpt(xt)dxtdt(30)\frac{d}{dt} \log p_t(x_t) = \frac{\partial \log p_t}{\partial t}(x_t) + \nabla \log p_t(x_t) \cdot \frac{dx_t}{dt} \tag{30}

The first term is the change due to time, the second is the change due to the particle’s movement through space.

From the continuity equation (Equation 28) ptt+div(ptuθ)=0\frac{\partial p_t}{\partial t} + \text{div}(p_t u_\theta) = 0, we can derive:

logptt=1ptptt=1ptdiv(ptuθ)=div(uθ)ptptuθ(31)\frac{\partial \log p_t}{\partial t} = \frac{1}{p_t}\frac{\partial p_t}{\partial t} = -\frac{1}{p_t}\text{div}(p_t u_\theta) = -\text{div}(u_\theta) - \frac{\nabla p_t}{p_t} \cdot u_\theta \tag{31}

The last equality uses the product rule: div(ptuθ)=ptuθ+ptdiv(uθ)\text{div}(p_t u_\theta) = \nabla p_t \cdot u_\theta + p_t \text{div}(u_\theta).

Now substituting Equation 31 into our chain rule expression (Equation 30):

ddtlogpt(xt)=div(uθ)(xt,t)logpt(xt)uθ(xt,t)+logpt(xt)dxtdt=div(uθ)(xt,t)logpt(xt)uθ(xt,t)+logpt(xt)uθ(xt,t)=div(uθ)(xt,t)\begin{align} \frac{d}{dt} \log p_t(x_t) &= -\text{div}(u_\theta)(x_t, t) - \nabla \log p_t(x_t) \cdot u_\theta(x_t, t) + \nabla \log p_t(x_t) \cdot \frac{dx_t}{dt}\\ &= -\text{div}(u_\theta)(x_t, t) - \nabla \log p_t(x_t) \cdot u_\theta(x_t, t) + \nabla \log p_t(x_t) \cdot u_\theta(x_t, t)\\ &= -\text{div}(u_\theta)(x_t, t) \end{align}

The middle terms cancel, giving us the Instantaneous Change of Variables Formula:

ddtlogpt(xt)=tr(uθx(xt,t))=div(uθ)(xt,t)(32)\frac{d}{dt} \log p_t(x_t) = -\text{tr}\left(\frac{\partial u_\theta}{\partial x}(x_t, t)\right) = -\text{div}(u_\theta)(x_t, t) \tag{32}

where the divergence operator is defined as:

div(u)(x,t)=i=1duixi(x,t)(33)\text{div}(u)(x, t) = \sum_{i=1}^d \frac{\partial u_i}{\partial x_i}(x, t) \tag{33}

Physical Intuition:

The divergence div(u)(x,t)\text{div}(u)(x, t) measures the net outflow of the vector field uu at point xx and time tt:

  • div(u)(x,t)>0\text{div}(u)(x, t) > 0: net outflow (flow is diverging/expanding), so density must decrease
  • div(u)(x,t)<0\text{div}(u)(x, t) < 0: net inflow (flow is converging/contracting), so density must increase
  • div(u)(x,t)=0\text{div}(u)(x, t) = 0: no net flow, density stays constant

The negative sign in Equation 33 enforces conservation of probability mass:

rate of change of log-density=(net outflow)\text{rate of change of log-density} = -(\text{net outflow})

The quantity pt(x)uθ(x,t)p_t(x) u_\theta(x, t) represents probability flux: how much probability mass flows through point xx at time tt. When flux diverges (div(ptuθ)>0\text{div}(p_t u_\theta) > 0), probability flows out, so density decreases (ptt<0\frac{\partial p_t}{\partial t} < 0). The negative sign in the continuity equation makes this relationship consistent.

This is the continuous analog of how Jacobian determinants work in discrete flows (Equation 8): expansion decreases density, contraction increases it.

Figure 8: The divergence div(u)\text{div}(u) controls density changes. When the flow has net outflow (div(u)>0\text{div}(u) > 0), density decreases. When it has net inflow (div(u)<0\text{div}(u) < 0), density increases. This is the continuous analog of the Jacobian determinant in discrete flows.

The Continuous Change of Variables Formula

Finally, integrating the instantaneous formula (Equation 32) from t=0t=0 to t=1t=1 along a trajectory:

logp1(x1)=logp0(x0)01div(uθ)(xt,t)dt(34)\log p_1(x_1) = \log p_0(x_0) - \int_0^1 \text{div}(u_\theta)(x_t, t) dt \tag{34}

This is the continuous analog of the discrete formula (Equation 8) logpZ(z)=logpX(x)+logdetJT(x)\log p_Z(z) = \log p_X(x) + \log |\det J_T(x)|:

  • The initial log-density logp0(x0)\log p_0(x_0) corresponds to logpX(x)\log p_X(x)
  • The sum of log determinants becomes an integral of divergences
  • The discrete layers dissolve into a continuous flow

Equation 34 is fundamental to CNFs (Chen et al., 2018):

  • Training: We can compute logp1(x1)\log p_1(x_1) by solving the ODE and integrating the divergence
  • Generation: We sample x0p0x_0 \sim p_0 and solve the ODE forward to get x1p1x_1 \sim p_1

Training CNFs

Before flow matching revolutionized CNF training, the standard approach was maximum likelihood estimation using the continuous change of variables formula. While powerful in theory, this approach faces significant computational challenges.

Maximum Likelihood Training

The training objective maximizes the log-likelihood of data samples under the model:

maxθEzpdata[logp1(z)]\max_\theta \mathbb{E}_{z \sim p_{\text{data}}}[\log p_1(z)]

Using Equation 34, computing logp1(z)\log p_1(z) for a data point zz requires:

  1. Backward ODE solve: Starting from zz at time t=1t=1, solve the ODE backward to find the corresponding latent point x0x_0 at t=0t=0
  2. Divergence integration: Track 01div(uθ)(xt,t)dt\int_0^1 \text{div}(u_\theta)(x_t, t) dt along the trajectory

This leads to the augmented ODE system where we simultaneously track position and log-density:

ddt[xtlogpt(xt)]=[uθ(xt,t)div(uθ)(xt,t)]\frac{d}{dt}\begin{bmatrix} x_t \\ \log p_t(x_t) \end{bmatrix} = \begin{bmatrix} u_\theta(x_t, t) \\ -\text{div}(u_\theta)(x_t, t) \end{bmatrix}

Computational Bottlenecks

Two major computational challenges arise:

1. Divergence Computation: Computing div(uθ)=i=1duixi\text{div}(u_\theta) = \sum_{i=1}^d \frac{\partial u_i}{\partial x_i} naively requires dd backward passes through the network, costing O(d2)O(d^2) operations. Grathwohl et al. (2019) introduced Hutchinson's trace estimator to reduce this to O(d)O(d) using unbiased Monte Carlo estimation:

div(uθ)(x,t)=EϵN(0,I)[ϵTJuθ(x,t)ϵ]\text{div}(u_\theta)(x, t) = \mathbb{E}_{\epsilon \sim \mathcal{N}(0, I)}[\epsilon^T J_{u_\theta}(x, t) \epsilon]

where the expectation can be approximated with a single random sample.

2. Memory for Backpropagation: Standard backpropagation through ODE solvers requires storing all intermediate states, leading to O(Kd)O(K \cdot d) memory where KK is the number of solver steps. Chen et al. (2018) introduced the adjoint method which reduces memory to O(d)O(d) by computing gradients through an auxiliary backward solve, trading off speed for memory efficiency.

Why This Is Slow

Despite these optimizations, training CNFs via maximum likelihood remains computationally expensive:

  • Each training iteration requires solving the ODE backward (typically 50-100 function evaluations)
  • The divergence must be computed at every ODE step
  • Gradient computation requires additional backward passes

For high-dimensional problems (images, videos), training can take orders of magnitude longer than discrete normalizing flows or other generative models. This motivated the development of flow matching, which we’ll explore in the next post.

Summary

We’ve built up the complete theory of Continuous Normalizing Flows:

  1. CNFs transform distributions continuously via ODEs (Equation 26): dxtdt=uθ(xt,t)\frac{dx_t}{dt} = u_\theta(x_t, t).
  2. Three equivalent perspectives: flow map ψt\psi_t, probability path ptp_t, and velocity field uθu_\theta.
  3. Continuity equation (Equation 28): Describes how the density field evolves in space and time.
  4. Instantaneous change of variables (Equation 32): ddtlogpt(xt)=div(uθ)(xt,t)\frac{d}{dt} \log p_t(x_t) = -\text{div}(u_\theta)(x_t, t).
  5. Continuous change of variables (Equation 34): logp1(x1)=logp0(x0)01div(uθ)(xt,t)dt\log p_1(x_1) = \log p_0(x_0) - \int_0^1 \text{div}(u_\theta)(x_t, t) dt.
  6. Traditional maximum likelihood training requires expensive ODE solving and divergence computation.

In the next post, we’ll explore flow matching, a breakthrough approach that trains CNFs without solving ODEs during training, making them practical for large-scale applications.

References

[1] Laurent Dinh, David Krueger, and Yoshua Bengio. “NICE: Non-linear Independent Components Estimation.” ICLR Workshop, 2015. arXiv:1410.8516

[2] Danilo Rezende and Shakir Mohamed. “Variational Inference with Normalizing Flows.” ICML, 2015. arXiv:1505.05770

[3] Laurent Dinh, Jascha Sohl-Dickstein, and Samy Bengio. “Density estimation using Real NVP.” ICLR, 2017. arXiv:1605.08803

[4] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. “Deep Residual Learning for Image Recognition.” CVPR, 2016. arXiv:1512.03385

[5] Jens Behrmann, Will Grathwohl, Ricky T. Q. Chen, David Duvenaud, and Jörn-Henrik Jacobsen. “Invertible Residual Networks.” ICML, 2019. arXiv:1811.00995

[6] Ricky T. Q. Chen, Jens Behrmann, David Duvenaud, and Jörn-Henrik Jacobsen. “Residual Flows for Invertible Generative Modeling” NeurIPS, 2019. arXiv:1906.02735

[7] Takeru Miyato, Toshiki Kataoka, Masanori Koyama, and Yuichi Yoshida. “Spectral Normalization for Generative Adversarial Networks.” ICLR, 2018. arXiv:1802.05957

[8] Danilo Jimenez Rezende and Shakir Mohamed. “Variational Inference with Normalizing Flows.” ICML, 2015. arXiv:1505.05770

[9] Ricky T. Q. Chen, Yulia Rubanova, Jesse Bettencourt, and David Duvenaud. “Neural Ordinary Differential Equations.” NeurIPS, 2018. arXiv:1806.07366

[10] Will Grathwohl, Ricky T. Q. Chen, Jesse Bettencourt, Ilya Sutskever, and David Duvenaud. “FFJORD: Free-form Continuous Dynamics for Scalable Reversible Generative Models.” ICLR, 2019. arXiv:1810.01367

[11] Lawrence C. Evans. “Partial Differential Equations.” American Mathematical Society, 2010.

Additional Resources

For further reading on flow matching and related topics, I recommend:

  • Michael Albergo and Eric Vanden-Eijnden. “Building Normalizing Flows with Stochastic Interpolants.” ICLR, 2023. arXiv:2209.15571

  • Yaron Lipman, Ricky T. Q. Chen, Heli Ben-Hamu, Maximilian Nickel, and Matt Le. “Flow Matching for Generative Modeling.” ICLR, 2023. arXiv:2210.02747

  • Alexander Tong, Nikolay Malkin, Guillaume Huguet, Yanlei Zhang, Jarrid Rector-Brooks, Kilian Fatras, Guy Wolf, and Yoshua Bengio. “Improving and Generalizing Flow-Based Generative Models with Minibatch Optimal Transport.” TMLR, 2024. arXiv:2302.00482

  • Blog posts and resources that inspired this series: