Flow matching and diffusion models are two popular frameworks in generative modeling. Despite seeming similar, there is some confusion in the community about their exact connection. In this post, we aim to clear up this confusion and show that diffusion models and Gaussian flow matching are the same, although different model specifications can lead to different network outputs and sampling schedules. This is great news, it means you can use the two frameworks interchangeably.
Flow matching has gained popularity recently, due to the simplicity of its formulation and the “straightness” of its induced sampling trajectories. This raises the commonly asked question:
"Which is better, diffusion or flow matching?"
As we will see, diffusion models and flow matching are equivalent (for the common special case that the source distribution used with flow matching corresponds to a Gaussian), so there is no single answer to this question. In particular, we will show how to convert one formalism to another. But why does this equivalence matter? Well, it allows you to mix and match techniques developed from the two frameworks. For example, after training a flow matching model, you can use either a stochastic or deterministic sampling method (contrary to the common belief that flow matching is always deterministic).
We will focus on the most commonly used flow matching formalism with the optimal transport path
Check this Google Colab for code used to produce plots and animations in this post.
We start with a quick overview of the two frameworks.
A diffusion process gradually destroys an observed datapoint \(\bf{x}\) (such as an image) over time \(t\), by mixing the data with Gaussian noise. The noisy data at time \(t\) is given by a forward process: \(\begin{equation} {\bf z}_t = \alpha_t {\bf x} + \sigma_t {\boldsymbol \epsilon}, \;\mathrm{where} \; {\boldsymbol \epsilon} \sim \mathcal{N}(0, {\bf I}). \label{eq:forward} \end{equation}\) \(\alpha_t\) and \(\sigma_t\) define the noise schedule. A noise schedule is called variance-preserving if \(\alpha_t^2 + \sigma_t^2 = 1\). The noise schedule is designed in a way such that \({\bf z}_0\) is close to the clean data, and \({\bf z}_1\) is close to a Gaussian noise.
To generate new samples, we can “reverse” the forward process: We initialize the sample \({\bf z}_1\) from a standard Gaussian. Given the sample \({\bf z}_t\) at time step \(t\), we predict what the clean sample might look like with a neural network (a.k.a. denoiser model) \(\hat{\bf x} = \hat{\bf x}({\bf z}_t; t)\), and then we project it back to a lower noise level \(s\) with the same forward transformation:
\(\begin{eqnarray}
{\bf z}_{s} &=& \alpha_{s} \hat{\bf x} + \sigma_{s} \hat{\boldsymbol \epsilon},\\
\end{eqnarray}\)
where \(\hat{\boldsymbol \epsilon} = ({\bf z}_t - \alpha_t \hat{\bf x}) / \sigma_t\).
(Alternatively we can train a neural network to predict the noise \(\hat{\boldsymbol \epsilon}\).)
We keep alternating between predicting the clean data, and projecting it back to a lower noise level until we get the clean sample.
This is the DDIM sampler
In Flow Matching, we view the forward process as a linear interpolation between the data \({\bf x}\) and a noise term \(\boldsymbol \epsilon\): \(\begin{eqnarray} {\bf z}_t = (1-t) {\bf x} + t {\boldsymbol \epsilon}.\\ \end{eqnarray}\)
This corresponds to the diffusion forward process if the noise is Gaussian (a.k.a. Gaussian flow matching) and we use the schedule \(\alpha_t = 1-t, \sigma_t = t\).
Using simple algebra, we can derive that \({\bf z}_t = {\bf z}_{s} + {\bf u} \cdot (t - s)\) for \(s < t\), where \({\bf u} = {\boldsymbol \epsilon} - {\bf x}\) is the “velocity”, “flow”, or “vector field”. Hence, to sample \({\bf z}_s\) given \({\bf z}_t\), we reverse time and replace the vector field with our best guess at time \(t\): \(\hat{\bf u} = \hat{\bf u}({\bf z}_t; t) = \hat{\boldsymbol \epsilon} - \hat{\bf x}\), represented by a neural network, to get
\[\begin{eqnarray} {\bf z}_{s} = {\bf z}_t + \hat{\bf u} \cdot (s - t).\\ \label{eq:flow_update} \end{eqnarray}\]Initializing the sample \({\bf z}_1\) from a standard Gaussian, we keep getting \({\bf z}_s\) at a lower noise level than \({\bf z}_t\), until we obtain the clean sample.
So far, we can already discern the similar essences in the two frameworks:
1. Same forward process, if we assume that one end of flow matching is Gaussian, and the noise schedule of the diffusion model is in a particular form.
2. "Similar" sampling processes: both follow an iterative update that involves a guess of the clean data at the current time step. (Spoiler: below we will show they are exactly the same!)
It is commonly thought that the two frameworks differ in how they generate samples: Flow matching sampling is deterministic with “straight” paths, while diffusion model sampling is stochastic and follows “curved paths”. Below, we clarify this misconception. We will focus on deterministic sampling first, since it is simpler, and will discuss the stochastic case later on.
Imagine you want to use your trained denoiser model to transform random noise into a datapoint. Recall that the DDIM update is given by \({\bf z}_{s} = \alpha_{s} \hat{\bf x} + \sigma_{s} \hat{\boldsymbol \epsilon}\). Interestingly, by rearranging terms it can be expressed in the following formulation, with respect to several sets of network outputs and reparametrizations:
\[\begin{equation} \tilde{\bf z}_{s} = \tilde{\bf z}_{t} + \mathrm{Network \; output} \cdot (\eta_s - \eta_t) \\ \end{equation}\]Network Output | Reparametrization |
---|---|
\(\hat{\bf x}\)-prediction | \(\tilde{\bf z}_t = {\bf z}_t / \sigma_t\) and \(\eta_t = {\alpha_t}/{\sigma_t}\) |
\(\hat{\boldsymbol \epsilon}\)-prediction | \(\tilde{\bf z}_t = {\bf z}_t / \alpha_t\) and \(\eta_t = {\sigma_t}/{\alpha_t}\) |
\(\hat{\bf u}\)-flow matching vector field | \(\tilde{\bf z}_t = {\bf z}_t/(\alpha_t + \sigma_t)\) and \(\eta_t = {\sigma_t}/(\alpha_t + \sigma_t)\) |
Remember the flow matching update in Equation (4)? This should look similar. If we set the network output as \(\hat{\bf u}\) in the last line and let \(\alpha_t = 1- t\), \(\sigma_t = t\), we have \(\tilde{\bf z}_t = {\bf z}_t\) and \(\eta_t = t\), which is the flow matching update! More formally, the flow matching update is the discretized Euler integration of the sampling ODE (i.e., \(\mathrm{d}{\bf z}_t = \hat{\bf u} \mathrm{d}t\)), and with the flow matching noise schedule,
Diffusion with DDIM sampler == Flow matching sampler (Euler).
Some other comments on the DDIM sampler:
The DDIM sampler analytically integrates the reparametrized sampling ODE if the network output is a constant over time. Of course the network prediction is not constant, but it means the inaccuracy of DDIM sampler only comes from approximating the intractable integral of the network output (unlike the Euler sampler of the probability flow ODE
The DDIM sampler is invariant to a linear scaling applied to the noise schedule \(\alpha_t\) and \(\sigma_t\), as scaling does not affect \(\tilde{\bf z}_t\) and \(\eta_t\). This is not true for other samplers e.g. Euler sampler of the probability flow ODE.
To validate Claim 2, we present the results obtained using several noise schedules, each of which follows a flow-matching schedule (\(\alpha_t = 1-t, \sigma_t = t\)) with different scaling factors. Feel free to change the slider below the figure. At the left end, the scaling factor is \(1\), which is exactly the flow matching schedule (FM), while at the right end, the scaling factor is \(1/[(1-t)^2 + t^2]\), which corresponds to a variance-preserving schedule (VP). We see that DDIM (and flow matching sampler) always gives the same final data samples, regardless of the scaling of the schedule. The paths bend in different ways as we are showing \({\bf z}_t\) (but not \(\tilde{\bf z}_t\)), which is scale-dependent along the path. For the Euler sampler of the probabilty flow ODE, the scaling makes a true difference: we see that both the paths and the final samples change.
Wait a second! People often say flow matching results in straight paths, but in the above figure, the sampling trajectories look curved.
Well first, why do they say that? If the model would be perfectly confident about the data point it is moving to, the path from noise to data will be a straight line, with the flow matching noise schedule. Straight line ODEs would be great because it means that there is no integration error whatsoever. Unfortunately, the predictions are not for a single point. Instead they average over a larger distribution. And flowing straight to a point != straight to a distribution.
In the interactive graph below, you can change the variance of the data distribution on the right hand side by the slider. Note how the variance preserving schedule is better (straighter paths) for wide distributions, while the flow matching schedule works better for narrow distributions.
Finding such straight paths for real-life datasets like images is of course much less straightforward. But the conclusion remains the same: The optimal integration method depends on the data distribution.
Two important takeaways from deterministic sampling:
1. Equivalence in samplers: DDIM is equivalent to the flow matching sampler, and is invariant to a linear scaling to the noise schedule.
2. Straightness misnomer: Flow matching schedule is only straight for a model predicting a single point. For realistic distributions, other schedules can give straighter paths.
Diffusion models
Flow matching also fits in the above training objective. Recall below is the conditional flow matching objective
used by
Since \(\hat{\bf u}\) can be expressed as a linear combination of \(\hat{\boldsymbol \epsilon}\) and \({\bf z}_t\), the CFM training objective can be rewritten as mean squared error on \({\boldsymbol \epsilon}\) with a specific weighting.
Below we summarize several network outputs proposed in the literature, including a few versions used by diffusion models and the one used by flow matching. They can be derived from each other given the current data \({\bf z}_t\). One may see the training objective defined with respect to MSE of different network outputs in the literature. From the perspective of training objective, they all correspond to having some additional weighting in front of the \({\boldsymbol \epsilon}\)-MSE that can be absorbed in the weighting function.
Network Output | Formulation | MSE on Network Output |
---|---|---|
\(\hat{\boldsymbol \epsilon}\)-prediction | \(\hat{\boldsymbol \epsilon}\) | \(\lVert\hat{\boldsymbol{\epsilon}} - \boldsymbol{\epsilon}\rVert_2^2\) |
\(\hat{\bf x}\)-prediction | \(\hat{\bf x} = ({\bf x}_t - \sigma_t \hat{\boldsymbol \epsilon}) / \alpha_t\) | \(\lVert\hat{\bf x} - {\bf x}\rVert_2^2 = e^{-\lambda} \lVert\hat{\boldsymbol \epsilon} - {\boldsymbol \epsilon}\rVert_2^2\) |
\(\hat{\bf v}\)-prediction | \(\hat{\bf v} = \alpha_t \hat{\boldsymbol{\epsilon}} - \sigma_t \hat{\bf x}\) | \(\lVert\hat{\bf v} - {\bf v}\rVert_2^2 = \alpha_t^2(e^{-\lambda} + 1)^2 \lVert\hat{\boldsymbol \epsilon} - {\boldsymbol \epsilon}\rVert_2^2\) |
\(\hat{\bf u}\)-flow matching vector field | \(\hat{\bf u} = \hat{\boldsymbol{\epsilon}} - \hat{\bf x}\) | \(\lVert\hat{\bf u} - {\bf u}\rVert_2^2 = (e^{-\lambda / 2} + 1)^2 \lVert\hat{\boldsymbol \epsilon} - {\boldsymbol \epsilon}\rVert_2^2\) |
In practice, however, the model output might make a difference. For example,
\(\hat{\boldsymbol \epsilon}\)-prediction can be problematic at high noise levels, because any error in \(\hat{\boldsymbol \epsilon}\) will get amplified in \(\hat{\bf x} = ({\bf x}_t - \sigma_t \hat{\boldsymbol \epsilon}) / \alpha_t\), as \(\alpha_t\) is close to 0. It means that small changes create a large loss under some weightings.
Following the similar reason, \(\hat{\bf x}\)-prediction is problematic at low noise levels, because \({\bf x}\) as a target is not informative when added noise is small, and the error gets amplified in \(\hat{\boldsymbol \epsilon}\).
Therefore, a heuristic is to choose a network output that is a combination of \(\hat{\bf x}\)- and \(\hat{\boldsymbol \epsilon}\)-predictions, which applies to the \(\hat{\bf v}\)-prediction and the flow matching vector field \(\hat{\bf u}\).
The weighting function is the most important part of the loss. It balances the importance of high frequency and low frequency components in perceptual data such as images, videos and audo
Flow matching weighting == diffusion weighting of ${\bf v}$-MSE loss + cosine noise schedule.
That is, the conditional flow matching objective in Equation (7) is the same as a commonly used setting in diffusion models! See Appendix D.2-3 in
The flow matching weighting (also \({\bf v}\)-MSE + cosine schdule weighting) decreases exponentially as \(\lambda\) increases. Empirically we find another interesting connection: The Stable Diffusion 3 weighting
We discuss the training noise schedule last, as it should be the least important to training for the following reasons:
A few takeaways for training of diffusion models / flow matching:
1. Equivalence in weightings: The weighting function is important for training, which balances the importance of different frequency components of perceptual data. Flow matching weightings coincidentlly match commonly used diffusion training weightings in the literature.
2. Insignificance of training noise schedule: The noise schedule is far less important to the training objective, but can affect the training efficiency.
3. Difference in network outputs: The network output proposed by flow matching is new, which nicely balances $\hat{\bf x}$- and $\hat{\epsilon}$-prediction, similar to $\hat{\bf v}$-prediction.
In this section, we discuss different kinds of samplers in more detail.
The Reflow operation in flow matching connects noise and data points in a straight line.
One can obtain these (data, noise) pairs by running a deterministic sampler from noise.
A model can then be trained to directly predict the data given the noise avoiding the need for sampling.
In the diffusion literature, the same approach was the one of the first distillation techniques
So far we have just discussed the deterministic sampler of diffusion models or flow matching. An alternative is to use stochastic samplers such as the DDPM sampler
Performing one DDPM sampling step going from $\lambda_t$ to $\lambda_t + \Delta\lambda$ is exactly equivalent to performing one DDIM sampling step to $\lambda_t + 2\Delta\lambda$, and then renoising to $\lambda_t + \Delta\lambda$ by doing forward diffusion. That is, the renoising by doing forward diffusion reverses exactly half the progress made by DDIM. To see this, let’s take a look at a 2D example. Starting from the same mixture of Gaussians distribution, we can take either a small DDIM sampling step with the sign of the update reversed (left), or a small forward diffusion step (right):
For individual samples, these updates behave quite differently: the reversed DDIM update consistently pushes each sample away from the modes of the distribution, while the diffusion update is entirely random. However, when aggregating all samples, the resulting distributions after the updates are identical. Consequently, if we perform a DDIM sampling step (without reversing the sign) followed by a forward diffusion step, the overall distribution remains unchanged from the one prior to these updates.
The fraction of the DDIM step to undo by renoising is a hyperparameter which we are free to choose (i.e. does not have to be exact half of the DDIM step), and which has been called the level of churn by
Here we ran different samplers for 100 sampling steps using a cosine noise schedule
and \(\hat{\bf v}\)-prediction
We’ve observed the practical equivalence between diffusion models and flow matching algorithms. Here, we formally describe the equivalence of the forward and sampling processes using ODE and SDE, as a completeness in theory.
The forward process of diffusion models which gradually destroys a data over time can be described by the following stochastic differential equation (SDE):
\[\begin{equation} \mathrm{d} {\bf z}_t = f_t {\bf z}_t \mathrm{d} t + g_t \mathrm{d} {\bf z} , \end{equation}\]where \(\mathrm{d} {\bf z}\) is an infinitesimal Gaussian (formally, a Brownian motion). $f_t$ and $g_t$ decide the noise schedule. The generative process is given by the reverse of the forward process, whose formula is given by
\[\begin{equation} \mathrm{d} {\bf z}_t = \left( f_t {\bf z}_t - \frac{1+ \eta_t^2}{2}g_t^2 \nabla \log p_t({\bf z_t}) \right) \mathrm{d} t + \eta_t g_t \mathrm{d} {\bf z} , \end{equation}\]where $\nabla \log p_t$ is the score of the forward process.
Note that we have introduced an additional parameter $\eta_t$ which controls the amount of stochasticity at inference time. This is related to the churn parameter introduced before. When discretizing the backward process we recover DDIM in the case $\eta_t = 0$ and DDPM in the case $\eta_t = 1$.
The interpolation between \({\bf x}\) and \({\boldsymbol \epsilon}\) in flow matching can be described by the following ordinary differential equation (ODE):
\[\begin{equation} \mathrm{d}{\bf z}_t = {\bf u}_t \mathrm{d}t. \end{equation}\]Assuming the interpolation is \({\bf z}_t = \alpha_t {\bf x} + \sigma_t {\boldsymbol \epsilon}\), then \({\bf u}_t = \dot{\alpha}_t {\bf x} + \dot{\sigma}_t {\boldsymbol \epsilon}\).
The generative process is simply reversing the ODE in time, and replacing \({\bf u}_t\) by its conditional expectation with respect to \({\bf z}_t\). This is a specific case of stochastic interpolants
\(\begin{equation} \mathrm{d} {\bf z}_t = ({\bf u}_t - \frac{1}{2} \varepsilon_t^2 \nabla \log p_t({\bf z_t})) \mathrm{d} t + \varepsilon_t \mathrm{d} {\bf z}, \end{equation}\) where \(\varepsilon_t\) controls the amount of stochasticity at inference time.
Both frameworks are defined by three hyperparameters respectively: $f_t, g_t, \eta_t$ for diffusion, and $\alpha_t, \sigma_t, \varepsilon_t$ for flow matching. We can show the equivalence by deriving one set of hyperparameters from the other. From diffusion to flow matching:
\[\alpha_t = \exp\left(\int_0^t f_s \mathrm{d}s\right) , \quad \sigma_t = \left(\int_0^t g_s^2 \exp\left(-2\int_0^s f_u \mathrm{d}u\right) \mathrm{d} s\right)^{1/2} , \quad \varepsilon_t = \eta_t g_t .\]From flow matching to diffusion:
\[f_t = \partial_t \log(\alpha_t) , \quad g_t^2 = 2 \alpha_t \sigma_t \partial_t (\sigma_t / \alpha_t) , \quad \eta_t = \varepsilon_t / (2 \alpha_t \sigma_t \partial_t (\sigma_t / \alpha_t))^{1/2} .\]In summary, aside from training considerations and sampler selection, diffusion and Gaussian flow matching exhibit no fundamental differences.
If you’ve read this far, hopefully we’ve convinced you that diffusion models and Gaussian flow matching are equivalent. However, we highlight two new model specifications that Gaussian flow matching brings to the field:
It would be interesting to investigate the importance of these two model specifications empirically in different real world applications, which we leave to future work. It is also an exciting research area to apply flow matching to more general cases where the source distribution is non-Gaussian, e.g. for more structured data like protein
Thanks to our colleagues at Google DeepMind for fruitful discussions. In particular, thanks to Sander Dieleman, Ben Poole and Aleksander Hołyński.
PLACEHOLDER FOR BIBTEX