Flow Matching for Generative Modelling

The problem of generative modelling is the following:

Given a way to sample from a (source) distribution $p$, how to generate samples from a (target) distribution $q$?

Let the samples be $d$-dimensional real vectors. A flow model samples from $q$ by a flow function $\psi: [0, 1] \times \mathbb{R}^d \to \mathbb{R}^d$ such that $X_t = \psi_t(X_0) \sim p_t$ from a sample $X_0 \sim p_0 = p$, and $X_1 \sim p_1 = q$. We assume that $\psi_t$ is a diffeomorphism (bijective differentiable function) for all $t\in(0,1)$. Flow can be obtained from a velocity field $u:[0,1] \times \mathbb{R}^d\to \mathbb{R}^d$ by the following ODE: \(\begin{aligned} \psi_0(x) &= x, \quad \frac{d \psi_t(x)}{d t} = u_t(\psi_t(x)), ~ t > 0. \end{aligned}\)

Theorem: The velocity field $u$ generates a flow $\psi$ that is unique locally.

Probability Paths and Continuity Equation

A velocity field $u$ is said to generate the probability path $p_\bullet = (p_t)_{t\ge 0}$ if $X_t = \psi_t(X_0) \sim p_t$ for all $t\in[0,1)$, and in this case, the pair $(u_t, p_t)$ must satisfy the continuity equation:

\(\frac{d p_t(x)}{d t} + \mathrm{div} (p_t u_t) (x) = 0,\) where $\mathrm{div}(v)(x) \triangleq \sum_{i=1}^d \partial_{x^i} v^i(x) = \mathrm{trace}(\partial_x v(x))$.

Likelihood Computation and Naive Training

The flow model allows for exact likelihood computation by the Instantaneous Change of Variables:

\[\begin{aligned} \frac{d }{dt} \log p_t(\psi_t(x)) = -\mathrm{div}(u_t)(\psi_t(x)). \end{aligned}\]

Therefore, the likelihood for a sample $X \sim p_1$ is given by:

\[\begin{aligned} \log p_1(\psi_1(X)) = \log p_0(\psi_0(X)) - \int_0^1 \mathrm{div}(u_s)(\psi_s(X)) dt. \end{aligned}\]

The computation of trace of a matrix can be done by random projections and can therefore be used to estimate $\mathrm{div}(u_t)(\psi_t(x)) = \mathbb{E}_{Z \sim \mathcal{N}(0, I)}[Z^T (\partial_x u_t)(\psi_t(x)) Z]$.

Let $u_t^\theta$ be a network with learnable parameters $\theta$, and let $p_1^\theta$ be the distribution of $X_1 = \psi_1^\theta(X_0)$ for $X_0 \sim p_0 = p$ that is generated by $u_t^\theta$. Then, one can train the network by minimizing the loss: \(\mathcal{L}(\theta) = D_{\mathrm{KL}}(q, p_1^\theta) = -\mathbb{E}_{Y \sim q}[\log p_1^\theta(Y)] + \mathrm{constant}.\) The computation of the above loss involves simulating the ODEs corresponding to the flow and data log-likelihoods. Alternatively, one can train the network by the Flow Matching loss as described in the next section.

Training by Flow Matching (FM)

The objective is the following:

Find $u_t^\theta$ that generates $p_t$, with $p_0 = p$ and $p_1 = q$.

To achieve this, FM minimizes the regression loss on the velocity field:

\(\mathcal{L}_{\mathrm{FM}}(\theta) = \mathbb{E}_{t, X_t \sim p_t}[D(u_t(X_t), u_t^\theta(X_t))],\) where $D$ is some dissimilarity measure, for example, the $\ell_2$-norm, $D(v_0, v_1) = \lVert v_0-v_1\rVert _2^2$. The question now becomes:

How to get $p_t, u_t$ from the samples of $q = p_1$?

Consider the coupling $(X_0, X_1) \sim \pi_{0,1}(X_0, X_1)$ between $X_0 \sim p_0 = p$ (source) and $X_1 \sim q$ (target). We generate the probability paths $p_t$ from the velocity field $u_t$ by adopting a conditional strategy. Given a sample $X_1 = x_1$ from the target distribution $q$, we can find the marginal probability $p_t$ from the conditional probability $p_{t\mid 1}(\cdot \mid x_1)$ at time $t$: \(p_t(x) = \int p_{t\mid 1 }(x \mid x_1) q(x_1) dx.\) We wish that path $p_\bullet$ to satisfy the boundary conditions, \(p_0 = p, p_1 = q,\) i.e., the path $p_\bullet$ interpolates between source $p_0$ and the target $q$ as $t$ goes from $0$ to $1$, which can be obtained by enforcing: \(p_{0 \mid 1}(x\mid x_1) = \pi_{0\mid 1}(x \mid x_1) = \frac{\pi_{0,1}(x, x_1)}{q(x_1)}, \quad p_{1\mid 1}(x\mid x_1) = \delta_{x_1}(x).\) For independent coupling, the first equality becomes $p_{0\mid 1}(x\mid x_1) = p(x)$. A popular choice of the conditional paths $p_{t\mid 1}(\cdot \mid x_1)$ is the Gaussian conditional paths given by: \(p_{t\mid 1}(\cdot \mid x_1) = \mathcal{N}(\cdot \mid tx_1, (1-t)^2 I) \to \delta_{x_1}(\cdot) ~\mathrm{as}~ t\to 1.\)

Conditional Velocity Field

The conditional velocity field $u_{t\mid 1}( \cdot \mid x_1)$ generates a conditional probability path $p_{t \mid 1}(\cdot \mid x_1)$. The marginal velocity field $u_t$ is an average of the conditional ones over the target samples as given by: \(u_t(x) = \int u_{t\mid 1}(x \mid x_1) p_{1\mid t}(x_1\mid x) d x_1.\) Using the Bayes’ theorem, we write the above expression in terms of known quantities as: \(p_{1\mid t}(x_1 \mid x) = \frac{ p_{t\mid 1}(x \mid x_1) q(x_1)}{p_t(x)}.\) Rewriting the first equation, we get \(u_t(x) = \mathbb{E}_{X_1 \sim p_{1\mid t}(\cdot \mid x)}[u_{t\mid 1}(\tilde{X}_t \mid X_1) \mid \tilde{X}_t = x],\) which reveals the following interpretation about the marginal velocity fields:

The marginal velocity field $u_t(x)$ is the posterior average of all the conditional velocity fields $u{t\mid 1}(\tilde{X}t = x \mid X_1)$ over the samples $X_1$ drawn from the conditional target distribution $p{1\mid t}(\cdot \mid x)$._

Theorem: Under mild assumptions, if $u_{t\mid Z}(\cdot \mid z)$ generates the conditional probability path $p_{t\mid Z}(\cdot \mid z)$, then $u_t$ generates the marginal probability path $p_t$.

Bregmann Divergences and Conditional Flow Matching (CFM):

The problem with learning $u_t$ is that the loss includes marginalizing over the entire target distribution $q$. Bregmann divergence is a dissimilarity measure, that provides unbiased gradient of $u_t$ from the conditional velocity fields $u_{t\mid Z}$, defined as \(D(v, w) = \Phi(v) - (\Phi(w)+\langle \nabla \Phi(w), v-w \rangle ),\) for a strictly convex differentiable function $\Phi$, and so $D(v, w) \ge 0$ with equality iff $v = w$. A useful property is the convex invariance in second argument stated as \(\nabla_{w} D(\lambda v_0 + (1-\lambda) v_1, w) = \lambda \nabla_{w} D(v_0, w) + (1-\lambda) \nabla_{w} D(v_1, w), \quad \lambda \in [0,1],\) allowing the following equality for any random vector $Y\in \mathbb{R}^d$: \(\nabla_{w} D(\mathbb{E}[Y], w) = \mathbb{E}[\nabla_{w} D(Y, w)].\)

We can define the Conditional Flow Matching (CFM)loss using a Bregmann divergence as: \(\mathcal{L}_{\mathrm{CFM}}(\theta) = \mathbb{E}_{t, Z, X_t\sim p_{t\mid Z}(\cdot \mid Z)}[D(u_t(X_t\mid Z), u_t^\theta(X_t))].\) It can be shown that the gradients of the FM and CFM losses coincide, and hence one can minimize the above loss.

Affine Flow and Gaussian Probability Paths

Consider the case of affine conditional flow, i.e., $x_t = \psi_{t \mid 1}(x \mid x_1) = \alpha_t x_1 + \sigma_t x$, where the pair $(\alpha_t, \sigma_t)$ is called schedule. Then the conditional velocity field becomes:

\[u_{t\mid 1}(x \mid x_1) = \frac{d}{dt}(\psi_t(x \mid x_1)) = \dot{\alpha}_t x_1 + \dot{\sigma}_t x.\]

Note that we require $\psi_0(x \mid x_1) = x, \psi_1(x \mid x_1) = x_1$ for any $x_1$, and so we set $\alpha_0 = \sigma_1 = 0$ and $\sigma_0 = \alpha_1 = 1$, along with $\dot{\alpha}_t, -\dot{\sigma}_t > 0$.

The marginal velocity field is the minimizer to the CFM loss: \(\mathcal{L}_{\mathrm{CFM}}(\theta) = \mathbb{E}_{t, X_0, X_1}[\lVert u_t(X_t)-u_{t\mid 1}(X_0 \mid X_1)\rVert _2^2],\) and hence is given by the conditional exceptation: \(u_t(x) = \mathbb{E}_{X_0, X_1}[ u_{t\mid 1}(X_0 \mid X_1) \mid X_t = x] = \mathbb{E}[\dot{\alpha}_t X_1 + \dot{\sigma}_t X_0 \mid X_t = x] = \dot{\alpha}_t \hat{x}_{1\mid t}(x) + \dot{\sigma}_t \hat{x}_{0\mid t}(x).\) Further, since $X_t = \alpha_t X_1 + \sigma_t X_0$, we have \(\begin{aligned} X_0 = \frac{X_t-\alpha_t X_1}{\sigma_t}, \quad X_1 = \frac{X_t-\sigma_t X_0}{\alpha_t}. \end{aligned}\) Thus, the marginal velocity can also be written as \(\begin{aligned} u_t(x) = \mathbb{E}[\dot{\alpha}_t(\frac{X_t-\sigma_t X_0}{\alpha_t}) + \dot{\sigma}_t X_0 \mid X_t = x] \end{aligned}\)

</span>