Notes on Denoising Diffusion Implicit Models

Jianfeng Wang, July 9, 2023

Forward pass

In DDPM, the forward pass is a Markov chain, but in the model of DDIM, it is not.

Given the observation \(\mathbf{x}_0\) following an unknown distribution \(q(\mathbf{x}_0)\) and a timestep \(T\) for a Gaussian distribution

\[q(\mathbf{x}_{T} | \mathbf{x}_0) =\mathcal{N}(\sqrt{\bar{\alpha}_T}\mathbf{x}_0; (1 - \bar{\alpha}_T) \mathbf{I}),\]

we define the following

\[q(\mathbf{x}_{1:T} | \mathbf{x}_0) = q(\mathbf{x}_T | \mathbf{x}_0) \prod_{t = 2}^{T} q(\mathbf{x}_{t - 1} | \mathbf{x}_t, \mathbf{x}_0).\]

Note that, the paper of DDIM uses \(\alpha_{t}\), which is actually closely related to \(\bar{\alpha}_{t}\) in the DDPM paper. Thus, we use \(\bar{\alpha}_t\) here. The process says that 1) we have \(\mathbf{x}_0\), 2) draw \(\mathbf{x}_T\) from \(q(\mathbf{x}_T | \mathbf{x}_0)\), 3) draw \(\mathbf{x}_{T - 1}\) based on \(q(\mathbf{x}_{T - 1} | \mathbf{x}_{T}, \mathbf{x}_0)\), 4) draw \(\mathbf{x}_{T - 2}\) based on \(q(\mathbf{x}_{T - 2} | \mathbf{x}_{T - 1}, \mathbf{x}_0)\), until we draw \(\mathbf{x}_1\) based on \(q(\mathbf{x}_1 | \mathbf{x}_2, \mathbf{x}_{0})\). The probability of \(q(\mathbf{x}_{t - 1} | \mathbf{x}_t, \mathbf{x}_0)\) is defined as

\[q(\mathbf{x}_{t - 1} | \mathbf{x}_t, \mathbf{x}_0) = \mathcal{N} ( \sqrt{\bar{\alpha}_{t - 1}} \mathbf{x}_0 + \sqrt{1 - \bar{\alpha}_{t - 1} - \sigma_t^2} \frac { \mathbf{x}_t - \sqrt{\bar{\alpha}_t}\mathbf{x}_0 } { \sqrt{1 - \bar{\alpha}_{t}} }, \sigma_{t}^{2} \mathbf{I} ).\]

This definition leads to the following

\[q(\mathbf{x}_t | \mathbf{x}_0) = \mathcal{N} ( \sqrt{\bar{\alpha}_t} \mathbf{x}_0; (1 - \bar{\alpha}_t) \mathbf{I} ).\]

Let’s verify it through induction. The equation holds for \(t = T\) because of the definition. Let’s say it also holds for some \(t\) starting from \(T\). Then, we only need to calculate \(q(\mathbf{x}_{t - 1} | \mathbf{x}_0)\). Considering the equation holds for \(q(\mathbf{x}_t | \mathbf{x}_0)\), we have

\[\mathbf{x}_t = \sqrt{\bar{\alpha}_t} \mathbf{x}_0 + \sqrt{1 - \bar{\alpha}_t} \epsilon_t\]

where \(\epsilon \sim \mathcal{N}(0, \mathbf{I})\). Given \(\mathbf{x}_0\) and \(\mathbf{x}_t\), we can draw \(\mathbf{x}_{t - 1}\) through \(q(\mathbf{x}_{t - 1} | \mathbf{x}_t, \mathbf{x}_0)\):

\[\mathbf{x}_{t - 1} = \sqrt{\bar{\alpha}_{t - 1}} \mathbf{x}_0 + \sqrt{1 - \bar{\alpha}_{t - 1} - \sigma_t^2} \frac { \mathbf{x}_t - \sqrt{\bar{\alpha}_t}\mathbf{x}_0 } { \sqrt{1 - \bar{\alpha}_{t}} } + \sigma_{t} \epsilon_{t - 1}.\]

By substituting \(\mathbf{x}_t\), we have

\[\mathbf{x}_{t - 1} = \sqrt{\bar{\alpha}_{t - 1}} \mathbf{x}_0 + \sqrt{1 - \bar{\alpha}_{t - 1} - \sigma_t^2} \epsilon_{t} + \sigma_{t} \epsilon_{t - 1}.\]

Thus, we can easily conclude \(q(\mathbf{x}_{t - 1} | \mathbf{x}_{0})\) also follows the Gaussian distribution. The mean is \(\sqrt{\bar{\alpha}_{t - 1}}\), and the variance is \(1 - \bar{\alpha}_{t - 1} - \sigma_{t}^2 + \sigma_{t}^{2} = 1 - \bar{\alpha}_{t - 1}\).

An interesting property is that the probability \(q(\mathbf{x}_{t}|\mathbf{x}_{0})\) is un-related with any \(\sigma\).

Backward pass for a generative process

Recall the form of \(q(\mathbf{x}_T | \mathbf{x}_t)\), if we design \(\bar{\alpha}_{T}\) close to 0, it will be close to a normal Gaussian distribution. Thus, we can draw a sample from normal distribution as an approximation of \(\mathbf{x}_T\). For any \(t\), we can draw \(\mathbf{x}_t\) based on a random sample \(\epsilon_{t}\). A generative network model is designed to learn such noise by \(\epsilon{x}_{\theta}(\mathbf{x}_t)\). Note that, the network should also depend on the timestep \(t\), and we ignore the notation for simplicity.

With the predicted noise, we can estimate \(\mathbf{x}_0\) as

\[\hat{\mathbf{x}}_0(\mathbf{x}_t) = \frac { \mathbf{x}_t - \sqrt{1 - \bar{\alpha}_t}\epsilon_{\theta}(\mathbf{x}_t) } { \sqrt{\alpha_{t}} }.\]

Then, we can draw another sample of \(\mathbf{x}_{t - 1}\) by the probability of \(q(\mathbf{x}_{t - 1} | \mathbf{x}_t, \hat{\mathbf{x}}_0)\). This process repeats until we reach $\mathbf{x}_{1}$. Then, the original \(\mathbf{x}_0\) is estimated by \(\hat{\mathbf{x}}_0(\mathbf{x}_1)\).

The key is to train \(\epsilon_{\theta}(\mathbf{x}_t)\), which can follow exactly the same process of DDPM. Thus, the training process between DDIM and DDPM is the same, and the difference is only in inference.

\(\sigma_{t} = 0\) for DDIM

When \(\sigma_{t} = 0\), the process is called DDIM. That is, in the forward pass, we draw \(\mathbf{x}_T\) based on \(\mathbf{x}_0\), with some random noise \(\epsilon_{T}\). Then, we will draw all other \(\mathbf{x}_t\) in a deterministic way, rather than in a random way, as \(\sigma_t\) is always 0.

In the backward generative process, we first draw a random sample from normal distribution as an approximation of \(\mathbf{x}_{T}\). Then, all other samples will be drawn also in the deterministic way.

Appropriate \(\sigma_{t}\) for DDPM

When \(\sigma_{t}\) is larger than 0, we will instroduce some noise in both the forward pass and the backward pass. One special case is when

\[\sigma_{t} = \sqrt{ \frac { 1 - \bar{\alpha}_{t - 1} } { 1 - \bar{\alpha}_{t} } (1 - \frac{\bar{\alpha}_t}{\bar{\alpha}_{t - 1}}) }.\]

Recalling DDPM’s inference process, we know both DDPM and DDIM drive \(\mathbf{x}_{t - 1}\) based on \(\hat{\mathbf{x}}_0\) and \(\mathbf{x}_t\). In DDIM, the variance is

\[\sigma_t^2 = \frac { 1 - \bar{\alpha}_{t - 1} } { 1 - \bar{\alpha}_{t} } (1 - \frac{\bar{\alpha}_t}{\bar{\alpha}_{t - 1}})= \frac { 1 - \bar{\alpha}_{t - 1} } { 1 - \bar{\alpha}_{t} }\beta_{t}\]

Recall that in DDPM, \(\bar{\alpha}_{t} = \prod_{i = 1}^{t} \alpha_{i} = \prod_{i = 1}^{t} (1 - \beta_{i})\). This variance is exactly the same as the Eqn (7) of the DDPM paper (reference [1]).

To derive the mean, let’s first calculate

\[\begin{aligned} 1 - \bar{\alpha}_{t - 1} - \sigma_{t}^2 & = 1 - \bar{\alpha}_{t - 1} - \frac { 1 - \bar{\alpha}_{t - 1} } { 1 - \bar{\alpha}_{t} } \frac { \bar{\alpha}_{t - 1} - \bar{\alpha}_{t} } { \bar{\alpha}_{t - 1} } \\ & = \frac { (1 - \bar{\alpha}_t - \bar{\alpha}_{t - 1} + \bar{\alpha}_t\bar{\alpha}_{t - 1}) \bar{\alpha}_{t - 1} - (1 - \bar{\alpha}_{t - 1}) (\bar{\alpha}_{t - 1} - \bar{\alpha}_{t}) } { (1 - \bar{\alpha}_t) \bar{\alpha}_{t - 1} } \\ & = \frac { - 2 \bar{\alpha}_{t} \bar{\alpha}_{t - 1} + \bar{\alpha}_{t} \bar{\alpha}_{t - 1}^{2} + \bar{\alpha}_t } { (1 - \bar{\alpha}_t) \bar{\alpha}_{t - 1} } \\ & = \frac { \bar{\alpha}_t (1 - \bar{\alpha}_{t - 1})^2 } { (1 - \bar{\alpha}_t) \bar{\alpha}_{t - 1} } \end{aligned}\]

Then, the mean is

\[\begin{aligned} & \sqrt{\bar{\alpha}_{t - 1}} \mathbf{x}_0 + \sqrt{1 - \bar{\alpha}_{t - 1} - \sigma_t^2} \frac { \mathbf{x}_t - \sqrt{\bar{\alpha}_t}\mathbf{x}_0 } { \sqrt{1 - \bar{\alpha}_{t}} } \\ = & ( \sqrt{\bar{\alpha}_{t - 1}} - \sqrt{1 - \bar{\alpha}_{t - 1} - \sigma_t^2} \frac { \sqrt{\bar{\alpha}_t} } { \sqrt{1 - \bar{\alpha}_t} } ) \mathbf{x}_0 + \frac { \sqrt{1 - \bar{\alpha}_{t - 1} - \sigma_t^2} } { \sqrt{1 - \bar{\alpha}_{t}} }\mathbf{x}_t \\ = & ( \sqrt{\bar{\alpha}_{t - 1}} - \frac { \sqrt{\bar{\alpha}_t} (1 - \bar{\alpha}_{t - 1}) } { \sqrt{1 - \bar{\alpha}_t} \sqrt{\bar{\alpha}_{t - 1}} } \frac { \sqrt{\bar{\alpha}_t} } { \sqrt{1 - \bar{\alpha}_t} } ) \mathbf{x}_0 + \frac { \sqrt{\bar{\alpha}_t} (1 - \bar{\alpha}_{t - 1}) } { \sqrt{1 - \bar{\alpha}_{t}} \sqrt{1 - \bar{\alpha}_t} \sqrt{\bar{\alpha}_{t - 1}} }\mathbf{x}_t \\ = & ( \sqrt{\bar{\alpha}_{t - 1}} - \frac { \bar{\alpha}_t (1 - \bar{\alpha}_{t - 1}) } { (1 - \bar{\alpha}_t) \sqrt{\bar{\alpha}_{t - 1}} } ) \mathbf{x}_0 + \frac { \sqrt{\bar{\alpha}_t} (1 - \bar{\alpha}_{t - 1}) } { (1 - \bar{\alpha}_{t}) \sqrt{\bar{\alpha}_{t - 1}} }\mathbf{x}_t \\ = & \frac { \bar{\alpha}_{t - 1} - \bar{\alpha}_{t} } { (1 - \bar{\alpha}_t) \sqrt{\bar{\alpha}_{t - 1}} } \mathbf{x}_0 + \frac { \sqrt{\alpha_t} (1 - \bar{\alpha}_{t - 1}) } { 1 - \bar{\alpha}_{t} }\mathbf{x}_t \\ = & \frac { \sqrt{\bar{\alpha}_{t - 1}} \beta_{t} } { 1 - \bar{\alpha}_t } \mathbf{x}_0 + \frac { \sqrt{\alpha_t} (1 - \bar{\alpha}_{t - 1}) } { 1 - \bar{\alpha}_{t} }\mathbf{x}_t \end{aligned}\]

This is exactly the same as the mean of Eqn (7) of DDPM paper. Thus, the inference process is exactly the same as DDPM.

References

[1] Ho, Jonathan, Ajay Jain, and Pieter Abbeel. “Denoising diffusion probabilistic models.” Advances in neural information processing systems 33 (2020): 6840-6851.

[2] Song, Jiaming, Chenlin Meng, and Stefano Ermon. “Denoising diffusion implicit models.” arXiv preprint arXiv:2010.02502 (2020).