Notes on Denoising Diffusion Probabilistic Models

Jianfeng Wang, July 3, 2023

Forward pass

Let \(\mathbf{x}_0\) be the observed sample and follow an unknown distribution \(q(\mathbf{x}_0)\). The forward pass is to disturb the data by shrinking the mean and adding extra noise. That is

\[q(\mathbf{x}_{t}|\mathbf{x}_{t - 1}) \sim \mathcal{N}(\sqrt{1 - \beta_t}\mathbf{x}_{t - 1}, \beta_{t}\mathbf{I}), t = 1,\cdots, N.\]

We can also write \(\mathbf{x}_{t}\) as follows.

\[\mathbf{x}_{t} = \sqrt{1 - \beta_{t}} \mathbf{x}_{t - 1} + \sqrt{\beta_{t}} \epsilon_t,\]

where $\epsilon_t$ follows the normal Gaussian distribution \(\mathcal{N}(\mathbf{0}, \mathbf{I})\). Next, let’s derive the probability distribution of $\mathbf{x}_{t}$, given $\mathbf{x}_0$. Similarly, we have

\[\mathbf{x}_{t-1} = \sqrt{1 - \beta_{t - 1}} \mathbf{x}_{t - 2} + \sqrt{\beta_{t - 1}} \epsilon_{t - 1} \\ \vdots \\ \mathbf{x}_{1} = \sqrt{1 - \beta_{1}} \mathbf{x}_{0} + \sqrt{\beta_{1}} \epsilon_{1} \\\]

Each equation is multiplied by a factor such that the multiplied left side is equal to the one on the right of the previous equation. Thus, we have

\[\mathbf{x}_{t} = \prod_{i = 1}^{t} \sqrt{1 - \beta_{i}} \mathbf{x}_0 + f(\epsilon_1, \cdots, \epsilon_t) \\ f(\epsilon_1, \cdots, \epsilon_t) = \sqrt{\beta_{t}} \epsilon_t + \cdots + \prod_{i=2}^{t} \sqrt{1 - \beta_i} \sqrt{\beta_1} \epsilon_1\]

Considering $\epsilon_i$ are i.i.d, we can simply calculate the mean and variance of $f$. As the mean of each $\epsilon_i$ is 0, the mean of $f$ is also 0, as the mean of the sum of multiple random variable equals the sum of the mean of each variable. The variance of $f$ is also the sum of the variance of each variable when each variable is independently distributed. That is,

\[\beta_{t} + (1 - \beta_{t}) \beta_{t - 1} + \cdots + \prod_{i=2}^{t} (1 - \beta_i) \beta_1 = \beta_{t} + \sum_{k = 2}^{t} \prod_{i = k}^{t}(1 - \beta_{i}) \beta_{k - 1}\]

Replacing $\alpha_i = 1 - \beta_i$, we can rewrite the variance as

\[1 - \alpha_t + \sum_{k = 2}^{t} \prod_{i = k}^{t} \alpha_{i} (1 - \alpha_{k - 1}) = 1 - \alpha_t + \sum_{k = 2}^{t} ( \prod_{i = k}^{t} \alpha_{i} - \prod_{i = k - 1}^{t} \alpha_{i} ) = 1 - \prod_{i = 1}^{t} \alpha_{i}.\]

Thus, we have

\[q(\mathbf{x}_{t} | \mathbf{x}_{0}) \sim \mathcal{N}(\sqrt{\bar{\alpha}_t}\mathbf{x}_0, (1 - \bar{\alpha}_{t})\mathbf{I}) \\ \bar{\alpha}_{t} = \prod_{i = 1}^{t} \alpha_{i}, \alpha_i = 1 - \beta_i\]

Another way to derive \(q(\mathbf{x}_t | \mathbf{x}_0)\) is by induction. When \(t = 1\), it is straightfward to have the result. Let’s say we have \(q(\mathbf{x}_{t - 1} | \mathbf{x}_0) = \sqrt{\bar{\alpha}_{t - 1}} \mathbf{x}_0 + \sqrt{1 - \bar{\alpha}_{t - 1}} \epsilon_{t - 1}\). Considering the definition of \(q(\mathbf{x}_t | \mathbf{x}_{t - 1})\), we have

\[\begin{aligned} \mathbf{x}_t & = \sqrt{1 - \beta_{t}} \mathbf{x}_{t - 1} + \sqrt{\beta_{t}} \epsilon_{t} \\ & = \sqrt{1 - \beta_{t}} ( \sqrt{\bar{\alpha}_{t - 1}} \mathbf{x}_0 + \sqrt{1 - \bar{\alpha}_{t - 1}} \epsilon_{t - 1} ) + \sqrt{\beta_{t}} \epsilon_{t} \\ & = \sqrt{\bar{\alpha}_t} \mathbf{x}_0 + \sqrt{1 - \beta_{t}} \sqrt{1 - \bar{\alpha}_{t - 1}} \epsilon_{t - 1} + \sqrt{\beta_{t}} \epsilon_t \end{aligned}\]

The right two items are the sum of two independent Gaussian variables. The mean is 0, and the variance is

\[\alpha_t (1 - \bar{\alpha}_{t - 1}) + \beta_{t} = 1 - \bar{\alpha}_t\]

which concludes the results.

Backward pass

The backward is modeled as a Markov chain. $\mathbf{x}_{T}$ follows the standard Gaussian distribution.

\[p(\mathbf{x}_{t-1}|\mathbf{x}_{t})\sim\mathcal{N}(x_{t-1};\mathbf{\mu}_{\theta}(\mathbf{x}_t,t),\mathbf{\Sigma}_{\theta}(\mathbf{x}_t,t))\]

The parameter $\theta$ denotes all the learnable parameters to predict the mean and the variance. Given a fixed $\theta$, we can find a marginal probability distribution for \(\mathbf{x}_{0}\), starting from \(\mathbf{x}_{T}\).

Training objective

The goal is to find \(\theta\) to maximize \(p(\mathbf{x}_{0})\) when \(\mathbf{x}_0\) follows \(q(\mathbf{x}_0)\).

Before moving forward, let’s first derive the evidence lower bound. If $f()$ is a convex function, we have $E[f(x)] \ge fE[x]$ based on Jensen’s inequality. Then,

\[\log p(x) = \log \int p(x, z) dz = \log \int p(x, z) \frac{q(z)}{q(z)} dz = \log E_{z\sim q} \frac{p(x, z)}{q(z)} \ge E_{z\sim q}\log \frac{p(x, z)}{q(z)}\]

where \(q()\) can be any probability function. The right side is called the evidence lower bound as the left side is the probability of the observation, which is called the evidence.

As \(q()\) can be any function, we can change it to \(q(\mathbf{z}|\mathbf{x})\). Also, let \(\mathbf{z}\) be \(\mathbf{x}_{1:T}\). Then, we have

\[\begin{aligned} L = -E[\log p(\mathbf{x}_0)] \le -E [\log \frac{p(\mathbf{x}_{0:T})}{q(\mathbf{x}_{1:T}|\mathbf{x}_{0})}] & = - E \log \frac{ p(\mathbf{x}_T) \prod_{t=1}^{T} p(\mathbf{x}_{t - 1}|\mathbf{x}_{t}) }{ \prod_{t = 1}^{T} q(x_t|x_{t - 1}) } \\ & = E\left[ -\log p(\mathbf{x}_{T}) - \sum_{t = 1}^{T} \log \frac{ p(\mathbf{x}_{t - 1}|\mathbf{x}_{t}) }{ q(\mathbf{x}_t|\mathbf{x}_{t - 1}) } \right] \end{aligned}\]

This corresponds to Eqn (1) of Reference [2] or Eqn (3) of Reference [1].

Considering \(q(\mathbf{x}_t)\) is a Markov process, we have

\[q(\mathbf{x}_t | \mathbf{x}_{t - 1}) = q(\mathbf{x}_t | \mathbf{x}_{t - 1}, \mathbf{x}_{0}) = \frac{ q(\mathbf{x}_t, \mathbf{x}_{t - 1} | \mathbf{x}_{0}) }{ q(\mathbf{x}_{t - 1} | \mathbf{x}_0) } = \frac{ q(\mathbf{x}_{t} | \mathbf{x}_{0}) q(\mathbf{x}_{t - 1} | \mathbf{x}_{t}, \mathbf{x}_0) }{ q(\mathbf{x}_{t - 1} | \mathbf{x}_0) }, t > 1\]

Thus,

\[\begin{aligned} L & = E\left[ - \log p(\mathbf{x}_T) - \sum_{t > 1} \log \frac{ p(\mathbf{x}_{t - 1} | \mathbf{x}_t) }{ q(\mathbf{x}_{t - 1} | \mathbf{x}_t, \mathbf{x}_0) } \frac{ q(\mathbf{x}_t | \mathbf{x}_0) }{ q(\mathbf{x}_{t - 1} | \mathbf{x}_0) } - \log \frac{ p(\mathbf{x}_0 | \mathbf{x}_1) }{ q(\mathbf{x}_1 | \mathbf{x}_0) } \right] \\ & = E\left[ - \log \frac{p(\mathbf{x}_T)}{q(\mathbf{x}_{T} | \mathbf{x}_0)} - \sum_{t > 1} \log \frac{ p(\mathbf{x}_{t - 1} | \mathbf{x}_t) }{ q(\mathbf{x}_{t - 1} | \mathbf{x}_t, \mathbf{x}_0) } - \log p(\mathbf{x}_0 | \mathbf{x}_1) \right] \\ & = E\left[ \underbrace{KL(q(\mathbf{x}_T | \mathbf{x}_0) \| p(\mathbf{x}_T))}_{L_{T}} + \sum_{t > 1} \underbrace{KL(q(\mathbf{x}_{t - 1} | \mathbf{x}_t, \mathbf{x}_0) \| p(\mathbf{x}_{t - 1} | \mathbf{x}_t))}_{L_{t - 1}} - \underbrace{\log p(\mathbf{x}_0 | \mathbf{x}_1)}_{L_{0}} \right] \end{aligned}\]

This corresponds to Eqn (5) of Reference [1]. The loss in $L_{T}$ contains no learnable parameters, recalling that all learnable parameters are $\theta$. To calculate $L_{t - 1}$, let’s first calculate $q(\mathbf{x}_{t - 1} | \mathbf{x}_t, \mathbf{x}_0)$.

\[q(\mathbf{x}_{t - 1} | \mathbf{x}_t, \mathbf{x}_0) = \frac { q(\mathbf{x}_{t - 1}, \mathbf{x}_t | \mathbf{x}_0) } { q(\mathbf{x}_t | \mathbf{x}_0) } = \frac { q(\mathbf{x}_{t - 1} | \mathbf{x}_0) q(\mathbf{x}_t | \mathbf{x}_{t - 1}, \mathbf{x}_0) } { q(\mathbf{x}_t | \mathbf{x}_0) } = \frac { q(\mathbf{x}_{t - 1} | \mathbf{x}_0) q(\mathbf{x}_t | \mathbf{x}_{t - 1}) } { q(\mathbf{x}_t | \mathbf{x}_0) }\]

The last equation is from the Markov property, i.e. \(q(\mathbf{x}_t | \mathbf{x}_{t - 1}, \mathbf{x}_0) = q(\mathbf{x}_t | \mathbf{x}_{t - 1})\). The denorminator contains no $\mathbf{x}_{t - 1}$, and the numerator has 2 Gaussian distribution. Thus, the result must also follow the Gaussian distribution. We only need to calculate the exponential part (excluding \(-1/2\)). That is,

\[\begin{aligned} & (\mathbf{x}_{t - 1} - \sqrt{\bar{\alpha}_{t - 1}}\mathbf{x}_0)^{'} (1 - \bar{\alpha}_{t - 1})^{-1} (\mathbf{x}_{t - 1} - \sqrt{\bar{\alpha}_{t - 1}}\mathbf{x}_0) + (\mathbf{x}_{t} - \sqrt{\alpha_{t}}\mathbf{x}_{t - 1})^{'} \beta_{t}^{-1} (\mathbf{x}_{t} - \sqrt{\alpha_{t}}\mathbf{x}_{t - 1}) \\ = & ( (1 - \bar{\alpha}_{t - 1})^{-1} + \alpha_t / \beta_{t} ) \|\mathbf{x}_{t - 1}\|^2 - 2 ( \sqrt{\bar{\alpha}_{t - 1}}\mathbf{x}_0^{'} / (1 - \bar{\alpha}_{t - 1}) + \sqrt{\alpha_t} / \beta_{t} \mathbf{x}_{t}^{'} ) \mathbf{x}_{t - 1} + C \end{aligned}\]

Thus, we can conclude the mean and the convariance are

\[\begin{aligned} \tilde{\mu}_t(\mathbf{x}_t, \mathbf{x}_0) & = \frac { \sqrt{\bar{\alpha}_{t - 1}}\mathbf{x}_0^{'} / (1 - \bar{\alpha}_{t - 1}) + \sqrt{\alpha_t} / \beta_{t} \mathbf{x}_{t} } { (1 - \bar{\alpha}_{t - 1})^{-1} + \alpha_t / \beta_{t} } = \frac { \beta_t \sqrt{\bar{\alpha}_{t - 1}} \mathbf{x}_0 + \sqrt{\alpha_{t}} (1 - \bar{\alpha}_{t - 1}) \mathbf{x}_t } { 1 - \bar{\alpha}_{t} } \\ \tilde{\beta}_{t} & = ((1 - \bar{\alpha}_{t - 1})^{-1} + \alpha_t / \beta_t)^{-1} = \frac{ \beta_{t} (1 - \bar{\alpha}_{t - 1}) }{ 1 - \bar{\alpha}_t } \end{aligned}\]

This corresponds to Eqn (6) (7) of Reference [1].

Now let’s get back to the loss of $L_{t - 1}$, which is a KL loss between two Gaussian distribution. The first distribution is \(r(\mathbf{x}) = q(\mathbf{x}_{t - 1} | \mathbf{x}_t, \mathbf{x}_0)\), which is constant (does not depend on the learnable parameters). The second distribution is \(s_\theta(\mathbf{x}) = p(\mathbf{x}_{t - 1} | \mathbf{x}_t) = \mathcal{N}( \mathbf{x}_{t - 1}; \mu_{\theta}(\mathbf{x}_t, t), \Sigma_{\theta}(\mathbf{x}_t, t) )\). We assume \(\Sigma_{\theta}(\mathbf{x}_t, t) = \sigma_{t}^{2}\mathbf{I}\) following Reference [1]. The parameter \(\sigma_t\) is also fixed, not learnable.

\[\begin{aligned} L_{t - 1} & = KL(r(\mathbf{x}) \| s_\theta(\mathbf{x})) \\ & = \int r(\mathbf{x}) \log \frac{r(\mathbf{x})}{s_\theta(\mathbf{x})} d \mathbf{x} \\ & = -\int r(\mathbf{x}) \log s_\theta(\mathbf{x}) d \mathbf{x} + C_1 \\ & = -\int r(\mathbf{x}) \log \exp\left( - \frac{1}{2\sigma_{t}^{2}}\|\mathbf{x} - \mu_{\theta}(\mathbf{x}_t, t)\|^2 \right) d \mathbf{x} + C_2 \\ & = -\int r(\mathbf{x}) \log \exp\left( - \frac{1}{2\sigma_{t}^{2}}\| \mathbf{x} - \tilde{\mu}_t(\mathbf{x}_t, \mathbf{x}_0) + \tilde{\mu}_t(\mathbf{x}_t, \mathbf{x}_0) - \mu_{\theta}(\mathbf{x}_t, t)\|^2 \right) d \mathbf{x} + C_2 \\ & = -\int r(\mathbf{x}) \left( - \frac{1}{2\sigma_{t}^{2}} \| \mathbf{x} - \tilde{\mu}_t(\mathbf{x}_t, \mathbf{x}_0) + \tilde{\mu}_t(\mathbf{x}_t, \mathbf{x}_0) - \mu_{\theta}(\mathbf{x}_t, t)\|^2 \right) d \mathbf{x} + C_2 \\ & = -\int r(\mathbf{x}) \left( - \frac{1}{2\sigma_{t}^{2}} \| \tilde{\mu}_t(\mathbf{x}_t, \mathbf{x}_0) - \mu_{\theta}(\mathbf{x}_t, t)\|^2 \right) d \mathbf{x} + C_3 \\ & = \frac{1}{2\sigma_{t}^{2}} \| \tilde{\mu}_t(\mathbf{x}_t, \mathbf{x}_0) - \mu_{\theta}(\mathbf{x}_t, t)\|^2 + C_3 \end{aligned}\]

This corresponds to Eqn (8) of Reference [1]. With \(\mathbf{x}_t = \sqrt{\bar{\alpha}_t} \mathbf{x}_0 + \sqrt{1 - \bar{\alpha}_t} \epsilon\), (\(\epsilon \sim \mathcal{N}(0, \mathbf{I})\)), we have

\[\begin{aligned} \mathbf{x}_0 &= \frac{1}{\sqrt{\bar{\alpha}_t}}(\mathbf{x}_t - \sqrt{1 - \bar{\alpha}_t}\epsilon) \\ \tilde{\mu}_t(\mathbf{x}_t, \mathbf{x}_0) &= \frac { \beta_t \sqrt{\bar{\alpha}_{t - 1}} } { 1 - \bar{\alpha}_{t} } \mathbf{x}_0 + \frac {\sqrt{\alpha_{t}} (1 - \bar{\alpha}_{t - 1})} {1 - \bar{\alpha}_{t}} \mathbf{x}_t \\ & = \frac { \beta_t \sqrt{\bar{\alpha}_{t - 1}} } { 1 - \bar{\alpha}_{t} }\frac{1}{\sqrt{\bar{\alpha}_t}}(\mathbf{x}_t - \sqrt{1 - \bar{\alpha}_t}\epsilon) + \frac {\sqrt{\alpha_{t}} (1 - \bar{\alpha}_{t - 1})} {1 - \bar{\alpha}_{t}} \mathbf{x}_t \\ & = \frac{1}{\sqrt{\alpha_t}}\mathbf{x}_t - \frac{\beta_t}{\sqrt{1 - \bar{\alpha}_t}\sqrt{\alpha_t}} \epsilon \\ & = \frac{1}{\sqrt{\alpha_t}} ( \mathbf{x}_t - \frac{\beta_{t}}{\sqrt{1 - \bar{\alpha}_t}} \epsilon ) \end{aligned}\]

Although we need to learn \(\mu_\theta(\mathbf{x}_t, t)\), we can reparameterize it with a learnable \(\epsilon_\theta(\mathbf{x}_t, t)\) as follows:

\[\mu_{\theta}(\mathbf{x}_t, t) = \frac{1}{\sqrt{\alpha_t}} ( \mathbf{x}_t - \frac{\beta_{t}}{\sqrt{1 - \bar{\alpha}_t}} \epsilon_\theta(\mathbf{x}_t, t) )\]

This is Eqn (11) of Reference [1]. Then, we have the loss \(L_{t - 1}\) as follows (the constant value is removed)

\[\begin{aligned} L_{t - 1} & = \frac{\beta_t^2}{2 \sigma_t^2 \alpha_t(1 - \bar{\alpha}_t)} \|\epsilon - \epsilon_{\theta}(\mathbf{x}_t, t)\|^2 \\ \mathbf{x}_t & = \sqrt{\bar{\alpha}_t} \mathbf{x}_0 + \sqrt{1 - \bar{\alpha}_t} \epsilon \end{aligned}\]

The $t$ is larger than 1, which means it is not applicable for the case of $t = 1$. The loss $L_{0}$ is specifically for the case of $t = 1$, which is not in a similar form of $L_{t - 1}$. However, in the paper of Reference [1], the author suggests to simplify the training objectives by the following

\[\begin{aligned} L_\text{simple} & = E\|\epsilon - \epsilon_{\theta}(\mathbf{x}_t, t)\|^2 \\ \mathbf{x}_t & = \sqrt{\bar{\alpha}_t} \mathbf{x}_0 + \sqrt{1 - \bar{\alpha}_t} \epsilon \end{aligned}\]

In the expectation of \(L_\text{simple}\), the timestep $t$ can start from 1, rather than from 2. The coefficient is also removed, which means a weighted sum of the original $L_{t - 1}$.

Thus, in summary, during training, we 1) randomly sample a timestep \(t\), 2) sample a noise \(\epsilon\), 3) calculate $\mathbf{x}_t$, 4) predict the noise based on our network \(\epsilon_{\theta}(\mathbf{x}_t, t)\), 5) calculate the difference as the loss function.

During inference, we can simply get $\mu_{\theta}$ based on $\epsilon_{\theta}$ to predict $\mathbf{x}_{t - 1}$ from $\mathbf{x}_t$, starting from \(\mathbf{x}_{T}\).

References

[1] Ho, Jonathan, Ajay Jain, and Pieter Abbeel. “Denoising diffusion probabilistic models.” Advances in neural information processing systems 33 (2020): 6840-6851.

[2] Sohl-Dickstein, J., Weiss, E., Maheswaranathan, N. and Ganguli, S., 2015, June. Deep unsupervised learning using nonequilibrium thermodynamics. In International conference on machine learning (pp. 2256-2265). PMLR.