Classifier Guided Diffusion

上次已经学习了open AI的 DDPM(. 这次来看openAI的另一个作品。), 以及斯坦福的 DDIM。这次来看一篇扩散模型超越GAN的一篇经典之作。

Andy Dennis

5469人浏览 · 2022-12-23 16:37:52

Andy Dennis · 2022-12-23 16:37:52 发布

前言

上次已经学习了open AI的 DDPM(DDPM原理与代码剖析)和 IDDPM(IDDPM原理和代码剖析), 以及斯坦福的 DDIM DDIM原理及代码(Denoising diffusion implicit models). 这次来看openAI的另一个作品 Diffusion Models Beat GANs on Image Synthesis
github: https://github.com/openai/guided-diffusion

该博客主要参考 66、Classifier Guided Diffusion条件扩散模型论文与PyTorch代码详细解读

该部分代码主要基于 IDDPM这篇论文对应的代码 IDDPM原理和代码剖析

先挖个坑…
代码分析部分还没完成。。。

理论

前置

(1) 作者先在uncondition 的扩散模型上做了很多消融实验，得到了一些结论，并用这些结论设计结构
(2) 一种straightforward的condition 扩散模型方法是将label信息进行embedding后加到time embedding中，但是效果不是很好。所以本文加上了分类器指导的方法(并没有把上述的常规的condition生成方法丢弃)。
具体的做法是在分类器中获取图片X的梯度，从而辅助模型进行采样生成图像。

Introduction

(1) diffusion模型是一个似然模型。
(2) 模型借鉴了improve-ddpm中预测方差的range(即公式中的v)
$\Sigma_{\theta}(X_t, t)=exp(vlog\beta_t + (1-v)log \widetilde{\beta}_t)$

(3) 更改unet结构:
We explore the following architectural changes:
• Increasing depth versus width, holding model size relatively constant.
• Increasing the number of attention heads.
• Using attention at 32×32, 16×16, and 8×8 resolutions rather than only at 16×16.
• Using the BigGAN residual block for upsampling and downsampling the activations,
following.
• Rescaling residual connections with $\frac{1}{\sqrt{2}}$ , following [60, 27, 28].

Adaptive Group Normalization
用time embedding和label embedding 去生成 $y_s$ 和 $y_b$
$AdaGN(h, y) = y_s GroupNorm(h)+y_b$

以下部分在附录H (P25-26)

代码

其实推导了那么多，代码还是差不多，这里只讲有区别的地方。

p_sample

guided_diffusion/gaussian_diffusion.py

def p_sample(
        self,
        model,
        x,
        t,
        clip_denoised=True,
        denoised_fn=None,
        cond_fn=None,
        model_kwargs=None,
    ):
        """
        Sample x_{t-1} from the model at the given timestep.
        :param cond_fn: if not None, this is a gradient function that acts
                        similarly to the model.
        """
        out = self.p_mean_variance(
            model,
            x,
            t,
            clip_denoised=clip_denoised,
            denoised_fn=denoised_fn,
            model_kwargs=model_kwargs,
        )
        noise = th.randn_like(x)
        nonzero_mask = (
            (t != 0).float().view(-1, *([1] * (len(x.shape) - 1)))
        )  # no noise when t == 0
        if cond_fn is not None:
            out["mean"] = self.condition_mean(
                cond_fn, out, x, t, model_kwargs=model_kwargs
            )
        sample = out["mean"] + nonzero_mask * th.exp(0.5 * out["log_variance"]) * noise
        return {"sample": sample, "pred_xstart": out["pred_xstart"]}

对比可以发现，这里多了这一步

if cond_fn is not None:
   out["mean"] = self.condition_mean(
      cond_fn, out, x, t, model_kwargs=model_kwargs
   )

condition_mean

guided_diffusion/gaussian_diffusion.py

    def condition_mean(self, cond_fn, p_mean_var, x, t, model_kwargs=None):
        """
        Compute the mean for the previous step, given a function cond_fn that
        computes the gradient of a conditional log probability with respect to
        x. In particular, cond_fn computes grad(log(p(y|x))), and we want to
        condition on y.

        This uses the conditioning strategy from Sohl-Dickstein et al. (2015).
        """
        gradient = cond_fn(x, self._scale_timesteps(t), **model_kwargs)
        new_mean = (
            p_mean_var["mean"].float() + p_mean_var["variance"] * gradient.float()
        )
        return new_mean

cond_fn

scripts/classifier_sample.py

这里要返回的是 $\times \bigtriangledown_{X_t} log p_{\phi}(y|X_t)$ , 其中， $s$ 是 args.classifier_scale

def cond_fn(x, t, y=None):
   assert y is not None
   with th.enable_grad():
       x_in = x.detach().requires_grad_(True)
       logits = classifier(x_in, t)
       log_probs = F.log_softmax(logits, dim=-1)
       selected = log_probs[range(len(logits)), y.view(-1)]
       return th.autograd.grad(selected.sum(), x_in)[0] * args.classifier_scale

ddim_sample

这是ddim的采样方法，关于这个在 IDDPM原理和代码剖析有介绍，不明白的请移步哦。这里只讲主要变换。

def ddim_sample(
        self,
        model,
        x,
        t,
        clip_denoised=True,
        denoised_fn=None,
        cond_fn=None,
        model_kwargs=None,
        eta=0.0,
    ):
        """
        Sample x_{t-1} from the model using DDIM.

        Same usage as p_sample().
        """
        out = self.p_mean_variance(
            model,
            x,
            t,
            clip_denoised=clip_denoised,
            denoised_fn=denoised_fn,
            model_kwargs=model_kwargs,
        )
        if cond_fn is not None:
            out = self.condition_score(cond_fn, out, x, t, model_kwargs=model_kwargs)

        # Usually our model outputs epsilon, but we re-derive it
        # in case we used x_start or x_prev prediction.
        eps = self._predict_eps_from_xstart(x, t, out["pred_xstart"])

        alpha_bar = _extract_into_tensor(self.alphas_cumprod, t, x.shape)
        alpha_bar_prev = _extract_into_tensor(self.alphas_cumprod_prev, t, x.shape)
        sigma = (
            eta
            * th.sqrt((1 - alpha_bar_prev) / (1 - alpha_bar))
            * th.sqrt(1 - alpha_bar / alpha_bar_prev)
        )
        # Equation 12.
        noise = th.randn_like(x)
        mean_pred = (
            out["pred_xstart"] * th.sqrt(alpha_bar_prev)
            + th.sqrt(1 - alpha_bar_prev - sigma ** 2) * eps
        )
        nonzero_mask = (
            (t != 0).float().view(-1, *([1] * (len(x.shape) - 1)))
        )  # no noise when t == 0
        sample = mean_pred + nonzero_mask * sigma * noise
        return {"sample": sample, "pred_xstart": out["pred_xstart"]}

if cond_fn is not None:
    out = self.condition_score(cond_fn, out, x, t, model_kwargs=model_kwargs)

condition_score

    def condition_score(self, cond_fn, p_mean_var, x, t, model_kwargs=None):
        """
        Compute what the p_mean_variance output would have been, should the
        model's score function be conditioned by cond_fn.

        See condition_mean() for details on cond_fn.

        Unlike condition_mean(), this instead uses the conditioning strategy
        from Song et al (2020).
        """
        alpha_bar = _extract_into_tensor(self.alphas_cumprod, t, x.shape)

        eps = self._predict_eps_from_xstart(x, t, p_mean_var["pred_xstart"])
        eps = eps - (1 - alpha_bar).sqrt() * cond_fn(
            x, self._scale_timesteps(t), **model_kwargs
        )

        out = p_mean_var.copy()
        out["pred_xstart"] = self._predict_xstart_from_eps(x, t, eps)
        out["mean"], _, _ = self.q_posterior_mean_variance(
            x_start=out["pred_xstart"], x_t=x, t=t
        )
        return out

其中， alpha_bar 是 $\overline{\alpha}_t$

alpha_bar = _extract_into_tensor(self.alphas_cumprod, t, x.shape)

eps 是 $\epsilon_{\theta}(X_t)-\sqrt{1-\overline{\alpha}_t} \bigtriangledown_{X_t} log p_{\phi}(y|X_t)$ , 其中 cond_fn 函数返回的就是 $\bigtriangledown_{X_t} log p_{\phi}(y|X_t)$

eps = self._predict_eps_from_xstart(x, t, p_mean_var["pred_xstart"])
eps = eps - (1 - alpha_bar).sqrt() * cond_fn(
    x, self._scale_timesteps(t), **model_kwargs
)

后面的就和原始DDIM公式一样

但是我看代码更像是
$\widetilde{\mu}(X_t, X_0) = \frac{\sqrt{\overline{\alpha}_{t-1}}}{1-\overline{\alpha}_t} X_0 + \frac{\sqrt{\alpha_t}(1-\overline{\alpha}_{t-1})}{1-\overline{\alpha}_{t}}X_t$

out = p_mean_var.copy()
out["pred_xstart"] = self._predict_xstart_from_eps(x, t, eps)
out["mean"], _, _ = self.q_posterior_mean_variance(
    x_start=out["pred_xstart"], x_t=x, t=t
)

q_posterior_mean_variance函数返回的均值是这么算的

posterior_mean = (
            _extract_into_tensor(self.posterior_mean_coef1, t, x_t.shape) * x_start
            + _extract_into_tensor(self.posterior_mean_coef2, t, x_t.shape) * x_t
        )

$\widetilde{\mu}(X_t, X_0) = \frac{\sqrt{\overline{\alpha}_{t-1}}}{1-\overline{\alpha}_t} X_0 + \frac{\sqrt{\alpha_t}(1-\overline{\alpha}_{t-1})}{1-\overline{\alpha}_{t}}X_t$

posterior_mean_coef1 就是 $\frac{\sqrt{\overline{\alpha}_{t-1}}}{1-\overline{\alpha}_t}$

posterior_mean_coef2 就是 $\frac{\sqrt{\alpha_t}(1-\overline{\alpha}_{t-1})}{1-\overline{\alpha}_{t}}$

self.posterior_mean_coef1 = (
    betas * np.sqrt(self.alphas_cumprod_prev) / (1.0 - self.alphas_cumprod)
        )
self.posterior_mean_coef2 = (
    (1.0 - self.alphas_cumprod_prev)
    * np.sqrt(alphas)
    / (1.0 - self.alphas_cumprod)
)