Classifier Guided Diffusion
上次已经学习了open AI的 DDPM(. 这次来看openAI的另一个作品。), 以及 斯坦福的 DDIM。这次来看一篇扩散模型超越GAN的一篇经典之作。
前言
上次已经学习了open AI的 DDPM(DDPM原理与代码剖析)和 IDDPM(IDDPM原理和代码剖析), 以及 斯坦福的 DDIM DDIM原理及代码(Denoising diffusion implicit models). 这次来看openAI的另一个作品 Diffusion Models Beat GANs on Image Synthesis
github: https://github.com/openai/guided-diffusion
该博客主要参考 66、Classifier Guided Diffusion条件扩散模型论文与PyTorch代码详细解读
该部分代码主要基于 IDDPM这篇论文对应的代码 IDDPM原理和代码剖析
先挖个坑…
代码分析部分还没完成。。。
理论
前置
(1) 作者先在uncondition 的扩散模型上做了很多消融实验,得到了一些结论,并用这些结论设计结构
(2) 一种straightforward的condition 扩散模型方法是将label信息进行embedding后加到time embedding中,但是效果不是很好。所以本文加上了分类器指导的方法(并没有把上述的常规的condition生成方法丢弃)。
具体的做法是在分类器中获取图片X的梯度,从而辅助模型进行采样生成图像。
Introduction
(1) diffusion模型是一个似然模型。
(2) 模型借鉴了improve-ddpm中预测方差的range(即公式中的v)
Σ θ ( X t , t ) = e x p ( v l o g β t + ( 1 − v ) l o g β ~ t ) \Sigma_{\theta}(X_t, t)=exp(vlog\beta_t + (1-v)log \widetilde{\beta}_t) Σθ(Xt,t)=exp(vlogβt+(1−v)logβ
t)
(3) 更改unet结构:
We explore the following architectural changes:
• Increasing depth versus width, holding model size relatively constant.
• Increasing the number of attention heads.
• Using attention at 32×32, 16×16, and 8×8 resolutions rather than only at 16×16.
• Using the BigGAN residual block for upsampling and downsampling the activations,
following.
• Rescaling residual connections with 1 2 \frac{1}{\sqrt{2}} 21, following [60, 27, 28].
Adaptive Group Normalization
用time embedding和label embedding 去生成 y s y_s ys 和 y b y_b yb
A d a G N ( h , y ) = y s G r o u p N o r m ( h ) + y b AdaGN(h, y) = y_s GroupNorm(h)+y_b AdaGN(h,y)=ysGroupNorm(h)+yb
以下部分在附录H (P25-26)
代码
其实推导了那么多,代码还是差不多,这里只讲有区别的地方。
p_sample
guided_diffusion/gaussian_diffusion.py
def p_sample(
self,
model,
x,
t,
clip_denoised=True,
denoised_fn=None,
cond_fn=None,
model_kwargs=None,
):
"""
Sample x_{t-1} from the model at the given timestep.
:param cond_fn: if not None, this is a gradient function that acts
similarly to the model.
"""
out = self.p_mean_variance(
model,
x,
t,
clip_denoised=clip_denoised,
denoised_fn=denoised_fn,
model_kwargs=model_kwargs,
)
noise = th.randn_like(x)
nonzero_mask = (
(t != 0).float().view(-1, *([1] * (len(x.shape) - 1)))
) # no noise when t == 0
if cond_fn is not None:
out["mean"] = self.condition_mean(
cond_fn, out, x, t, model_kwargs=model_kwargs
)
sample = out["mean"] + nonzero_mask * th.exp(0.5 * out["log_variance"]) * noise
return {"sample": sample, "pred_xstart": out["pred_xstart"]}
对比可以发现,这里多了这一步
if cond_fn is not None:
out["mean"] = self.condition_mean(
cond_fn, out, x, t, model_kwargs=model_kwargs
)
condition_mean
guided_diffusion/gaussian_diffusion.py
def condition_mean(self, cond_fn, p_mean_var, x, t, model_kwargs=None):
"""
Compute the mean for the previous step, given a function cond_fn that
computes the gradient of a conditional log probability with respect to
x. In particular, cond_fn computes grad(log(p(y|x))), and we want to
condition on y.
This uses the conditioning strategy from Sohl-Dickstein et al. (2015).
"""
gradient = cond_fn(x, self._scale_timesteps(t), **model_kwargs)
new_mean = (
p_mean_var["mean"].float() + p_mean_var["variance"] * gradient.float()
)
return new_mean
cond_fn
scripts/classifier_sample.py
这里要返回的是 s × ▽ X t l o g p ϕ ( y ∣ X t ) s \times \bigtriangledown_{X_t} log p_{\phi}(y|X_t) s×▽Xtlogpϕ(y∣Xt), 其中, s s s 是 args.classifier_scale
def cond_fn(x, t, y=None):
assert y is not None
with th.enable_grad():
x_in = x.detach().requires_grad_(True)
logits = classifier(x_in, t)
log_probs = F.log_softmax(logits, dim=-1)
selected = log_probs[range(len(logits)), y.view(-1)]
return th.autograd.grad(selected.sum(), x_in)[0] * args.classifier_scale
ddim_sample
这是ddim的采样方法,关于这个在 IDDPM原理和代码剖析 有介绍,不明白的请移步哦。这里只讲主要变换。
def ddim_sample(
self,
model,
x,
t,
clip_denoised=True,
denoised_fn=None,
cond_fn=None,
model_kwargs=None,
eta=0.0,
):
"""
Sample x_{t-1} from the model using DDIM.
Same usage as p_sample().
"""
out = self.p_mean_variance(
model,
x,
t,
clip_denoised=clip_denoised,
denoised_fn=denoised_fn,
model_kwargs=model_kwargs,
)
if cond_fn is not None:
out = self.condition_score(cond_fn, out, x, t, model_kwargs=model_kwargs)
# Usually our model outputs epsilon, but we re-derive it
# in case we used x_start or x_prev prediction.
eps = self._predict_eps_from_xstart(x, t, out["pred_xstart"])
alpha_bar = _extract_into_tensor(self.alphas_cumprod, t, x.shape)
alpha_bar_prev = _extract_into_tensor(self.alphas_cumprod_prev, t, x.shape)
sigma = (
eta
* th.sqrt((1 - alpha_bar_prev) / (1 - alpha_bar))
* th.sqrt(1 - alpha_bar / alpha_bar_prev)
)
# Equation 12.
noise = th.randn_like(x)
mean_pred = (
out["pred_xstart"] * th.sqrt(alpha_bar_prev)
+ th.sqrt(1 - alpha_bar_prev - sigma ** 2) * eps
)
nonzero_mask = (
(t != 0).float().view(-1, *([1] * (len(x.shape) - 1)))
) # no noise when t == 0
sample = mean_pred + nonzero_mask * sigma * noise
return {"sample": sample, "pred_xstart": out["pred_xstart"]}
if cond_fn is not None:
out = self.condition_score(cond_fn, out, x, t, model_kwargs=model_kwargs)
condition_score
def condition_score(self, cond_fn, p_mean_var, x, t, model_kwargs=None):
"""
Compute what the p_mean_variance output would have been, should the
model's score function be conditioned by cond_fn.
See condition_mean() for details on cond_fn.
Unlike condition_mean(), this instead uses the conditioning strategy
from Song et al (2020).
"""
alpha_bar = _extract_into_tensor(self.alphas_cumprod, t, x.shape)
eps = self._predict_eps_from_xstart(x, t, p_mean_var["pred_xstart"])
eps = eps - (1 - alpha_bar).sqrt() * cond_fn(
x, self._scale_timesteps(t), **model_kwargs
)
out = p_mean_var.copy()
out["pred_xstart"] = self._predict_xstart_from_eps(x, t, eps)
out["mean"], _, _ = self.q_posterior_mean_variance(
x_start=out["pred_xstart"], x_t=x, t=t
)
return out
其中, alpha_bar 是 α ‾ t \overline{\alpha}_t αt
alpha_bar = _extract_into_tensor(self.alphas_cumprod, t, x.shape)
eps 是 ϵ θ ( X t ) − 1 − α ‾ t ▽ X t l o g p ϕ ( y ∣ X t ) \epsilon_{\theta}(X_t)-\sqrt{1-\overline{\alpha}_t} \bigtriangledown_{X_t} log p_{\phi}(y|X_t) ϵθ(Xt)−1−αt▽Xtlogpϕ(y∣Xt), 其中 cond_fn 函数返回的就是 ▽ X t l o g p ϕ ( y ∣ X t ) \bigtriangledown_{X_t} log p_{\phi}(y|X_t) ▽Xtlogpϕ(y∣Xt)
eps = self._predict_eps_from_xstart(x, t, p_mean_var["pred_xstart"])
eps = eps - (1 - alpha_bar).sqrt() * cond_fn(
x, self._scale_timesteps(t), **model_kwargs
)
后面的就和原始DDIM公式一样
但是我看代码更像是
μ ~ ( X t , X 0 ) = α ‾ t − 1 1 − α ‾ t X 0 + α t ( 1 − α ‾ t − 1 ) 1 − α ‾ t X t \widetilde{\mu}(X_t, X_0) = \frac{\sqrt{\overline{\alpha}_{t-1}}}{1-\overline{\alpha}_t} X_0 + \frac{\sqrt{\alpha_t}(1-\overline{\alpha}_{t-1})}{1-\overline{\alpha}_{t}}X_t μ
(Xt,X0)=1−αtαt−1X0+1−αtαt(1−αt−1)Xt
out = p_mean_var.copy()
out["pred_xstart"] = self._predict_xstart_from_eps(x, t, eps)
out["mean"], _, _ = self.q_posterior_mean_variance(
x_start=out["pred_xstart"], x_t=x, t=t
)
q_posterior_mean_variance函数返回的均值是这么算的
posterior_mean = (
_extract_into_tensor(self.posterior_mean_coef1, t, x_t.shape) * x_start
+ _extract_into_tensor(self.posterior_mean_coef2, t, x_t.shape) * x_t
)
μ ~ ( X t , X 0 ) = α ‾ t − 1 1 − α ‾ t X 0 + α t ( 1 − α ‾ t − 1 ) 1 − α ‾ t X t \widetilde{\mu}(X_t, X_0) = \frac{\sqrt{\overline{\alpha}_{t-1}}}{1-\overline{\alpha}_t} X_0 + \frac{\sqrt{\alpha_t}(1-\overline{\alpha}_{t-1})}{1-\overline{\alpha}_{t}}X_t μ (Xt,X0)=1−αtαt−1X0+1−αtαt(1−αt−1)Xt
posterior_mean_coef1 就是 α ‾ t − 1 1 − α ‾ t \frac{\sqrt{\overline{\alpha}_{t-1}}}{1-\overline{\alpha}_t} 1−αtαt−1
posterior_mean_coef2 就是 α t ( 1 − α ‾ t − 1 ) 1 − α ‾ t \frac{\sqrt{\alpha_t}(1-\overline{\alpha}_{t-1})}{1-\overline{\alpha}_{t}} 1−αtαt(1−αt−1)
self.posterior_mean_coef1 = (
betas * np.sqrt(self.alphas_cumprod_prev) / (1.0 - self.alphas_cumprod)
)
self.posterior_mean_coef2 = (
(1.0 - self.alphas_cumprod_prev)
* np.sqrt(alphas)
/ (1.0 - self.alphas_cumprod)
)
更多推荐




所有评论(0)