亲测解决Grad strides do not match bucket view strides

这个问题是由于小虎在用DP的时候出现的，解决方法是在特别的变量后面加上。

狂小虎

1269人浏览 · 2024-11-22 09:08:36

狂小虎 · 2024-11-22 09:08:36 发布

这个问题是由于小虎在用DP的时候出现的，解决方法是在特别的变量后面加上.contiguous()。

环境

python 3.10 + Pytorch2.0

问题原文

/home/xiaohu/anaconda3/envs/smor/lib/python3.10/site-packages/torch/autograd/__init__.py:251: UserWarning: Grad strides do not match bucket view strides. This may indicate grad was not created according to the gradient layout contract, or that the param's strides changed since DDP was constructed.  This is not an error, but may impair performance.
grad.sizes() = [3, 24, 1, 1], strides() = [24, 1, 24, 24]
bucket_view.sizes() = [3, 24, 1, 1], strides() = [24, 1, 1, 1] (Triggered internally at ../torch/csrc/distributed/c10d/reducer.cpp:320.)
  Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass
/home/xiaohu/anaconda3/envs/smor/lib/python3.10/site-packages/torch/autograd/__init__.py:251: UserWarning: Grad strides do not match bucket view strides. This may indicate grad was not created according to the gradient layout contract, or that the param's strides changed since DDP was constructed.  This is not an error, but may impair performance.
grad.sizes() = [3, 24, 1, 1], strides() = [24, 1, 24, 24]
bucket_view.sizes() = [3, 24, 1, 1], strides() = [24, 1, 1, 1] (Triggered internally at ../torch/csrc/distributed/c10d/reducer.cpp:320.)

解决方法

在rearrange, transpose, repeat, 和einsum后面加上.contiguous()。Add .contiguous() after rearrange, transpose, repeat, and einsum.
比如：

x = rearrange(x, 'b h w (p1 p2 c)-> b (h p1) (w p2) c', p1=self.dim_scale, p2=self.dim_scale, c=C//self.dim_scale).contiguous()

x_hwwh = torch.stack([x.view(B, -1, L), torch.transpose(x, dim0=2, dim1=3).contiguous().view(B, -1, L)], dim=1).view(B, 2, -1, L)

x_dbl = torch.einsum("b k d l, k c d -> b k c l", xs.view(B, K, -1, L), self.x_proj_weight).contiguous()