Evenly distribute output grad into all matching inputs for min/max/median #43519

zasdfgbnm · 2020-08-24T21:36:54Z

[bc-breaking note]. Previously, in case there were multiple max/min/median elements with the same value, gradient propagated only to the first element with this value, now gradient is evenly distributed between all the elements. This results in a minimum subnorm gradient.
cc: @ngimel @mruberry

…dian

tools/autograd/derivatives.yaml

ngimel · 2020-08-24T21:53:34Z

tools/autograd/templates/Functions.cpp

-  return grad_input;
+Tensor evenly_dispatch_backward(Tensor grad, const Tensor & input, const Tensor & value) {
+  auto mask = (input == value);
+  auto count = mask.sum(input.scalar_type());


For tensors with >2^24 elements that all happen to be max/min/median, this is going to be inaccurate, so maybe better leave in int64 or double, depending on which is faster?

It is a scalar, so I don't think it makes any difference on int64 vs fp64. Let's just use the default (int64).

albanD · 2020-08-24T21:57:42Z

tools/autograd/templates/Functions.cpp

+Tensor evenly_dispatch_backward(Tensor grad, const Tensor & input, const Tensor & value) {
+  auto mask = (input == value);
+  auto count = mask.sum(input.scalar_type());
+  return at::zeros_like(input).masked_fill_(mask, grad / count);


I wonder if it would be faster to do mask.to(input.scalar_type()) * (grad / count) here?
Also it might be worth it to special case when count=1 where we would be able to do something more efficient than masked_fill_().

Good point, not even mask.to because TensorIterator would take care of it, and on cuda type promotion will be implicit. grad/count has to be converted to input.scalar_type for this to work though.

Whether to choose * vs masked_fill depend on CPU vs CUDA:

import torch for device in ['cpu', 'cuda']: t = torch.torch.randint(100, (1024, 1024, 64), device=device, dtype=torch.float) s = t.sum() m = t.max() mask = (t == m) go = t.new_tensor(1.) torch.cuda.synchronize() print(device) %timeit mask * (go / s); torch.cuda.synchronize() if device == 'cuda' else None %timeit torch.zeros_like(t).masked_fill_(mask, go / s); torch.cuda.synchronize() if device == 'cuda' else None

cpu 44.5 ms ± 199 µs per loop (mean ± std. dev. of 7 runs, 10 loops each) 25.2 ms ± 130 µs per loop (mean ± std. dev. of 7 runs, 10 loops each) cuda 848 µs ± 16.6 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each) 1.08 ms ± 19.3 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

so there should be some logic

if (cuda) { * } else { masked_fill }

and I don't think it should have a separate case for s == 1, because the only difference is grad / count vs grad, and grad is a scalar tensor. scalar_tensor.item() is not faster than scalar_tensor / another_scalar_tensor.

zasdfgbnm · 2020-08-24T23:02:14Z

Should be ready now. See my replies to reviews.

facebook-github-bot

@ngimel has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

facebook-github-bot · 2020-08-26T02:14:30Z

@ngimel merged this pull request in 348e78b.

michaelklachko · 2020-10-30T18:58:33Z

@zasdfgbnm @albanD is this also the case for kthvalue op?

albanD · 2020-10-30T19:04:25Z

kthvalue is different as it returns indices (just like max(dim=)) and so to be consistent, these ops only returns gradients for the index that was chosen during the forward.
So no I don't think this applies to the current kthvalue function.
If we had a version of kthvalue that would do a full reduction and not return indices, then yes that would apply.

michaelklachko · 2020-10-30T19:46:38Z

I see. What what the reason to fix this issue? Are there examples of where the old behavior could lead to training instability? What should I do if I have to backprop through kthvalue during training and I want to spread the gradients evenly?

albanD · 2020-10-30T20:21:21Z

Hi,

Instability is a strong word but unexpected behaviors. In particular because the value that was getting all the gradient was chosen in a non-deterministic way and could change across devices.

But also from a more principled point of view. When computing subgradients, we prefer the one with minimum norm (as it is always a descent direction). So in this case, the even distribution across all inputs that realize the value.

To do the same thing with kthvalue I guess you will need to do it yourself:

import torch
from torch import autograd

# No dim can be specified an only full reduction is done
class MyKthvalue(autograd.Function):
    @staticmethod
    def forward(ctx, inp, k):
        res = inp.contiguous().view(-1).kthvalue(k).values
        ctx.save_for_backward(inp, res)
        return res

    @staticmethod
    def backward(ctx, gO):
        inp, res = ctx.saved_tensors
        mask = (inp == res)
        count = mask.sum()
        gO = gO / count
        return mask * gO, None


a = torch.randint(0, 10, (4, 4, 4), dtype=torch.float, requires_grad=True)
k = 3

MyKthvalue.apply(a, k).sum().backward()

print(a)
print(a.grad)

chanshing · 2020-11-07T13:43:25Z

Does this change ReLU and related layers that depend on the max function? If so, what are the implications?

zasdfgbnm · 2020-11-07T17:15:21Z

@chanshing ReLU are not changed

@spzala

Fixes #155048 The behavior of `min` and `max` were changed in #43519. The note about gradient behavior in torch.amin and torch.amax docs are updated to reflect this change: New note: `amax, amin, max(dim), min(dim) evenly distributes gradient between equal values when there are multiple input elements with the same minimum or maximum value.` cc - @spzala @svekars @soulitzer @sekyondaMeta @AlannaBurke @ezyang @gqchen @nikitaved @Varal7 @xmfan Pull Request resolved: #155071 Approved by: https://github.com/soulitzer

Evenly distribute output grad into all matching inputs for min/max/me…

186c75a

…dian

zasdfgbnm requested a review from albanD August 24, 2020 21:36

pytorchbot added the open source label Aug 24, 2020

ngimel reviewed Aug 24, 2020

View reviewed changes

albanD reviewed Aug 24, 2020

View reviewed changes

optimize for speed

e8e19ae

ngimel added the triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module label Aug 25, 2020

ngimel approved these changes Aug 25, 2020

View reviewed changes

facebook-github-bot reviewed Aug 25, 2020

View reviewed changes

zasdfgbnm mentioned this pull request Aug 25, 2020

Add amax/amin #43092

Closed

facebook-github-bot closed this in 348e78b Aug 25, 2020

zasdfgbnm deleted the grad-to-all branch August 25, 2020 23:42

facebook-github-bot added the merged label Aug 26, 2020

ngimel added the module: bc-breaking Related to a BC-breaking change label Aug 26, 2020

mruberry added the Merged label Oct 28, 2020

soulitzer mentioned this pull request Jun 3, 2025

In the docs for torch.amax/amin the note about min/max gradient behavior is outdated #155048

Closed

nirajkamal mentioned this pull request Jun 3, 2025

Update gradient behavior note in torch.amin and torch.amax #155071

Closed

Evenly distribute output grad into all matching inputs for min/max/median #43519

Evenly distribute output grad into all matching inputs for min/max/median #43519

Uh oh!

Conversation

zasdfgbnm commented Aug 24, 2020 • edited by ngimel Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

ngimel Aug 24, 2020

Choose a reason for hiding this comment

Uh oh!

zasdfgbnm Aug 24, 2020

Choose a reason for hiding this comment

Uh oh!

albanD Aug 24, 2020

Choose a reason for hiding this comment

Uh oh!

ngimel Aug 24, 2020

Choose a reason for hiding this comment

Uh oh!

zasdfgbnm Aug 24, 2020

Choose a reason for hiding this comment

Uh oh!

zasdfgbnm commented Aug 24, 2020

Uh oh!

facebook-github-bot left a comment

Choose a reason for hiding this comment

Uh oh!

facebook-github-bot commented Aug 26, 2020

Uh oh!

michaelklachko commented Oct 30, 2020

Uh oh!

albanD commented Oct 30, 2020

Uh oh!

michaelklachko commented Oct 30, 2020

Uh oh!

albanD commented Oct 30, 2020

Uh oh!

chanshing commented Nov 7, 2020

Uh oh!

zasdfgbnm commented Nov 7, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

9 participants

zasdfgbnm commented Aug 24, 2020 •

edited by ngimel

Loading