Fix standard deviation gradient #9238

vishwakftw · 2018-07-07T15:03:04Z

Defined the subgradient of std when result = 0 to be inf.

Added tests in test_autograd
New method std_backward() for computing backward in Functions.cpp

Fixes #4320 .

1. Added tests in test_autograd 2. New method std_backward() for computing backward in Functions.cpp

ssnl · 2018-07-07T18:56:17Z

:( this adds 2 extra ops

vishwakftw · 2018-07-07T19:04:04Z

I can make it one extra op alone, by modifying the entry in derivatives.yaml alone.

tools/autograd/derivatives.yaml


 - name: std(Tensor self, bool unbiased)
-  self: std_backward(grad, self, result, unbiased)
+  self: var_backward(grad / (2 * result), self, unbiased).masked_fill_(result == 0., INFINITY)


ezyang · 2018-07-09T00:03:54Z

@vishwakftw I'm perhaps not the best core dev to ask about this, since I have always leaned on the side of correctness versus perf, but maybe it would help appease fears about slowness to do a quick benchmark before and after to see what the perf impact is.

facebook-github-bot

@ezyang has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

ssnl · 2018-07-09T00:26:12Z

while we are at it, why do we choose inf? if i understand correctly, any value in [-inf, inf] should be fine, right?

vishwakftw · 2018-07-09T01:06:00Z

This is the time difference:

# Without masked_fill_(result == 0.0, INFINITY)
81 ms ± 1.71 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

# With masked_fill_(result == 0.0, INFINITY)
128 ms ± 2.65 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

# With recent version (optimized masked_fill)
109 ms ± 1.87 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

Regarding use of inf instead of of any other value in -inf, inf, I did this to match the behaviour of sqrt.

colesbury

inf is not a good choice for the gradient of std() when std() is 0.

You can derive a formula using the limit definition of a derivative and the formula for standard deviation. It's going to depend on the number of elements N.

colesbury · 2018-07-12T00:04:35Z

I think the formula should be:

unbiased: 1/sqrt(n)
biased: sqrt(n-1)/n

You can verify this numerically

N = 7
eps = 1e-9
x = torch.zeros(N, dtype=torch.double)
x[0] = eps
print(float(torch.std(x) / eps))
print(float(1/math.sqrt(N)))

vishwakftw · 2018-07-12T00:06:52Z

@colesbury Thanks for the advice. I computed the derivative using the definition and it came out to be:
sqrt((n-1)/nN) where N = n-1 for unbiased and n otherwise.

ssnl · 2018-07-12T00:07:38Z

@colesbury then you can apply the same logic and say that they should be negated ones of what you have.

…

On Thu, Jul 12, 2018 at 02:04 Sam Gross ***@***.***> wrote: I think the formula should be: unbiased: 1/sqrt(n) biased: sqrt(n-1)/n You can verify this numerically N = 7 eps = 1e-9 x = torch.zeros(N, dtype=torch.double) x[0] = eps print(float(torch.std(x) / eps)) print(float(1/math.sqrt(N))) — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#9238 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AFaWZVoCh0gowWJJXhr-pR7iSMPvHdesks5uFpKegaJpZM4VGY2-> .

ssnl · 2018-07-12T00:08:45Z

I think it depends on which direction you take the limit.

…

On Thu, Jul 12, 2018 at 02:07 Vishwak Srinivasan ***@***.***> wrote: @colesbury <https://github.com/colesbury> Thanks for the advice. I computed the derivative using the definition and it came out to be: sqrt(n-1)/nN where N = n-1 for unbiased and n otherwise. — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#9238 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AFaWZVvnF9kx9qJGce4AFks6txhUSTeAks5uFpMogaJpZM4VGY2-> .

vishwakftw · 2018-07-12T00:10:00Z

It won't be negative, but I think it'll be higher if you take x - delta instead of x + delta.

ssnl · 2018-07-12T00:11:41Z

Hmmm isn’t stddev symmetrical at origin?

…

On Thu, Jul 12, 2018 at 02:10 Vishwak Srinivasan ***@***.***> wrote: It won't be negative, but I think it'll be higher if you take x - delta instead of x + delta. — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#9238 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AFaWZTFob3Sr47SkMYv8QESqkgPkTwVNks5uFpPigaJpZM4VGY2-> .

vishwakftw · 2018-07-14T12:12:52Z

@pytorchbot retest this please

facebook-github-bot

@ezyang has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

tools/autograd/templates/Functions.cpp


+Tensor std_backward(const Tensor & grad, const Tensor & self, Tensor result, bool unbiased) {
+  Tensor result_zero_mask = (result == 0.);
+  if (result_zero_mask.any().toCByte()) {


ezyang · 2018-07-14T19:10:40Z

@ssnl is the math good now?

ssnl · 2018-07-15T10:10:29Z

Derivative is not the same if we take the limit from x->0- direction. The change in the output is the same, but the change in the input is negated. So the limit is negative of that taken from x->0+ direction. You can verify this using @colesbury 's code above but with eps=-1e-9. So essentially anywhere in range [-1/sqrt(N), 1/sqrt(N)] is valid subgradient.

Since the usage is mostly gradient-based optimization like gd, it makes sense to take one of these two extremes. But it's unclear to me that which we should choose. (nan really isn't a bad choice in this sense.) I think if we choose one, we should state the choice in the doc.

ezyang · 2018-07-17T02:28:44Z

@vishwakftw?

vishwakftw · 2018-07-17T02:30:46Z

I thought we were waiting for @colesbury ‘s call. If not, I am sorry, I will send in the changes as soon as I can.

ezyang · 2018-07-17T03:34:58Z

No, I'm sorry; I wasn't sure what we were waiting on. If you like I can bug him about it tomorrow.

vishwakftw · 2018-07-17T15:15:06Z

@ezyang Here are the timings you were curious about:

With CUDA
With unconditional masked_fill_ in std_backward() for tensor of size 1000 x 1000 x 10 filled with 1s
33.6 ms ± 173 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

With conditional masked_fill_ in std_backward() for tensor of size 1000 x 1000 x 10 filled with 1s
33.7 ms ± 275 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

With unconditional masked_fill_ in std_backward() for tensor of size 1000 x 1000 x 10 filled using randn
31.2 ms ± 111 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

With conditional masked_fill_ in std_backward() for tensor of size 1000 x 1000 x 10 filled using randn
30.8 ms ± 209 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

conditional masked_fill_ is the current implementation.

into std-gradient-fix

facebook-github-bot

@ezyang has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

ailzhang · 2018-07-24T16:46:49Z

@pytorchbot retest this please

vishwakftw · 2018-07-30T02:52:16Z

Is this good to go?

colesbury · 2018-07-30T21:24:47Z

I'm a bit wary about this change. It looks like it introduces a CUDA synchronization point in result_zero_mask.any().toCByte() (as ezyang pointed out).

synchronization points may not slow down an individual call, but they make it harder to hide kernel launch latency and CPU computation in a bigger system.

vishwakftw · 2018-08-04T22:17:47Z

Are you suggesting that the masked_fill_ be done unconditionally?

ezyang · 2018-08-13T15:13:48Z

I hate to say it, but maybe we should just write the fused kernel that does the masked fill unconditionally. Then we don't have to worry about the extra kernel launch overhead.

YannDubs · 2018-09-24T21:23:39Z

@vishwakftw @colesbury not sure if the derivative you computed are correct. It seems to me that the formulas you have were derived adding only eps to one dimension, but what we really want is to use a eps at each dimensions. Indeed, we only get nans when all the xi's are constant (you don't have 0/0 in the other case).

I.e what you have :

N = 7
eps = 1e-9
x0 = torch.zeros(N, dtype=torch.double, requires_grad=True)
epsis = torch.zeros(N, dtype=torch.double)
epsis[0] = eps
x = x0 + epsis
y=torch.std(x)
y.backward()
print(x0.grad)
# tensor([ 0.3780, -0.0630, -0.0630, -0.0630, -0.0630, -0.0630, -0.0630],  dtype=torch.float64)

what I think we want to solve:

N = 7
eps = 1e-9
x0 = torch.zeros(N, dtype=torch.double, requires_grad=True)
x = x0 +eps
y=torch.std(x)
y.backward()
print(x0.grad)
# tensor([nan, nan, nan, nan, nan, nan, nan], dtype=torch.float64)

For the second point you have to add eps to each xi's in the definition of the derivative. The limit should give 0.

Numerical check:

N = 7
eps = 1e-9
x = torch.zeros(N, dtype=torch.double)
print(float(torch.std(x+eps) / eps))
# 0.0

I hope I understood correctly the issue.

vishwakftw · 2018-11-14T03:20:25Z

This is stale, closing for now.

According to "attention is all you need", the formula apply to sublayer is supposed to be LayerNorm(x + Sublayer(x)). As mentioned in [issue#142](jadore801120#142), this implementation results from the consideration of the problem in pytorch(see [pytorch issue#4320](pytorch/pytorch#4320)), which has been fix in [Fix standard deviation gradient](pytorch/pytorch#9238).

Fix standard deviation gradient

dc082be

1. Added tests in test_autograd 2. New method std_backward() for computing backward in Functions.cpp

vishwakftw requested review from apaszke, colesbury, ezyang, gchanan, soumith and zdevito as code owners July 7, 2018 15:03

Remove entry from Functions.cpp, and include in derivatives.yaml instead

7f76580

vishwakftw commented Jul 7, 2018

View reviewed changes

facebook-github-bot reviewed Jul 9, 2018

View reviewed changes

Optimize the use of masked_fill_

3efab88

colesbury requested changes Jul 11, 2018

View reviewed changes

vishwakftw added 2 commits July 11, 2018 20:28

Update grad for std when result is 0

6a88531

Remove redundant computation and modify sqrt

a7dca96

vishwakftw force-pushed the std-gradient-fix branch from ce9b578 to a7dca96 Compare July 12, 2018 01:26

vishwakftw added 2 commits July 11, 2018 22:24

Final set of fixes

5661f33

Clean up test

a2809e6

facebook-github-bot reviewed Jul 14, 2018

View reviewed changes

ezyang reviewed Jul 14, 2018

View reviewed changes

vishwakftw added 2 commits July 17, 2018 11:26

Add note about std derivative

71793bd

Merge branch 'std-gradient-fix' of https://github.com/vishwakftw/pytorch

8a3736a

into std-gradient-fix

facebook-github-bot reviewed Jul 18, 2018

View reviewed changes

vishwakftw closed this Nov 14, 2018

vishwakftw deleted the std-gradient-fix branch November 14, 2018 03:20

ezyang added open source labels Jun 24, 2019

tony2037 mentioned this pull request May 27, 2020

Fix LayerNorm. jadore801120/attention-is-all-you-need-pytorch#150

Merged

Fix standard deviation gradient #9238

Fix standard deviation gradient #9238

Uh oh!

Conversation

vishwakftw commented Jul 7, 2018

Uh oh!

ssnl commented Jul 7, 2018

Uh oh!

vishwakftw commented Jul 7, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

This comment was marked as off-topic.

Uh oh!

This comment was marked as off-topic.

Uh oh!

This comment was marked as off-topic.

Uh oh!

ezyang commented Jul 9, 2018

Uh oh!

facebook-github-bot left a comment

Choose a reason for hiding this comment

Uh oh!

ssnl commented Jul 9, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

vishwakftw commented Jul 9, 2018

Uh oh!

colesbury left a comment

Choose a reason for hiding this comment

Uh oh!

colesbury commented Jul 12, 2018

Uh oh!

vishwakftw commented Jul 12, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ssnl commented Jul 12, 2018 via email

Uh oh!

ssnl commented Jul 12, 2018 via email

Uh oh!

vishwakftw commented Jul 12, 2018

Uh oh!

ssnl commented Jul 12, 2018 via email

Uh oh!

vishwakftw commented Jul 14, 2018

Uh oh!

facebook-github-bot left a comment

Choose a reason for hiding this comment

Uh oh!

This comment was marked as off-topic.

Uh oh!

This comment was marked as off-topic.

Uh oh!

This comment was marked as off-topic.

Uh oh!

ezyang commented Jul 14, 2018

Uh oh!

ssnl commented Jul 15, 2018

Uh oh!

ezyang commented Jul 17, 2018

Uh oh!

vishwakftw commented Jul 17, 2018

Uh oh!

ezyang commented Jul 17, 2018

Uh oh!

vishwakftw commented Jul 17, 2018

Uh oh!

facebook-github-bot left a comment

Choose a reason for hiding this comment

Uh oh!

ailzhang commented Jul 24, 2018

Uh oh!

vishwakftw commented Jul 30, 2018

Uh oh!

colesbury commented Jul 30, 2018

Uh oh!

vishwakftw commented Aug 4, 2018

Uh oh!

ezyang commented Aug 13, 2018

Uh oh!

YannDubs commented Sep 24, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

vishwakftw commented Jul 7, 2018 •

edited

Loading

ssnl commented Jul 9, 2018 •

edited

Loading

vishwakftw commented Jul 12, 2018 •

edited

Loading

YannDubs commented Sep 24, 2018 •

edited

Loading