-
Notifications
You must be signed in to change notification settings - Fork 26.3k
Fix standard deviation gradient #9238
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
1. Added tests in test_autograd 2. New method std_backward() for computing backward in Functions.cpp
|
:( this adds 2 extra ops |
|
I can make it one extra op alone, by modifying the entry in |
tools/autograd/derivatives.yaml
Outdated
|
|
||
| - name: std(Tensor self, bool unbiased) | ||
| self: std_backward(grad, self, result, unbiased) | ||
| self: var_backward(grad / (2 * result), self, unbiased).masked_fill_(result == 0., INFINITY) |
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
|
@vishwakftw I'm perhaps not the best core dev to ask about this, since I have always leaned on the side of correctness versus perf, but maybe it would help appease fears about slowness to do a quick benchmark before and after to see what the perf impact is. |
facebook-github-bot
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ezyang has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.
|
while we are at it, why do we choose |
|
This is the time difference: Regarding use of |
colesbury
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
inf is not a good choice for the gradient of std() when std() is 0.
You can derive a formula using the limit definition of a derivative and the formula for standard deviation. It's going to depend on the number of elements N.
|
I think the formula should be: unbiased: You can verify this numerically |
|
@colesbury Thanks for the advice. I computed the derivative using the definition and it came out to be: |
|
@colesbury then you can apply the same logic and say that they should be
negated ones of what you have.
…On Thu, Jul 12, 2018 at 02:04 Sam Gross ***@***.***> wrote:
I think the formula should be:
unbiased: 1/sqrt(n)
biased: sqrt(n-1)/n
You can verify this numerically
N = 7
eps = 1e-9
x = torch.zeros(N, dtype=torch.double)
x[0] = eps
print(float(torch.std(x) / eps))
print(float(1/math.sqrt(N)))
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#9238 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/AFaWZVoCh0gowWJJXhr-pR7iSMPvHdesks5uFpKegaJpZM4VGY2->
.
|
|
I think it depends on which direction you take the limit.
…On Thu, Jul 12, 2018 at 02:07 Vishwak Srinivasan ***@***.***> wrote:
@colesbury <https://github.com/colesbury> Thanks for the advice. I
computed the derivative using the definition and it came out to be:
sqrt(n-1)/nN where N = n-1 for unbiased and n otherwise.
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#9238 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/AFaWZVvnF9kx9qJGce4AFks6txhUSTeAks5uFpMogaJpZM4VGY2->
.
|
|
It won't be negative, but I think it'll be higher if you take |
|
Hmmm isn’t stddev symmetrical at origin?
…On Thu, Jul 12, 2018 at 02:10 Vishwak Srinivasan ***@***.***> wrote:
It won't be negative, but I think it'll be higher if you take x - delta
instead of x + delta.
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#9238 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/AFaWZTFob3Sr47SkMYv8QESqkgPkTwVNks5uFpPigaJpZM4VGY2->
.
|
ce9b578 to
a7dca96
Compare
|
@pytorchbot retest this please |
facebook-github-bot
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ezyang has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.
|
|
||
| Tensor std_backward(const Tensor & grad, const Tensor & self, Tensor result, bool unbiased) { | ||
| Tensor result_zero_mask = (result == 0.); | ||
| if (result_zero_mask.any().toCByte()) { |
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
|
@ssnl is the math good now? |
|
Derivative is not the same if we take the limit from Since the usage is mostly gradient-based optimization like gd, it makes sense to take one of these two extremes. But it's unclear to me that which we should choose. ( |
|
I thought we were waiting for @colesbury ‘s call. If not, I am sorry, I will send in the changes as soon as I can. |
|
No, I'm sorry; I wasn't sure what we were waiting on. If you like I can bug him about it tomorrow. |
|
@ezyang Here are the timings you were curious about:
|
facebook-github-bot
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ezyang has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.
|
@pytorchbot retest this please |
|
Is this good to go? |
|
I'm a bit wary about this change. It looks like it introduces a CUDA synchronization point in result_zero_mask.any().toCByte() (as ezyang pointed out). synchronization points may not slow down an individual call, but they make it harder to hide kernel launch latency and CPU computation in a bigger system. |
|
Are you suggesting that the |
|
I hate to say it, but maybe we should just write the fused kernel that does the masked fill unconditionally. Then we don't have to worry about the extra kernel launch overhead. |
|
@vishwakftw @colesbury not sure if the derivative you computed are correct. It seems to me that the formulas you have were derived adding only I.e what you have : what I think we want to solve: For the second point you have to add eps to each xi's in the definition of the derivative. The limit should give 0. Numerical check: I hope I understood correctly the issue. |
|
This is stale, closing for now. |
According to "attention is all you need", the formula apply to sublayer is supposed to be LayerNorm(x + Sublayer(x)). As mentioned in [issue#142](jadore801120#142), this implementation results from the consideration of the problem in pytorch(see [pytorch issue#4320](pytorch/pytorch#4320)), which has been fix in [Fix standard deviation gradient](pytorch/pytorch#9238).
Defined the subgradient of
stdwhenresult = 0to beinf.Fixes #4320 .