-
Notifications
You must be signed in to change notification settings - Fork 26.3k
fix max/min on cuda in presence of NaN (fixes #6996) #7052
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
| inline __device__ T operator()(T a, T b) const { | ||
| return THCNumerics<T>::lt(a, b) ? a : b; | ||
| // a != a means a == NaN | ||
| return (THCNumerics<T>::lt(a, b) || |
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
| actual = f(a.cuda()).cpu() | ||
| expected = f(a).cpu() | ||
| self.assertEqual(torch.isnan(actual), torch.isnan(expected), 'nans for {}'.format(name)) | ||
| self.assertEqual(actual[~torch.isnan(actual)], |
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
|
Will do. Thank you ngimel!
|
Thank you ngimel and zou3519!
|
Looks correct. @t-vi would you mind running some quick perf numbers just to characterize what the effect of the extra neq test is? |
|
My conclusion would be measurable in isolation (in the order 10%-20% slowdown), negligible in any other context. Just like So I run this several times (to avoid init topics, the first run takes longer than the others). I lifted this from somewhere in the pytorch/benchmark repository - if you have a better methodology, I'd be happy to apply it. And I get: vs. |
apaszke
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just wanted to clarify our strategy here. Do we want to match the CPU operators? If so, that's not the right way to go. The way they are implemented is slightly different. Actually I think the CPU ops mix both the condition I linked, and the one you put here. Can you please make them all consistent?
The benchmarks look like noise to me, so that's ok.
|
To me, the main difference seems to be breaking - this is needed in the cpu implementation because otherwise the NaN in theMax var will cause the next comparison to favour value.
Stopping the calculation might be a bit more efficient - when there are NaNs, so not really.
Also, we wanted a cuda implementation that doesn't need comments to explain whats going on. :)
That said, I might be missing something...
|
|
But what's the actual fix in this PR then? If we want to have kernels return NaNs when there are NaNs, then why don't we treat CPU kernels that way |
|
Ok, never mind my comment. I did the math again and it seems to be ok. |
|
(Sorry, the second mail doesn't seem to have reached the issue log... :( ) |
…7052) Thank you ngimel and zou3519!
…7052) Thank you ngimel and zou3519!
This adds tests for NaN to the min and max kernels and should return NaNs in the right places.