-
Notifications
You must be signed in to change notification settings - Fork 26.3k
Fix corner case with torch.multinomial #9960
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
In the shortcut for n_sample=1, when category 0 has 0 weight, we should not map the (uniform) sample 0 to category 0. The conversion uniform->multinomial was apparently written to work on a (0,1] range (like curand uses), but PyTorch uses a [0,1) range. Fixes: pytorch#4858. Thank you, Roy Fejgin for reporting.
|
@t-vi: Thanks for the fix! I've tried it and can confirm that with the fix I am no longer able to reproduce the problem. This is using my own test case which was previously able to cause the problem reliably (regardless of the particular random seed). While I'm not able to fully review the fix from an algorithmic point of view, I do suggest to add a comment in the code explaining why we are substituting the uniform samples with (1.0 - uniform_samples) since -- to me -- it's rather non-obvious and depends on stuff that curand does in a different part of the codebase and on implicit assumptions in the code below it. |
|
Fair enough, I've changed the fix to substitute 0 for 1 (mirroring the move from curand -> PyTorch uniform) and added a comment. Quite likely, one might rewrite the function a bit to work with [0,1) when it is migrated to ATen native or so. |
|
Thanks @t-vi . Do you think the second test in this line still makes sense after this fix? It seems like that second condition can now never be true. Also, if the code was originally written to work with samples in the interval (0,1] I wonder why it's testing for sample==0? |
|
Ha. That was exactly the test I had missed. Thanks Roy! |
Make the intervals for the binning half open [p,p_next). Also remove the check for sample==0 that would short-circuit the bin search and give a wrong sample. Thank you, Roy Fejgin for the report and the discussion!
|
hmm, unfortunately my script is still able to catch some zero-probability events being sampled. But now they are not at the 0'th index anymore. Maybe we are peeling an onion here... |
|
Thanks! You could just capture and save the RNG state to a temporary variable (torch.cuda.get_rng_state) and output it after detecting failure. |
|
Here are the probabilities. Will post a script shortly. |
|
Attaching RNG state |
|
Here is my script. I have not had a chance to create an absolute minimal test case, but if you run this you should see the problem when it gets iteration 61678, which takes about 52 seconds on my GPU (1080ti). |
|
Attaching a new version of the test case and output. This version has slightly clearer printouts. |
|
The strange thing is that now, these zero probability bins getting selected are not bin zero, nor bins that are near any bin that has non-zero probability. For example the PDF can look like this: |
|
I added some tracing in the CUDA code. In the case I caught, the uniform sample equals 0.99999994. Then many buckets (to the right of the last non-zero-probability bin) ended up meeting the condition
All of them had curBucket == 1.0 and prevBucket == 0.99999994 Then many threads end up writing out the result in this line and I think that maybe the last of the threads to make that update wins. Now I guess the question is why curBucket does not equal prevBucket even though all the nearby bins have zero probability. |
|
Oops, I just realize you must have wanted the RNG state before the bug occurs so that you could regenerate the uniform samples that trigger the problem. What I posted was the RNG state after the bug occurred -sorry, I misunderstood. I'm not at my desk now to re-capture that data, but hopefully the test case I posted and the uniform sample value is enough to analyze / reproduce this. |
|
I hope to look into it later today.
|
|
So I took the liberty of condensing your example to the following: (and it really fails with e5c5ae3 , I had just realized I had another branch when I first posted this). |
|
So interestingly, when I use the identical kernel with a caller that calls torch.rand and then the kernel from a PyTorch extension, that seems to not have the error. I'll see to extract the random part from the actual call. |
|
It's starting to look like a floating point numeric accuracy thing, that the code isn't resilient to. I'm seeing cases where curBucket should be equal to prevBucket (because the bucket itself has zero probabilitly), but they are slightly different, with curBucket being 1.0 and prevBucket being a little less than 1.0. If the sample falls between those values, the bucket gets selected. |
|
@t-vi interesting finding about the memory allocation, btw. Yeah, that product does look strange. Though probably allocating too much memory which is better than the opposite... :) |
|
Here's a log showing an example where bin 1054 was selected. All bins from 1028 onwards had zero probability. Hmm, at least with this example I think the problem would not have happened with if the condition was 'gt' as it originally was. Let's think about whether that check fundamentally makes more sense or just happens to work better in this case. |
|
Maybe if we want ensure zero-probability events are not selected, in a way that is robust to floating point numeric problems, the |
|
So I think the root cause is the inclusive prefix sum (in the comment). Here, the summation order is totally different for adjacent elements, that might lead to the observed non-monotonicity and other effects.
|
|
So to me it looks like this test case captures the essence of @rfejgin 's latest example and works as expected with the latest commit. If @rfejgin is reasonably happy with this, I'll add the above as a test case and remove the not ready yet label. |
|
Thanks Thomas. I agree with you on the root cause. I also like the idea of explicitly checking if the distribution value equals zero, and was having similar thoughts after I posted. The explicit check avoids the need to tune a threshold (or one per type of T), which I've actually tried and found fragile. So your fix looks good to me! One minor comment is that I would consider putting the memory allocation fix in a different commit. In the unlikely case that that change causes instability it would be good to be able to separate the two changes. |
|
OK. My impression is that PRs are heavyhanded enough to tack the shared memory size on this, but I'll gladly remove it if the reviewer prefers. |
|
@t-vi: it's really up to you - this is my first pytorch PR review, so I'm not super familiar with the conventions - just a suggestion. On another topic: Now that we're explicitly handling the zero-probability case, do you think that the change in the condition from gt to ge is still necessary? https://github.com/pytorch/pytorch/pull/9960/files#diff-a5d44fe2c50c7cc4616e927154fb7e30R259 |
|
You could argue that 0 is exceedingly rare, but I think it is good practice to cover the half-open interval [0,1) with equally half-open bins [a,b), which is what the ge achieves (and the gt seems to stem from the time when the uniform random number was in fact in (0,1]). |
|
Ah, I see the logic now. Makes sense. |
|
I just wanted to mention that my availability will be very limited in the next two weeks. But it looks like we're pretty close to resolving this -- nice :). |
|
I think it's ready for review and possibly merging.
|
facebook-github-bot
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ailzhang has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.
Summary: In the shortcut for n_sample=1, when category 0 has 0 weight, we should not map the (uniform) sample 0 to category 0. The conversion uniform->multinomial was apparently written to work on a (0,1] range (like curand uses), but PyTorch uses a [0,1) range. Fixes: #4858. Thank you, Roy Fejgin for reporting. Pull Request resolved: pytorch/pytorch#9960 Reviewed By: soumith Differential Revision: D9341793 Pulled By: ailzhang fbshipit-source-id: 6b1a96419a7bc58cc594f761f34c6408ff6354cf
In the shortcut for n_sample=1, when category 0 has 0 weight,
we should not map the (uniform) sample 0 to category 0.
The conversion uniform->multinomial was apparently written to work on
a (0,1] range (like curand uses), but PyTorch uses a [0,1) range.
Fixes: #4858. Thank you, Roy Fejgin for reporting.