-
Notifications
You must be signed in to change notification settings - Fork 26.3k
Vectorize normal_ #4312
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Vectorize normal_ #4312
Conversation
apaszke
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks good, I only have two minor comments.
My only concern is that it doesn't only vectorize the function, but it will also make it return different results on AVX2 and non-AVX2 platforms for the same random seed. Not sure how much do we care about cross-platform reproducibility, since that's very hard and constraining, but it's worth noting that. normal is quite important because it's used for initializing weights. @soumith thoughts?
aten/src/TH/generic/THTensorRandom.c
Outdated
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
aten/src/TH/vector/avx_mathfun.h
Outdated
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
aten/src/TH/vector/AVX2.c
Outdated
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
|
Oh, also it would be nice to evaluate some of the real examples (e.g. word language model form our repo), just to make sure we won't get severe downclocking penalty because of AVX2. |
|
Thanks for the comments. About to board a flight, so will address them in a bit. Just a note: the non-AVX2 code can be re-written to layout the numbers in the same order as the AVX2 version (8 at a time, interleaved). Something like https://gist.github.com/goldsborough/75ee1802110eda71517cc33ea3c59a88. Then they would be the same (on my benchmarks this "unrolled" version actually is quite a bit faster than a simple loop). |
zdevito
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks great!
Changing the serial version to mimic the AVX one seems like a good idea since it increases the perf. of the serial one and also makes it the same as the avx one.
aten/src/TH/generic/THTensorRandom.c
Outdated
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
soumith
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looks pretty good, needs some runtime dispatch changes (see comments in-line)
aten/src/TH/generic/THTensorRandom.c
Outdated
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
ChangesLatest 2 commits make the following changes:
BenchmarksThis time compiling with GCC 7. Microbenchmarks10,000 x 10,000 (float/AVX): 3.3s -> 0.48s (6.875x speedup) float/AVX here means it's using the vectorized version, double/scalar means it's the interleaved scalar function that I added, since the vectorized version is only called for floats. Imagenet Startup TimesVGG19: 5.61s -> 1.62s (3.5x speedup) (Please re-review the code @colesbury @zdevito @soumith ) Happy new year 🎆 🎉 |
|
@pytorchbot add to whitelist |
apaszke
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! I think there's one small bug though
aten/src/TH/vector/AVX2.c
Outdated
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
|
Should be ready to merge but the builds are failing now (probably because the seed we used previously is unlucky after this change) |
|
Ok, how do we resolve the build failure? It says something about not being able to "get pull request builder trigger". |
|
Oh this one looks like a CI failure, but the CUDA jobs manage to build and fail at test time |
|
@pytorchbot retest this please |
8a719ff to
b983dcd
Compare
|
No luck with random seeds for cudnn builds. I will need a gpu machine to figure out what's wrong locally. Or is there any smarter way of solving these random failures other than finding a lucky seed? |
|
Alright, I looked into it and it seems that the test that's failing now is particularily flaky when using half (63 failures / 1000 trials). Reducing the scale of values used to test it makes the absolute errors smaller, and it succeeded 10000 times now. Here's the patch (just add --- a/test/test_nn.py
+++ b/test/test_nn.py
@@ -2132,9 +2132,9 @@ class TestNN(NNTestCase):
continue
for depth_multiplier in [1, 2]:
m = nn.Conv2d(2, 2 * depth_multiplier, kernel_size=3, groups=2).type(tp)
- i = Variable(torch.randn(2, 2, 6, 6).type(tp), requires_grad=True)
+ i = Variable(torch.randn(2, 2, 6, 6).type(tp) / 2, requires_grad=True)
output = m(i)
- grad_output = torch.randn(2, 2 * depth_multiplier, 4, 4).type(tp)
+ grad_output = torch.randn(2, 2 * depth_multiplier, 4, 4).type(tp) / 2
output.backward(grad_output)
offset = 1 * depth_multiplier |
|
That worked, thanks Adam! Now one of the builds got stuck after a segfault in the dataloader tests. Seems flaky too. |
|
😍 |
|
There's already other PR open that fixes the data loader thing. The test had a race condition that has started to appear only recently |
|
Thanks Peter! |
|
Not sure if this warrents a new bug report. But today while trying to build Pytorch on Windows, I ran into the following error: |
PyTorch is known to have high start-up times, to a large part due to slow tensor initialization/filling when calling
normal_. Torch'snormalfunction uses the Box-Mueller transform to produce Gaussian distributed floats from uniform floats. I did some benchmarks and found that generating normal numbers took around 5 times longer than generating only uniform floats, suggesting that the current normal sampling code was the bottleneck.This PR addresses this by introducing a vectorized version of the Box-Mueller transform that essentially does the same thing, but for 8 values at a time. This version is called only for floats, only if we have AVX2, only if there are more than 16 values (due to implementation) and only if the tensor is contiguous. However, this should cover like 90%-95% of real-world cases where it's currently slow.
My initial, small-scale benchmarks show a 5x-6x speed-up:
Before:
After:
Which looks pretty good. I will see how this affects loading a model like imagenet or similar.
CC @zdevito