Vectorize normal_ #4312

goldsborough · 2017-12-22T09:12:31Z

PyTorch is known to have high start-up times, to a large part due to slow tensor initialization/filling when calling normal_. Torch's normal function uses the Box-Mueller transform to produce Gaussian distributed floats from uniform floats. I did some benchmarks and found that generating normal numbers took around 5 times longer than generating only uniform floats, suggesting that the current normal sampling code was the bottleneck.

This PR addresses this by introducing a vectorized version of the Box-Mueller transform that essentially does the same thing, but for 8 values at a time. This version is called only for floats, only if we have AVX2, only if there are more than 16 values (due to implementation) and only if the tensor is contiguous. However, this should cover like 90%-95% of real-world cases where it's currently slow.

My initial, small-scale benchmarks show a 5x-6x speed-up:

Before:

In [1]: import torch
In [2]: x = torch.Tensor(10000, 10000)
In [3]: %time _ = x.normal_(0, 1)
CPU times: user 3.45 s, sys: 111 ms, total: 3.57 s
Wall time: 3.57 s

After:

In [1]: import torch
In [2]: x = torch.Tensor(10000, 10000)
In [3]: %time _ = x.normal_(0, 1)
CPU times: user 611 ms, sys: 1.07 ms, total: 612 ms
Wall time: 613 ms

Which looks pretty good. I will see how this affects loading a model like imagenet or similar.

CC @zdevito

apaszke

This looks good, I only have two minor comments.

My only concern is that it doesn't only vectorize the function, but it will also make it return different results on AVX2 and non-AVX2 platforms for the same random seed. Not sure how much do we care about cross-platform reproducibility, since that's very hard and constraining, but it's worth noting that. normal is quite important because it's used for initializing weights. @soumith thoughts?

aten/src/TH/generic/THTensorRandom.c

aten/src/TH/vector/avx_mathfun.h

aten/src/TH/vector/AVX2.c

apaszke · 2017-12-22T15:08:59Z

Oh, also it would be nice to evaluate some of the real examples (e.g. word language model form our repo), just to make sure we won't get severe downclocking penalty because of AVX2.

goldsborough · 2017-12-22T18:51:57Z

Thanks for the comments. About to board a flight, so will address them in a bit. Just a note: the non-AVX2 code can be re-written to layout the numbers in the same order as the AVX2 version (8 at a time, interleaved). Something like https://gist.github.com/goldsborough/75ee1802110eda71517cc33ea3c59a88. Then they would be the same (on my benchmarks this "unrolled" version actually is quite a bit faster than a simple loop).

zdevito

This looks great!

Changing the serial version to mimic the AVX one seems like a good idea since it increases the perf. of the serial one and also makes it the same as the avx one.

aten/src/TH/generic/THTensorRandom.c

soumith

looks pretty good, needs some runtime dispatch changes (see comments in-line)

aten/src/TH/generic/THTensorRandom.c

goldsborough · 2018-01-01T04:01:37Z

Changes

Latest 2 commits make the following changes:

Actually use the mean and standard deviation (was not using the variables before and thus generating unit Gaussian samples),
Added a scalar normal_fill function that interleaves values just like the vectorized code, so that there is no difference in generated samples between AVX and non-AVX platforms. This function is also around 1.5x faster than the old version. Thus also non-AVX versions get a speedup (for contiguous tensors with at least 16 values).
Implemented all the vector dispatch stuff
Using int64_t for size
Using THAssert instead of assert
AVX code uses _m256_loadu_ps instead of _m256_load_ps, i.e. misaligned loads so that it works also for misaligned data,
Using explicit FMA _m256_fmadd_ps instruction instead of multiply + add (clang doesn't do this automatically, GCC does ... it's less code anyway. Also FMA is available wherever AVX is on Intel and AMD chips, so should be fine). Marginally faster.

Benchmarks

This time compiling with GCC 7.

Microbenchmarks

10,000 x 10,000 (float/AVX): 3.3s -> 0.48s (6.875x speedup)
1,000 x 1,000 (float/AVX): 35ms -> 4.9ms (7.1x speedup)
10,000 x 10,000 (double/scalar): 3.2s -> 2.1s (1.5x speedup)

float/AVX here means it's using the vectorized version, double/scalar means it's the interleaved scalar function that I added, since the vectorized version is only called for floats.

Imagenet Startup Times

VGG19: 5.61s -> 1.62s (3.5x speedup)
ResNet101: 2.4s -> 0.65s (3.7x speedup)
ResNet50: 1.21s -> 0.46s (2.6x speedup)

(Please re-review the code @colesbury @zdevito @soumith )

Happy new year 🎆 🎉

soumith · 2018-01-01T10:21:27Z

@pytorchbot add to whitelist

apaszke

LGTM! I think there's one small bug though

aten/src/TH/vector/AVX2.c

apaszke · 2018-01-02T11:08:35Z

Should be ready to merge but the builds are failing now (probably because the seed we used previously is unlucky after this change)

goldsborough · 2018-01-02T18:21:04Z

Ok, how do we resolve the build failure? It says something about not being able to "get pull request builder trigger".

apaszke · 2018-01-02T18:28:12Z

Oh this one looks like a CI failure, but the CUDA jobs manage to build and fail at test time

zdevito · 2018-01-02T18:32:14Z

@pytorchbot retest this please

…atch

goldsborough · 2018-01-03T08:54:51Z

No luck with random seeds for cudnn builds. I will need a gpu machine to figure out what's wrong locally. Or is there any smarter way of solving these random failures other than finding a lucky seed?

apaszke · 2018-01-03T14:13:51Z

Alright, I looked into it and it seems that the test that's failing now is particularily flaky when using half (63 failures / 1000 trials). Reducing the scale of values used to test it makes the absolute errors smaller, and it succeeded 10000 times now. Here's the patch (just add / 2 in two places):

--- a/test/test_nn.py                                                                    
+++ b/test/test_nn.py                                                                    
@@ -2132,9 +2132,9 @@ class TestNN(NNTestCase):                                          
                 continue                                                                
             for depth_multiplier in [1, 2]:                                             
                 m = nn.Conv2d(2, 2 * depth_multiplier, kernel_size=3, groups=2).type(tp)
-                i = Variable(torch.randn(2, 2, 6, 6).type(tp), requires_grad=True)      
+                i = Variable(torch.randn(2, 2, 6, 6).type(tp) / 2, requires_grad=True)  
                 output = m(i)                                                           
-                grad_output = torch.randn(2, 2 * depth_multiplier, 4, 4).type(tp)       
+                grad_output = torch.randn(2, 2 * depth_multiplier, 4, 4).type(tp) / 2   
                 output.backward(grad_output)                                            
                                                                                         
                 offset = 1 * depth_multiplier

goldsborough · 2018-01-03T18:47:01Z

That worked, thanks Adam! Now one of the builds got stuck after a segfault in the dataloader tests. Seems flaky too.

goldsborough · 2018-01-03T20:50:30Z

😍

apaszke · 2018-01-03T21:29:02Z

There's already other PR open that fixes the data loader thing. The test had a race condition that has started to appear only recently

apaszke · 2018-01-03T21:31:05Z

Thanks Peter!

aluo-x · 2018-01-25T07:44:34Z

Not sure if this warrents a new bug report. But today while trying to build Pytorch on Windows, I ran into the following error:

"C:\optimae\pytorch\torch\lib\build\ATen\INSTALL.vcxproj" (default target) (1) ->
"C:\optimae\pytorch\torch\lib\build\ATen\ALL_BUILD.vcxproj" (default target) (3) ->
"C:\optimae\pytorch\torch\lib\build\ATen\src\ATen\ATen.vcxproj" (default target) (4) ->
(ClCompile target) ->
  C:\optimae\pytorch\aten\src\TH\vector\AVX2.c(60): error C2440: 'function': cannot convert from
'int' to '__m256' [C:\optimae\pytorch\torch\lib\build\ATen\src\ATen\ATen.vcxproj]

    8925 Warning(s)
    1 Error(s)

apaszke reviewed Dec 22, 2017

View reviewed changes

aten/src/TH/generic/THTensorRandom.c Outdated

This comment was marked as off-topic.

Sign in to view

aten/src/TH/vector/avx_mathfun.h Outdated

This comment was marked as off-topic.

Sign in to view

aten/src/TH/vector/AVX2.c Outdated

This comment was marked as off-topic.

Sign in to view

zdevito approved these changes Dec 22, 2017

View reviewed changes

colesbury reviewed Dec 22, 2017

View reviewed changes

aten/src/TH/generic/THTensorRandom.c Outdated

This comment was marked as off-topic.

Sign in to view

This comment was marked as off-topic.

Sign in to view

This comment was marked as off-topic.

Sign in to view

soumith reviewed Dec 24, 2017

View reviewed changes

aten/src/TH/generic/THTensorRandom.c Outdated

This comment was marked as off-topic.

Sign in to view

fritzo mentioned this pull request Dec 26, 2017

test_distributions.py should seed on setUp, to make it harder to accidentally commit nondeterministic tests #4353

Closed

goldsborough force-pushed the master branch from 7bdb2a2 to ada3f14 Compare January 1, 2018 01:43

apaszke reviewed Jan 1, 2018

View reviewed changes

aten/src/TH/vector/AVX2.c Outdated

This comment was marked as off-topic.

Sign in to view

This comment was marked as off-topic.

Sign in to view

This comment was marked as off-topic.

Sign in to view

goldsborough force-pushed the master branch from 3ede574 to 72272c7 Compare January 2, 2018 04:16

apaszke approved these changes Jan 2, 2018

View reviewed changes

soumith mentioned this pull request Jan 2, 2018

make weight initialization optional to speed vgg-construction pytorch/vision#377

Merged

goldsborough added 4 commits January 2, 2018 12:22

Vectorize normal_

69943c6

Add interleaved scalar normal fill function and implement proper disp…

c60a51c

…atch

Use fmadd instead of multiply + add

d1b6540

Be more explicit about pointer increment

43acf99

goldsborough force-pushed the master branch 4 times, most recently from 8a719ff to b983dcd Compare January 3, 2018 08:42

goldsborough force-pushed the master branch from b983dcd to 43acf99 Compare January 3, 2018 18:08

Stabilizy flaky test

f7c556d

goldsborough force-pushed the master branch from 16b2585 to f7c556d Compare January 3, 2018 18:47

apaszke approved these changes Jan 3, 2018

View reviewed changes

apaszke merged commit 77c792e into pytorch:master Jan 3, 2018

yf225 pushed a commit to yf225/pytorch that referenced this pull request Jan 4, 2018

Windows fix for pytorch#4312

76479ad

yf225 mentioned this pull request Jan 4, 2018

Windows fix for #4312 #4469

Merged

ezyang pushed a commit that referenced this pull request Jan 4, 2018

Windows fix for #4312

cc70a33

soumith added 0.3.1 and removed 0.3.1 labels Feb 4, 2018

Vectorize normal_ #4312

Vectorize normal_ #4312

Uh oh!

Conversation

goldsborough commented Dec 22, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

apaszke left a comment

Choose a reason for hiding this comment

Uh oh!

This comment was marked as off-topic.

Uh oh!

This comment was marked as off-topic.

Uh oh!

This comment was marked as off-topic.

Uh oh!

apaszke commented Dec 22, 2017

Uh oh!

goldsborough commented Dec 22, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

zdevito left a comment

Choose a reason for hiding this comment

Uh oh!

This comment was marked as off-topic.

Uh oh!

This comment was marked as off-topic.

Uh oh!

This comment was marked as off-topic.

Uh oh!

soumith left a comment

Choose a reason for hiding this comment

Uh oh!

This comment was marked as off-topic.

Uh oh!

goldsborough commented Jan 1, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Changes

Benchmarks

Microbenchmarks

Imagenet Startup Times

Uh oh!

soumith commented Jan 1, 2018

Uh oh!

apaszke left a comment

Choose a reason for hiding this comment

Uh oh!

This comment was marked as off-topic.

Uh oh!

This comment was marked as off-topic.

Uh oh!

This comment was marked as off-topic.

Uh oh!

apaszke commented Jan 2, 2018

Uh oh!

goldsborough commented Jan 2, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

apaszke commented Jan 2, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

zdevito commented Jan 2, 2018

Uh oh!

goldsborough commented Jan 3, 2018

Uh oh!

apaszke commented Jan 3, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

goldsborough commented Jan 3, 2018

Uh oh!

goldsborough commented Jan 3, 2018

Uh oh!

apaszke commented Jan 3, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

apaszke commented Jan 3, 2018

Uh oh!

aluo-x commented Jan 25, 2018

Uh oh!

Reviewers

goldsborough commented Dec 22, 2017 •

edited

Loading

goldsborough commented Dec 22, 2017 •

edited

Loading

goldsborough commented Jan 1, 2018 •

edited

Loading

goldsborough commented Jan 2, 2018 •

edited

Loading

apaszke commented Jan 2, 2018 •

edited

Loading

apaszke commented Jan 3, 2018 •

edited

Loading

apaszke commented Jan 3, 2018 •

edited

Loading