Vectorize sigmoid #8612

cpuhrsch · 2018-06-18T19:03:55Z

This PR ports the vectorization of sigmoid to also enable better performance for non-contiguous arrays. Detailed timings will follow shortly.

cpuhrsch · 2018-06-19T22:22:47Z

Timing were retrieved using this benchmark.

The command run was

OMP_NUM_THREADS=1 numactl --membind=0 --cpubind=0 taskset -c 0 python run.py --include CPUUnaryBench --benchmark-min-time 2 --benchmark-warmup-repetitions 2 --benchmark-repetitions 3

This is a single core benchmark.

framework                                Torch   Torch Master     Ratio  Better
cont  trans function     dim mag
False False ('sigmoid',) 3   1        6.930478       6.567699  0.947655   False
                             3       13.235672      29.237036  2.208957    True
                             6     8351.410715   30095.457062  3.603638    True
                             7    85592.009994  308379.842767  3.602905    True
      True  ('sigmoid',) 3   1        7.514903       6.595705  0.877683   False
                             3       18.093501      29.376038  1.623568    True
                             6     8895.611246   30055.028340  3.378636    True
                             7    91487.099854  307583.449497  3.362042    True
True  False ('sigmoid',) 3   1        6.942165       6.119649  0.881519   False
                             3       12.475567       7.145404  0.572752   False
                             6     1647.585599    1244.441353  0.755312   False
                             7    16909.896146   13191.346432  0.780096   False
      True  ('sigmoid',) 3   1        7.536675       6.587368  0.874042   False
                             3       17.715659      28.759583  1.623399    True
                             6     8135.648723   29658.940342  3.645553    True
                             7    79345.416158  303789.017412  3.828690    True

There are significant gains for the non-contiguous cases if the Tensor is larger than 10 elements. However, there is a regression for the regular contiguous case. This needs to be resolved before this can be merged.

cpuhrsch · 2018-06-29T21:45:41Z

We found and mitigated the perf issue and will treat it separately.

These are the new speedups

                                                              time_mean
dtype         sizes                       strides
torch.float32 torch.Size([215, 215, 215]) (215, 46225, 1)     16.867260
                                          (3870, 832050, 18)   2.989488
                                          (46225, 215, 1)      1.039506
                                          (832050, 3870, 18)   2.960994
              torch.Size([99, 99, 99])    (176418, 1782, 18)   3.477831
                                          (1782, 176418, 18)   3.087606
                                          (9801, 99, 1)        1.037980
                                          (99, 9801, 1)       14.272740
torch.float64 torch.Size([215, 215, 215]) (215, 46225, 1)      4.597996
                                          (3870, 832050, 18)   2.413990
                                          (46225, 215, 1)      4.718199
                                          (832050, 3870, 18)   2.474521
              torch.Size([99, 99, 99])    (176418, 1782, 18)   2.544542
                                          (1782, 176418, 18)   2.373826
                                          (9801, 99, 1)        4.980543
                                          (99, 9801, 1)        4.441531

$ lscpu
Architecture:          x86_64
CPU op-mode(s):        32-bit, 64-bit
Byte Order:            Little Endian
CPU(s):                80
On-line CPU(s) list:   0-79
Thread(s) per core:    2
Core(s) per socket:    20
Socket(s):             2
NUMA node(s):          2
Vendor ID:             GenuineIntel
CPU family:            6
Model:                 79
Model name:            Intel(R) Xeon(R) CPU E5-2698 v4 @ 2.20GHz
Stepping:              1
CPU MHz:               1641.835
CPU max MHz:           3600.0000
CPU min MHz:           1200.0000
BogoMIPS:              4400.92
Virtualization:        VT-x
L1d cache:             32K
L1i cache:             32K
L2 cache:              256K
L3 cache:              51200K
NUMA node0 CPU(s):     0-19,40-59
NUMA node1 CPU(s):     20-39,60-79
Flags:                 fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc aperfmperf eagerfpu pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch epb intel_pt tpr_shadow vnmi flexpriority ept vpid fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm cqm rdseed adx smap xsaveopt cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local dtherm ida arat pln pts

Command

$ cset shield --exec -- taskset -c 20 python run.py --include CPUUnaryBench --benchmark-min-time 2 --benchmark-warmup-repetitions 2 --benchmark-repetitions 3 --benchmark-filter .*sigmoid.* --benchmark-out /tmp/2

Benchmark commit: d7b07460f401363888e6e5343eee7079b70374c8

facebook-github-bot

@cpuhrsch has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

facebook-github-bot

@cpuhrsch has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

cpuhrsch · 2018-07-03T02:15:33Z

ROCM build succeeded separately.

aten/src/ATen/cpu/vml.h

facebook-github-bot

@cpuhrsch has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

facebook-github-bot

@cpuhrsch has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

Summary: This PR ports the vectorization of sigmoid to also enable better performance for non-contiguous arrays. Detailed timings will follow shortly. Pull Request resolved: pytorch/pytorch#8612 Reviewed By: ezyang Differential Revision: D8712298 Pulled By: cpuhrsch fbshipit-source-id: 01a3d06af8d04513edd024ab1d01a6b753fc6f6a

Summary: This PR ports the vectorization of sigmoid to also enable better performance for non-contiguous arrays. Detailed timings will follow shortly. Pull Request resolved: pytorch#8612 Reviewed By: ezyang Differential Revision: D8712298 Pulled By: cpuhrsch fbshipit-source-id: 01a3d06af8d04513edd024ab1d01a6b753fc6f6a

cpuhrsch requested review from apaszke, colesbury, ezyang, gchanan, soumith and zdevito as code owners June 18, 2018 19:03

cpuhrsch force-pushed the sig branch 3 times, most recently from b2e7f05 to 1c3926e Compare June 19, 2018 22:17

cpuhrsch force-pushed the sig branch 8 times, most recently from ef0bb33 to c48f626 Compare June 29, 2018 21:32

cpuhrsch force-pushed the sig branch from c48f626 to bd46ef0 Compare July 1, 2018 16:53

facebook-github-bot reviewed Jul 2, 2018

View reviewed changes

cpuhrsch force-pushed the sig branch 2 times, most recently from aa13038 to 0ac6d73 Compare July 2, 2018 04:39

facebook-github-bot reviewed Jul 2, 2018

View reviewed changes

colesbury approved these changes Jul 9, 2018

View reviewed changes

aten/src/ATen/cpu/vml.h Outdated

This comment was marked as off-topic.

Sign in to view

This comment was marked as off-topic.

Sign in to view

cpuhrsch force-pushed the sig branch 2 times, most recently from 5554acf to 774bf85 Compare July 9, 2018 20:57

Vectorize sigmoid

7776fbb

cpuhrsch force-pushed the sig branch from 774bf85 to 7776fbb Compare July 10, 2018 00:00

facebook-github-bot reviewed Jul 10, 2018

View reviewed changes

facebook-github-bot closed this in e9e47ce Jul 10, 2018

ezyang added the merged label Jun 26, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Vectorize sigmoid #8612

Vectorize sigmoid #8612

Uh oh!

cpuhrsch commented Jun 18, 2018

Uh oh!

cpuhrsch commented Jun 19, 2018

Uh oh!

cpuhrsch commented Jun 29, 2018 •

edited

Loading

Uh oh!

facebook-github-bot left a comment

Uh oh!

facebook-github-bot left a comment

Uh oh!

cpuhrsch commented Jul 3, 2018

Uh oh!

This comment was marked as off-topic.

Uh oh!

This comment was marked as off-topic.

Uh oh!

facebook-github-bot left a comment

Uh oh!

facebook-github-bot left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Vectorize sigmoid #8612

Vectorize sigmoid #8612

Uh oh!

Conversation

cpuhrsch commented Jun 18, 2018

Uh oh!

cpuhrsch commented Jun 19, 2018

Uh oh!

cpuhrsch commented Jun 29, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

facebook-github-bot left a comment

Choose a reason for hiding this comment

Uh oh!

facebook-github-bot left a comment

Choose a reason for hiding this comment

Uh oh!

cpuhrsch commented Jul 3, 2018

Uh oh!

This comment was marked as off-topic.

Uh oh!

This comment was marked as off-topic.

Uh oh!

facebook-github-bot left a comment

Choose a reason for hiding this comment

Uh oh!

facebook-github-bot left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

cpuhrsch commented Jun 29, 2018 •

edited

Loading