Skip to content

Conversation

@cpuhrsch
Copy link
Contributor

This PR ports the vectorization of sigmoid to also enable better performance for non-contiguous arrays. Detailed timings will follow shortly.

@cpuhrsch
Copy link
Contributor Author

Timing were retrieved using this benchmark.

The command run was

OMP_NUM_THREADS=1 numactl --membind=0 --cpubind=0 taskset -c 0 python run.py --include CPUUnaryBench --benchmark-min-time 2 --benchmark-warmup-repetitions 2 --benchmark-repetitions 3

This is a single core benchmark.

framework                                Torch   Torch Master     Ratio  Better
cont  trans function     dim mag
False False ('sigmoid',) 3   1        6.930478       6.567699  0.947655   False
                             3       13.235672      29.237036  2.208957    True
                             6     8351.410715   30095.457062  3.603638    True
                             7    85592.009994  308379.842767  3.602905    True
      True  ('sigmoid',) 3   1        7.514903       6.595705  0.877683   False
                             3       18.093501      29.376038  1.623568    True
                             6     8895.611246   30055.028340  3.378636    True
                             7    91487.099854  307583.449497  3.362042    True
True  False ('sigmoid',) 3   1        6.942165       6.119649  0.881519   False
                             3       12.475567       7.145404  0.572752   False
                             6     1647.585599    1244.441353  0.755312   False
                             7    16909.896146   13191.346432  0.780096   False
      True  ('sigmoid',) 3   1        7.536675       6.587368  0.874042   False
                             3       17.715659      28.759583  1.623399    True
                             6     8135.648723   29658.940342  3.645553    True
                             7    79345.416158  303789.017412  3.828690    True

There are significant gains for the non-contiguous cases if the Tensor is larger than 10 elements. However, there is a regression for the regular contiguous case. This needs to be resolved before this can be merged.

@cpuhrsch cpuhrsch force-pushed the sig branch 8 times, most recently from ef0bb33 to c48f626 Compare June 29, 2018 21:32
@cpuhrsch
Copy link
Contributor Author

cpuhrsch commented Jun 29, 2018

We found and mitigated the perf issue and will treat it separately.

These are the new speedups

                                                              time_mean
dtype         sizes                       strides
torch.float32 torch.Size([215, 215, 215]) (215, 46225, 1)     16.867260
                                          (3870, 832050, 18)   2.989488
                                          (46225, 215, 1)      1.039506
                                          (832050, 3870, 18)   2.960994
              torch.Size([99, 99, 99])    (176418, 1782, 18)   3.477831
                                          (1782, 176418, 18)   3.087606
                                          (9801, 99, 1)        1.037980
                                          (99, 9801, 1)       14.272740
torch.float64 torch.Size([215, 215, 215]) (215, 46225, 1)      4.597996
                                          (3870, 832050, 18)   2.413990
                                          (46225, 215, 1)      4.718199
                                          (832050, 3870, 18)   2.474521
              torch.Size([99, 99, 99])    (176418, 1782, 18)   2.544542
                                          (1782, 176418, 18)   2.373826
                                          (9801, 99, 1)        4.980543
                                          (99, 9801, 1)        4.441531
$ lscpu
Architecture:          x86_64
CPU op-mode(s):        32-bit, 64-bit
Byte Order:            Little Endian
CPU(s):                80
On-line CPU(s) list:   0-79
Thread(s) per core:    2
Core(s) per socket:    20
Socket(s):             2
NUMA node(s):          2
Vendor ID:             GenuineIntel
CPU family:            6
Model:                 79
Model name:            Intel(R) Xeon(R) CPU E5-2698 v4 @ 2.20GHz
Stepping:              1
CPU MHz:               1641.835
CPU max MHz:           3600.0000
CPU min MHz:           1200.0000
BogoMIPS:              4400.92
Virtualization:        VT-x
L1d cache:             32K
L1i cache:             32K
L2 cache:              256K
L3 cache:              51200K
NUMA node0 CPU(s):     0-19,40-59
NUMA node1 CPU(s):     20-39,60-79
Flags:                 fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc aperfmperf eagerfpu pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch epb intel_pt tpr_shadow vnmi flexpriority ept vpid fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm cqm rdseed adx smap xsaveopt cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local dtherm ida arat pln pts

Command

$ cset shield --exec -- taskset -c 20 python run.py --include CPUUnaryBench --benchmark-min-time 2 --benchmark-warmup-repetitions 2 --benchmark-repetitions 3 --benchmark-filter .*sigmoid.* --benchmark-out /tmp/2

Benchmark commit: d7b07460f401363888e6e5343eee7079b70374c8

Copy link
Contributor

@facebook-github-bot facebook-github-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@cpuhrsch has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

@cpuhrsch cpuhrsch force-pushed the sig branch 2 times, most recently from aa13038 to 0ac6d73 Compare July 2, 2018 04:39
Copy link
Contributor

@facebook-github-bot facebook-github-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@cpuhrsch has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

@cpuhrsch
Copy link
Contributor Author

cpuhrsch commented Jul 3, 2018

ROCM build succeeded separately.

This comment was marked as off-topic.

This comment was marked as off-topic.

@cpuhrsch cpuhrsch force-pushed the sig branch 2 times, most recently from 5554acf to 774bf85 Compare July 9, 2018 20:57
Copy link
Contributor

@facebook-github-bot facebook-github-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@cpuhrsch has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

Copy link
Contributor

@facebook-github-bot facebook-github-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@cpuhrsch has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

zdevito pushed a commit to zdevito/ATen that referenced this pull request Jul 10, 2018
Summary:
This PR ports the vectorization of sigmoid to also enable better performance for non-contiguous arrays. Detailed timings will follow shortly.
Pull Request resolved: pytorch/pytorch#8612

Reviewed By: ezyang

Differential Revision: D8712298

Pulled By: cpuhrsch

fbshipit-source-id: 01a3d06af8d04513edd024ab1d01a6b753fc6f6a
zdevito pushed a commit to zdevito/ATen that referenced this pull request Jul 13, 2018
Summary:
This PR ports the vectorization of sigmoid to also enable better performance for non-contiguous arrays. Detailed timings will follow shortly.
Pull Request resolved: pytorch/pytorch#8612

Reviewed By: ezyang

Differential Revision: D8712298

Pulled By: cpuhrsch

fbshipit-source-id: 01a3d06af8d04513edd024ab1d01a6b753fc6f6a
goodlux pushed a commit to goodlux/pytorch that referenced this pull request Aug 15, 2018
Summary:
This PR ports the vectorization of sigmoid to also enable better performance for non-contiguous arrays. Detailed timings will follow shortly.
Pull Request resolved: pytorch#8612

Reviewed By: ezyang

Differential Revision: D8712298

Pulled By: cpuhrsch

fbshipit-source-id: 01a3d06af8d04513edd024ab1d01a6b753fc6f6a
@ezyang ezyang added the merged label Jun 26, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants