Skip to content

Conversation

@heitorschueroff
Copy link
Contributor

@heitorschueroff heitorschueroff commented Aug 19, 2020

Stack from ghstack:

This PR implements a version of max_pool2d that doesn't compute indices when it's not needed. It also makes some optimizations that will be carried over to other pooling functions in future PRs.

Benchmarking:

Tensor Parameters

BATCH = 10
CHANNEL = 16
HEIGHT = 2048
WIDTH = 2048
DTYPE = torch.float32
DEVICE = "cpu"

Pooling Parameters

KERNEL_SIZE = 2
STRIDE = None
PADDING = 0
DILATION = 1
CEIL_MODE = False

Results (time in ms) (speedup factor)

test_max_pool2d: 118.4793 (1.0)
test_mkldnn_max_pool2d: 360.2836 (3.04)
test_max_pool2d_with_indices: 626.9831 (5.29)

Discussion

The new implementation is on average 2~3 times faster than mkldnn and 5x faster than with_indices. The original with_indices code only parallelized over batches and channels, so if these numbers were low it wouldn't achieve optimal parallelism.

This algorithm also reduces duplicate comparisons in the case of overlapping kernel windows. For instance, if we change the pooling parameters above to:

KERNEL_SIZE = 4
STRIDE = 1
PADDING = 1
DILATION = 2
CEIL_MODE = True

Results (time in ms) (speedup factor)

test_max_pool2d: 136.4228 (1.0)
test_mkldnn_max_pool2d: 608.4158 (4.46)
test_max_pool2d_with_indices: 1,230.1916 (9.02)

There is also an issue with the existing pooling implementations that they use nested at::parallel_for loops and as such only the outer most loop is parallelized since at::parallel_for does not support nesting.

Differential Revision: D23273406

closes #28733

heitorschueroff added a commit that referenced this pull request Aug 19, 2020
@heitorschueroff heitorschueroff changed the title Draft version of max_pool2d without indices optimization [WIP] max_pool2d without indices optimization Aug 19, 2020
@heitorschueroff heitorschueroff marked this pull request as draft August 19, 2020 15:18
@dr-ci
Copy link

dr-ci bot commented Aug 19, 2020

💊 CI failures summary and remediations

As of commit 42a5360 (more details on the Dr. CI page):



🕵️ 15 new failures recognized by patterns

The following CI failures do not appear to be due to upstream breakages:

See CircleCI build pytorch_linux_xenial_py3_clang5_android_ndk_r19c_x86_32_build (1/15)

Step: "Build" (full log | diagnosis details | 🔁 rerun)

Aug 26 21:02:55 /var/lib/jenkins/workspace/aten/src/ATen/native/MaxPooling.cpp:103:12: error: use of undeclared identifier 'xnnpack'
Aug 26 21:02:47 1 warning generated. 
Aug 26 21:02:50 [1171/1550] Building CXX object caffe2/CMakeFiles/torch_cpu.dir/__/aten/src/ATen/native/LinearAlgebra.cpp.o 
Aug 26 21:02:52 [1172/1550] Building CXX object caffe2/CMakeFiles/torch_cpu.dir/__/aten/src/ATen/native/LossNLL.cpp.o 
Aug 26 21:02:54 [1173/1550] Building CXX object caffe2/CMakeFiles/torch_cpu.dir/__/aten/src/ATen/native/LossNLL2d.cpp.o 
Aug 26 21:02:55 [1174/1550] Building CXX object caffe2/CMakeFiles/torch_cpu.dir/__/aten/src/ATen/native/MaxPooling.cpp.o 
Aug 26 21:02:55 FAILED: caffe2/CMakeFiles/torch_cpu.dir/__/aten/src/ATen/native/MaxPooling.cpp.o  
-parameter -Wno-missing-field-initializers -Wno-write-strings -Wno-unknown-pragmas -Wno-missing-braces -Wno-maybe-uninitialized -fvisibility=hidden -O2 -DCAFFE2_BUILD_MAIN_LIB -pthread -std=gnu++14 -MD -MT caffe2/CMakeFiles/torch_cpu.dir/__/aten/src/ATen/native/MaxPooling.cpp.o -MF caffe2/CMakeFiles/torch_cpu.dir/__/aten/src/ATen/native/MaxPooling.cpp.o.d -o caffe2/CMakeFiles/torch_cpu.dir/__/aten/src/ATen/native/MaxPooling.cpp.o -c /var/lib/jenkins/workspace/aten/src/ATen/native/MaxPooling.cpp 
Aug 26 21:02:55 /var/lib/jenkins/workspace/aten/src/ATen/native/MaxPooling.cpp:101:7: error: use of undeclared identifier 'xnnpack' 
Aug 26 21:02:55   if (xnnpack::use_max_pool2d( 
Aug 26 21:02:55       ^ 
Aug 26 21:02:55 /var/lib/jenkins/workspace/aten/src/ATen/native/MaxPooling.cpp:103:12: error: use of undeclared identifier 'xnnpack' 
Aug 26 21:02:55     return xnnpack::max_pool2d( 
Aug 26 21:02:55            ^ 
Aug 26 21:02:55 2 errors generated. 
Aug 26 21:02:58 [1175/1550] Building CXX object caffe2/CMakeFiles/torch_cpu.dir/__/aten/src/ATen/native/MaxUnpooling.cpp.o 
Aug 26 21:02:59 [1176/1550] Building CXX object caffe2/CMakeFiles/torch_cpu.dir/__/aten/src/ATen/native/Memory.cpp.o 
Aug 26 21:02:59 ninja: build stopped: subcommand failed. 

See CircleCI build pytorch_windows_vs2019_py36_cpu_build (2/15)

Step: "Build" (full log | diagnosis details | 🔁 rerun)

aten\src\ATen\native\cpu\MaxPooling.cpp.DEFAULT.cpp(36): error C2131: expression did not evaluate to a constant
Microsoft (R) C/C++ Optimizing Compiler Version 19.26.28806 for x64
Copyright (C) Microsoft Corporation.  All rights reserved.

X -DUSE_AVX2 -DTH_HAVE_THREAD /EHsc /DNOMINMAX /wd4267 /wd4251 /wd4522 /wd4838 /wd4305 /wd4244 /wd4190 /wd4101 /wd4996 /wd4275 /bigobj -O2 -openmp:experimental -DCAFFE2_BUILD_MAIN_LIB -DONNX_BUILD_MAIN_LIB -std:c++14 /fp:strict  /DCPU_CAPABILITY=DEFAULT /DCPU_CAPABILITY_DEFAULT /showIncludes /Focaffe2\CMakeFiles\torch_cpu.dir\__\aten\src\ATen\native\cpu\DepthwiseConvKernel.cpp.DEFAULT.cpp.obj /Fdcaffe2\CMakeFiles\torch_cpu.dir\ /FS -c aten\src\ATen\native\cpu\DepthwiseConvKernel.cpp.DEFAULT.cpp 
Microsoft (R) C/C++ Optimizing Compiler Version 19.26.28806 for x64
Copyright (C) Microsoft Corporation.  All rights reserved.

GET_CPUID -DUSE_AVX -DUSE_AVX2 -DTH_HAVE_THREAD /EHsc /DNOMINMAX /wd4267 /wd4251 /wd4522 /wd4838 /wd4305 /wd4244 /wd4190 /wd4101 /wd4996 /wd4275 /bigobj -O2 -openmp:experimental -DCAFFE2_BUILD_MAIN_LIB -DONNX_BUILD_MAIN_LIB -std:c++14 /fp:strict  /DCPU_CAPABILITY=DEFAULT /DCPU_CAPABILITY_DEFAULT /showIncludes /Focaffe2\CMakeFiles\torch_cpu.dir\__\aten\src\ATen\native\cpu\MaxPooling.cpp.DEFAULT.cpp.obj /Fdcaffe2\CMakeFiles\torch_cpu.dir\ /FS -c aten\src\ATen\native\cpu\MaxPooling.cpp.DEFAULT.cpp 
FAILED: caffe2/CMakeFiles/torch_cpu.dir/__/aten/src/ATen/native/cpu/MaxPooling.cpp.DEFAULT.cpp.obj  
GET_CPUID -DUSE_AVX -DUSE_AVX2 -DTH_HAVE_THREAD /EHsc /DNOMINMAX /wd4267 /wd4251 /wd4522 /wd4838 /wd4305 /wd4244 /wd4190 /wd4101 /wd4996 /wd4275 /bigobj -O2 -openmp:experimental -DCAFFE2_BUILD_MAIN_LIB -DONNX_BUILD_MAIN_LIB -std:c++14 /fp:strict  /DCPU_CAPABILITY=DEFAULT /DCPU_CAPABILITY_DEFAULT /showIncludes /Focaffe2\CMakeFiles\torch_cpu.dir\__\aten\src\ATen\native\cpu\MaxPooling.cpp.DEFAULT.cpp.obj /Fdcaffe2\CMakeFiles\torch_cpu.dir\ /FS -c aten\src\ATen\native\cpu\MaxPooling.cpp.DEFAULT.cpp 
aten\src\ATen\native\cpu\MaxPooling.cpp.DEFAULT.cpp(36): error C2131: expression did not evaluate to a constant
aten\src\ATen\native\cpu\MaxPooling.cpp.DEFAULT.cpp(36): note: failure was caused by a read of a variable outside its lifetime
aten\src\ATen\native\cpu\MaxPooling.cpp.DEFAULT.cpp(36): note: see usage of 'p'
aten\src\ATen\native\cpu\MaxPooling.cpp.DEFAULT.cpp(113): note: see reference to function template instantiation 'void at::native::`anonymous-namespace'::max_pool2d_kernel<scalar_t>(scalar_t *,const scalar_t *const ,const int64_t,const int64_t,const at::native::PoolingParams &)' being compiled
        with
        [
            scalar_t=scalar_t
        ]
Microsoft (R) C/C++ Optimizing Compiler Version 19.26.28806 for x64
Copyright (C) Microsoft Corporation.  All rights reserved.

See CircleCI build pytorch_linux_xenial_py3_clang5_android_ndk_r19c_vulkan_x86_32_build (3/15)

Step: "Build" (full log | diagnosis details | 🔁 rerun)

Aug 26 21:03:20 /var/lib/jenkins/workspace/aten/src/ATen/native/MaxPooling.cpp:103:12: error: use of undeclared identifier 'xnnpack'
Aug 26 21:03:18 /var/lib/jenkins/workspace/aten/src/ATen/native/LossMultiLabelMargin.cpp:170:15: warning: unused variable 'c' [-Wunused-variable] 
Aug 26 21:03:18   CheckedFrom c = "multilabel_margin_loss_backward_out_frame"; 
Aug 26 21:03:18               ^ 
Aug 26 21:03:18 1 warning generated. 
Aug 26 21:03:20 [1177/1561] Building CXX object caffe2/CMakeFiles/torch_cpu.dir/__/aten/src/ATen/native/MaxPooling.cpp.o 
Aug 26 21:03:20 FAILED: caffe2/CMakeFiles/torch_cpu.dir/__/aten/src/ATen/native/MaxPooling.cpp.o  
-parameter -Wno-missing-field-initializers -Wno-write-strings -Wno-unknown-pragmas -Wno-missing-braces -Wno-maybe-uninitialized -fvisibility=hidden -O2 -DCAFFE2_BUILD_MAIN_LIB -pthread -std=gnu++14 -MD -MT caffe2/CMakeFiles/torch_cpu.dir/__/aten/src/ATen/native/MaxPooling.cpp.o -MF caffe2/CMakeFiles/torch_cpu.dir/__/aten/src/ATen/native/MaxPooling.cpp.o.d -o caffe2/CMakeFiles/torch_cpu.dir/__/aten/src/ATen/native/MaxPooling.cpp.o -c /var/lib/jenkins/workspace/aten/src/ATen/native/MaxPooling.cpp 
Aug 26 21:03:20 /var/lib/jenkins/workspace/aten/src/ATen/native/MaxPooling.cpp:101:7: error: use of undeclared identifier 'xnnpack' 
Aug 26 21:03:20   if (xnnpack::use_max_pool2d( 
Aug 26 21:03:20       ^ 
Aug 26 21:03:20 /var/lib/jenkins/workspace/aten/src/ATen/native/MaxPooling.cpp:103:12: error: use of undeclared identifier 'xnnpack' 
Aug 26 21:03:20     return xnnpack::max_pool2d( 
Aug 26 21:03:20            ^ 
Aug 26 21:03:20 2 errors generated. 
Aug 26 21:03:22 [1178/1561] Building CXX object caffe2/CMakeFiles/torch_cpu.dir/__/aten/src/ATen/native/LossNLL2d.cpp.o 
Aug 26 21:03:25 [1179/1561] Building CXX object caffe2/CMakeFiles/torch_cpu.dir/__/aten/src/ATen/native/MaxUnpooling.cpp.o 
Aug 26 21:03:25 ninja: build stopped: subcommand failed. 

See CircleCI build binary_windows_libtorch_3_7_cpu_release_build (4/15)

Step: "Build" (full log | diagnosis details | 🔁 rerun)

aten\src\ATen\native\cpu\MaxPooling.cpp.DEFAULT.cpp(36): error C2131: expression did not evaluate to a constant
[1161/2287] Building CXX object caffe2\CMakeFiles\torch_cpu.dir\core\event.cc.obj 
[1162/2287] Building CXX object caffe2\CMakeFiles\torch_cpu.dir\operators\concat_split_op.cc.obj 
[1163/2287] Building CXX object caffe2\CMakeFiles\torch_cpu.dir\operators\conditional_op.cc.obj 
[1164/2287] Building CXX object caffe2\CMakeFiles\torch_cpu.dir\__\aten\src\ATen\native\cpu\Activation.cpp.DEFAULT.cpp.obj 
[1165/2287] Building CXX object caffe2\CMakeFiles\torch_cpu.dir\core\module.cc.obj 
[1166/2287] Building CXX object caffe2\CMakeFiles\torch_cpu.dir\core\graph.cc.obj 
[1167/2287] Building CXX object caffe2\CMakeFiles\torch_cpu.dir\operators\communicator_op.cc.obj 
[1168/2287] Building CXX object caffe2\CMakeFiles\torch_cpu.dir\__\aten\src\ATen\native\cpu\MaxPooling.cpp.DEFAULT.cpp.obj 
FAILED: caffe2/CMakeFiles/torch_cpu.dir/__/aten/src/ATen/native/cpu/MaxPooling.cpp.DEFAULT.cpp.obj  
GET_CPUID -DUSE_AVX -DUSE_AVX2 -DTH_HAVE_THREAD /EHsc /DNOMINMAX /wd4267 /wd4251 /wd4522 /wd4838 /wd4305 /wd4244 /wd4190 /wd4101 /wd4996 /wd4275 /bigobj -O2 -openmp:experimental -DCAFFE2_BUILD_MAIN_LIB -DONNX_BUILD_MAIN_LIB -std:c++14 /fp:strict  /DCPU_CAPABILITY=DEFAULT /DCPU_CAPABILITY_DEFAULT /showIncludes /Focaffe2\CMakeFiles\torch_cpu.dir\__\aten\src\ATen\native\cpu\MaxPooling.cpp.DEFAULT.cpp.obj /Fdcaffe2\CMakeFiles\torch_cpu.dir\ /FS -c aten\src\ATen\native\cpu\MaxPooling.cpp.DEFAULT.cpp 
aten\src\ATen\native\cpu\MaxPooling.cpp.DEFAULT.cpp(36): error C2131: expression did not evaluate to a constant
aten\src\ATen\native\cpu\MaxPooling.cpp.DEFAULT.cpp(36): note: failure was caused by a read of a variable outside its lifetime
aten\src\ATen\native\cpu\MaxPooling.cpp.DEFAULT.cpp(36): note: see usage of 'p'
aten\src\ATen\native\cpu\MaxPooling.cpp.DEFAULT.cpp(113): note: see reference to function template instantiation 'void at::native::`anonymous-namespace'::max_pool2d_kernel<scalar_t>(scalar_t *,const scalar_t *const ,const int64_t,const int64_t,const at::native::PoolingParams &)' being compiled
        with
        [
            scalar_t=scalar_t
        ]
[1169/2287] Building CXX object caffe2\CMakeFiles\torch_cpu.dir\core\export_c10_op_to_caffe2.cc.obj 
[1170/2287] Building CXX object caffe2\CMakeFiles\torch_cpu.dir\core\int8_serialization.cc.obj 
[1171/2287] Building CXX object caffe2\CMakeFiles\torch_cpu.dir\core\init.cc.obj 

See CircleCI build pytorch-linux-xenial-py3-clang5-android-ndk-r19c-gradle-custom-build-single (5/15)

Step: "pytorch android gradle custom build single architecture (for PR)" (full log | diagnosis details | 🔁 rerun)

Aug 26 21:02:26 /var/lib/jenkins/workspace/aten/src/ATen/native/MaxPooling.cpp:103:12: error: use of undeclared identifier 'xnnpack'
Aug 26 21:02:21 ../../aten/src/ATen/native/Math.h:371:9: warning: unused function 'abs_impl' [-Wunused-function] 
Aug 26 21:02:21 uint8_t abs_impl(uint8_t v) { 
Aug 26 21:02:21         ^ 
Aug 26 21:02:21 1 warning generated. 
Aug 26 21:02:26 [989/1375] Building CXX object caffe2/CMakeFiles/torch_cpu.dir/__/aten/src/ATen/native/MaxPooling.cpp.o 
Aug 26 21:02:26 FAILED: caffe2/CMakeFiles/torch_cpu.dir/__/aten/src/ATen/native/MaxPooling.cpp.o  
-parameter -Wno-missing-field-initializers -Wno-write-strings -Wno-unknown-pragmas -Wno-missing-braces -Wno-maybe-uninitialized -fvisibility=hidden -O2 -DCAFFE2_BUILD_MAIN_LIB -pthread -std=gnu++14 -MD -MT caffe2/CMakeFiles/torch_cpu.dir/__/aten/src/ATen/native/MaxPooling.cpp.o -MF caffe2/CMakeFiles/torch_cpu.dir/__/aten/src/ATen/native/MaxPooling.cpp.o.d -o caffe2/CMakeFiles/torch_cpu.dir/__/aten/src/ATen/native/MaxPooling.cpp.o -c /var/lib/jenkins/workspace/aten/src/ATen/native/MaxPooling.cpp 
Aug 26 21:02:26 /var/lib/jenkins/workspace/aten/src/ATen/native/MaxPooling.cpp:101:7: error: use of undeclared identifier 'xnnpack' 
Aug 26 21:02:26   if (xnnpack::use_max_pool2d( 
Aug 26 21:02:26       ^ 
Aug 26 21:02:26 /var/lib/jenkins/workspace/aten/src/ATen/native/MaxPooling.cpp:103:12: error: use of undeclared identifier 'xnnpack' 
Aug 26 21:02:26     return xnnpack::max_pool2d( 
Aug 26 21:02:26            ^ 
Aug 26 21:02:26 2 errors generated. 
Aug 26 21:02:27 [990/1375] Building CXX object caffe2/CMakeFiles/torch_cpu.dir/__/aten/src/ATen/native/LossMultiLabelMargin.cpp.o 
Aug 26 21:02:27 /var/lib/jenkins/workspace/aten/src/ATen/native/LossMultiLabelMargin.cpp:170:15: warning: unused variable 'c' [-Wunused-variable] 
Aug 26 21:02:27   CheckedFrom c = "multilabel_margin_loss_backward_out_frame"; 
Aug 26 21:02:27               ^ 
Aug 26 21:02:27 1 warning generated. 
Aug 26 21:02:29 [991/1375] Building CXX object caffe2/CMakeFiles/torch_cpu.dir/__/aten/src/ATen/native/LossNLL.cpp.o 
Aug 26 21:02:30 [992/1375] Building CXX object caffe2/CMakeFiles/torch_cpu.dir/__/aten/src/ATen/native/LinearAlgebra.cpp.o 

See CircleCI build pytorch_linux_xenial_py3_clang5_mobile_custom_build_dynamic (6/15)

Step: "Build" (full log | diagnosis details | 🔁 rerun)

Aug 26 21:01:14 /var/lib/jenkins/workspace/aten/src/ATen/native/MaxPooling.cpp:103:12: error: use of undeclared identifier 'xnnpack'
Aug 26 21:01:14 [1208/1584] Building CXX object caffe2/CMakeFiles/torch_cpu.dir/__/aten/src/ATen/native/MaxPooling.cpp.o 
Aug 26 21:01:14 FAILED: caffe2/CMakeFiles/torch_cpu.dir/__/aten/src/ATen/native/MaxPooling.cpp.o  
-parameter -Wno-missing-field-initializers -Wno-write-strings -Wno-unknown-pragmas -Wno-missing-braces -Wno-maybe-uninitialized -fvisibility=hidden -O2 -DCAFFE2_BUILD_MAIN_LIB -pthread -std=gnu++14 -MD -MT caffe2/CMakeFiles/torch_cpu.dir/__/aten/src/ATen/native/MaxPooling.cpp.o -MF caffe2/CMakeFiles/torch_cpu.dir/__/aten/src/ATen/native/MaxPooling.cpp.o.d -o caffe2/CMakeFiles/torch_cpu.dir/__/aten/src/ATen/native/MaxPooling.cpp.o -c /var/lib/jenkins/workspace/aten/src/ATen/native/MaxPooling.cpp 
Aug 26 21:01:14 In file included from /var/lib/jenkins/workspace/aten/src/ATen/native/MaxPooling.cpp:3: 
Aug 26 21:01:14 ../../../aten/src/ATen/native/Pool.h:59:17: warning: unused variable 'nOutputPlane' [-Wunused-variable] 
Aug 26 21:01:14   const int64_t nOutputPlane = nInputPlane; 
Aug 26 21:01:14                 ^ 
Aug 26 21:01:14 /var/lib/jenkins/workspace/aten/src/ATen/native/MaxPooling.cpp:101:7: error: use of undeclared identifier 'xnnpack' 
Aug 26 21:01:14   if (xnnpack::use_max_pool2d( 
Aug 26 21:01:14       ^ 
Aug 26 21:01:14 /var/lib/jenkins/workspace/aten/src/ATen/native/MaxPooling.cpp:103:12: error: use of undeclared identifier 'xnnpack' 
Aug 26 21:01:14     return xnnpack::max_pool2d( 
Aug 26 21:01:14            ^ 
Aug 26 21:01:14 1 warning and 2 errors generated. 
Aug 26 21:01:16 [1209/1584] Building CXX object caffe2/CMakeFiles/torch_cpu.dir/__/aten/src/ATen/native/Memory.cpp.o 
Aug 26 21:01:17 [1210/1584] Building CXX object caffe2/CMakeFiles/torch_cpu.dir/__/aten/src/ATen/native/MaxUnpooling.cpp.o 
Aug 26 21:01:18 [1211/1584] Building CXX object caffe2/CMakeFiles/torch_cpu.dir/__/aten/src/ATen/native/MetaTensor.cpp.o 
Aug 26 21:01:18 ninja: build stopped: subcommand failed. 
Aug 26 21:01:18 + sccache_epilogue 
Aug 26 21:01:18 + echo '=================== sccache compilation log ===================' 
Aug 26 21:01:18 + python /var/lib/jenkins/workspace/.jenkins/pytorch/print_sccache_log.py /var/lib/jenkins/sccache_error.log 

See CircleCI build pytorch_linux_xenial_py3_clang5_mobile_build (7/15)

Step: "Build" (full log | diagnosis details | 🔁 rerun)

Tmp/CheckSymbolExists.c: In function \'main\':\n/var/lib/jenkins/workspace/build_test_custom_build/build_default_libtorch/CMakeFiles/CMakeTmp/CheckSymbolExists.c:8:19: error: \'strtod_l\' undeclared (first use in this function)\n return ((int*)(&strtod_l))[argc];\n ^\n/var/lib/jenkins/workspace/build_test_custom_build/build_default_libtorch/CMakeFiles/CMakeTmp/CheckSymbolExists.c:8:19: note: each undeclared identifier is reported only once for each function it appears in\n" }
Aug 26 20:55:05 make[2]: *** [caffe2/CMakeFiles/torch_cpu.dir/__/aten/src/ATen/native/MaxPooling.cpp.o] Error 1 
Aug 26 20:55:05 make[2]: *** Waiting for unfinished jobs.... 
Aug 26 20:55:06 make[1]: *** [caffe2/CMakeFiles/torch_cpu.dir/all] Error 2 
Aug 26 20:55:06 CMakeFiles/Makefile2:1020: recipe for target 'caffe2/CMakeFiles/torch_cpu.dir/all' failed 
Aug 26 20:55:06 make: *** [all] Error 2 
Aug 26 20:55:06 Makefile:138: recipe for target 'all' failed 
Aug 26 20:55:06 + sccache_epilogue 
Aug 26 20:55:06 + echo '=================== sccache compilation log ===================' 
Aug 26 20:55:06 + python /var/lib/jenkins/workspace/.jenkins/pytorch/print_sccache_log.py /var/lib/jenkins/sccache_error.log 
Aug 26 20:55:06 =================== sccache compilation log =================== 
mp/CheckSymbolExists.c: In function \'main\':\n/var/lib/jenkins/workspace/build_test_custom_build/build_default_libtorch/CMakeFiles/CMakeTmp/CheckSymbolExists.c:8:19: error: \'strtod_l\' undeclared (first use in this function)\n   return ((int*)(&strtod_l))[argc];\n                   ^\n/var/lib/jenkins/workspace/build_test_custom_build/build_default_libtorch/CMakeFiles/CMakeTmp/CheckSymbolExists.c:8:19: note: each undeclared identifier is reported only once for each function it appears in\n" } 
Aug 26 20:55:06  
u{1b}[01;31m\u{1b}[Kerror: \u{1b}[m\u{1b}[K\'\u{1b}[01m\u{1b}[Kxnnpack\u{1b}[m\u{1b}[K\' has not been declared\n   if (xnnpack::use_max_pool2d(\n\u{1b}[01;32m\u{1b}[K       ^\u{1b}[m\u{1b}[K\n\u{1b}[01m\u{1b}[K/var/lib/jenkins/workspace/aten/src/ATen/native/MaxPooling.cpp:103:12:\u{1b}[m\u{1b}[K \u{1b}[01;31m\u{1b}[Kerror: \u{1b}[m\u{1b}[K\'\u{1b}[01m\u{1b}[Kxnnpack\u{1b}[m\u{1b}[K\' has not been declared\n     return xnnpack::max_pool2d(\n\u{1b}[01;32m\u{1b}[K            ^\u{1b}[m\u{1b}[K\n" } 
Aug 26 20:55:06  
Aug 26 20:55:06 + echo '=========== If your build fails, please take a look at the log above for possible reasons ===========' 
Aug 26 20:55:06 + sccache --show-stats 
Aug 26 20:55:06 =========== If your build fails, please take a look at the log above for possible reasons =========== 
Aug 26 20:55:06 Compile requests              1501 
Aug 26 20:55:06 Compile requests executed     1236 
Aug 26 20:55:06 Cache hits                    1231 
Aug 26 20:55:06 Cache misses                     0 

See CircleCI build pytorch_windows_vs2019_py36_cuda10.1_build (8/15)

Step: "Build" (full log | diagnosis details | 🔁 rerun)

aten\src\ATen\native\cpu\MaxPooling.cpp.DEFAULT.cpp(36): error C2131: expression did not evaluate to a constant
Microsoft (R) C/C++ Optimizing Compiler Version 19.26.28806 for x64
Copyright (C) Microsoft Corporation.  All rights reserved.

GET_CPUID -DUSE_AVX -DUSE_AVX2 -DTH_HAVE_THREAD /EHsc /DNOMINMAX /wd4267 /wd4251 /wd4522 /wd4838 /wd4305 /wd4244 /wd4190 /wd4101 /wd4996 /wd4275 /bigobj -O2 -openmp:experimental -DCAFFE2_BUILD_MAIN_LIB -DONNX_BUILD_MAIN_LIB -std:c++14 /fp:strict  /DCPU_CAPABILITY=DEFAULT /DCPU_CAPABILITY_DEFAULT /showIncludes /Focaffe2\CMakeFiles\torch_cpu.dir\__\aten\src\ATen\native\cpu\CopyKernel.cpp.DEFAULT.cpp.obj /Fdcaffe2\CMakeFiles\torch_cpu.dir\ /FS -c aten\src\ATen\native\cpu\CopyKernel.cpp.DEFAULT.cpp 
Microsoft (R) C/C++ Optimizing Compiler Version 19.26.28806 for x64
Copyright (C) Microsoft Corporation.  All rights reserved.

GET_CPUID -DUSE_AVX -DUSE_AVX2 -DTH_HAVE_THREAD /EHsc /DNOMINMAX /wd4267 /wd4251 /wd4522 /wd4838 /wd4305 /wd4244 /wd4190 /wd4101 /wd4996 /wd4275 /bigobj -O2 -openmp:experimental -DCAFFE2_BUILD_MAIN_LIB -DONNX_BUILD_MAIN_LIB -std:c++14 /fp:strict  /DCPU_CAPABILITY=DEFAULT /DCPU_CAPABILITY_DEFAULT /showIncludes /Focaffe2\CMakeFiles\torch_cpu.dir\__\aten\src\ATen\native\cpu\MaxPooling.cpp.DEFAULT.cpp.obj /Fdcaffe2\CMakeFiles\torch_cpu.dir\ /FS -c aten\src\ATen\native\cpu\MaxPooling.cpp.DEFAULT.cpp 
FAILED: caffe2/CMakeFiles/torch_cpu.dir/__/aten/src/ATen/native/cpu/MaxPooling.cpp.DEFAULT.cpp.obj  
GET_CPUID -DUSE_AVX -DUSE_AVX2 -DTH_HAVE_THREAD /EHsc /DNOMINMAX /wd4267 /wd4251 /wd4522 /wd4838 /wd4305 /wd4244 /wd4190 /wd4101 /wd4996 /wd4275 /bigobj -O2 -openmp:experimental -DCAFFE2_BUILD_MAIN_LIB -DONNX_BUILD_MAIN_LIB -std:c++14 /fp:strict  /DCPU_CAPABILITY=DEFAULT /DCPU_CAPABILITY_DEFAULT /showIncludes /Focaffe2\CMakeFiles\torch_cpu.dir\__\aten\src\ATen\native\cpu\MaxPooling.cpp.DEFAULT.cpp.obj /Fdcaffe2\CMakeFiles\torch_cpu.dir\ /FS -c aten\src\ATen\native\cpu\MaxPooling.cpp.DEFAULT.cpp 
aten\src\ATen\native\cpu\MaxPooling.cpp.DEFAULT.cpp(36): error C2131: expression did not evaluate to a constant
aten\src\ATen\native\cpu\MaxPooling.cpp.DEFAULT.cpp(36): note: failure was caused by a read of a variable outside its lifetime
aten\src\ATen\native\cpu\MaxPooling.cpp.DEFAULT.cpp(36): note: see usage of 'p'
aten\src\ATen\native\cpu\MaxPooling.cpp.DEFAULT.cpp(113): note: see reference to function template instantiation 'void at::native::`anonymous-namespace'::max_pool2d_kernel<scalar_t>(scalar_t *,const scalar_t *const ,const int64_t,const int64_t,const at::native::PoolingParams &)' being compiled
        with
        [
            scalar_t=scalar_t
        ]
Microsoft (R) C/C++ Optimizing Compiler Version 19.26.28806 for x64
Copyright (C) Microsoft Corporation.  All rights reserved.

See CircleCI build pytorch_linux_xenial_py3_6_gcc5_4_test (9/15)

Step: "Run tests" (full log | diagnosis details | 🔁 rerun)

Aug 26 21:59:24 ERROR:sccache::server: Compilation failed: Output { status: ExitStatus(ExitStatus(256)), stdout: "", stderr: "/var/lib/jenkins/.cache/torch_extensions/test_compilation_error_formatting/main.cpp: In function \'int main()\':\n/var/lib/jenkins/.cache/torch_extensions/test_compilation_error_formatting/main.cpp:2:23: error: expected \';\' before \'}\' token\n int main() { return 0 }\n ^\n" }
Aug 26 21:59:24 Traceback (most recent call last): 
Aug 26 21:59:24   File "test/run_test.py", line 721, in <module> 
Aug 26 21:59:24     main() 
Aug 26 21:59:24   File "test/run_test.py", line 710, in main 
Aug 26 21:59:24     raise RuntimeError(err) 
Aug 26 21:59:24 RuntimeError: test_quantization failed! 
Aug 26 21:59:24 + cleanup 
Aug 26 21:59:24 + retcode=1 
Aug 26 21:59:24 + set +x 
Aug 26 21:59:24 =================== sccache compilation log =================== 
Aug 26 21:59:24 ERROR:sccache::server: Compilation failed: Output { status: ExitStatus(ExitStatus(256)), stdout: "", stderr: "/var/lib/jenkins/.cache/torch_extensions/test_compilation_error_formatting/main.cpp: In function \'int main()\':\n/var/lib/jenkins/.cache/torch_extensions/test_compilation_error_formatting/main.cpp:2:23: error: expected \';\' before \'}\' token\n int main() { return 0 }\n                       ^\n" } 
Aug 26 21:59:24  
Aug 26 21:59:24 =========== If your build fails, please take a look at the log above for possible reasons =========== 
Aug 26 21:59:24 Compile requests                 65 
Aug 26 21:59:24 Compile requests executed        35 
Aug 26 21:59:24 Cache hits                       27 
Aug 26 21:59:24 Cache misses                      7 
Aug 26 21:59:24 Cache timeouts                    0 
Aug 26 21:59:24 Cache read errors                 0 
Aug 26 21:59:24 Forced recaches                   0 
Aug 26 21:59:24 Cache write errors                0 

See CircleCI build pytorch_linux_xenial_py3_clang5_asan_test2 (10/15)

Step: "Run tests" (full log | diagnosis details | 🔁 rerun)

Aug 26 21:21:22 SUMMARY: UndefinedBehaviorSanitizer: undefined-behavior /var/lib/jenkins/workspace/aten/src/ATen/Utils.cpp:11:3 in
Aug 26 21:21:22     #7 0x5564847df7eb in PyEval_EvalCode /tmp/build/80754af9/python_1588903631989/work/Python/ceval.c:731 
Aug 26 21:21:22     #8 0x55648485fe73 in run_mod /tmp/build/80754af9/python_1588903631989/work/Python/pythonrun.c:1025 
Aug 26 21:21:22     #9 0x55648485ff0c in PyRun_StringFlags /tmp/build/80754af9/python_1588903631989/work/Python/pythonrun.c:949 
Aug 26 21:21:22     #10 0x55648485ff6e in PyRun_SimpleStringFlags /tmp/build/80754af9/python_1588903631989/work/Python/pythonrun.c:445 
Aug 26 21:21:22     #11 0x556484863d72 in run_command /tmp/build/80754af9/python_1588903631989/work/Modules/main.c:301 
Aug 26 21:21:22     #12 0x556484863d72 in Py_Main /tmp/build/80754af9/python_1588903631989/work/Modules/main.c:749 
Aug 26 21:21:22     #13 0x55648472df2d in main /tmp/build/80754af9/python_1588903631989/work/Programs/python.c:69 
Aug 26 21:21:22     #14 0x7fa0086dd83f in __libc_start_main /build/glibc-e6zv40/glibc-2.23/csu/../csu/libc-start.c:291 
Aug 26 21:21:22     #15 0x55648480d27e in _start /home/rdonnelly/mc/conda-bld/compilers_linux-64_1534865402226/work/.build/src/glibc-2.12.2/csu/../sysdeps/x86_64/elf/start.S:103 
Aug 26 21:21:22  
Aug 26 21:21:22 SUMMARY: UndefinedBehaviorSanitizer: undefined-behavior /var/lib/jenkins/workspace/aten/src/ATen/Utils.cpp:11:3 in  
Aug 26 21:21:22 + retcode=1 
Aug 26 21:21:22 + set -e 
Aug 26 21:21:22 + return 1 
Aug 26 21:21:22 + [[ pytorch-linux-xenial-py3-clang5-asan-test2 == *-NO_AVX-* ]] 
Aug 26 21:21:22 + [[ pytorch-linux-xenial-py3-clang5-asan-test2 == *-NO_AVX2-* ]] 
Aug 26 21:21:22 + '[' -n https://github.com/pytorch/pytorch/pull/43267 ']' 
Aug 26 21:21:22 ++ mktemp 
Aug 26 21:21:22 + DETERMINE_FROM=/tmp/tmp.6txZLPEpzc 
Aug 26 21:21:22 + file_diff_from_base /tmp/tmp.6txZLPEpzc 
Aug 26 21:21:22 + set +e 

See CircleCI build pytorch_windows_vs2019_py36_cuda11.0_build (11/15)

Step: "Build" (full log | diagnosis details | 🔁 rerun)

aten\src\ATen\native\cpu\MaxPooling.cpp.DEFAULT.cpp(36): error C2131: expression did not evaluate to a constant
Microsoft (R) C/C++ Optimizing Compiler Version 19.26.28806 for x64
Copyright (C) Microsoft Corporation.  All rights reserved.

R -DHAVE_AVX_CPU_DEFINITION -DHAVE_AVX2_CPU_DEFINITION /MD /O2 /Ob2 /DNDEBUG /w /bigobj -DNDEBUG -DCUDA_HAS_FP16=1 -DUSE_GCC_GET_CPUID -DUSE_AVX -DUSE_AVX2 -DTH_HAVE_THREAD /EHsc /DNOMINMAX /wd4267 /wd4251 /wd4522 /wd4838 /wd4305 /wd4244 /wd4190 /wd4101 /wd4996 /wd4275 /bigobj -O2 -openmp:experimental -DCAFFE2_BUILD_MAIN_LIB -DONNX_BUILD_MAIN_LIB -std:c++14 /showIncludes /Focaffe2\CMakeFiles\torch_cpu.dir\core\init_omp.cc.obj /Fdcaffe2\CMakeFiles\torch_cpu.dir\ /FS -c ..\caffe2\core\init_omp.cc 
Microsoft (R) C/C++ Optimizing Compiler Version 19.26.28806 for x64
Copyright (C) Microsoft Corporation.  All rights reserved.

GET_CPUID -DUSE_AVX -DUSE_AVX2 -DTH_HAVE_THREAD /EHsc /DNOMINMAX /wd4267 /wd4251 /wd4522 /wd4838 /wd4305 /wd4244 /wd4190 /wd4101 /wd4996 /wd4275 /bigobj -O2 -openmp:experimental -DCAFFE2_BUILD_MAIN_LIB -DONNX_BUILD_MAIN_LIB -std:c++14 /fp:strict  /DCPU_CAPABILITY=DEFAULT /DCPU_CAPABILITY_DEFAULT /showIncludes /Focaffe2\CMakeFiles\torch_cpu.dir\__\aten\src\ATen\native\cpu\MaxPooling.cpp.DEFAULT.cpp.obj /Fdcaffe2\CMakeFiles\torch_cpu.dir\ /FS -c aten\src\ATen\native\cpu\MaxPooling.cpp.DEFAULT.cpp 
FAILED: caffe2/CMakeFiles/torch_cpu.dir/__/aten/src/ATen/native/cpu/MaxPooling.cpp.DEFAULT.cpp.obj  
GET_CPUID -DUSE_AVX -DUSE_AVX2 -DTH_HAVE_THREAD /EHsc /DNOMINMAX /wd4267 /wd4251 /wd4522 /wd4838 /wd4305 /wd4244 /wd4190 /wd4101 /wd4996 /wd4275 /bigobj -O2 -openmp:experimental -DCAFFE2_BUILD_MAIN_LIB -DONNX_BUILD_MAIN_LIB -std:c++14 /fp:strict  /DCPU_CAPABILITY=DEFAULT /DCPU_CAPABILITY_DEFAULT /showIncludes /Focaffe2\CMakeFiles\torch_cpu.dir\__\aten\src\ATen\native\cpu\MaxPooling.cpp.DEFAULT.cpp.obj /Fdcaffe2\CMakeFiles\torch_cpu.dir\ /FS -c aten\src\ATen\native\cpu\MaxPooling.cpp.DEFAULT.cpp 
aten\src\ATen\native\cpu\MaxPooling.cpp.DEFAULT.cpp(36): error C2131: expression did not evaluate to a constant
aten\src\ATen\native\cpu\MaxPooling.cpp.DEFAULT.cpp(36): note: failure was caused by a read of a variable outside its lifetime
aten\src\ATen\native\cpu\MaxPooling.cpp.DEFAULT.cpp(36): note: see usage of 'p'
aten\src\ATen\native\cpu\MaxPooling.cpp.DEFAULT.cpp(113): note: see reference to function template instantiation 'void at::native::`anonymous-namespace'::max_pool2d_kernel<scalar_t>(scalar_t *,const scalar_t *const ,const int64_t,const int64_t,const at::native::PoolingParams &)' being compiled
        with
        [
            scalar_t=scalar_t
        ]
Microsoft (R) C/C++ Optimizing Compiler Version 19.26.28806 for x64
Copyright (C) Microsoft Corporation.  All rights reserved.

See CircleCI build pytorch_linux_bionic_py3_6_clang9_test (12/15)

Step: "Run tests" (full log | diagnosis details | 🔁 rerun)

Aug 26 21:49:08 ERROR:sccache::server: Compilation failed: Output { status: ExitStatus(ExitStatus(256)), stdout: "", stderr: "/var/lib/jenkins/.cache/torch_extensions/test_compilation_error_formatting/main.cpp: In function \'int main()\':\n/var/lib/jenkins/.cache/torch_extensions/test_compilation_error_formatting/main.cpp:2:23: error: expected \';\' before \'}\' token\n int main() { return 0 }\n ^\n" }
Aug 26 21:49:08     raise RuntimeError(err) 
Aug 26 21:49:08 RuntimeError: test_quantization failed! 
Aug 26 21:49:08  
Aug 26 21:49:08 real	26m19.089s 
Aug 26 21:49:08 user	44m24.798s 
Aug 26 21:49:08 sys	2m32.915s 
Aug 26 21:49:08 + cleanup 
Aug 26 21:49:08 + retcode=1 
Aug 26 21:49:08 + set +x 
Aug 26 21:49:08 =================== sccache compilation log =================== 
Aug 26 21:49:08 ERROR:sccache::server: Compilation failed: Output { status: ExitStatus(ExitStatus(256)), stdout: "", stderr: "/var/lib/jenkins/.cache/torch_extensions/test_compilation_error_formatting/main.cpp: In function \'int main()\':\n/var/lib/jenkins/.cache/torch_extensions/test_compilation_error_formatting/main.cpp:2:23: error: expected \';\' before \'}\' token\n int main() { return 0 }\n                       ^\n" } 
Aug 26 21:49:08  
Aug 26 21:49:08 =========== If your build fails, please take a look at the log above for possible reasons =========== 
Aug 26 21:49:08 Compile requests                 65 
Aug 26 21:49:08 Compile requests executed        35 
Aug 26 21:49:08 Cache hits                       27 
Aug 26 21:49:08 Cache misses                      7 
Aug 26 21:49:08 Cache timeouts                    0 
Aug 26 21:49:08 Cache read errors                 0 
Aug 26 21:49:08 Forced recaches                   0 
Aug 26 21:49:08 Cache write errors                0 

See CircleCI build pytorch_ios_11_2_1_x86_64_build (13/15)

Step: "Build" (full log | diagnosis details | 🔁 rerun)

Aug 26 20:55:16 /Users/distiller/project/aten/src/ATen/native/MaxPooling.cpp:103:12: error: use of undeclared identifier 'xnnpack'
Aug 26 20:55:10               ^ 
Aug 26 20:55:12 1 warning generated. 
Aug 26 20:55:12 [ 74%] Building CXX object caffe2/CMakeFiles/torch_cpu.dir/__/aten/src/ATen/native/LossNLL.cpp.o 
Aug 26 20:55:12 [ 74%] Building CXX object caffe2/CMakeFiles/torch_cpu.dir/__/aten/src/ATen/native/LossNLL2d.cpp.o 
Aug 26 20:55:13 2 warnings generated. 
Aug 26 20:55:13 [ 74%] Building CXX object caffe2/CMakeFiles/torch_cpu.dir/__/aten/src/ATen/native/MaxPooling.cpp.o 
Aug 26 20:55:13 [ 74%] Building CXX object caffe2/CMakeFiles/torch_cpu.dir/__/aten/src/ATen/native/MaxUnpooling.cpp.o 
Aug 26 20:55:16 /Users/distiller/project/aten/src/ATen/native/MaxPooling.cpp:101:7: error: use of undeclared identifier 'xnnpack' 
Aug 26 20:55:16   if (xnnpack::use_max_pool2d( 
Aug 26 20:55:16       ^ 
Aug 26 20:55:16 /Users/distiller/project/aten/src/ATen/native/MaxPooling.cpp:103:12: error: use of undeclared identifier 'xnnpack' 
Aug 26 20:55:16     return xnnpack::max_pool2d( 
Aug 26 20:55:16            ^ 
Aug 26 20:55:17 2 errors generated. 
Aug 26 20:55:17 make[2]: *** [caffe2/CMakeFiles/torch_cpu.dir/__/aten/src/ATen/native/MaxPooling.cpp.o] Error 1 
Aug 26 20:55:17 make[2]: *** Waiting for unfinished jobs.... 
Aug 26 20:55:19 make[1]: *** [caffe2/CMakeFiles/torch_cpu.dir/all] Error 2 
Aug 26 20:55:19 make: *** [all] Error 2 

See CircleCI build binary_windows_libtorch_3_7_cpu_debug_build (14/15)

Step: "Build" (full log | diagnosis details | 🔁 rerun)

aten\src\ATen\native\cpu\MaxPooling.cpp.DEFAULT.cpp(36): error C2131: expression did not evaluate to a constant
[1155/2287] Building CXX object caffe2\CMakeFiles\torch_cpu.dir\core\blob_stats.cc.obj 
[1156/2287] Building CXX object caffe2\CMakeFiles\torch_cpu.dir\core\context.cc.obj 
[1157/2287] Building CXX object caffe2\CMakeFiles\torch_cpu.dir\core\event.cc.obj 
[1158/2287] Building CXX object caffe2\CMakeFiles\torch_cpu.dir\core\context_base.cc.obj 
[1159/2287] Building CXX object caffe2\CMakeFiles\torch_cpu.dir\__\aten\src\ATen\native\cpu\Activation.cpp.DEFAULT.cpp.obj 
[1160/2287] Building CXX object caffe2\CMakeFiles\torch_cpu.dir\core\db.cc.obj 
[1161/2287] Building CXX object caffe2\CMakeFiles\torch_cpu.dir\core\init_denormals.cc.obj 
[1162/2287] Building CXX object caffe2\CMakeFiles\torch_cpu.dir\__\aten\src\ATen\native\cpu\MaxPooling.cpp.DEFAULT.cpp.obj 
FAILED: caffe2/CMakeFiles/torch_cpu.dir/__/aten/src/ATen/native/cpu/MaxPooling.cpp.DEFAULT.cpp.obj  
GET_CPUID -DUSE_AVX -DUSE_AVX2 -DTH_HAVE_THREAD /Z7 /EHsc /DNOMINMAX /wd4267 /wd4251 /wd4522 /wd4838 /wd4305 /wd4244 /wd4190 /wd4101 /wd4996 /wd4275 /bigobj -openmp:experimental -DCAFFE2_BUILD_MAIN_LIB -DONNX_BUILD_MAIN_LIB -std:c++14 /fp:strict  /DCPU_CAPABILITY=DEFAULT /DCPU_CAPABILITY_DEFAULT /showIncludes /Focaffe2\CMakeFiles\torch_cpu.dir\__\aten\src\ATen\native\cpu\MaxPooling.cpp.DEFAULT.cpp.obj /Fdcaffe2\CMakeFiles\torch_cpu.dir\ /FS -c aten\src\ATen\native\cpu\MaxPooling.cpp.DEFAULT.cpp 
aten\src\ATen\native\cpu\MaxPooling.cpp.DEFAULT.cpp(36): error C2131: expression did not evaluate to a constant
aten\src\ATen\native\cpu\MaxPooling.cpp.DEFAULT.cpp(36): note: failure was caused by a read of a variable outside its lifetime
aten\src\ATen\native\cpu\MaxPooling.cpp.DEFAULT.cpp(36): note: see usage of 'p'
aten\src\ATen\native\cpu\MaxPooling.cpp.DEFAULT.cpp(113): note: see reference to function template instantiation 'void at::native::`anonymous-namespace'::max_pool2d_kernel<scalar_t>(scalar_t *,const scalar_t *const ,const int64_t,const int64_t,const at::native::PoolingParams &)' being compiled
        with
        [
            scalar_t=scalar_t
        ]
[1163/2287] Building CXX object caffe2\CMakeFiles\torch_cpu.dir\core\graph.cc.obj 
[1164/2287] Building CXX object caffe2\CMakeFiles\torch_cpu.dir\core\init.cc.obj 
[1165/2287] Building CXX object caffe2\CMakeFiles\torch_cpu.dir\core\export_c10_op_to_caffe2.cc.obj 

See CircleCI build pytorch_linux_xenial_py3_clang5_mobile_custom_build_static (15/15)

Step: "Build" (full log | diagnosis details | 🔁 rerun)

bolExists.c: In function \'main\':\n/var/lib/jenkins/workspace/build_test_custom_build/build_custom_libtorch_static/CMakeFiles/CMakeTmp/CheckSymbolExists.c:8:19: error: \'strtod_l\' undeclared (first use in this function)\n return ((int*)(&strtod_l))[argc];\n ^\n/var/lib/jenkins/workspace/build_test_custom_build/build_custom_libtorch_static/CMakeFiles/CMakeTmp/CheckSymbolExists.c:8:19: note: each undeclared identifier is reported only once for each function it appears in\n" }
Aug 26 21:00:28 make[2]: *** [caffe2/CMakeFiles/torch_cpu.dir/__/aten/src/ATen/native/MaxPooling.cpp.o] Error 1 
Aug 26 21:00:28 make[2]: *** Waiting for unfinished jobs.... 
Aug 26 21:00:33 CMakeFiles/Makefile2:1020: recipe for target 'caffe2/CMakeFiles/torch_cpu.dir/all' failed 
Aug 26 21:00:33 make[1]: *** [caffe2/CMakeFiles/torch_cpu.dir/all] Error 2 
Aug 26 21:00:33 make: *** [all] Error 2 
Aug 26 21:00:33 Makefile:138: recipe for target 'all' failed 
Aug 26 21:00:33 =================== sccache compilation log =================== 
Aug 26 21:00:33 + sccache_epilogue 
Aug 26 21:00:33 + echo '=================== sccache compilation log ===================' 
Aug 26 21:00:33 + python /var/lib/jenkins/workspace/.jenkins/pytorch/print_sccache_log.py /var/lib/jenkins/sccache_error.log 
olExists.c: In function \'main\':\n/var/lib/jenkins/workspace/build_test_custom_build/build_custom_libtorch_static/CMakeFiles/CMakeTmp/CheckSymbolExists.c:8:19: error: \'strtod_l\' undeclared (first use in this function)\n   return ((int*)(&strtod_l))[argc];\n                   ^\n/var/lib/jenkins/workspace/build_test_custom_build/build_custom_libtorch_static/CMakeFiles/CMakeTmp/CheckSymbolExists.c:8:19: note: each undeclared identifier is reported only once for each function it appears in\n" } 
Aug 26 21:00:33  
u{1b}[01;31m\u{1b}[Kerror: \u{1b}[m\u{1b}[K\'\u{1b}[01m\u{1b}[Kxnnpack\u{1b}[m\u{1b}[K\' has not been declared\n   if (xnnpack::use_max_pool2d(\n\u{1b}[01;32m\u{1b}[K       ^\u{1b}[m\u{1b}[K\n\u{1b}[01m\u{1b}[K/var/lib/jenkins/workspace/aten/src/ATen/native/MaxPooling.cpp:103:12:\u{1b}[m\u{1b}[K \u{1b}[01;31m\u{1b}[Kerror: \u{1b}[m\u{1b}[K\'\u{1b}[01m\u{1b}[Kxnnpack\u{1b}[m\u{1b}[K\' has not been declared\n     return xnnpack::max_pool2d(\n\u{1b}[01;32m\u{1b}[K            ^\u{1b}[m\u{1b}[K\n" } 
Aug 26 21:00:33  
Aug 26 21:00:33 + echo '=========== If your build fails, please take a look at the log above for possible reasons ===========' 
Aug 26 21:00:33 + sccache --show-stats 
Aug 26 21:00:33 =========== If your build fails, please take a look at the log above for possible reasons =========== 
Aug 26 21:00:33 Compile requests              1477 
Aug 26 21:00:33 Compile requests executed     1224 
Aug 26 21:00:33 Cache hits                       1 
Aug 26 21:00:33 Cache misses                  1218 

❄️ 3 failures tentatively classified as flaky

but reruns have not yet been triggered to confirm:

See CircleCI build pytorch_linux_xenial_cuda10_2_cudnn7_py3_gcc7_test (1/3)

Step: "Run tests" (full log | diagnosis details | 🔁 rerun) ❄️

Aug 26 22:16:36 ConnectionResetError: [Errno 104] Connection reset by peer
Aug 26 22:16:36   File "/opt/conda/lib/python3.6/multiprocessing/connection.py", line 456, in accept 
Aug 26 22:16:36     answer_challenge(c, self._authkey) 
Aug 26 22:16:36   File "/opt/conda/lib/python3.6/multiprocessing/connection.py", line 732, in answer_challenge 
Aug 26 22:16:36     message = connection.recv_bytes(256)         # reject large message 
Aug 26 22:16:36   File "/opt/conda/lib/python3.6/multiprocessing/connection.py", line 216, in recv_bytes 
Aug 26 22:16:36     buf = self._recv_bytes(maxlength) 
Aug 26 22:16:36   File "/opt/conda/lib/python3.6/multiprocessing/connection.py", line 407, in _recv_bytes 
Aug 26 22:16:36     buf = self._recv(4) 
Aug 26 22:16:36   File "/opt/conda/lib/python3.6/multiprocessing/connection.py", line 379, in _recv 
Aug 26 22:16:36     chunk = read(handle, remaining) 
Aug 26 22:16:36 ConnectionResetError: [Errno 104] Connection reset by peer 
Aug 26 22:16:36 /opt/conda/lib/python3.6/multiprocessing/semaphore_tracker.py:143: UserWarning: semaphore_tracker: There appear to be 14 leaked semaphores to clean up at shutdown 
Aug 26 22:16:36   len(cache)) 
Aug 26 22:16:39 Process ErrorTrackingProcess-152: 
Aug 26 22:16:39 Traceback (most recent call last): 
Aug 26 22:16:39   File "/opt/conda/lib/python3.6/multiprocessing/process.py", line 258, in _bootstrap 
Aug 26 22:16:39     self.run() 
Aug 26 22:16:39   File "/var/lib/jenkins/workspace/test/test_dataloader.py", line 361, in run 
Aug 26 22:16:39     super(ErrorTrackingProcess, self).run() 
Aug 26 22:16:39   File "/opt/conda/lib/python3.6/multiprocessing/process.py", line 93, in run 
Aug 26 22:16:39     self._target(*self._args, **self._kwargs) 

See CircleCI build pytorch_linux_xenial_py3_6_gcc5_4_ge_config_simple_test (2/3)

Step: "Run tests" (full log | diagnosis details | 🔁 rerun) ❄️

Aug 26 21:56:53 hypothesis.errors.Flaky: Hypothesis test_max_pool2d_nhwc(self=, X=(array([[[[1., 1., 1.],
Aug 26 21:56:53   File "/var/lib/jenkins/.local/lib/python3.6/site-packages/hypothesis/core.py", line 1116, in wrapped_test 
Aug 26 21:56:53     raise the_error_hypothesis_found 
Aug 26 21:56:53   File "/var/lib/jenkins/.local/lib/python3.6/site-packages/hypothesis/core.py", line 1071, in wrapped_test 
Aug 26 21:56:53     state.run_engine() 
Aug 26 21:56:53   File "/var/lib/jenkins/.local/lib/python3.6/site-packages/hypothesis/core.py", line 789, in run_engine 
Aug 26 21:56:53     info.__expected_traceback, 
Aug 26 21:56:53   File "/var/lib/jenkins/.local/lib/python3.6/site-packages/hypothesis/core.py", line 656, in execute_once 
Aug 26 21:56:53     % (test.__name__, text_repr[0]) 
Aug 26 21:56:53   File "/var/lib/jenkins/.local/lib/python3.6/site-packages/hypothesis/core.py", line 856, in __flaky 
Aug 26 21:56:53     raise Flaky(message) 
Aug 26 21:56:53 hypothesis.errors.Flaky: Hypothesis test_max_pool2d_nhwc(self=<quantization.test_quantized_op.TestQuantizedOps testMethod=test_max_pool2d_nhwc>, X=(array([[[[1., 1., 1.], 
Aug 26 21:56:53           [1., 1., 1.], 
Aug 26 21:56:53           [1., 1., 1.]]]], dtype=float32), (1.0, 0, torch.quint8)), kernel=3, stride=None, dilation=1, padding=0, ceil_mode=False) produces unreliable results: Falsified on the first call but did not on a subsequent one 
Aug 26 21:56:53  
Aug 26 21:56:53 ---------------------------------------------------------------------- 
Aug 26 21:56:53 Ran 331 tests in 517.519s 
Aug 26 21:56:53  
Aug 26 21:56:53 FAILED (errors=1, skipped=13) 
Aug 26 21:56:53  
Aug 26 21:56:53 Generating XML reports... 
Aug 26 21:56:53 Generated XML report: test-reports/dist-gloo/TEST-quantization.test_bias_correction.TestBiasCorrection-20200826214816.xml 

See CircleCI build pytorch_linux_bionic_py3_8_gcc9_test (3/3)

Step: "Run tests" (full log | diagnosis details | 🔁 rerun) ❄️

Aug 26 21:53:25 hypothesis.errors.Flaky: Hypothesis test_max_pool2d_nhwc(self=, X=(array([[[[7., 7., 7., ..., 7., 7., 7.],
Aug 26 21:53:25   File "/var/lib/jenkins/.local/lib/python3.8/site-packages/hypothesis/core.py", line 1116, in wrapped_test 
Aug 26 21:53:25     raise the_error_hypothesis_found 
Aug 26 21:53:25   File "/var/lib/jenkins/.local/lib/python3.8/site-packages/hypothesis/core.py", line 1071, in wrapped_test 
Aug 26 21:53:25     state.run_engine() 
Aug 26 21:53:25   File "/var/lib/jenkins/.local/lib/python3.8/site-packages/hypothesis/core.py", line 783, in run_engine 
Aug 26 21:53:25     self.execute_once( 
Aug 26 21:53:25   File "/var/lib/jenkins/.local/lib/python3.8/site-packages/hypothesis/core.py", line 651, in execute_once 
Aug 26 21:53:25     self.__flaky( 
Aug 26 21:53:25   File "/var/lib/jenkins/.local/lib/python3.8/site-packages/hypothesis/core.py", line 856, in __flaky 
Aug 26 21:53:25     raise Flaky(message) 
Aug 26 21:53:25 hypothesis.errors.Flaky: Hypothesis test_max_pool2d_nhwc(self=<quantization.test_quantized_op.TestQuantizedOps testMethod=test_max_pool2d_nhwc>, X=(array([[[[7., 7., 7., ..., 7., 7., 7.], 
Aug 26 21:53:25           [7., 7., 7., ..., 7., 7., 7.], 
Aug 26 21:53:25           [7., 7., 7., ..., 7., 7., 7.], 
Aug 26 21:53:25           ..., 
Aug 26 21:53:25           [7., 7., 7., ..., 7., 7., 7.], 
Aug 26 21:53:25           [7., 7., 7., ..., 7., 7., 7.], 
Aug 26 21:53:25           [7., 7., 7., ..., 7., 7., 7.]], 
Aug 26 21:53:25   
Aug 26 21:53:25          [[7., 7., 7., ..., 7., 7., 7.], 
Aug 26 21:53:25           [7., 7., 7., ..., 7., 7., 7.], 
Aug 26 21:53:25           [7., 7., 7., ..., 7., 7., 7.], 

ci.pytorch.org: 1 failed


This comment was automatically generated by Dr. CI (expand for details).Follow this link to opt-out of these comments for your Pull Requests.

Please report bugs/suggestions on the GitHub issue tracker or post in the (internal) Dr. CI Users group.

See how this bot performed.

This comment has been revised 80 times.

Included benchmark file for reference. Will remove on final PR.

[ghstack-poisoned]
heitorschueroff added a commit that referenced this pull request Aug 19, 2020
Copy link
Contributor

@glaringlee glaringlee left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is overall a great algo and easy to expand to 3d.
Please see my comments.

Included benchmark file for reference. Will remove on final PR.

[ghstack-poisoned]
Included benchmark file for reference. Will remove on final PR.

[ghstack-poisoned]
Included benchmark file for reference. Will remove on final PR.

[ghstack-poisoned]
heitorschueroff added a commit that referenced this pull request Aug 21, 2020
ghstack-source-id: 25fa3da
Pull Request resolved: #43267
@glaringlee glaringlee changed the title [WIP] max_pool2d without indices optimization max_pool2d without indices optimization [CPU] Aug 21, 2020
Copy link
Contributor

@glaringlee glaringlee left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some minor comments. Please rebase the code, and then I will approve it.

Included benchmark file for reference. Will remove on final PR.

[ghstack-poisoned]
@heitorschueroff heitorschueroff marked this pull request as ready for review August 21, 2020 19:11
Copy link
Contributor

@glaringlee glaringlee left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@heitorschueroff
LGTM to me now except the std::max error.
where is your benchmark?

Included benchmark file for reference. Will remove on final PR.

[ghstack-poisoned]
Included benchmark file for reference. Will remove on final PR.

[ghstack-poisoned]
This PR implements a version of max_pool2d that doesn't compute indices when it's not needed. It also makes some optimizations that will be carried over to other pooling functions in future PRs.

## Benchmarking:

#### Tensor Parameters
BATCH = 10
CHANNEL = 16
HEIGHT = 2048
WIDTH = 2048
DTYPE = torch.float32
DEVICE = "cpu"

#### Pooling Parameters
KERNEL_SIZE = 2
STRIDE = None
PADDING = 0
DILATION = 1
CEIL_MODE = False

#### Results (time in ms) (speedup factor)
test_max_pool2d: 110.0176 (1.0)
test_mkldnn_max_pool2d: 378.5602 (3.44)
test_max_pool2d_with_indices: 626.6335 (5.70)

## Discussion

The new implementation is on average 2~4 times faster than mkldnn and >5x faster than with_indices. The original with_indices code only parallelized over batches and channels, so if these numbers were low it wouldn't achieve optimal parallelism.

This algorithm also reduces duplicate comparisons in the case of overlapping kernel windows. For instance, if we change the pooling parameters above to:

KERNEL_SIZE = 4
STRIDE = 2
PADDING = 2
DILATION = 1
CEIL_MODE = True

#### Results (time in ms) (speedup factor)
test_max_pool2d: 136.4228 (1.0)
test_mkldnn_max_pool2d: 608.4158 (4.46)
test_max_pool2d_with_indices: 1,230.1916 (9.02)

[ghstack-poisoned]
heitorschueroff added a commit that referenced this pull request Aug 21, 2020
ghstack-source-id: 8dcefd5
Pull Request resolved: #43267
This PR implements a version of max_pool2d that doesn't compute indices when it's not needed. It also makes some optimizations that will be carried over to other pooling functions in future PRs.

## Benchmarking:

#### Tensor Parameters
BATCH = 10
CHANNEL = 16
HEIGHT = 2048
WIDTH = 2048
DTYPE = torch.float32
DEVICE = "cpu"

#### Pooling Parameters
KERNEL_SIZE = 2
STRIDE = None
PADDING = 0
DILATION = 1
CEIL_MODE = False

#### Results (time in ms) (speedup factor)
test_max_pool2d: 119.0151 (1.0)
test_mkldnn_max_pool2d: 287.4994 (2.42)
test_max_pool2d_with_indices: 639.1541 (5.37)

## Discussion

The new implementation is on average 2~4 times faster than mkldnn and >5x faster than with_indices. The original with_indices code only parallelized over batches and channels, so if these numbers were low it wouldn't achieve optimal parallelism.

This algorithm also reduces duplicate comparisons in the case of overlapping kernel windows. For instance, if we change the pooling parameters above to:

KERNEL_SIZE = 4
STRIDE = 2
PADDING = 2
DILATION = 1
CEIL_MODE = True

#### Results (time in ms) (speedup factor)
test_max_pool2d: 136.4228 (1.0)
test_mkldnn_max_pool2d: 608.4158 (4.46)
test_max_pool2d_with_indices: 1,230.1916 (9.02)

Differential Revision: [D23273406](https://our.internmc.facebook.com/intern/diff/D23273406)

closes #28733

[ghstack-poisoned]
heitorschueroff added a commit that referenced this pull request Aug 22, 2020
ghstack-source-id: 43ba50d
Pull Request resolved: #43267
Copy link
Contributor

@glaringlee glaringlee left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, approving. Please rebase and import to phabricator.

This PR implements a version of max_pool2d that doesn't compute indices when it's not needed. It also makes some optimizations that will be carried over to other pooling functions in future PRs.

## Benchmarking:

#### Tensor Parameters
BATCH = 10
CHANNEL = 16
HEIGHT = 2048
WIDTH = 2048
DTYPE = torch.float32
DEVICE = "cpu"

#### Pooling Parameters
KERNEL_SIZE = 2
STRIDE = None
PADDING = 0
DILATION = 1
CEIL_MODE = False

#### Results (time in ms) (speedup factor)
test_max_pool2d: 119.0151 (1.0)
test_mkldnn_max_pool2d: 287.4994 (2.42)
test_max_pool2d_with_indices: 639.1541 (5.37)

## Discussion

The new implementation is on average 2~4 times faster than mkldnn and >5x faster than with_indices. The original with_indices code only parallelized over batches and channels, so if these numbers were low it wouldn't achieve optimal parallelism.

This algorithm also reduces duplicate comparisons in the case of overlapping kernel windows. For instance, if we change the pooling parameters above to:

KERNEL_SIZE = 4
STRIDE = 2
PADDING = 2
DILATION = 1
CEIL_MODE = True

#### Results (time in ms) (speedup factor)
test_max_pool2d: 136.4228 (1.0)
test_mkldnn_max_pool2d: 608.4158 (4.46)
test_max_pool2d_with_indices: 1,230.1916 (9.02)

Differential Revision: [D23273406](https://our.internmc.facebook.com/intern/diff/D23273406)

closes #28733

[ghstack-poisoned]
heitorschueroff added a commit that referenced this pull request Aug 24, 2020
ghstack-source-id: 3d2ec4b
Pull Request resolved: #43267
This PR implements a version of max_pool2d that doesn't compute indices when it's not needed. It also makes some optimizations that will be carried over to other pooling functions in future PRs.

## Benchmarking:

#### Tensor Parameters
BATCH = 10
CHANNEL = 16
HEIGHT = 2048
WIDTH = 2048
DTYPE = torch.float32
DEVICE = "cpu"

#### Pooling Parameters
KERNEL_SIZE = 2
STRIDE = None
PADDING = 0
DILATION = 1
CEIL_MODE = False

#### Results (time in ms) (speedup factor)
test_max_pool2d: 119.0151 (1.0)
test_mkldnn_max_pool2d: 287.4994 (2.42)
test_max_pool2d_with_indices: 639.1541 (5.37)

## Discussion

The new implementation is on average 2~4 times faster than mkldnn and >5x faster than with_indices. The original with_indices code only parallelized over batches and channels, so if these numbers were low it wouldn't achieve optimal parallelism.

This algorithm also reduces duplicate comparisons in the case of overlapping kernel windows. For instance, if we change the pooling parameters above to:

KERNEL_SIZE = 4
STRIDE = 2
PADDING = 2
DILATION = 1
CEIL_MODE = True

#### Results (time in ms) (speedup factor)
test_max_pool2d: 136.4228 (1.0)
test_mkldnn_max_pool2d: 608.4158 (4.46)
test_max_pool2d_with_indices: 1,230.1916 (9.02)

Differential Revision: [D23273406](https://our.internmc.facebook.com/intern/diff/D23273406)

closes #28733

[ghstack-poisoned]
heitorschueroff added a commit that referenced this pull request Aug 24, 2020
ghstack-source-id: fbb5c17
Pull Request resolved: #43267
helper(10, 512, 31, 31, 3, stride=2)
helper(1, 129, 8, 8, 3, stride=2)

@onlyCUDA
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do you think this was onlyCUDA before? Isn't your test (on CPU) going to run the same thing twice and check it's the same? That's fine I guess?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The original purpose of this onlyCUDA was to test CUDA implementation with CPU impl as a reference. I would also suggest that we keep onlyCUDA there, otherwise there would be a duplicate CPU-CPU comparison.

In order to have a "purely CPU" test, we can have a few hard-coded pooling input and results.

def helper(n, c, h, w, ks):
x = torch.randn(n, c, h, w, device='cuda', dtype=torch.float, requires_grad=True)
def helper(n, c, h, w, ks, requires_grad):
x = torch.randn(n, c, h, w, device=device, dtype=torch.float, requires_grad=requires_grad)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not your code, but the line below -- does the detach actually do anything? I also think x.to('cpu', copy=True).requires_grad_ captures the intent more clearly.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

x.to('cpu', copy=True).requires_grad_() is returning None for some reason.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

or rather calling .grad on the returning Tensor is None

}
#endif
auto output_and_indices = at::max_pool2d_with_indices(
if (self.requires_grad() || self.device() != at::kCPU) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's up with the gradient check here? Maybe another TODO?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we require grad then we need to compute indices for backward pass.


y = pool(x)
ref_y = pool(ref_x)
pool = torch.nn.MaxPool2d(kernel_size=ks, return_indices=True)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would you elaborate on this change? In particular:

  • In the original return_indices was not set and it default to False.
  • Doesn't your change only affect the return_indices = False codepath?

@glaringlee glaringlee changed the title max_pool2d without indices optimization [CPU] [WIP] max_pool2d without indices optimization [CPU] Aug 25, 2020
Copy link
Contributor

@glaringlee glaringlee left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@heitorschueroff
I put this back to [WIP] since you will add with_indices part as well for this PR. feel free to remove [WIP] once you are ready.

@mruberry
Copy link
Collaborator

Would you post your benchmark script? cc @ngimel for perf, too. Maybe a couple more sizes as a sanity check?

  • inception_v3: batch x 64 x 147 x 147
  • googlenet: batch x 64 x 112 x 112

cross with params for inception v3 (kernel size 3, stride 2), googlenet (kernel size 3, stride 2, ceil mode True) and ResNet (kernel size 3, stride, padding 1)

Are there tests that the other options to maxpool2d are working correctly? Like padding, ceil mode, stride, and dilation?

This PR implements a version of max_pool2d that doesn't compute indices when it's not needed. It also makes some optimizations that will be carried over to other pooling functions in future PRs.

## Benchmarking:

#### Tensor Parameters
BATCH = 10
CHANNEL = 16
HEIGHT = 2048
WIDTH = 2048
DTYPE = torch.float32
DEVICE = "cpu"

#### Pooling Parameters
KERNEL_SIZE = 2
STRIDE = None
PADDING = 0
DILATION = 1
CEIL_MODE = False

#### Results (time in ms) (speedup factor)
test_max_pool2d: 118.4793 (1.0)
test_mkldnn_max_pool2d: 360.2836 (3.04)
test_max_pool2d_with_indices: 626.9831 (5.29)

## Discussion

The new implementation is on average 2~3 times faster than mkldnn and 5x faster than with_indices. The original with_indices code only parallelized over batches and channels, so if these numbers were low it wouldn't achieve optimal parallelism.

This algorithm also reduces duplicate comparisons in the case of overlapping kernel windows. For instance, if we change the pooling parameters above to:

KERNEL_SIZE = 4
STRIDE = 1
PADDING = 1
DILATION = 2
CEIL_MODE = True

#### Results (time in ms) (speedup factor)
test_max_pool2d: 136.4228 (1.0)
test_mkldnn_max_pool2d: 608.4158 (4.46)
test_max_pool2d_with_indices: 1,230.1916 (9.02)

There is also an issue with the existing pooling implementations that they use nested at::parallel_for loops and as such only the outer most loop is parallelized since at::parallel_for does not support nesting.

Differential Revision: [D23273406](https://our.internmc.facebook.com/intern/diff/D23273406)

closes #28733

[ghstack-poisoned]
heitorschueroff added a commit that referenced this pull request Aug 26, 2020
ghstack-source-id: cdfe7fd
Pull Request resolved: #43267
@heitorschueroff heitorschueroff marked this pull request as draft August 26, 2020 21:16
Copy link
Collaborator

@xwang233 xwang233 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Besides your changes to test_max_pool2d, there is also another test_max_pool2d_indices. Would you mind combine the two tests together, rather than modifying the current test_max_pool2d to another duplicates of "test_another_max_pool2d_indices"? Thanks!

pytorch/test/test_nn.py

Lines 9868 to 9889 in 42a5360

@onlyCUDA
def test_max_pool2d_indices(self, device):
def helper(n, c, h, w, ks):
if n is None:
x = torch.randn(c, h, w, device='cuda', dtype=torch.float, requires_grad=True)
else:
x = torch.randn(n, c, h, w, device='cuda', dtype=torch.float, requires_grad=True)
ref_x = x.detach().clone().cpu().requires_grad_()
pool = torch.nn.MaxPool2d(kernel_size=ks, return_indices=True)
y, idx = pool(x)
ref_y, ref_idx = pool(ref_x)
y.sum().backward()
ref_y.sum().backward()
self.assertEqual(y, ref_y)
self.assertEqual(idx, ref_idx) # assertEqual implicitly compares shape for tensors
self.assertEqual(x.grad, ref_x.grad)

heitorschueroff added a commit that referenced this pull request Aug 27, 2020
This is part of a larger effort to refactor and optimize the pooling code. Previously I started working on MaxPool2d here #43267 but since it uses MaxPool1d as a subroutine, it made more sense to work on 1D first and get it tested and optimized and then move up to 2D and then 3D.

TODO: I'll add some bigger tests and some early benchmarking code and results here.

[ghstack-poisoned]
heitorschueroff added a commit that referenced this pull request Aug 28, 2020
This is part of a larger effort to refactor and optimize the pooling code. Previously I started working on MaxPool2d here #43267 but since it uses MaxPool1d as a subroutine, it made more sense to work on 1D first and get it tested and optimized and then move up to 2D and then 3D.

TODO: I'll add some bigger tests and some early benchmarking code and results here.

[ghstack-poisoned]
heitorschueroff added a commit that referenced this pull request Aug 28, 2020
This is part of a larger effort to refactor and optimize the pooling code. Previously I started working on MaxPool2d here #43267 but since it uses MaxPool1d as a subroutine, it made more sense to work on 1D first and get it tested and optimized and then move up to 2D and then 3D.

TODO: I'll add some bigger tests and some early benchmarking code and results here.

[ghstack-poisoned]
heitorschueroff added a commit that referenced this pull request Aug 29, 2020
This is part of a larger effort to refactor and optimize the pooling code. Previously I started working on MaxPool2d here #43267 but since it uses MaxPool1d as a subroutine, it made more sense to work on 1D first and get it tested and optimized and then move up to 2D and then 3D.

TODO: I'll add some bigger tests and some early benchmarking code and results here.

[ghstack-poisoned]
heitorschueroff added a commit that referenced this pull request Aug 31, 2020
This is part of a larger effort to refactor and optimize the pooling code. Previously I started working on MaxPool2d here #43267 but since it uses MaxPool1d as a subroutine, it made more sense to work on 1D first and get it tested and optimized and then move up to 2D and then 3D.

Below are some benchmarking results, the python script I used is under the results.

## Benchmarking
```
Name (time in us)                            Min                   Max                Mean             StdDev              Median                 IQR            Outliers  OPS (Kops/s)            Rounds  Iterations
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
test_googlenet[(3, 2, 0, 1, 0)-new]      79.7659 (1.03)     1,059.6327 (5.32)      90.6280 (1.01)     19.1196 (1.41)      84.2176 (1.01)       2.4289 (1.0)     1079;2818       11.0341 (0.99)       9055           1
test_googlenet[(3, 2, 0, 1, 0)-old]     505.1531 (6.55)       830.8962 (4.17)     563.4763 (6.29)     65.3974 (4.81)     538.3361 (6.43)      80.5371 (33.16)      242;99        1.7747 (0.16)       1742           1
test_googlenet[(3, 2, 0, 1, 1)-new]      80.2949 (1.04)       233.0020 (1.17)      97.6498 (1.09)     19.1228 (1.41)      89.2282 (1.07)      18.5743 (7.65)     1858;741       10.2407 (0.92)       9587           1
test_googlenet[(3, 2, 0, 1, 1)-old]     513.5350 (6.66)       977.4677 (4.91)     594.4559 (6.63)     69.9372 (5.15)     577.9080 (6.90)      79.8218 (32.86)      503;84        1.6822 (0.15)       1675           1
test_googlenet[(3, 2, 1, 1, 0)-new]      77.1061 (1.0)        199.1168 (1.0)       89.6529 (1.0)      13.5864 (1.0)       83.7557 (1.0)        7.5139 (3.09)    1419;1556       11.1541 (1.0)        7434           1
test_googlenet[(3, 2, 1, 1, 0)-old]     543.6055 (7.05)       964.5708 (4.84)     636.9867 (7.11)     84.0732 (6.19)     616.7777 (7.36)     100.4562 (41.36)      434;65        1.5699 (0.14)       1552           1
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
test_inception[(3, 2, 0, 1, 0)-new]      84.5827 (1.00)       184.2827 (1.0)       90.5438 (1.01)      9.6324 (1.0)       89.3027 (1.05)      4.5672 (1.03)      637;759       11.0444 (0.99)       6274           1
test_inception[(3, 2, 0, 1, 0)-old]     641.2268 (7.59)     1,704.8977 (9.25)     686.9383 (7.65)     57.2499 (5.94)     682.5905 (8.01)     58.3753 (13.17)       86;21        1.4557 (0.13)        802           1
test_inception[(3, 2, 0, 1, 1)-new]      84.5008 (1.0)      1,093.6335 (5.93)      89.8233 (1.0)      14.0443 (1.46)      85.2682 (1.0)       4.4331 (1.0)      802;1106       11.1330 (1.0)        9190           1
test_inception[(3, 2, 0, 1, 1)-old]     643.7078 (7.62)       851.4188 (4.62)     687.4905 (7.65)     41.1116 (4.27)     685.1386 (8.04)     60.2733 (13.60)      286;14        1.4546 (0.13)       1300           1
test_inception[(3, 2, 1, 1, 0)-new]     106.0739 (1.26)       258.5649 (1.40)     115.3597 (1.28)     17.5436 (1.82)     106.9643 (1.25)      5.5470 (1.25)     894;1402        8.6685 (0.78)       7635           1
test_inception[(3, 2, 1, 1, 0)-old]     651.0504 (7.70)       955.2278 (5.18)     698.0295 (7.77)     45.5097 (4.72)     692.8109 (8.13)     64.6794 (14.59)      145;15        1.4326 (0.13)        909           1
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
test_large_batch_size[new]       2.9608 (1.0)        5.1127 (1.0)        3.3096 (1.0)      0.1936 (1.0)        3.3131 (1.0)      0.2093 (1.0)          71;6  302.1515 (1.0)         297           1
test_large_batch_size[old]     130.6583 (44.13)    152.9521 (29.92)    137.1385 (41.44)    7.4352 (38.40)    135.1784 (40.80)    5.1358 (24.53)         1;1    7.2919 (0.02)          7           1
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
test_large_channel_size[new]      2.9696 (1.0)       5.5595 (1.0)       3.5997 (1.0)      0.5836 (1.0)       3.3497 (1.0)      0.3445 (1.0)         58;54  277.8014 (1.0)         277           1
test_large_channel_size[old]     19.6838 (6.63)     22.6637 (4.08)     21.1775 (5.88)     0.8610 (1.48)     21.3739 (6.38)     1.4930 (4.33)         13;0   47.2199 (0.17)         36           1
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
test_large_width[new]      1.7714 (1.0)       2.4104 (1.0)       1.8988 (1.0)      0.0767 (1.0)       1.8911 (1.0)      0.0885 (1.0)         86;13  526.6454 (1.0)         373           1
test_large_width[old]     19.5708 (11.05)    22.8755 (9.49)     20.7987 (10.95)    0.7009 (9.14)     20.6623 (10.93)    0.8584 (9.70)         14;1   48.0799 (0.09)         46           1
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
test_multithreaded[new]      15.0560 (1.0)       24.2891 (1.0)       16.1627 (1.0)      1.5657 (1.0)       15.7182 (1.0)      0.7598 (1.0)           4;6  61.8709 (1.0)          65           1
test_multithreaded[old]     115.7614 (7.69)     120.9670 (4.98)     118.3004 (7.32)     1.6259 (1.04)     118.4164 (7.53)     1.9613 (2.58)          2;0   8.4531 (0.14)          8           1
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

Legend:
  Outliers: 1 Standard Deviation from Mean; 1.5 IQR (InterQuartile Range) from 1st Quartile and 3rd Quartile.
  OPS: Operations Per Second, computed as 1 / Mean
```

### Benchmarking script
To run the benchmark make sure you have pytest-benchmark installed with `pip install pytest-benchmark` and use the following command: `pytest benchmark.py --benchmark-sort='name'`

```
import torch
import pytest


def _test_speedup(benchmark, batches=1, channels=32, width=32,
                  kernel_size=2, stride=None, padding=0, dilation=1, ceil_mode=False, return_indices=False):
    torch.set_num_threads(1)
    x = torch.randn((batches, channels, width))
    model = torch.nn.MaxPool1d(kernel_size, stride, padding, dilation, return_indices, ceil_mode)
    benchmark(model, x)


@pytest.mark.benchmark(group="inception")
@pytest.mark.parametrize("return_indices", [True, False], ids=["old", "new"])
@pytest.mark.parametrize("params", [(3, 2), (3, 2, 0, 1, True), (3, 2, 1)],
                         ids=["(3, 2, 0, 1, 0)",
                              "(3, 2, 0, 1, 1)",
                              "(3, 2, 1, 1, 0)"])
def test_inception(benchmark, params, return_indices):
    _test_speedup(benchmark, 10, 64, 147, *params, return_indices=return_indices)


@pytest.mark.benchmark(group="googlenet")
@pytest.mark.parametrize("return_indices", [True, False], ids=["old", "new"])
@pytest.mark.parametrize("params", [(3, 2), (3, 2, 0, 1, True), (3, 2, 1)],
                         ids=["(3, 2, 0, 1, 0)",
                              "(3, 2, 0, 1, 1)",
                              "(3, 2, 1, 1, 0)"])
def test_googlenet(benchmark, params, return_indices):
    _test_speedup(benchmark, 10, 64, 112, *params, return_indices=return_indices)


@pytest.mark.benchmark(group="large batch size")
@pytest.mark.parametrize("return_indices", [True, False], ids=["old", "new"])
def test_large_batch_size(benchmark, return_indices):
    _test_speedup(benchmark, 100000, 1, 32, return_indices=return_indices)


@pytest.mark.benchmark(group="large channel size")
@pytest.mark.parametrize("return_indices", [True, False], ids=["old", "new"])
def test_large_channel_size(benchmark, return_indices):
    _test_speedup(benchmark, 1, 100000, 32, return_indices=return_indices)


@pytest.mark.benchmark(group="large width")
@pytest.mark.parametrize("return_indices", [True, False], ids=["old", "new"])
def test_large_width(benchmark, return_indices):
    _test_speedup(benchmark, 1, 32, 100000, return_indices=return_indices)


@pytest.mark.benchmark(group="multithreading")
@pytest.mark.parametrize("return_indices", [True, False], ids=["old", "new"])
def test_multithreaded(benchmark, return_indices):
    x = torch.randn((40, 10000, 32))
    model = torch.nn.MaxPool1d(2, return_indices=return_indices)
    benchmark(model, x)
```

## Discussion

The new algorithm is on average 7x faster than the old one. But because the old algorithm had many issues with how it parallelized the code and made use of the cache, one can come up with input parameters (like large batch size) that will make the new algorithm much faster than the original one.

[ghstack-poisoned]
facebook-github-bot pushed a commit that referenced this pull request Sep 1, 2020
Summary:
Pull Request resolved: #43745

This is part of a larger effort to refactor and optimize the pooling code. Previously I started working on MaxPool2d here #43267 but since it uses MaxPool1d as a subroutine, it made more sense to work on 1D first and get it tested and optimized and then move up to 2D and then 3D.

Below are some benchmarking results, the python script I used is under the results.

## Benchmarking
```
Name (time in us)                            Min                   Max                Mean             StdDev              Median                 IQR            Outliers  OPS (Kops/s)            Rounds  Iterations
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
test_googlenet[(3, 2, 0, 1, 0)-new]      79.7659 (1.03)     1,059.6327 (5.32)      90.6280 (1.01)     19.1196 (1.41)      84.2176 (1.01)       2.4289 (1.0)     1079;2818       11.0341 (0.99)       9055           1
test_googlenet[(3, 2, 0, 1, 0)-old]     505.1531 (6.55)       830.8962 (4.17)     563.4763 (6.29)     65.3974 (4.81)     538.3361 (6.43)      80.5371 (33.16)      242;99        1.7747 (0.16)       1742           1
test_googlenet[(3, 2, 0, 1, 1)-new]      80.2949 (1.04)       233.0020 (1.17)      97.6498 (1.09)     19.1228 (1.41)      89.2282 (1.07)      18.5743 (7.65)     1858;741       10.2407 (0.92)       9587           1
test_googlenet[(3, 2, 0, 1, 1)-old]     513.5350 (6.66)       977.4677 (4.91)     594.4559 (6.63)     69.9372 (5.15)     577.9080 (6.90)      79.8218 (32.86)      503;84        1.6822 (0.15)       1675           1
test_googlenet[(3, 2, 1, 1, 0)-new]      77.1061 (1.0)        199.1168 (1.0)       89.6529 (1.0)      13.5864 (1.0)       83.7557 (1.0)        7.5139 (3.09)    1419;1556       11.1541 (1.0)        7434           1
test_googlenet[(3, 2, 1, 1, 0)-old]     543.6055 (7.05)       964.5708 (4.84)     636.9867 (7.11)     84.0732 (6.19)     616.7777 (7.36)     100.4562 (41.36)      434;65        1.5699 (0.14)       1552           1
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
test_inception[(3, 2, 0, 1, 0)-new]      84.5827 (1.00)       184.2827 (1.0)       90.5438 (1.01)      9.6324 (1.0)       89.3027 (1.05)      4.5672 (1.03)      637;759       11.0444 (0.99)       6274           1
test_inception[(3, 2, 0, 1, 0)-old]     641.2268 (7.59)     1,704.8977 (9.25)     686.9383 (7.65)     57.2499 (5.94)     682.5905 (8.01)     58.3753 (13.17)       86;21        1.4557 (0.13)        802           1
test_inception[(3, 2, 0, 1, 1)-new]      84.5008 (1.0)      1,093.6335 (5.93)      89.8233 (1.0)      14.0443 (1.46)      85.2682 (1.0)       4.4331 (1.0)      802;1106       11.1330 (1.0)        9190           1
test_inception[(3, 2, 0, 1, 1)-old]     643.7078 (7.62)       851.4188 (4.62)     687.4905 (7.65)     41.1116 (4.27)     685.1386 (8.04)     60.2733 (13.60)      286;14        1.4546 (0.13)       1300           1
test_inception[(3, 2, 1, 1, 0)-new]     106.0739 (1.26)       258.5649 (1.40)     115.3597 (1.28)     17.5436 (1.82)     106.9643 (1.25)      5.5470 (1.25)     894;1402        8.6685 (0.78)       7635           1
test_inception[(3, 2, 1, 1, 0)-old]     651.0504 (7.70)       955.2278 (5.18)     698.0295 (7.77)     45.5097 (4.72)     692.8109 (8.13)     64.6794 (14.59)      145;15        1.4326 (0.13)        909           1
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
test_large_batch_size[new]       2.9608 (1.0)        5.1127 (1.0)        3.3096 (1.0)      0.1936 (1.0)        3.3131 (1.0)      0.2093 (1.0)          71;6  302.1515 (1.0)         297           1
test_large_batch_size[old]     130.6583 (44.13)    152.9521 (29.92)    137.1385 (41.44)    7.4352 (38.40)    135.1784 (40.80)    5.1358 (24.53)         1;1    7.2919 (0.02)          7           1
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
test_large_channel_size[new]      2.9696 (1.0)       5.5595 (1.0)       3.5997 (1.0)      0.5836 (1.0)       3.3497 (1.0)      0.3445 (1.0)         58;54  277.8014 (1.0)         277           1
test_large_channel_size[old]     19.6838 (6.63)     22.6637 (4.08)     21.1775 (5.88)     0.8610 (1.48)     21.3739 (6.38)     1.4930 (4.33)         13;0   47.2199 (0.17)         36           1
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
test_large_width[new]      1.7714 (1.0)       2.4104 (1.0)       1.8988 (1.0)      0.0767 (1.0)       1.8911 (1.0)      0.0885 (1.0)         86;13  526.6454 (1.0)         373           1
test_large_width[old]     19.5708 (11.05)    22.8755 (9.49)     20.7987 (10.95)    0.7009 (9.14)     20.6623 (10.93)    0.8584 (9.70)         14;1   48.0799 (0.09)         46           1
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
test_multithreaded[new]      15.0560 (1.0)       24.2891 (1.0)       16.1627 (1.0)      1.5657 (1.0)       15.7182 (1.0)      0.7598 (1.0)           4;6  61.8709 (1.0)          65           1
test_multithreaded[old]     115.7614 (7.69)     120.9670 (4.98)     118.3004 (7.32)     1.6259 (1.04)     118.4164 (7.53)     1.9613 (2.58)          2;0   8.4531 (0.14)          8           1
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

Legend:
  Outliers: 1 Standard Deviation from Mean; 1.5 IQR (InterQuartile Range) from 1st Quartile and 3rd Quartile.
  OPS: Operations Per Second, computed as 1 / Mean
```

### Benchmarking script
To run the benchmark make sure you have pytest-benchmark installed with `pip install pytest-benchmark` and use the following command: `pytest benchmark.py --benchmark-sort='name'`

```
import torch
import pytest

def _test_speedup(benchmark, batches=1, channels=32, width=32,
                  kernel_size=2, stride=None, padding=0, dilation=1, ceil_mode=False, return_indices=False):
    torch.set_num_threads(1)
    x = torch.randn((batches, channels, width))
    model = torch.nn.MaxPool1d(kernel_size, stride, padding, dilation, return_indices, ceil_mode)
    benchmark(model, x)

pytest.mark.benchmark(group="inception")
pytest.mark.parametrize("return_indices", [True, False], ids=["old", "new"])
pytest.mark.parametrize("params", [(3, 2), (3, 2, 0, 1, True), (3, 2, 1)],
                         ids=["(3, 2, 0, 1, 0)",
                              "(3, 2, 0, 1, 1)",
                              "(3, 2, 1, 1, 0)"])
def test_inception(benchmark, params, return_indices):
    _test_speedup(benchmark, 10, 64, 147, *params, return_indices=return_indices)

pytest.mark.benchmark(group="googlenet")
pytest.mark.parametrize("return_indices", [True, False], ids=["old", "new"])
pytest.mark.parametrize("params", [(3, 2), (3, 2, 0, 1, True), (3, 2, 1)],
                         ids=["(3, 2, 0, 1, 0)",
                              "(3, 2, 0, 1, 1)",
                              "(3, 2, 1, 1, 0)"])
def test_googlenet(benchmark, params, return_indices):
    _test_speedup(benchmark, 10, 64, 112, *params, return_indices=return_indices)

pytest.mark.benchmark(group="large batch size")
pytest.mark.parametrize("return_indices", [True, False], ids=["old", "new"])
def test_large_batch_size(benchmark, return_indices):
    _test_speedup(benchmark, 100000, 1, 32, return_indices=return_indices)

pytest.mark.benchmark(group="large channel size")
pytest.mark.parametrize("return_indices", [True, False], ids=["old", "new"])
def test_large_channel_size(benchmark, return_indices):
    _test_speedup(benchmark, 1, 100000, 32, return_indices=return_indices)

pytest.mark.benchmark(group="large width")
pytest.mark.parametrize("return_indices", [True, False], ids=["old", "new"])
def test_large_width(benchmark, return_indices):
    _test_speedup(benchmark, 1, 32, 100000, return_indices=return_indices)

pytest.mark.benchmark(group="multithreading")
pytest.mark.parametrize("return_indices", [True, False], ids=["old", "new"])
def test_multithreaded(benchmark, return_indices):
    x = torch.randn((40, 10000, 32))
    model = torch.nn.MaxPool1d(2, return_indices=return_indices)
    benchmark(model, x)
```

## Discussion

The new algorithm is on average 7x faster than the old one. But because the old algorithm had many issues with how it parallelized the code and made use of the cache, one can come up with input parameters (like large batch size) that will make the new algorithm much faster than the original one.

Test Plan: Imported from OSS

Reviewed By: glaringlee

Differential Revision: D23425348

Pulled By: heitorschueroff

fbshipit-source-id: 3fa3f9b8e71200da48424a95510124a83f50d7b2
@facebook-github-bot facebook-github-bot deleted the gh/heitorschueroff/5/head branch October 11, 2020 14:18
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

max_pool2d always compute indices even when it's not required

6 participants