Skip to content

Conversation

@lantiga
Copy link
Contributor

@lantiga lantiga commented Mar 14, 2018

Note: this is a re-submission of #5698, which had an issue with CI.

This PR addresses #5648. In particular, following the discussion at #5648:

  • it adds Catch as a submodule (https://github.com/catchorg/Catch2) in torch/aten/utils
  • it ports all ATen tests to Catch
  • it ports torch/csrc/jit/test_jit.cpp to Catch (libtorch only, Python build is unaffected)

@ezyang
Copy link
Contributor

ezyang commented Mar 15, 2018

@pytorchbot retest this please

@ezyang
Copy link
Contributor

ezyang commented Mar 15, 2018

It looks like the CI hook-up for libtorch is still not working. I strongly advise downloading the Docker image and testing off of that.

@lantiga
Copy link
Contributor Author

lantiga commented Mar 15, 2018

Yes, I’m doing that now. Sorry for the trial and error

@ezyang
Copy link
Contributor

ezyang commented Mar 15, 2018

I don't mind :) The advice is mostly for your sanity ;)

@lantiga
Copy link
Contributor Author

lantiga commented Mar 16, 2018

Ok, it looks like fails are genuine fails now (Can only index tensors with integral scalars (got CPUDoubleType). Linking issues seem to be gone.

@lantiga
Copy link
Contributor Author

lantiga commented Mar 16, 2018

I get a failure in a python multiprocessing test in pr/pytorch-linux-xenial-cuda9-cudnn7-py2, independent from the PR. I'll rerun the tests.

On the other hand, I need a suggestion for the failure in pr/pytorch-linux-xenial-py3-clang5-asan.

Building CXX object contrib/meter/CMakeFiles/test-meter.dir/test/basic.cc.o
09:01:03 src/ATen/test/CMakeFiles/scalar_test.dir/build.make:62: recipe for target 'src/ATen/test/CMakeFiles/scalar_test.dir/scalar_test.cpp.o' failed
09:01:03 g++-5: internal compiler error: Killed (program cc1plus)

It looks like when asan is activated the compiler crashes. Does this ring any bell @ezyang ?

@ezyang
Copy link
Contributor

ezyang commented Mar 16, 2018

Yes, I know exactly what the ASAN problem is. Look at the bottom of build.sh here:

if [[ "$BUILD_ENVIRONMENT" == *asan* ]]; then
  export ASAN_OPTIONS=detect_leaks=0:symbolize=1
  # Disable Valgrind tests in run_aten_tests.sh; otherwise
  # we'll be valgrind'ing an ASAN'ed binary!  ASANity.
  export VALGRIND=0

  sudo apt-get update
  sudo apt-get install clang-5.0

  export PATH="/usr/lib/llvm-5.0/bin:$PATH"

  # TODO: Figure out how to avoid hard-coding these paths
  LD_LIBRARY_PATH=/usr/lib/llvm-5.0/lib/clang/5.0.0/lib/linux \
    CC="sccache clang" \
    CXX="sccache clang++" \
    LDSHARED="clang --shared" \
    LDFLAGS="-stdlib=libstdc++" \
    CFLAGS="-fsanitize=address -shared-libasan" \
    NO_CUDA=1 \
    python setup.py install

  export LD_PRELOAD=/usr/lib/llvm-5.0/lib/clang/5.0.0/lib/linux/libclang_rt.asan-x86_64.so

else
  python setup.py install

fi

To ensure that the build smoketests properly load the ASAN runtime, we are fucking around with LD_PRELOAD. But your code is building with gcc, not clang. That's not right, and linking in clang's ASAN runtime is definitely bad juju.

I think the right thing to do is to make sure libtorch gets built with clang, and with ASAN. That means figuring out how to apply the flags in the invocation above to the cmake invocation, and also doing the build BEFORE we run the LD_PRELOAD. Another way we can make things safer is if we make the LD_PRELOAD opt-in for commands and then manually reapply it in the necessary cases (I think it was necessary in all of the cases, but not sure.)

@ezyang
Copy link
Contributor

ezyang commented Mar 16, 2018

Re the other test, it does seem like

09:30:37 ======================================================================
09:30:37 FAIL: test_backwards_fork (__main__.TestMultiprocessing)
09:30:37 backwards() should succeed when called before and after a fork
09:30:37 ----------------------------------------------------------------------
09:30:37 Traceback (most recent call last):
09:30:37   File "test_multiprocessing.py", line 436, in test_backwards_fork
09:30:37     self.assertFalse(p.is_alive())
09:30:37 AssertionError: True is not false
09:30:37 

has been flaky.

@ezyang
Copy link
Contributor

ezyang commented Mar 16, 2018

@pytorchbot retest this please

@lantiga
Copy link
Contributor Author

lantiga commented Mar 16, 2018

Thank you @ezyang! What I don't get is why the error is showing up on a build that does not involve libtorch, as libtorch is only built on pr/pytorch-linux-xenial-cuda9-cudnn7-py3 and pr/pytorch-linux-trusty-py3.6-gcc7.2. Could it be leftover CMakeCache.txt or the like?

@lantiga lantiga force-pushed the catch_tests branch 3 times, most recently from 2d5a66f to d985c9a Compare March 17, 2018 20:06
@lantiga
Copy link
Contributor Author

lantiga commented Mar 17, 2018

Thanks for the hint @ezyang, I think I finally figured it out.

The crash was coming from gcc during the tools/run_aten_tests.sh step, on the asan build that does not involve libtorch. Note that ATen tests (that are built for all builds except *cuda*) are always built with gcc, even in asan builds.

Since the asan build has an export LD_PRELOAD after the main build, LD_PRELOAD is also seen from within run_aten_tests.sh.
It's always been that way even before this PR, but apparently the fact that we're using catch now in aten tests makes gcc crash. Why wasn't it crashing before? Not sure.

I now unset LD_PRELOAD in a subshell (to avoid having effects downstream) right before executing run_aten_tests.sh:

( unset LD_PRELOAD; time tools/run_aten_tests.sh )

This way the build succeeds and tests pass.

@lantiga
Copy link
Contributor Author

lantiga commented Mar 17, 2018

Current failures are unrelated, I tried re-testing but it didn't help. I'll try later.

@ezyang
Copy link
Contributor

ezyang commented Mar 18, 2018

It's not been a good weekend for CI.

@pytorchbot retest this please

@ezyang
Copy link
Contributor

ezyang commented Mar 18, 2018

@pytorchbot retest this please

@ezyang ezyang merged commit 6f80023 into pytorch:master Mar 19, 2018
@lantiga
Copy link
Contributor Author

lantiga commented Mar 19, 2018

Thanks @ezyang. Just a note: ATen should also get Catch2 as a submodule.

ezyang added a commit that referenced this pull request Mar 19, 2018
soumith added a commit that referenced this pull request Mar 19, 2018
soumith added a commit that referenced this pull request Mar 19, 2018
…simple") moving average" (#5892)

* Revert "Port ATen and JIT C++ tests to Catch2 (#5788)"

This reverts commit 6f80023.

* Revert "Fix error message for cat-ing zero-dim tensors (#5819)"

This reverts commit cf2e176.

* Revert "Softmax symbolic should account for negative dim (#5846)"

This reverts commit ba64724.

* Revert "[fft][1 of 3] build system and helpers to support cuFFT and MKL (#5855)"

This reverts commit 22ef8e5.

* Revert "Don't modify requires_grad when running DataParallel in no_grad mode (#5880)"

This reverts commit d11b7fb.

* Revert "fix some methods not showing up in doc (#5882)"

This reverts commit 24fca0e.

* Revert "ReduceOps cleanup and set_num_threads (#5723)"

This reverts commit 84400d5.

* Revert "introduce shape_as_tensor and reshape_from_variable_shape (#5824)"

This reverts commit f446b82.

* Revert "Enable resetting of batchnorm running moments and cumulative ("simple") moving average (#5766)"

This reverts commit 99b1f6c.
jekbradbury pushed a commit to jekbradbury/pytorch that referenced this pull request Mar 21, 2018
This PR addresses pytorch#5648. In particular, following the discussion at pytorch#5648:

- it adds Catch as a submodule (https://github.com/catchorg/Catch2) in torch/aten/utils
- it ports all ATen tests to Catch
- it ports torch/csrc/jit/test_jit.cpp to Catch (libtorch only, Python build is unaffected)
jekbradbury pushed a commit to jekbradbury/pytorch that referenced this pull request Mar 21, 2018
…simple") moving average" (pytorch#5892)

* Revert "Port ATen and JIT C++ tests to Catch2 (pytorch#5788)"

This reverts commit 6f80023.

* Revert "Fix error message for cat-ing zero-dim tensors (pytorch#5819)"

This reverts commit cf2e176.

* Revert "Softmax symbolic should account for negative dim (pytorch#5846)"

This reverts commit ba64724.

* Revert "[fft][1 of 3] build system and helpers to support cuFFT and MKL (pytorch#5855)"

This reverts commit 22ef8e5.

* Revert "Don't modify requires_grad when running DataParallel in no_grad mode (pytorch#5880)"

This reverts commit d11b7fb.

* Revert "fix some methods not showing up in doc (pytorch#5882)"

This reverts commit 24fca0e.

* Revert "ReduceOps cleanup and set_num_threads (pytorch#5723)"

This reverts commit 84400d5.

* Revert "introduce shape_as_tensor and reshape_from_variable_shape (pytorch#5824)"

This reverts commit f446b82.

* Revert "Enable resetting of batchnorm running moments and cumulative ("simple") moving average (pytorch#5766)"

This reverts commit 99b1f6c.
wuhuikx pushed a commit to wuhuikx/pytorch that referenced this pull request Jan 30, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants