Skip to content

Conversation

@skrah
Copy link
Contributor

@skrah skrah commented Jun 11, 2019

Try to fix a sporadic failure on some CIs.

I've run this test hundreds of times on my machine (GeForce 1060, MAGMA) but I cannot reproduce this.

@skrah skrah added module: ci Related to continuous integration and removed module: ci Related to continuous integration labels Jun 11, 2019
@skrah
Copy link
Contributor Author

skrah commented Jun 11, 2019

Naturally this particular CI run is green. :)

But the cause is unrelated to the diff, with thousands of tests I've managed to
reproduce it once now. Still trying to isolate it.

@ezyang
Copy link
Contributor

ezyang commented Jun 11, 2019

@kostmo can we get a CI scan for the nuclear_norm failure?

@skrah
Copy link
Contributor Author

skrah commented Jun 11, 2019

The latest status is unfortunately "Heisenbug": If I repeat the tests often enough (1000+), print out the input values on failure, then run the test manually with the inputs, the results are the same.

@skrah
Copy link
Contributor Author

skrah commented Jun 11, 2019

This is one example (test_torch.py is modified for printing):

test_torch.py:1075: in _test_nuclear_norm_axes
    check_single_nuclear_norm(rr, x, axes)
test_torch.py:1001: in check_single_nuclear_norm
    self.assertTrue(np.allclose(ans.cpu(), expected, rtol=1e-02, atol=1e-03, equal_nan=True), msg=msg)
E   AssertionError: False is not true : r: tensor([[[[-0.47391583903785255982,  0.99425932417625995097]],
E   
E            [[-0.16560788865645595380, -0.94544471859987488926]]],
E   
E   
E           [[[ 0.86797840599953324237, -0.07809049467944455258]],
E   
E            [[-1.15369110156186338578, -0.83403423341840843275]]]],
E          device='cuda:0')
E   x: tensor([[[[-0.47391583903785255982]],
E   
E            [[-0.16560788865645595380]]],
E   
E   
E           [[[ 0.86797840599953324237]],
E   
E            [[-1.15369110156186338578]]]], device='cuda:0')
E   axes: (1, 3)
E   ans: tensor([[0.50201812245794563694],
E           [0.52862857287650466542]], device='cuda:0')
E   expected: array([[0.50201812],
E          [1.44374155]])
E   strides: (16, 8, 8, 8)

Equivalent to:

import torch
import numpy as np


axes = (1, 3)

r = torch.tensor([[[[-0.47391583903785255982,  0.99425932417625995097]],
                   [[-0.16560788865645595380, -0.94544471859987488926]]],
                  [[[ 0.86797840599953324237, -0.07809049467944455258]],
                   [[-1.15369110156186338578, -0.83403423341840843275]]]],
                 dtype=torch.float32, device='cuda:0')

x = r[:, :, :, ::2]

a = np.array(x.cpu(), copy=False)

expected = np.linalg.norm(a, "nuc", axis=axes)

ans = torch.norm(x, "nuc", dim=axes)

print("%r\n%r\n" % (ans, expected))

Which gives:

tensor([[0.5020],
        [1.4437]], device='cuda:0')
array([[0.50201815],
       [1.4437416 ]], dtype=float32)

@ezyang
Copy link
Contributor

ezyang commented Jun 12, 2019

Can you run the kernel with cuda-memcheck? Usually, if it doesn't reproduce when you rerun the inputs, it's because you're accessing uninitialized memory of some sort or another. A recent similar bug we fixed was #21392

@pytorchbot pytorchbot added the module: cuda Related to torch.cuda, and CUDA support in general label Jun 13, 2019
@skrah
Copy link
Contributor Author

skrah commented Jun 13, 2019

@ezyang I tried cuda-memcheck, but it always showed 0 errors, even when the tests fail. It did seem to accelerate the test failures though (no repetitions needed).

So the situation is:

  1. In the unmodified master, cuda-memcheck accelerates the test failure but shows 0 errors.

  2. When extracting all tests from unittest into a separate python script, the tests always pass.

  3. When moving the tests from test_cuda to test_torch, the tests pass, also with cuda_memcheck.

So 3) is what the latest diff implements. It is not quite satisfying, but may help in suppressing the flakiness until the cause is found.

@skrah
Copy link
Contributor Author

skrah commented Jun 13, 2019

I haven't tried --tool initcheck yet though. Still there seems to be some interaction with other tests.

_TestTorchMixin._test_nuclear_norm_axes(self, device='cuda')

@unittest.skipIf(not TEST_MAGMA, "no MAGMA library detected")
def test_nuclear_norm_exceptions(self):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you try adding the @skipCUDANonDefaultStreamIf(True) decorator here and check whether the tests pass? I think this might be an issue.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm just guessing here based on your comments about the tests passing after being moved to a different script.

@skrah
Copy link
Contributor Author

skrah commented Jun 14, 2019

@vishwakftw Thanks, I think @skipCUDANonDefaultStreamIf(True) solves the problem. At least I can no longer observe the behavior that cuda-memcheck reliably triggers the issue instantly.

cuda-memcheck --tool initcheck still shows uninitialized read accesses in magma, but that also occurs in other tests (svd for instance) and may be just a harmless magma issue.

@skrah skrah changed the title [WIP] Fix flaky nuclear_norm() test. Fix flaky nuclear_norm() test Jun 14, 2019
@skrah
Copy link
Contributor Author

skrah commented Jun 14, 2019

@ezyang I think the latest diff (suggestion by @vishwakftw) resolves #21785.

Copy link
Contributor

@facebook-github-bot facebook-github-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ezyang is landing this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

@facebook-github-bot
Copy link
Contributor

@ezyang merged this pull request in 7108218.

@skrah skrah deleted the nuclear_norm_flaky_test branch June 14, 2019 23:03
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Merged module: cuda Related to torch.cuda, and CUDA support in general open source

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants