Fix flaky nuclear_norm() test #21638

skrah · 2019-06-11T17:12:55Z

Try to fix a sporadic failure on some CIs.

I've run this test hundreds of times on my machine (GeForce 1060, MAGMA) but I cannot reproduce this.

skrah · 2019-06-11T20:08:59Z

Naturally this particular CI run is green. :)

But the cause is unrelated to the diff, with thousands of tests I've managed to
reproduce it once now. Still trying to isolate it.

ezyang · 2019-06-11T21:59:31Z

@kostmo can we get a CI scan for the nuclear_norm failure?

skrah · 2019-06-11T22:19:43Z

The latest status is unfortunately "Heisenbug": If I repeat the tests often enough (1000+), print out the input values on failure, then run the test manually with the inputs, the results are the same.

skrah · 2019-06-11T22:32:45Z

This is one example (test_torch.py is modified for printing):

test_torch.py:1075: in _test_nuclear_norm_axes
    check_single_nuclear_norm(rr, x, axes)
test_torch.py:1001: in check_single_nuclear_norm
    self.assertTrue(np.allclose(ans.cpu(), expected, rtol=1e-02, atol=1e-03, equal_nan=True), msg=msg)
E   AssertionError: False is not true : r: tensor([[[[-0.47391583903785255982,  0.99425932417625995097]],
E   
E            [[-0.16560788865645595380, -0.94544471859987488926]]],
E   
E   
E           [[[ 0.86797840599953324237, -0.07809049467944455258]],
E   
E            [[-1.15369110156186338578, -0.83403423341840843275]]]],
E          device='cuda:0')
E   x: tensor([[[[-0.47391583903785255982]],
E   
E            [[-0.16560788865645595380]]],
E   
E   
E           [[[ 0.86797840599953324237]],
E   
E            [[-1.15369110156186338578]]]], device='cuda:0')
E   axes: (1, 3)
E   ans: tensor([[0.50201812245794563694],
E           [0.52862857287650466542]], device='cuda:0')
E   expected: array([[0.50201812],
E          [1.44374155]])
E   strides: (16, 8, 8, 8)

Equivalent to:

import torch
import numpy as np


axes = (1, 3)

r = torch.tensor([[[[-0.47391583903785255982,  0.99425932417625995097]],
                   [[-0.16560788865645595380, -0.94544471859987488926]]],
                  [[[ 0.86797840599953324237, -0.07809049467944455258]],
                   [[-1.15369110156186338578, -0.83403423341840843275]]]],
                 dtype=torch.float32, device='cuda:0')

x = r[:, :, :, ::2]

a = np.array(x.cpu(), copy=False)

expected = np.linalg.norm(a, "nuc", axis=axes)

ans = torch.norm(x, "nuc", dim=axes)

print("%r\n%r\n" % (ans, expected))

Which gives:

tensor([[0.5020],
        [1.4437]], device='cuda:0')
array([[0.50201815],
       [1.4437416 ]], dtype=float32)

ezyang · 2019-06-12T13:21:17Z

Can you run the kernel with cuda-memcheck? Usually, if it doesn't reproduce when you rerun the inputs, it's because you're accessing uninitialized memory of some sort or another. A recent similar bug we fixed was #21392

skrah · 2019-06-13T13:45:11Z

@ezyang I tried cuda-memcheck, but it always showed 0 errors, even when the tests fail. It did seem to accelerate the test failures though (no repetitions needed).

So the situation is:

In the unmodified master, cuda-memcheck accelerates the test failure but shows 0 errors.
When extracting all tests from unittest into a separate python script, the tests always pass.
When moving the tests from test_cuda to test_torch, the tests pass, also with cuda_memcheck.

So 3) is what the latest diff implements. It is not quite satisfying, but may help in suppressing the flakiness until the cause is found.

skrah · 2019-06-13T14:22:29Z

I haven't tried --tool initcheck yet though. Still there seems to be some interaction with other tests.

…ear_norm_flaky_test

vishwakftw · 2019-06-14T12:09:13Z

test/test_cuda.py

-        _TestTorchMixin._test_nuclear_norm_axes(self, device='cuda')
-
-    @unittest.skipIf(not TEST_MAGMA, "no MAGMA library detected")
-    def test_nuclear_norm_exceptions(self):


Could you try adding the @skipCUDANonDefaultStreamIf(True) decorator here and check whether the tests pass? I think this might be an issue.

I'm just guessing here based on your comments about the tests passing after being moved to a different script.

skrah · 2019-06-14T14:06:02Z

@vishwakftw Thanks, I think @skipCUDANonDefaultStreamIf(True) solves the problem. At least I can no longer observe the behavior that cuda-memcheck reliably triggers the issue instantly.

cuda-memcheck --tool initcheck still shows uninitialized read accesses in magma, but that also occurs in other tests (svd for instance) and may be just a harmless magma issue.

skrah · 2019-06-14T16:33:04Z

@ezyang I think the latest diff (suggestion by @vishwakftw) resolves #21785.

facebook-github-bot

@ezyang is landing this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

facebook-github-bot · 2019-06-14T22:09:22Z

@ezyang merged this pull request in 7108218.

Fix flaky nuclear_norm() test.

fa3b9c4

pytorchbot added the module: operators label Jun 11, 2019

skrah added module: ci Related to continuous integration and removed module: ci Related to continuous integration labels Jun 11, 2019

ezyang added the open source label Jun 11, 2019

Move cuda tests to test_torch since they appear to pass there.

40db5a0

pytorchbot added the module: cuda Related to torch.cuda, and CUDA support in general label Jun 13, 2019

Merge branch 'master' of https://github.com/pytorch/pytorch into nucl…

a084338

…ear_norm_flaky_test

vishwakftw reviewed Jun 14, 2019

View reviewed changes

Use @skipCUDANonDefaultStreamIf(True).

e082ba3

skrah changed the title ~~[WIP] Fix flaky nuclear_norm() test.~~ Fix flaky nuclear_norm() test Jun 14, 2019

skrah requested a review from ezyang June 14, 2019 16:33

skrah mentioned this pull request Jun 14, 2019

test_nuclear_norm_axes_small_brute_force is flaky #21785

Closed

ezyang approved these changes Jun 14, 2019

View reviewed changes

facebook-github-bot reviewed Jun 14, 2019

View reviewed changes

facebook-github-bot closed this in 7108218 Jun 14, 2019

facebook-github-bot added the merged label Jun 14, 2019

skrah deleted the nuclear_norm_flaky_test branch June 14, 2019 23:03

skrah mentioned this pull request Sep 3, 2019

Stream safety of MAGMA functions #21821

Closed

mruberry added the Merged label Oct 28, 2020

Fix flaky nuclear_norm() test #21638

Fix flaky nuclear_norm() test #21638

Uh oh!

Conversation

skrah commented Jun 11, 2019

Uh oh!

skrah commented Jun 11, 2019

Uh oh!

ezyang commented Jun 11, 2019

Uh oh!

skrah commented Jun 11, 2019

Uh oh!

skrah commented Jun 11, 2019

Uh oh!

ezyang commented Jun 12, 2019

Uh oh!

skrah commented Jun 13, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

skrah commented Jun 13, 2019

Uh oh!

vishwakftw Jun 14, 2019

Choose a reason for hiding this comment

Uh oh!

vishwakftw Jun 14, 2019

Choose a reason for hiding this comment

Uh oh!

skrah commented Jun 14, 2019

Uh oh!

skrah commented Jun 14, 2019

Uh oh!

facebook-github-bot left a comment

Choose a reason for hiding this comment

Uh oh!

facebook-github-bot commented Jun 14, 2019

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

skrah commented Jun 13, 2019 •

edited

Loading