Add check-sparse-tensor-invariants flag to Context. #90849

pearu · 2022-12-14T17:33:17Z

This PR adds "check sparse tensor invariants" flag to Context that when enabled will trigger sparse tensor data invariants checks in unsafe methods of constructing sparse COO/CSR/CSC/BSR/BSC tensors. The feature includes the following changes to UI:

torch.enable_check_sparse_tensor_invariants and torch.is_check_sparse_tensor_invariants_enabled functions to globally enable/disable the invariant checks and to retrieve the state of the feature, respectively
torch.sparse_coo/csr/csc/bsr/bsc/compressed_tensor functions have a new optional argument check_invariants to enable/disable the invariant checks explicitly. When the check_invariants argument is specified, the global state of the feature is temporarily overridden.

The PR also fixes #90833

Main issue

The following content is outdated after merging the PRs in this ghstack but kept for the record.

The importance of this feature is that when enabling the invariants checks by default, say, via

Details

$ git diff
diff --git a/torch/__init__.py b/torch/__init__.py
index c8543057c7..19a91d0482 100644
--- a/torch/__init__.py
+++ b/torch/__init__.py
@@ -1239,3 +1239,8 @@ if 'TORCH_CUDA_SANITIZER' in os.environ:
 
 # Populate magic methods on SymInt and SymFloat
 import torch.fx.experimental.symbolic_shapes
+
+# temporarily enable sparse tensor arguments validation in unsafe
+# constructors:
+
+torch._C._set_check_sparse_tensor_invariants(True)

a massive number of test failures/errors occur in test_sparse_csr.py tests:

$ pytest -sv test/test_sparse_csr.py
<snip>
==== 4293 failed, 1557 passed, 237 skipped, 2744 errors in 69.71s (0:01:09) ====

that means that we are silently constructing sparse compressed tensors that do not satisfy the sparse tensor invariants. In particular, the following errors are raised:

AssertionError: "resize_as_sparse_compressed_tensor_: self and src must have the same layout" does not match "expected values to be a strided and contiguous tensor"

RuntimeError: CUDA error: device-side assert triggered

RuntimeError: `col_indices[..., crow_indices[..., i - 1]:crow_indices[..., i]] for all i = 1, ..., nrows are sorted and distinct along the last dimension values` is not satisfied.

RuntimeError: expected col_indices to be a strided and contiguous tensor

RuntimeError: expected row_indices to be a strided and contiguous tensor

RuntimeError: expected values to be a strided and contiguous tensor

RuntimeError: for_each: failed to synchronize: cudaErrorAssert: device-side assert triggered

RuntimeError: tensor dimensionality must be sum of batch, base, and dense dimensionalities (=0 + 2 + 0) but got 3

Stack from ghstack (oldest at bottom):

cc @alexsamardzic @nikitaved @cpuhrsch @amjames @bhosmer

[ghstack-poisoned]

pytorch-bot · 2022-12-14T17:33:20Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/90849

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki or our office hours

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit 87ee618:
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

ghstack-source-id: 9ef5962 Pull Request resolved: #90849

pearu · 2022-12-14T17:49:53Z

Pinging @cpuhrsch @amjames @nikitaved for awarness.

nikitaved · 2022-12-14T17:50:58Z

Thank you, @pearu , this is amazing!

amjames · 2022-12-14T17:56:05Z

With the addition of this flag, could we just drop the safe/unsafe dichotomy altogether? A singular API could provide the ability to be either safe, or fast.

pearu · 2022-12-14T18:17:25Z

With the addition of this flag, could we just drop the safe/unsafe dichotomy altogether? A singular API could provide the ability to be either safe, or fast.

Yes, I was also thinking about this. There exist two contexts where sparse tensors are constructed:

"user context" where a sparse tensor constructor is called by a user program
"PyTorch internal context" where a sparse tensor constructor is called in some PyTorch code

In context 1, the invariants checks should be always enabled to catch incorrect user inputs without crashing the Python process.

In context 2, the invariants checks should be enabled when executing PyTorch tests, otherwise, the checks can be disabled for efficiency while trusting on PyTorch tests coverage that should prove/guarantee the correct usage of constructors without invariant checks.

It is not always possible to determine if the execution happens in context 1 or 2, especially on the Python side. The user-facing constructor functions (torch.sparse_csr_tensor, etc) always have and should have the invariant checks enabled while PyTorch internal Python code may use the corresponding unsafe methods.

UPDATE: However, users could be provided an option of disabling invariant checks in user-facing constructors. The consequences of enabling this option would be on the user.

amjames · 2022-12-14T19:22:26Z

I think the same philosophy could be applied to both contexts though. We want to apply the checks in testing to make sure we have written our code that invokes the constructors correctly. Users would want the same thing, enable checks in testing or while debugging but ultimately have them off wherever possible in production.

pearu · 2022-12-14T19:29:19Z

I think the same philosophy could be applied to both contexts though. We want to apply the checks in testing to make sure we have written our code that invokes the constructors correctly. Users would want the same thing, enable checks in testing or while debugging but ultimately have them off wherever possible in production.

Yes, makes sense. The above implies that the flag should have three states:

enable invariant checks in public constructors and no checks in unsafe constructors - default
enable invariant checks in unsafe constructors (recall public constructors call unsafe constructors) - useful in testing
disable invariant checks in all circumstances - useful in a production

Btw, one of my points above was that even with this flag we cannot eliminate unsafe constructors.

amjames · 2022-12-14T19:49:15Z

This would be a bit more fine-grained. In this scenario the unsafe constructors would no longer need to be exported correct?

pearu · 2022-12-14T21:01:31Z

This would be a bit more fine-grained. In this scenario the unsafe constructors would no longer need to be exported correct?

The answer depends on the chosen design. Say, there is a PyTorch Python module that would use an unsafe sparse tensor constructor. We have two options: (i) use the current state where unsafe constructors are exported, and (ii) the module uses the following idiom (pseudo-code follows):

# save the current state
e = torch._C._check_sparse_tensor_invariants()
# override the unsafe state (unless there is a flag that forces the check)
torch._C._set_check_sparse_tensor_invariants(False)
# perform computations involving unsafe tensor constructors
...
# restore the previous state
torch._C._set_check_sparse_tensor_invariants(e)

Atm, there are only two places in torch/_utils.py where the unsafe constructor is used. So, starting to use the above idiom would be reasonable and the answer to your question would be affirmative.

This PR adds "check sparse tensor invariants" flag to Context that when enabled will trigger sparse tensor data invariants checks in unsafe methods of constructing sparse COO/CSR/CSC/BSR/BSC tensors. The importance of this feature is that when enabling the invariants checks by default, say, via <details> ``` $ git diff diff --git a/torch/__init__.py b/torch/__init__.py index c854305..19a91d0482 100644 --- a/torch/__init__.py +++ b/torch/__init__.py @@ -1239,3 +1239,8 @@ if 'TORCH_CUDA_SANITIZER' in os.environ: # Populate magic methods on SymInt and SymFloat import torch.fx.experimental.symbolic_shapes + +# temporarily enable sparse tensor arguments validation in unsafe +# constructors: + +torch._C._set_check_sparse_tensor_invariants(True) ``` </details> a massive number of test failures/errors occur in test_sparse_csr.py tests: ``` $ pytest -sv test/test_sparse_csr.py <snip> ==== 4293 failed, 1557 passed, 237 skipped, 2744 errors in 69.71s (0:01:09) ==== ``` that means that we are silently constructing sparse compressed tensors that do not satisfy the sparse tensor invariants. In particular, the following errors are raised: ``` AssertionError: "resize_as_sparse_compressed_tensor_: self and src must have the same layout" does not match "expected values to be a strided and contiguous tensor" RuntimeError: CUDA error: device-side assert triggered RuntimeError: `col_indices[..., crow_indices[..., i - 1]:crow_indices[..., i]] for all i = 1, ..., nrows are sorted and distinct along the last dimension values` is not satisfied. RuntimeError: expected col_indices to be a strided and contiguous tensor RuntimeError: expected row_indices to be a strided and contiguous tensor RuntimeError: expected values to be a strided and contiguous tensor RuntimeError: for_each: failed to synchronize: cudaErrorAssert: device-side assert triggered RuntimeError: tensor dimensionality must be sum of batch, base, and dense dimensionalities (=0 + 2 + 0) but got 3 ``` The PR also fixes #90833 [ghstack-poisoned]

ghstack-source-id: 0c8e5fa Pull Request resolved: #90849

pearu · 2022-12-15T10:11:47Z

This PR revealed a bug in CSR->BSR conversion: #90910

This PR adds "check sparse tensor invariants" flag to Context that when enabled will trigger sparse tensor data invariants checks in unsafe methods of constructing sparse COO/CSR/CSC/BSR/BSC tensors. The importance of this feature is that when enabling the invariants checks by default, say, via <details> ``` $ git diff diff --git a/torch/__init__.py b/torch/__init__.py index c854305..19a91d0482 100644 --- a/torch/__init__.py +++ b/torch/__init__.py @@ -1239,3 +1239,8 @@ if 'TORCH_CUDA_SANITIZER' in os.environ: # Populate magic methods on SymInt and SymFloat import torch.fx.experimental.symbolic_shapes + +# temporarily enable sparse tensor arguments validation in unsafe +# constructors: + +torch._C._set_check_sparse_tensor_invariants(True) ``` </details> a massive number of test failures/errors occur in test_sparse_csr.py tests: ``` $ pytest -sv test/test_sparse_csr.py <snip> ==== 4293 failed, 1557 passed, 237 skipped, 2744 errors in 69.71s (0:01:09) ==== ``` that means that we are silently constructing sparse compressed tensors that do not satisfy the sparse tensor invariants. In particular, the following errors are raised: ``` AssertionError: "resize_as_sparse_compressed_tensor_: self and src must have the same layout" does not match "expected values to be a strided and contiguous tensor" RuntimeError: CUDA error: device-side assert triggered RuntimeError: `col_indices[..., crow_indices[..., i - 1]:crow_indices[..., i]] for all i = 1, ..., nrows are sorted and distinct along the last dimension values` is not satisfied. RuntimeError: expected col_indices to be a strided and contiguous tensor RuntimeError: expected row_indices to be a strided and contiguous tensor RuntimeError: expected values to be a strided and contiguous tensor RuntimeError: for_each: failed to synchronize: cudaErrorAssert: device-side assert triggered RuntimeError: tensor dimensionality must be sum of batch, base, and dense dimensionalities (=0 + 2 + 0) but got 3 ``` The PR also fixes #90833 [ghstack-poisoned]

ghstack-source-id: 8cbdb69 Pull Request resolved: #90849

pearu · 2022-12-16T17:03:13Z

This PR revealed another bug in CSR->CSC conversion when using int32 indices: #91007

This PR adds "check sparse tensor invariants" flag to Context that when enabled will trigger sparse tensor data invariants checks in unsafe methods of constructing sparse COO/CSR/CSC/BSR/BSC tensors. The importance of this feature is that when enabling the invariants checks by default, say, via <details> ``` $ git diff diff --git a/torch/__init__.py b/torch/__init__.py index c854305..19a91d0482 100644 --- a/torch/__init__.py +++ b/torch/__init__.py @@ -1239,3 +1239,8 @@ if 'TORCH_CUDA_SANITIZER' in os.environ: # Populate magic methods on SymInt and SymFloat import torch.fx.experimental.symbolic_shapes + +# temporarily enable sparse tensor arguments validation in unsafe +# constructors: + +torch._C._set_check_sparse_tensor_invariants(True) ``` </details> a massive number of test failures/errors occur in test_sparse_csr.py tests: ``` $ pytest -sv test/test_sparse_csr.py <snip> ==== 4293 failed, 1557 passed, 237 skipped, 2744 errors in 69.71s (0:01:09) ==== ``` that means that we are silently constructing sparse compressed tensors that do not satisfy the sparse tensor invariants. In particular, the following errors are raised: ``` AssertionError: "resize_as_sparse_compressed_tensor_: self and src must have the same layout" does not match "expected values to be a strided and contiguous tensor" RuntimeError: CUDA error: device-side assert triggered RuntimeError: `col_indices[..., crow_indices[..., i - 1]:crow_indices[..., i]] for all i = 1, ..., nrows are sorted and distinct along the last dimension values` is not satisfied. RuntimeError: expected col_indices to be a strided and contiguous tensor RuntimeError: expected row_indices to be a strided and contiguous tensor RuntimeError: expected values to be a strided and contiguous tensor RuntimeError: for_each: failed to synchronize: cudaErrorAssert: device-side assert triggered RuntimeError: tensor dimensionality must be sum of batch, base, and dense dimensionalities (=0 + 2 + 0) but got 3 ``` The PR also fixes #90833 [ghstack-poisoned]

ghstack-source-id: 6922da6 Pull Request resolved: #90849

pearu · 2023-01-10T12:03:38Z

@pytorchbot merge

pytorchmergebot · 2023-01-10T12:07:10Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

pytorchmergebot · 2023-01-10T12:57:38Z

Merge failed

Reason: 2 additional jobs have failed, first few of them are: trunk ,trunk / macos-12-py3-arm64-mps / Run MPS tests

Details for Dev Infra team

Raised by workflow job

This PR adds "check sparse tensor invariants" flag to Context that when enabled will trigger sparse tensor data invariants checks in unsafe methods of constructing sparse COO/CSR/CSC/BSR/BSC tensors. The feature includes the following changes to UI: - `torch.enable_check_sparse_tensor_invariants` and `torch.is_check_sparse_tensor_invariants_enabled` functions to globally enable/disable the invariant checks and to retrieve the state of the feature, respectively - `torch.sparse_coo/csr/csc/bsr/bsc/compressed_tensor` functions have a new optional argument `check_invariants` to enable/disable the invariant checks explicitly. When the `check_invariants` argument is specified, the global state of the feature is temporarily overridden. The PR also fixes #90833 # Main issue *The following content is outdated after merging the PRs in this ghstack but kept for the record.* The importance of this feature is that when enabling the invariants checks by default, say, via <details> ``` $ git diff diff --git a/torch/__init__.py b/torch/__init__.py index c854305..19a91d0482 100644 --- a/torch/__init__.py +++ b/torch/__init__.py @@ -1239,3 +1239,8 @@ if 'TORCH_CUDA_SANITIZER' in os.environ: # Populate magic methods on SymInt and SymFloat import torch.fx.experimental.symbolic_shapes + +# temporarily enable sparse tensor arguments validation in unsafe +# constructors: + +torch._C._set_check_sparse_tensor_invariants(True) ``` </details> a massive number of test failures/errors occur in test_sparse_csr.py tests: ``` $ pytest -sv test/test_sparse_csr.py <snip> ==== 4293 failed, 1557 passed, 237 skipped, 2744 errors in 69.71s (0:01:09) ==== ``` that means that we are silently constructing sparse compressed tensors that do not satisfy the sparse tensor invariants. In particular, the following errors are raised: ``` AssertionError: "resize_as_sparse_compressed_tensor_: self and src must have the same layout" does not match "expected values to be a strided and contiguous tensor" RuntimeError: CUDA error: device-side assert triggered RuntimeError: `col_indices[..., crow_indices[..., i - 1]:crow_indices[..., i]] for all i = 1, ..., nrows are sorted and distinct along the last dimension values` is not satisfied. RuntimeError: expected col_indices to be a strided and contiguous tensor RuntimeError: expected row_indices to be a strided and contiguous tensor RuntimeError: expected values to be a strided and contiguous tensor RuntimeError: for_each: failed to synchronize: cudaErrorAssert: device-side assert triggered RuntimeError: tensor dimensionality must be sum of batch, base, and dense dimensionalities (=0 + 2 + 0) but got 3 ``` cc alexsamardzic nikitaved cpuhrsch amjames bhosmer [ghstack-poisoned]

pearu · 2023-01-10T22:06:19Z

@pytorchbot merge

pytorchmergebot · 2023-01-10T22:08:24Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

DanilBaibak · 2023-01-12T09:56:34Z

@pytorchbot revert -m "Break internal build" -c ghfirst

pytorchmergebot · 2023-01-12T09:58:11Z

@pytorchbot successfully started a revert job. Check the current status here.
Questions? Feedback? Please reach out to the PyTorch DevX Team

pytorchmergebot · 2023-01-12T09:58:21Z

@pearu your PR has been successfully reverted.

This reverts commit b9a035c. Reverted #90849 on behalf of https://github.com/DanilBaibak due to Break internal build

DanilBaibak · 2023-01-12T10:02:41Z

@cpuhrsch could you pls help fix this PR. Here you can find more details: D42459213.

cpuhrsch · 2023-01-12T16:57:01Z

@pearu the error is

stderr: ld.lld: error: undefined symbol: at::native::DispatchStub<void (*)(at::Tensor const&, at::Tensor const&, c10::Scalar const&, c10::Scalar const&, at::Tensor const&), at::native::sampled_addmm_sparse_csr_stub>::DEFAULT

This almost seems unrelated? Maybe we just need to rebase and try again.

pearu · 2023-01-12T17:45:37Z

@pytorchbot rebase

pytorchmergebot · 2023-01-12T17:47:29Z

@pytorchbot successfully started a rebase job. Check the current status here

pytorchmergebot · 2023-01-12T17:47:35Z

Rebase failed due to

Aborting rebase because rebasing the branch resulted in the same sha as the target branch.
This usually happens because the PR has already been merged.  Please rebase locally and push.

Raised by https://github.com/pytorch/pytorch/actions/runs/3904695373

pearu · 2023-01-12T19:07:42Z

@cpuhrsch could you stamp #92094 that replaces this PR.

This PR is a copy of #90849 that merge was reverted. The PR adds "check sparse tensor invariants" flag to Context that when enabled will trigger sparse tensor data invariants checks in unsafe methods of constructing sparse COO/CSR/CSC/BSR/BSC tensors. The feature includes the following changes to UI: `torch.sparse.check_sparse_tensor_invariants` class provides different ways to enable/disable the invariant checking. `torch.sparse_coo/csr/csc/bsr/bsc/compressed_tensor` functions have a new optional argument `check_invariants` to enable/disable the invariant checks explicitly. When the `check_invariants` argument is specified, the global state of the feature is temporarily overridden. The PR fixes #90833 Pull Request resolved: #92094 Approved by: https://github.com/cpuhrsch

pearu · 2023-01-13T17:53:30Z

Closing as its replacement #92094 has landed.

Add check-sparse-tensor-invariants flag to Context.

485863f

[ghstack-poisoned]

pytorch-bot bot added the release notes: sparse release notes category label Dec 14, 2022

pearu added a commit that referenced this pull request Dec 14, 2022

Add check-sparse-tensor-invariants flag to Context.

c2642bb

ghstack-source-id: 9ef5962 Pull Request resolved: #90849

pearu marked this pull request as draft December 14, 2022 17:33

pytorchbot added the open source label Dec 14, 2022

pearu self-assigned this Dec 14, 2022

pearu added a commit that referenced this pull request Dec 15, 2022

Add check-sparse-tensor-invariants flag to Context.

9b6b385

ghstack-source-id: 0c8e5fa Pull Request resolved: #90849

pearu mentioned this pull request Dec 15, 2022

Ensure sorted indices from the CSR->BSR conversion #90918

Closed

pearu mentioned this pull request Dec 15, 2022

Allow BSR/BSC tensor values to be column-major contiguous #90925

Closed

pearu added a commit that referenced this pull request Dec 15, 2022

Add check-sparse-tensor-invariants flag to Context.

ecac558

ghstack-source-id: 8cbdb69 Pull Request resolved: #90849

pearu mentioned this pull request Dec 17, 2022

Fix CSR with int32 indices to CSC conversion #91061

Closed

pearu added a commit that referenced this pull request Dec 17, 2022

Add check-sparse-tensor-invariants flag to Context.

d300908

ghstack-source-id: 6922da6 Pull Request resolved: #90849

pytorch-bot bot added the ciflow/trunk Trigger trunk jobs on your pull request label Jan 10, 2023

pearu added the keep-going Don't stop on first failure, keep running tests until the end label Jan 10, 2023

pytorchmergebot added the Merged label Jan 11, 2023

pytorchmergebot closed this in b9a035c Jan 11, 2023

pytorchmergebot added the Reverted label Jan 12, 2023

pytorchmergebot added a commit that referenced this pull request Jan 12, 2023

Revert "Add check-sparse-tensor-invariants flag to Context. (#90849)"

c7a22bb

This reverts commit b9a035c. Reverted #90849 on behalf of https://github.com/DanilBaibak due to Break internal build

cpuhrsch reopened this Jan 12, 2023

pearu mentioned this pull request Jan 12, 2023

Add check-sparse-tensor-invariants flag to Context - 2nd try. #92094

Closed

pearu removed the keep-going Don't stop on first failure, keep running tests until the end label Jan 12, 2023

pearu marked this pull request as draft January 12, 2023 19:14

pearu closed this Jan 13, 2023

facebook-github-bot deleted the gh/pearu/80/head branch June 8, 2023 18:17

Add check-sparse-tensor-invariants flag to Context. #90849

Add check-sparse-tensor-invariants flag to Context. #90849

Uh oh!

Conversation

pearu commented Dec 14, 2022 • edited by pytorch-bot bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Main issue

Uh oh!

pytorch-bot bot commented Dec 14, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/90849

✅ No Failures

Uh oh!

pearu commented Dec 14, 2022

Uh oh!

nikitaved commented Dec 14, 2022

Uh oh!

amjames commented Dec 14, 2022

Uh oh!

pearu commented Dec 14, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

amjames commented Dec 14, 2022

Uh oh!

pearu commented Dec 14, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

amjames commented Dec 14, 2022

Uh oh!

pearu commented Dec 14, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pearu commented Dec 15, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pearu commented Dec 16, 2022

Uh oh!

pearu commented Jan 10, 2023

Uh oh!

pytorchmergebot commented Jan 10, 2023

Merge started

Uh oh!

pytorchmergebot commented Jan 10, 2023

Merge failed

Uh oh!

pearu commented Jan 10, 2023

Uh oh!

pytorchmergebot commented Jan 10, 2023

Merge started

Uh oh!

DanilBaibak commented Jan 12, 2023

Uh oh!

pytorchmergebot commented Jan 12, 2023

Uh oh!

pytorchmergebot commented Jan 12, 2023

Uh oh!

DanilBaibak commented Jan 12, 2023

Uh oh!

cpuhrsch commented Jan 12, 2023

Uh oh!

pearu commented Jan 12, 2023

Uh oh!

pytorchmergebot commented Jan 12, 2023

Uh oh!

pytorchmergebot commented Jan 12, 2023

Uh oh!

pearu commented Jan 12, 2023

Uh oh!

pearu commented Jan 13, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

8 participants

pearu commented Dec 14, 2022 •

edited by pytorch-bot bot

Loading

pytorch-bot bot commented Dec 14, 2022 •

edited

Loading

pearu commented Dec 14, 2022 •

edited

Loading

pearu commented Dec 14, 2022 •

edited

Loading

pearu commented Dec 14, 2022 •

edited

Loading

pearu commented Dec 15, 2022 •

edited

Loading