-
Notifications
You must be signed in to change notification settings - Fork 26.3k
Add check-sparse-tensor-invariants flag to Context. #90849
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
[ghstack-poisoned]
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/90849
Note: Links to docs will display an error until the docs builds have been completed. ✅ No FailuresAs of commit 87ee618: This comment was automatically generated by Dr. CI and updates every 15 minutes. |
|
Pinging @cpuhrsch @amjames @nikitaved for awarness. |
|
Thank you, @pearu , this is amazing! |
|
With the addition of this flag, could we just drop the safe/unsafe dichotomy altogether? A singular API could provide the ability to be either safe, or fast. |
Yes, I was also thinking about this. There exist two contexts where sparse tensors are constructed:
In context 1, the invariants checks should be always enabled to catch incorrect user inputs without crashing the Python process. In context 2, the invariants checks should be enabled when executing PyTorch tests, otherwise, the checks can be disabled for efficiency while trusting on PyTorch tests coverage that should prove/guarantee the correct usage of constructors without invariant checks. It is not always possible to determine if the execution happens in context 1 or 2, especially on the Python side. The user-facing constructor functions ( UPDATE: However, users could be provided an option of disabling invariant checks in user-facing constructors. The consequences of enabling this option would be on the user. |
|
I think the same philosophy could be applied to both contexts though. We want to apply the checks in testing to make sure we have written our code that invokes the constructors correctly. Users would want the same thing, enable checks in testing or while debugging but ultimately have them off wherever possible in production. |
Yes, makes sense. The above implies that the flag should have three states:
Btw, one of my points above was that even with this flag we cannot eliminate unsafe constructors. |
|
This would be a bit more fine-grained. In this scenario the unsafe constructors would no longer need to be exported correct? |
The answer depends on the chosen design. Say, there is a PyTorch Python module that would use an unsafe sparse tensor constructor. We have two options: (i) use the current state where unsafe constructors are exported, and (ii) the module uses the following idiom (pseudo-code follows): # save the current state
e = torch._C._check_sparse_tensor_invariants()
# override the unsafe state (unless there is a flag that forces the check)
torch._C._set_check_sparse_tensor_invariants(False)
# perform computations involving unsafe tensor constructors
...
# restore the previous state
torch._C._set_check_sparse_tensor_invariants(e)Atm, there are only two places in torch/_utils.py where the unsafe constructor is used. So, starting to use the above idiom would be reasonable and the answer to your question would be affirmative. |
This PR adds "check sparse tensor invariants" flag to Context that when enabled will trigger sparse tensor data invariants checks in unsafe methods of constructing sparse COO/CSR/CSC/BSR/BSC tensors. The importance of this feature is that when enabling the invariants checks by default, say, via <details> ``` $ git diff diff --git a/torch/__init__.py b/torch/__init__.py index c854305..19a91d0482 100644 --- a/torch/__init__.py +++ b/torch/__init__.py @@ -1239,3 +1239,8 @@ if 'TORCH_CUDA_SANITIZER' in os.environ: # Populate magic methods on SymInt and SymFloat import torch.fx.experimental.symbolic_shapes + +# temporarily enable sparse tensor arguments validation in unsafe +# constructors: + +torch._C._set_check_sparse_tensor_invariants(True) ``` </details> a massive number of test failures/errors occur in test_sparse_csr.py tests: ``` $ pytest -sv test/test_sparse_csr.py <snip> ==== 4293 failed, 1557 passed, 237 skipped, 2744 errors in 69.71s (0:01:09) ==== ``` that means that we are silently constructing sparse compressed tensors that do not satisfy the sparse tensor invariants. In particular, the following errors are raised: ``` AssertionError: "resize_as_sparse_compressed_tensor_: self and src must have the same layout" does not match "expected values to be a strided and contiguous tensor" RuntimeError: CUDA error: device-side assert triggered RuntimeError: `col_indices[..., crow_indices[..., i - 1]:crow_indices[..., i]] for all i = 1, ..., nrows are sorted and distinct along the last dimension values` is not satisfied. RuntimeError: expected col_indices to be a strided and contiguous tensor RuntimeError: expected row_indices to be a strided and contiguous tensor RuntimeError: expected values to be a strided and contiguous tensor RuntimeError: for_each: failed to synchronize: cudaErrorAssert: device-side assert triggered RuntimeError: tensor dimensionality must be sum of batch, base, and dense dimensionalities (=0 + 2 + 0) but got 3 ``` The PR also fixes #90833 [ghstack-poisoned]
|
This PR revealed a bug in CSR->BSR conversion: #90910 |
This PR adds "check sparse tensor invariants" flag to Context that when enabled will trigger sparse tensor data invariants checks in unsafe methods of constructing sparse COO/CSR/CSC/BSR/BSC tensors. The importance of this feature is that when enabling the invariants checks by default, say, via <details> ``` $ git diff diff --git a/torch/__init__.py b/torch/__init__.py index c854305..19a91d0482 100644 --- a/torch/__init__.py +++ b/torch/__init__.py @@ -1239,3 +1239,8 @@ if 'TORCH_CUDA_SANITIZER' in os.environ: # Populate magic methods on SymInt and SymFloat import torch.fx.experimental.symbolic_shapes + +# temporarily enable sparse tensor arguments validation in unsafe +# constructors: + +torch._C._set_check_sparse_tensor_invariants(True) ``` </details> a massive number of test failures/errors occur in test_sparse_csr.py tests: ``` $ pytest -sv test/test_sparse_csr.py <snip> ==== 4293 failed, 1557 passed, 237 skipped, 2744 errors in 69.71s (0:01:09) ==== ``` that means that we are silently constructing sparse compressed tensors that do not satisfy the sparse tensor invariants. In particular, the following errors are raised: ``` AssertionError: "resize_as_sparse_compressed_tensor_: self and src must have the same layout" does not match "expected values to be a strided and contiguous tensor" RuntimeError: CUDA error: device-side assert triggered RuntimeError: `col_indices[..., crow_indices[..., i - 1]:crow_indices[..., i]] for all i = 1, ..., nrows are sorted and distinct along the last dimension values` is not satisfied. RuntimeError: expected col_indices to be a strided and contiguous tensor RuntimeError: expected row_indices to be a strided and contiguous tensor RuntimeError: expected values to be a strided and contiguous tensor RuntimeError: for_each: failed to synchronize: cudaErrorAssert: device-side assert triggered RuntimeError: tensor dimensionality must be sum of batch, base, and dense dimensionalities (=0 + 2 + 0) but got 3 ``` The PR also fixes #90833 [ghstack-poisoned]
This PR adds "check sparse tensor invariants" flag to Context that when enabled will trigger sparse tensor data invariants checks in unsafe methods of constructing sparse COO/CSR/CSC/BSR/BSC tensors. The importance of this feature is that when enabling the invariants checks by default, say, via <details> ``` $ git diff diff --git a/torch/__init__.py b/torch/__init__.py index c854305..19a91d0482 100644 --- a/torch/__init__.py +++ b/torch/__init__.py @@ -1239,3 +1239,8 @@ if 'TORCH_CUDA_SANITIZER' in os.environ: # Populate magic methods on SymInt and SymFloat import torch.fx.experimental.symbolic_shapes + +# temporarily enable sparse tensor arguments validation in unsafe +# constructors: + +torch._C._set_check_sparse_tensor_invariants(True) ``` </details> a massive number of test failures/errors occur in test_sparse_csr.py tests: ``` $ pytest -sv test/test_sparse_csr.py <snip> ==== 4293 failed, 1557 passed, 237 skipped, 2744 errors in 69.71s (0:01:09) ==== ``` that means that we are silently constructing sparse compressed tensors that do not satisfy the sparse tensor invariants. In particular, the following errors are raised: ``` AssertionError: "resize_as_sparse_compressed_tensor_: self and src must have the same layout" does not match "expected values to be a strided and contiguous tensor" RuntimeError: CUDA error: device-side assert triggered RuntimeError: `col_indices[..., crow_indices[..., i - 1]:crow_indices[..., i]] for all i = 1, ..., nrows are sorted and distinct along the last dimension values` is not satisfied. RuntimeError: expected col_indices to be a strided and contiguous tensor RuntimeError: expected row_indices to be a strided and contiguous tensor RuntimeError: expected values to be a strided and contiguous tensor RuntimeError: for_each: failed to synchronize: cudaErrorAssert: device-side assert triggered RuntimeError: tensor dimensionality must be sum of batch, base, and dense dimensionalities (=0 + 2 + 0) but got 3 ``` The PR also fixes #90833 [ghstack-poisoned]
|
This PR revealed another bug in CSR->CSC conversion when using int32 indices: #91007 |
This PR adds "check sparse tensor invariants" flag to Context that when enabled will trigger sparse tensor data invariants checks in unsafe methods of constructing sparse COO/CSR/CSC/BSR/BSC tensors. The importance of this feature is that when enabling the invariants checks by default, say, via <details> ``` $ git diff diff --git a/torch/__init__.py b/torch/__init__.py index c854305..19a91d0482 100644 --- a/torch/__init__.py +++ b/torch/__init__.py @@ -1239,3 +1239,8 @@ if 'TORCH_CUDA_SANITIZER' in os.environ: # Populate magic methods on SymInt and SymFloat import torch.fx.experimental.symbolic_shapes + +# temporarily enable sparse tensor arguments validation in unsafe +# constructors: + +torch._C._set_check_sparse_tensor_invariants(True) ``` </details> a massive number of test failures/errors occur in test_sparse_csr.py tests: ``` $ pytest -sv test/test_sparse_csr.py <snip> ==== 4293 failed, 1557 passed, 237 skipped, 2744 errors in 69.71s (0:01:09) ==== ``` that means that we are silently constructing sparse compressed tensors that do not satisfy the sparse tensor invariants. In particular, the following errors are raised: ``` AssertionError: "resize_as_sparse_compressed_tensor_: self and src must have the same layout" does not match "expected values to be a strided and contiguous tensor" RuntimeError: CUDA error: device-side assert triggered RuntimeError: `col_indices[..., crow_indices[..., i - 1]:crow_indices[..., i]] for all i = 1, ..., nrows are sorted and distinct along the last dimension values` is not satisfied. RuntimeError: expected col_indices to be a strided and contiguous tensor RuntimeError: expected row_indices to be a strided and contiguous tensor RuntimeError: expected values to be a strided and contiguous tensor RuntimeError: for_each: failed to synchronize: cudaErrorAssert: device-side assert triggered RuntimeError: tensor dimensionality must be sum of batch, base, and dense dimensionalities (=0 + 2 + 0) but got 3 ``` The PR also fixes #90833 [ghstack-poisoned]
This PR adds "check sparse tensor invariants" flag to Context that when enabled will trigger sparse tensor data invariants checks in unsafe methods of constructing sparse COO/CSR/CSC/BSR/BSC tensors. The importance of this feature is that when enabling the invariants checks by default, say, via <details> ``` $ git diff diff --git a/torch/__init__.py b/torch/__init__.py index c854305..19a91d0482 100644 --- a/torch/__init__.py +++ b/torch/__init__.py @@ -1239,3 +1239,8 @@ if 'TORCH_CUDA_SANITIZER' in os.environ: # Populate magic methods on SymInt and SymFloat import torch.fx.experimental.symbolic_shapes + +# temporarily enable sparse tensor arguments validation in unsafe +# constructors: + +torch._C._set_check_sparse_tensor_invariants(True) ``` </details> a massive number of test failures/errors occur in test_sparse_csr.py tests: ``` $ pytest -sv test/test_sparse_csr.py <snip> ==== 4293 failed, 1557 passed, 237 skipped, 2744 errors in 69.71s (0:01:09) ==== ``` that means that we are silently constructing sparse compressed tensors that do not satisfy the sparse tensor invariants. In particular, the following errors are raised: ``` AssertionError: "resize_as_sparse_compressed_tensor_: self and src must have the same layout" does not match "expected values to be a strided and contiguous tensor" RuntimeError: CUDA error: device-side assert triggered RuntimeError: `col_indices[..., crow_indices[..., i - 1]:crow_indices[..., i]] for all i = 1, ..., nrows are sorted and distinct along the last dimension values` is not satisfied. RuntimeError: expected col_indices to be a strided and contiguous tensor RuntimeError: expected row_indices to be a strided and contiguous tensor RuntimeError: expected values to be a strided and contiguous tensor RuntimeError: for_each: failed to synchronize: cudaErrorAssert: device-side assert triggered RuntimeError: tensor dimensionality must be sum of batch, base, and dense dimensionalities (=0 + 2 + 0) but got 3 ``` The PR also fixes #90833 [ghstack-poisoned]
|
@pytorchbot merge |
Merge startedYour change will be merged once all checks pass (ETA 0-4 Hours). Learn more about merging in the wiki. Questions? Feedback? Please reach out to the PyTorch DevX Team |
Merge failedReason: 2 additional jobs have failed, first few of them are: trunk ,trunk / macos-12-py3-arm64-mps / Run MPS tests Details for Dev Infra teamRaised by workflow job |
This PR adds "check sparse tensor invariants" flag to Context that when enabled will trigger sparse tensor data invariants checks in unsafe methods of constructing sparse COO/CSR/CSC/BSR/BSC tensors. The feature includes the following changes to UI: - `torch.enable_check_sparse_tensor_invariants` and `torch.is_check_sparse_tensor_invariants_enabled` functions to globally enable/disable the invariant checks and to retrieve the state of the feature, respectively - `torch.sparse_coo/csr/csc/bsr/bsc/compressed_tensor` functions have a new optional argument `check_invariants` to enable/disable the invariant checks explicitly. When the `check_invariants` argument is specified, the global state of the feature is temporarily overridden. The PR also fixes #90833 # Main issue *The following content is outdated after merging the PRs in this ghstack but kept for the record.* The importance of this feature is that when enabling the invariants checks by default, say, via <details> ``` $ git diff diff --git a/torch/__init__.py b/torch/__init__.py index c854305..19a91d0482 100644 --- a/torch/__init__.py +++ b/torch/__init__.py @@ -1239,3 +1239,8 @@ if 'TORCH_CUDA_SANITIZER' in os.environ: # Populate magic methods on SymInt and SymFloat import torch.fx.experimental.symbolic_shapes + +# temporarily enable sparse tensor arguments validation in unsafe +# constructors: + +torch._C._set_check_sparse_tensor_invariants(True) ``` </details> a massive number of test failures/errors occur in test_sparse_csr.py tests: ``` $ pytest -sv test/test_sparse_csr.py <snip> ==== 4293 failed, 1557 passed, 237 skipped, 2744 errors in 69.71s (0:01:09) ==== ``` that means that we are silently constructing sparse compressed tensors that do not satisfy the sparse tensor invariants. In particular, the following errors are raised: ``` AssertionError: "resize_as_sparse_compressed_tensor_: self and src must have the same layout" does not match "expected values to be a strided and contiguous tensor" RuntimeError: CUDA error: device-side assert triggered RuntimeError: `col_indices[..., crow_indices[..., i - 1]:crow_indices[..., i]] for all i = 1, ..., nrows are sorted and distinct along the last dimension values` is not satisfied. RuntimeError: expected col_indices to be a strided and contiguous tensor RuntimeError: expected row_indices to be a strided and contiguous tensor RuntimeError: expected values to be a strided and contiguous tensor RuntimeError: for_each: failed to synchronize: cudaErrorAssert: device-side assert triggered RuntimeError: tensor dimensionality must be sum of batch, base, and dense dimensionalities (=0 + 2 + 0) but got 3 ``` cc alexsamardzic nikitaved cpuhrsch amjames bhosmer [ghstack-poisoned]
|
@pytorchbot merge |
Merge startedYour change will be merged once all checks pass (ETA 0-4 Hours). Learn more about merging in the wiki. Questions? Feedback? Please reach out to the PyTorch DevX Team |
|
@pytorchbot revert -m "Break internal build" -c ghfirst |
|
@pytorchbot successfully started a revert job. Check the current status here. |
|
@pearu your PR has been successfully reverted. |
This reverts commit b9a035c. Reverted #90849 on behalf of https://github.com/DanilBaibak due to Break internal build
|
@pearu the error is This almost seems unrelated? Maybe we just need to rebase and try again. |
|
@pytorchbot rebase |
|
@pytorchbot successfully started a rebase job. Check the current status here |
|
Rebase failed due to Raised by https://github.com/pytorch/pytorch/actions/runs/3904695373 |
This PR is a copy of #90849 that merge was reverted. The PR adds "check sparse tensor invariants" flag to Context that when enabled will trigger sparse tensor data invariants checks in unsafe methods of constructing sparse COO/CSR/CSC/BSR/BSC tensors. The feature includes the following changes to UI: `torch.sparse.check_sparse_tensor_invariants` class provides different ways to enable/disable the invariant checking. `torch.sparse_coo/csr/csc/bsr/bsc/compressed_tensor` functions have a new optional argument `check_invariants` to enable/disable the invariant checks explicitly. When the `check_invariants` argument is specified, the global state of the feature is temporarily overridden. The PR fixes #90833 Pull Request resolved: #92094 Approved by: https://github.com/cpuhrsch
|
Closing as its replacement #92094 has landed. |
This PR adds "check sparse tensor invariants" flag to Context that when enabled will trigger sparse tensor data invariants checks in unsafe methods of constructing sparse COO/CSR/CSC/BSR/BSC tensors. The feature includes the following changes to UI:
torch.enable_check_sparse_tensor_invariantsandtorch.is_check_sparse_tensor_invariants_enabledfunctions to globally enable/disable the invariant checks and to retrieve the state of the feature, respectivelytorch.sparse_coo/csr/csc/bsr/bsc/compressed_tensorfunctions have a new optional argumentcheck_invariantsto enable/disable the invariant checks explicitly. When thecheck_invariantsargument is specified, the global state of the feature is temporarily overridden.The PR also fixes #90833
Main issue
The following content is outdated after merging the PRs in this ghstack but kept for the record.
The importance of this feature is that when enabling the invariants checks by default, say, via
Details
a massive number of test failures/errors occur in test_sparse_csr.py tests:
that means that we are silently constructing sparse compressed tensors that do not satisfy the sparse tensor invariants. In particular, the following errors are raised:
Stack from ghstack (oldest at bottom):
cc @alexsamardzic @nikitaved @cpuhrsch @amjames @bhosmer