Integrate NVIDIA cuSolver backend into ATen/Linalg (initial implementation for eig/eigval)#166715
Integrate NVIDIA cuSolver backend into ATen/Linalg (initial implementation for eig/eigval)#166715johannesz-codes wants to merge 28 commits intopytorch:mainfrom
Conversation
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/166715
Note: Links to docs will display an error until the docs builds have been completed. ❗ 1 Active SEVsThere are 1 currently active SEVs. If your PR is affected, please view them below: ⏳ No Failures, 1 PendingAs of commit b3f50e0 with merge base 4316df8 ( This comment was automatically generated by Dr. CI and updates every 15 minutes. |
|
@pytorchbot label "topic: not user facing" |
|
Didn't find following labels among repository labels: release notes: performance improvement |
|
@pytorchbot label "release notes: cuda" "module: linear algebra" |
|
|
||
| Tensor& linalg_eigvals_out(const Tensor& input, Tensor& values) { | ||
| squareCheckInputs(input, "linalg.eigvals"); | ||
| TORCH_CHECK(input.isfinite().all().item<bool>(), "torch.linalg.eigvals: input tensor should not contain infs or NaNs."); |
There was a problem hiding this comment.
Remove as it synchronizes. This is already documented for all ops in https://docs.pytorch.org/docs/stable/notes/numerical_accuracy.html#non-finite-values
There was a problem hiding this comment.
should I remove this from linalg_eig_out as well?
Doing so causes test failures in test_linalg.py.
Maybe could be adressed in separate PR?
| auto eigenvalues_cpu = eigenvalues.device().is_cpu() ? eigenvalues : eigenvalues.cpu(); | ||
| auto eigenvectors_cpu = eigenvectors.device().is_cpu() ? eigenvectors : eigenvectors.cpu(); | ||
| auto infos_cpu = infos.device().is_cpu() ? infos : infos.cpu(); |
There was a problem hiding this comment.
again, unconditionally call .cpu()
|
Should I fix the failing lintrunner test? The affected line wasn’t touched in this PR. |
lezcano
left a comment
There was a problem hiding this comment.
LGTM. cc @IvanYashchuk in case he wants to have a quick look. If he doesn't answer by monday let's merge it.
Thank you for the contribution! I think this is literally the first OSS contribution to linalg in 2 years :P
|
regarding the test, if you rebase on |
|
CC @nikitaved |
|
This perf improvement is definitely user facing and should have patch notes |
| jobvl, | ||
| jobvr, | ||
| n, | ||
| CUDA_R_64F, // Datentyp der Matrix |
There was a problem hiding this comment.
Language typo? Comments should be in English for consistency?
There was a problem hiding this comment.
Left over from testing. Will remove tomorrow.
| #include <utility> | ||
| #include <vector> | ||
|
|
||
| #include "jit/tensorexpr/bounds_overlap.h" |
There was a problem hiding this comment.
What do we need this one for?
There was a problem hiding this comment.
don't know how that slipped in there. Removed, compiles and tests fine
There was a problem hiding this comment.
Seems there was another stray import in the same commit. Probably auto-import kicked in.
| // move tensors to CPU if they are not already there for post-processing | ||
| auto vectors_cpu = vectors.cpu(); | ||
| auto values_cpu = values.cpu(); | ||
| auto maybe_complex_vectors_cpu = maybe_complex_vectors.cpu(); | ||
|
|
||
| vectors = linalg_eig_make_complex_eigenvectors(vectors_cpu, values_cpu, maybe_complex_vectors_cpu); |
There was a problem hiding this comment.
nit: let's change the comment to something like "we move to the CPU because linalg_eig_make_complex_eigenvectors requires that. Performance note: this function can be implemented via a TensorIterator -- and then we can avoid explicit host-device sync, IIUC.
There was a problem hiding this comment.
By the way, do we need this in the cuSolver path at all?
There was a problem hiding this comment.
As far as I understand, yes. CuSolver seems to output the eigenvectors in the same format used by MAGMA/CPU. The NVIDIA docs mention the same column-pair structure for real datatypes, though the wording is a bit unclear: https://docs.nvidia.com/cuda/cusolver/index.html
|
|
||
| vectors = linalg_eig_make_complex_eigenvectors(vectors_cpu, values_cpu, maybe_complex_vectors_cpu); | ||
|
|
||
| vectors.copy_(vectors_cpu.to(vectors.device())); |
There was a problem hiding this comment.
| vectors.copy_(vectors_cpu.to(vectors.device())); | |
| vectors.copy_(vectors_cpu); |
Is sufficient.
There was a problem hiding this comment.
Let's do vectors_cpu = linalg_eig_make_complex_eigenvalues(vectors_cpu...). Otherwise having vectors = vectors_cpu (because make complex eignevalues is in-place) looks confusing (I am not sure how operator= is implemented for Tensor objects, and how it is aligned with the clone operation that follows). Do we have tests for this path?
There was a problem hiding this comment.
Should be tested in test_eig_compare_backends_cuda_floatXX in test_linalg.py, that is also how I test it. Suggested change tests fine for me.
There was a problem hiding this comment.
Ok, but the test ignores the new dispatch strategy. We have to be sure that MAGMA is being dispatched to and tested.
There was a problem hiding this comment.
True. Should this just be done via the backends.cuda.preferred_linalg_library() option in python?
There was a problem hiding this comment.
Yes, and we can just parametrize the test with backend names.
There was a problem hiding this comment.
Makes sense. I currently loop over backends like in test_svd, but parametrizing the test would be the more elegant approach.
There was a problem hiding this comment.
Let's at least make a follow-up PR with the test, unless you are able to sneak it in here before the merge happens :)
There was a problem hiding this comment.
:D don't want to upset the merge... will do, I will just make a PR with what I have and start converting all the tests with different backends to be properly parametrized in a follow-up.
|
|
||
|
|
| Tensor vectors; | ||
| if (!vectors.defined()) { | ||
| vectors = at::empty({0}, input.options()); | ||
| } |
There was a problem hiding this comment.
Tensor vectors; already makes it undefined, right? So the if path looks redundant. Or am I missing something?
There was a problem hiding this comment.
I think this condition was always true. Removed the if, compiles and tests fine.
1253540 to
b3f50e0
Compare
albanD
left a comment
There was a problem hiding this comment.
Awesome. Thanks for the quick update.
|
@pytorchbot merge |
Merge startedYour change will be merged once all checks pass (ETA 0-4 Hours). Learn more about merging in the wiki. Questions? Feedback? Please reach out to the PyTorch DevX Team |
…ation for eig/eigval) (#166715) ### Summary Adds support for NVIDIA’s cuSolver backend to torch.linalg.eig and torch.linalg.eigvals within the ATen/Linalg framework. ### Motivation Extending PyTorch’s Linalg backends with NVIDIA’s cuSolver enables faster execution of torch.linalg.eig and torch.linalg.eigvals, complementing existing MAGMA and CPU implementations. The speedup observed on consumer hardware (RTX4070/Ryzen 5700x) is in the order of **2x**, with preliminary testing on HPC hardware (H100, EPYC 9454) suggesting **up to 10x speedup**. ### Details - Implements cuSolver support for linalg_eig and linalg_eigvals using the interface described in [NVIDIA cuSolver documentation](https://docs.nvidia.com/cuda/cusolver/index.html#cusolverdnxgeev) as introduced in CUDA 12.8 [CUDA 12.8 release notes](https://docs.nvidia.com/cuda/archive/12.8.0/cuda-toolkit-release-notes/index.html) - Follows the existing MAGMA backend design, adapting it for cuSolver’s cusolverDnXgeev API. - Integrates with existing eig/eigvals dispatch mechanism. - No automatic CPU↔GPU backend switching. (Happy to discuss) - Verified via existing Linalg test coverage; no new tests introduced in this PR. - Tested successfully against both test_linalg.py including slow test suites. - Tested MAGMA fallback successfully using CUDA 12.4. (observed unrelated test failures) ### Impact - Enables much faster solving of eigenvalue problems - Maintains numerical consistency and test stability across backends. - No change to public API or user-facing behavior. Special thanks to @albanD for prior feedback and discussions regarding the PR and @lezcano for feedback on the related testing PR [https://github.com/pytorch/pytorch/pull/166322](https://github.com/pytorch/pytorch/pull/166322). Happy to discuss backend dispatch strategy, results from performance and stability testing can be seen here [https://dev-discuss.pytorch.org/](https://dev-discuss.pytorch.org/t/cusolver-dnxgeev-faster-cuda-eigenvalue-calculations/3248/7) Pull Request resolved: #166715 Approved by: https://github.com/lezcano, https://github.com/albanD
Summary
Adds support for NVIDIA’s cuSolver backend to torch.linalg.eig and torch.linalg.eigvals within the ATen/Linalg framework.
Motivation
Extending PyTorch’s Linalg backends with NVIDIA’s cuSolver enables faster execution of torch.linalg.eig and torch.linalg.eigvals, complementing existing MAGMA and CPU implementations.
The speedup observed on consumer hardware (RTX4070/Ryzen 5700x) is in the order of 2x, with preliminary testing on HPC hardware (H100, EPYC 9454) suggesting up to 10x speedup.
Details
Impact
Special thanks to @albanD for prior feedback and discussions regarding the PR and @lezcano for feedback on the related testing PR #166322.
Happy to discuss backend dispatch strategy, results from performance and stability testing can be seen here https://dev-discuss.pytorch.org/
cc @jianyuh @nikitaved @mruberry @walterddr @xwang233 @lezcano @albanD