Skip to content

Integrate NVIDIA cuSolver backend into ATen/Linalg (initial implementation for eig/eigval)#166715

Closed
johannesz-codes wants to merge 28 commits intopytorch:mainfrom
johannesz-codes:feature/eig-xgeev
Closed

Integrate NVIDIA cuSolver backend into ATen/Linalg (initial implementation for eig/eigval)#166715
johannesz-codes wants to merge 28 commits intopytorch:mainfrom
johannesz-codes:feature/eig-xgeev

Conversation

@johannesz-codes
Copy link
Contributor

@johannesz-codes johannesz-codes commented Oct 31, 2025

Summary

Adds support for NVIDIA’s cuSolver backend to torch.linalg.eig and torch.linalg.eigvals within the ATen/Linalg framework.

Motivation

Extending PyTorch’s Linalg backends with NVIDIA’s cuSolver enables faster execution of torch.linalg.eig and torch.linalg.eigvals, complementing existing MAGMA and CPU implementations.

The speedup observed on consumer hardware (RTX4070/Ryzen 5700x) is in the order of 2x, with preliminary testing on HPC hardware (H100, EPYC 9454) suggesting up to 10x speedup.

Details

  • Implements cuSolver support for linalg_eig and linalg_eigvals using the interface described in NVIDIA cuSolver documentation as introduced in CUDA 12.8 CUDA 12.8 release notes
  • Follows the existing MAGMA backend design, adapting it for cuSolver’s cusolverDnXgeev API.
  • Integrates with existing eig/eigvals dispatch mechanism.
  • No automatic CPU↔GPU backend switching. (Happy to discuss)
  • Verified via existing Linalg test coverage; no new tests introduced in this PR.
  • Tested successfully against both test_linalg.py including slow test suites.
  • Tested MAGMA fallback successfully using CUDA 12.4. (observed unrelated test failures)

Impact

  • Enables much faster solving of eigenvalue problems
  • Maintains numerical consistency and test stability across backends.
  • No change to public API or user-facing behavior.

Special thanks to @albanD for prior feedback and discussions regarding the PR and @lezcano for feedback on the related testing PR #166322.

Happy to discuss backend dispatch strategy, results from performance and stability testing can be seen here https://dev-discuss.pytorch.org/

cc @jianyuh @nikitaved @mruberry @walterddr @xwang233 @lezcano @albanD

@pytorch-bot
Copy link

pytorch-bot bot commented Oct 31, 2025

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/166715

Note: Links to docs will display an error until the docs builds have been completed.

❗ 1 Active SEVs

There are 1 currently active SEVs. If your PR is affected, please view them below:

⏳ No Failures, 1 Pending

As of commit b3f50e0 with merge base 4316df8 (image):
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@johannesz-codes
Copy link
Contributor Author

@pytorchbot label "topic: not user facing"

@pytorch-bot pytorch-bot bot added the topic: not user facing topic category label Oct 31, 2025
@pytorch-bot
Copy link

pytorch-bot bot commented Oct 31, 2025

Didn't find following labels among repository labels: release notes: performance improvement

@johannesz-codes
Copy link
Contributor Author

@pytorchbot label "release notes: cuda" "module: linear algebra"

@pytorch-bot pytorch-bot bot added module: linear algebra Issues related to specialized linear algebra operations in PyTorch; includes matrix multiply matmul release notes: cuda release notes category labels Oct 31, 2025

Tensor& linalg_eigvals_out(const Tensor& input, Tensor& values) {
squareCheckInputs(input, "linalg.eigvals");
TORCH_CHECK(input.isfinite().all().item<bool>(), "torch.linalg.eigvals: input tensor should not contain infs or NaNs.");
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove as it synchronizes. This is already documented for all ops in https://docs.pytorch.org/docs/stable/notes/numerical_accuracy.html#non-finite-values

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should I remove this from linalg_eig_out as well?
Doing so causes test failures in test_linalg.py.

Maybe could be adressed in separate PR?

Comment on lines 2100 to 2102
auto eigenvalues_cpu = eigenvalues.device().is_cpu() ? eigenvalues : eigenvalues.cpu();
auto eigenvectors_cpu = eigenvectors.device().is_cpu() ? eigenvectors : eigenvectors.cpu();
auto infos_cpu = infos.device().is_cpu() ? infos : infos.cpu();
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

again, unconditionally call .cpu()

@johannesz-codes
Copy link
Contributor Author

Should I fix the failing lintrunner test? The affected line wasn’t touched in this PR.

Copy link
Collaborator

@lezcano lezcano left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. cc @IvanYashchuk in case he wants to have a quick look. If he doesn't answer by monday let's merge it.

Thank you for the contribution! I think this is literally the first OSS contribution to linalg in 2 years :P

@lezcano
Copy link
Collaborator

lezcano commented Oct 31, 2025

regarding the test, if you rebase on viable/strict CI should be green

@lezcano lezcano added ciflow/trunk Trigger trunk jobs on your pull request ciflow/slow labels Oct 31, 2025
@eqy
Copy link
Collaborator

eqy commented Oct 31, 2025

CC @nikitaved

@Skylion007 Skylion007 added release notes: rocm mandatorylabel topic: performance topic category and removed topic: not user facing topic category labels Oct 31, 2025
@Skylion007
Copy link
Collaborator

This perf improvement is definitely user facing and should have patch notes

jobvl,
jobvr,
n,
CUDA_R_64F, // Datentyp der Matrix
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Language typo? Comments should be in English for consistency?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left over from testing. Will remove tomorrow.

@lezcano lezcano removed the release notes: rocm mandatorylabel label Oct 31, 2025
@nikitaved nikitaved added the ciflow/rocm Trigger "default" config CI on ROCm label Oct 31, 2025
#include <utility>
#include <vector>

#include "jit/tensorexpr/bounds_overlap.h"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do we need this one for?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

don't know how that slipped in there. Removed, compiles and tests fine

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems there was another stray import in the same commit. Probably auto-import kicked in.

Comment on lines 3009 to 3014
// move tensors to CPU if they are not already there for post-processing
auto vectors_cpu = vectors.cpu();
auto values_cpu = values.cpu();
auto maybe_complex_vectors_cpu = maybe_complex_vectors.cpu();

vectors = linalg_eig_make_complex_eigenvectors(vectors_cpu, values_cpu, maybe_complex_vectors_cpu);
Copy link
Collaborator

@nikitaved nikitaved Nov 3, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: let's change the comment to something like "we move to the CPU because linalg_eig_make_complex_eigenvectors requires that. Performance note: this function can be implemented via a TensorIterator -- and then we can avoid explicit host-device sync, IIUC.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

By the way, do we need this in the cuSolver path at all?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As far as I understand, yes. CuSolver seems to output the eigenvectors in the same format used by MAGMA/CPU. The NVIDIA docs mention the same column-pair structure for real datatypes, though the wording is a bit unclear: https://docs.nvidia.com/cuda/cusolver/index.html


vectors = linalg_eig_make_complex_eigenvectors(vectors_cpu, values_cpu, maybe_complex_vectors_cpu);

vectors.copy_(vectors_cpu.to(vectors.device()));
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
vectors.copy_(vectors_cpu.to(vectors.device()));
vectors.copy_(vectors_cpu);

Is sufficient.

Copy link
Collaborator

@nikitaved nikitaved Nov 3, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's do vectors_cpu = linalg_eig_make_complex_eigenvalues(vectors_cpu...). Otherwise having vectors = vectors_cpu (because make complex eignevalues is in-place) looks confusing (I am not sure how operator= is implemented for Tensor objects, and how it is aligned with the clone operation that follows). Do we have tests for this path?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should be tested in test_eig_compare_backends_cuda_floatXX in test_linalg.py, that is also how I test it. Suggested change tests fine for me.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, but the test ignores the new dispatch strategy. We have to be sure that MAGMA is being dispatched to and tested.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

True. Should this just be done via the backends.cuda.preferred_linalg_library() option in python?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, and we can just parametrize the test with backend names.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense. I currently loop over backends like in test_svd, but parametrizing the test would be the more elegant approach.

Copy link
Collaborator

@nikitaved nikitaved Nov 3, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's at least make a follow-up PR with the test, unless you are able to sneak it in here before the merge happens :)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:D don't want to upset the merge... will do, I will just make a PR with what I have and start converting all the tests with different backends to be properly parametrized in a follow-up.

Comment on lines 3017 to 3018


Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: stray lines

Comment on lines 3155 to 3158
Tensor vectors;
if (!vectors.defined()) {
vectors = at::empty({0}, input.options());
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tensor vectors; already makes it undefined, right? So the if path looks redundant. Or am I missing something?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this condition was always true. Removed the if, compiles and tests fine.

Copy link
Collaborator

@albanD albanD left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome. Thanks for the quick update.

@albanD
Copy link
Collaborator

albanD commented Nov 3, 2025

@pytorchbot merge

@pytorch-bot pytorch-bot bot added the ciflow/trunk Trigger trunk jobs on your pull request label Nov 3, 2025
@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

pytorch-bot bot pushed a commit that referenced this pull request Nov 4, 2025
…ation for eig/eigval) (#166715)

### Summary

Adds support for NVIDIA’s cuSolver backend to torch.linalg.eig and torch.linalg.eigvals within the ATen/Linalg framework.

### Motivation

Extending PyTorch’s Linalg backends with NVIDIA’s cuSolver enables faster execution of torch.linalg.eig and torch.linalg.eigvals, complementing existing MAGMA and CPU implementations.

The speedup observed on consumer hardware (RTX4070/Ryzen 5700x) is in the order of **2x**, with preliminary testing on HPC hardware (H100, EPYC 9454) suggesting **up to 10x speedup**.

### Details

- Implements cuSolver support for linalg_eig and linalg_eigvals using the interface described in [NVIDIA cuSolver documentation](https://docs.nvidia.com/cuda/cusolver/index.html#cusolverdnxgeev)  as introduced in CUDA 12.8 [CUDA 12.8 release notes](https://docs.nvidia.com/cuda/archive/12.8.0/cuda-toolkit-release-notes/index.html)
- Follows the existing MAGMA backend design, adapting it for cuSolver’s cusolverDnXgeev API.
- Integrates with existing eig/eigvals dispatch mechanism.
- No automatic CPU↔GPU backend switching. (Happy to discuss)
- Verified via existing Linalg test coverage; no new tests introduced in this PR.
- Tested successfully against both test_linalg.py including slow test suites.
- Tested MAGMA fallback successfully using CUDA 12.4. (observed unrelated test failures)

### Impact

- Enables much faster solving of eigenvalue problems
- Maintains numerical consistency and test stability across backends.
- No change to public API or user-facing behavior.

Special thanks to @albanD for prior feedback and discussions regarding the PR and @lezcano for feedback on the related testing PR [https://github.com/pytorch/pytorch/pull/166322](https://github.com/pytorch/pytorch/pull/166322).

Happy to discuss backend dispatch strategy, results from performance and stability testing can be seen here [https://dev-discuss.pytorch.org/](https://dev-discuss.pytorch.org/t/cusolver-dnxgeev-faster-cuda-eigenvalue-calculations/3248/7)

Pull Request resolved: #166715
Approved by: https://github.com/lezcano, https://github.com/albanD
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ciflow/trunk Trigger trunk jobs on your pull request Merged module: linear algebra Issues related to specialized linear algebra operations in PyTorch; includes matrix multiply matmul open source release notes: cuda release notes category topic: performance topic category

Projects

None yet

Development

Successfully merging this pull request may close these issues.

8 participants