Skip to content

Conversation

@nWEIdia
Copy link
Collaborator

@nWEIdia nWEIdia commented Sep 26, 2025

See also #163972, which was intended to be this PR.

Triton (release/3.5.x) by default ships CUDA12.8 ptxas.
This PR tries to bundle a ptxas version for cuda13, so that it can help #163801 when users run on new devices like THOR and Spark.

Fixes #163801

Test Plan:

Check binary size increase against nightly or v2.9RC
Install the binary from into a working THOR and GB200/GH100 machine (reproduce the original issue first on THOR), then install the binary built from this PR and we expect the issue to be gone without any additional user setting. Testing on GB200 is to ensure no regression.
Reference: #119750 and pytorch/builder@5c814e2

Note: with this PR, the pytorch world's torch.compile is supposed to find ptxas via "torch/_inductor/runtime/compile_tasks.py" and "_set_triton_ptxas_path". Use cases that do not go through "_set_triton_ptxas_path" may not be able to use the cuda13 ptxas binary.
However, as is, the triton world does not know the existence of this new cuda13 ptxas. So IF a users thinks there is already pytorch/bin/ptxas and delete the ptxas from triton, then https://github.com/triton-lang/triton/blob/c6ad34f7eb42630533412d93ca2cc00a4b4f8f3c/python/triton/knobs.py#L216 would still complain ptxas not found (if removed - it won't know this new one available)

cc @voznesenskym @penguinwu @EikanWang @jgong5 @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @wenzhe-nrv @jiayisunx @ipiszy @chenyang78 @kadeng @muchulee8 @amjames @chauhang @aakhundov @coconutruben @ptrblck @eqy @tinglvv @atalman @malfet

@nWEIdia nWEIdia requested a review from a team as a code owner September 26, 2025 20:03
@pytorch-bot
Copy link

pytorch-bot bot commented Sep 26, 2025

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/163988

Note: Links to docs will display an error until the docs builds have been completed.

✅ You can merge normally! (3 Unrelated Failures)

As of commit 1a3aea4 with merge base 5880996 (image):

FLAKY - The following jobs failed but were likely due to flakiness present on trunk:

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@nWEIdia nWEIdia added release notes: build release notes category ciflow/binaries Trigger all binary build and upload jobs on the PR ciflow/linux-aarch64 linux aarch64 CI workflow and removed ciflow/linux-aarch64 linux aarch64 CI workflow labels Sep 26, 2025
@nWEIdia
Copy link
Collaborator Author

nWEIdia commented Sep 27, 2025

Binary size information: (this PR's artifact, python3.12)
523M Sep 27 02:00 torch-2.10.0.dev20250926+cu128-cp312-cp312-manylinux_2_28_aarch64.whl
ptxas size and location: -rwxr-xr-x 1 root root 31M Sep 27 02:12 /usr/local/lib/python3.12/dist-packages/torch/bin/ptxas

Check functionality: (e.g. on THOR)
unset TRITON_PTXAS_PATH
#clone pytorch
Run: python test/inductor/test_control_flow.py CondTests.test_cond_mismatched_branch_output_size_device_cuda_dynamic_False
Still encounters: "ptxas fatal : Value 'sm_110a' is not defined for option 'gpu-name'"
Needs to fix this part.

@nWEIdia
Copy link
Collaborator Author

nWEIdia commented Sep 27, 2025

Torch Compile now expect ptxas to be: /usr/local/lib/python3.12/dist-packages/torch/_inductor/bin/ptxas
So either we need to change this, or we need to package to torch/_inductor/bin/ptxas, rather than torch/bin/ptxas.

I would just change the expected directory to be /usr/local/lib/python3.12/dist-packages/torch/bin/ptxas again to reduce packaging risks.

to /usr/local/lib/python3.12/dist-packages/torch/bin/ptxas
@nWEIdia
Copy link
Collaborator Author

nWEIdia commented Sep 27, 2025

Test Results on THOR with the latest wheels:

gh run download 18053880730 -n manywheel-py3_12-cuda-aarch64-13_0
pip install torch-2.10.0.dev20250927+cu130-cp312-cp312-manylinux_2_28_aarch64.whl --index-url https://download.pytorch.org/whl/nightly/cu130 (--index-url is trying to satisfy pytorch_triton availability)

root@:/workspace/pytorch# python test/inductor/test_control_flow.py CondTests.test_cond_mismatched_branch_output_size_device_cuda_dynamic_False
inline_call []
stats [('calls_captured', 22), ('unique_graphs', 2)]
inductor [('async_compile_cache_miss', 5), ('extern_calls', 4), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_bypass', 1)]
aot_autograd [('total', 1), ('autograd_cache_bypass', 1), ('ok', 1)]
graph_break []
.

Ran 1 test in 2.326s

OK
root@:/workspace/pytorch# echo $TRITON_PTXAS_PATH

root@:/workspace/pytorch# pip list |grep torch
pytorch-triton 3.5.0+gitbbb06c03
torch 2.10.0.dev20250927+cu130

@nWEIdia
Copy link
Collaborator Author

nWEIdia commented Sep 27, 2025

On the other device:
Found GPU0 NVIDIA **** which is of cuda capability 12.1.
Minimum and Maximum cuda capability supported by this version of PyTorch is
(8.0) - (12.0)

warnings.warn(
inline_call []
stats [('calls_captured', 22), ('unique_graphs', 2)]
inductor [('async_compile_cache_miss', 5), ('extern_calls', 4), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_bypass', 1)]
aot_autograd [('total', 1), ('autograd_cache_bypass', 1), ('ok', 1)]
graph_break []
.

Ran 1 test in 1.441s

OK

@nWEIdia nWEIdia requested review from atalman and malfet September 28, 2025 01:28
@soulitzer soulitzer added the triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module label Sep 29, 2025
@nWEIdia nWEIdia moved this to Hi Priority in PyTorch + CUDA Sep 29, 2025
@nWEIdia nWEIdia self-assigned this Sep 29, 2025
@nWEIdia
Copy link
Collaborator Author

nWEIdia commented Sep 29, 2025

@pytorchbot merge

@pytorch-bot pytorch-bot bot added the ciflow/trunk Trigger trunk jobs on your pull request label Sep 29, 2025
@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

@github-project-automation github-project-automation bot moved this from Hi Priority to Done in PyTorch + CUDA Sep 30, 2025
@atalman
Copy link
Contributor

atalman commented Sep 30, 2025

@pytorchbot cherry-pick --onto release/2.9 --fixes "Critical CI fix" -c critical

pytorchbot pushed a commit that referenced this pull request Sep 30, 2025
…63988)

See also #163972, which was intended to be this PR.

Triton (release/3.5.x) by default ships CUDA12.8 ptxas.
This PR tries to bundle a ptxas version for cuda13, so that it can help #163801 when users run on new devices like THOR and Spark.

Fixes #163801

Test Plan:

Check binary size increase against nightly or v2.9RC
Install the binary from into a working THOR and GB200/GH100 machine (reproduce the original issue first on THOR), then install the binary built from this PR and we expect the issue to be gone without any additional user setting. Testing on GB200 is to ensure no regression.
Reference: #119750 and pytorch/builder@5c814e2

Note: with this PR, the pytorch world's torch.compile is supposed to find ptxas via "torch/_inductor/runtime/compile_tasks.py" and "_set_triton_ptxas_path". Use cases that do not go through "_set_triton_ptxas_path" may not be able to use the cuda13 ptxas binary.
However, as is, the triton world does not know the existence of this new cuda13 ptxas. So IF a users thinks there is already pytorch/bin/ptxas and delete the ptxas from triton, then  https://github.com/triton-lang/triton/blob/c6ad34f7eb42630533412d93ca2cc00a4b4f8f3c/python/triton/knobs.py#L216 would still complain ptxas not found (if removed - it won't know this new one available)

Pull Request resolved: #163988
Approved by: https://github.com/atalman

(cherry picked from commit 3b4ad4a)
@pytorchbot
Copy link
Collaborator

Cherry picking #163988

The cherry pick PR is at #164236 and it is linked with issue Critical CI fix. The following tracker issues are updated:

Details for Dev Infra team Raised by workflow job

atalman pushed a commit that referenced this pull request Sep 30, 2025
…64236)

[AARCH64][CD][CUDA13][Triton][PTXAS] Turn on BUILD_BUNDLE_PTXAS=1   (#163988)

See also #163972, which was intended to be this PR.

Triton (release/3.5.x) by default ships CUDA12.8 ptxas.
This PR tries to bundle a ptxas version for cuda13, so that it can help #163801 when users run on new devices like THOR and Spark.

Fixes #163801

Test Plan:

Check binary size increase against nightly or v2.9RC
Install the binary from into a working THOR and GB200/GH100 machine (reproduce the original issue first on THOR), then install the binary built from this PR and we expect the issue to be gone without any additional user setting. Testing on GB200 is to ensure no regression.
Reference: #119750 and pytorch/builder@5c814e2

Note: with this PR, the pytorch world's torch.compile is supposed to find ptxas via "torch/_inductor/runtime/compile_tasks.py" and "_set_triton_ptxas_path". Use cases that do not go through "_set_triton_ptxas_path" may not be able to use the cuda13 ptxas binary.
However, as is, the triton world does not know the existence of this new cuda13 ptxas. So IF a users thinks there is already pytorch/bin/ptxas and delete the ptxas from triton, then  https://github.com/triton-lang/triton/blob/c6ad34f7eb42630533412d93ca2cc00a4b4f8f3c/python/triton/knobs.py#L216 would still complain ptxas not found (if removed - it won't know this new one available)

Pull Request resolved: #163988
Approved by: https://github.com/atalman

(cherry picked from commit 3b4ad4a)

Co-authored-by: Wei Wang <weiwan@nvidia.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ciflow/binaries Trigger all binary build and upload jobs on the PR ciflow/inductor ciflow/trunk Trigger trunk jobs on your pull request Merged module: inductor open source release notes: build release notes category triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module

Projects

Status: Done

Development

Successfully merging this pull request may close these issues.

[CUDA][Triton][PTXAS] Triton Wheel Missing CUDA13 PTXAS - Breakage exists for the environment where CTK is not present

5 participants