Skip to content

Continue to build nightly CUDA 12.9 for internal#163029

Closed
huydhn wants to merge 3 commits intopytorch:mainfrom
huydhn:continue-build-cu129-for-vllm
Closed

Continue to build nightly CUDA 12.9 for internal#163029
huydhn wants to merge 3 commits intopytorch:mainfrom
huydhn:continue-build-cu129-for-vllm

Conversation

@huydhn
Copy link
Contributor

@huydhn huydhn commented Sep 16, 2025

Revert part of #161916 to continue building CUDA 12.9 nightly

cc @albanD

@huydhn huydhn requested review from atalman and malfet September 16, 2025 00:57
@huydhn huydhn requested a review from a team as a code owner September 16, 2025 00:57
@huydhn huydhn added ciflow/binaries_wheel Trigger binary build and upload jobs for wheel on the PR test-config/default labels Sep 16, 2025
@pytorch-bot
Copy link

pytorch-bot bot commented Sep 16, 2025

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/163029

Note: Links to docs will display an error until the docs builds have been completed.

❗ 1 Active SEVs

There are 1 currently active SEVs. If your PR is affected, please view them below:

✅ You can merge normally! (2 Unrelated Failures)

As of commit 6295362 with merge base 12d7cc5 (image):

FLAKY - The following jobs failed but were likely due to flakiness present on trunk:

This comment was automatically generated by Dr. CI and updates every 15 minutes.

Copy link
Contributor

@malfet malfet left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please mention an issue that answers a deadline on when to revert it, but sure, why not

@huydhn
Copy link
Contributor Author

huydhn commented Sep 17, 2025

Please mention an issue that answers a deadline on when to revert it, but sure, why not

I also doubt my sanity in doing this, so let's get this one ready, but not land it unless we really need it. Also add a note here that pytorch/test-infra#7074 needs to be reverted too to build domains on 12.9

@atalman
Copy link
Contributor

atalman commented Sep 22, 2025

@huydhn please provide some context on this. Supporting 4 CUDA versions across 3 platforms is quite expensive. Can we only build specific Python version and only Linux ?

@huydhn
Copy link
Contributor Author

huydhn commented Sep 22, 2025

@huydhn please provide some context on this. Supporting 4 CUDA versions across 3 platforms is quite expensive. Can we only build specific Python version and only Linux ?

I'm keeping this around in case people ask for this internally (post). From the response so far, I don't think there is enough incentive to land this yet

@huydhn
Copy link
Contributor Author

huydhn commented Oct 11, 2025

@pytorchbot rebase

@pytorchmergebot
Copy link
Collaborator

@pytorchbot started a rebase job onto refs/remotes/origin/viable/strict. Check the current status here

Signed-off-by: Huy Do <huydhn@gmail.com>
Signed-off-by: Huy Do <huydhn@gmail.com>
@pytorchmergebot
Copy link
Collaborator

Successfully rebased continue-build-cu129-for-vllm onto refs/remotes/origin/viable/strict, please pull locally before adding more changes (for example, via git checkout continue-build-cu129-for-vllm && git pull --rebase)

@pytorchmergebot pytorchmergebot force-pushed the continue-build-cu129-for-vllm branch from cc72948 to b1f78ae Compare October 11, 2025 00:14
Signed-off-by: Huy Do <huydhn@gmail.com>
@huydhn
Copy link
Contributor Author

huydhn commented Oct 11, 2025

@pytorchbot drci

@huydhn
Copy link
Contributor Author

huydhn commented Oct 11, 2025

@pytorchbot merge -f '12.9 build looks ok'

@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged immediately since you used the force (-f) flag, bypassing any CI checks (ETA: 1-5 minutes). Please use -f as last resort and instead consider -i/--ignore-current to continue the merge ignoring current failures. This will allow currently pending tests to finish and report signal before the merge.

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

@huydhn
Copy link
Contributor Author

huydhn commented Oct 14, 2025

@pytorchbot --help

@pytorch-bot
Copy link

pytorch-bot bot commented Oct 14, 2025

PyTorchBot Help

usage: @pytorchbot [-h] {merge,revert,rebase,label,drci,cherry-pick} ...

In order to invoke the bot on your PR, include a line that starts with
@pytorchbot anywhere in a comment. That line will form the command; no
multi-line commands are allowed. Some commands may be used on issues as specified below.

Example:
    Some extra context, blah blah, wow this PR looks awesome

    @pytorchbot merge

optional arguments:
  -h, --help            Show this help message and exit.

command:
  {merge,revert,rebase,label,drci,cherry-pick}
    merge               Merge a PR
    revert              Revert a PR
    rebase              Rebase a PR
    label               Add label to a PR
    drci                Update Dr. CI
    cherry-pick         Cherry pick a PR onto a release branch

Merge

usage: @pytorchbot merge [-f MESSAGE | -i] [-ic] [-r [{viable/strict,main}]]

Merge an accepted PR, subject to the rules in .github/merge_rules.json.
By default, this will wait for all required checks (lint, pull) to succeed before merging.

optional arguments:
  -f MESSAGE, --force MESSAGE
                        Merge without checking anything. This requires a reason for auditting purpose, for example:
                        @pytorchbot merge -f 'Minor update to fix lint. Expecting all PR tests to pass'
                        
                        Please use `-f` as last resort, prefer `--ignore-current` to continue the merge ignoring current failures. This will allow currently pending tests to finish and report signal before the merge.
  -i, --ignore-current  Merge while ignoring the currently failing jobs.  Behaves like -f if there are no pending jobs.
  -ic                   Old flag for --ignore-current. Deprecated in favor of -i.
  -r [{viable/strict,main}], --rebase [{viable/strict,main}]
                        Rebase the PR to re run checks before merging.  Accepts viable/strict or main as branch options and will default to viable/strict if not specified.

Revert

usage: @pytorchbot revert -m MESSAGE -c
                          {nosignal,ignoredsignal,landrace,weird,ghfirst,autorevert}

Revert a merged PR. This requires that you are a Meta employee.

Example:
  @pytorchbot revert -m="This is breaking tests on trunk. hud.pytorch.org/" -c=nosignal

optional arguments:
  -m MESSAGE, --message MESSAGE
                        The reason you are reverting, will be put in the commit message. Must be longer than 3 words.
  -c {nosignal,ignoredsignal,landrace,weird,ghfirst,autorevert}, --classification {nosignal,ignoredsignal,landrace,weird,ghfirst,autorevert}
                        A machine-friendly classification of the revert reason.

Rebase

usage: @pytorchbot rebase [-s | -b BRANCH]

Rebase a PR. Rebasing defaults to the stable viable/strict branch of pytorch.
Repeat contributor may use this command to rebase their PR.

optional arguments:
  -s, --stable          [DEPRECATED] Rebase onto viable/strict
  -b BRANCH, --branch BRANCH
                        Branch you would like to rebase to

Label

usage: @pytorchbot label labels [labels ...]

Adds label to a PR or Issue [Can be used on Issues]

positional arguments:
  labels  Labels to add to given Pull Request or Issue [Can be used on Issues]

Dr CI

usage: @pytorchbot drci 

Update Dr. CI. Updates the Dr. CI comment on the PR in case it's gotten out of sync with actual CI results.

cherry-pick

usage: @pytorchbot cherry-pick --onto ONTO [--fixes FIXES] -c
                               {regression,critical,fixnewfeature,docs,release}

Cherry pick a pull request onto a release branch for inclusion in a release

optional arguments:
  --onto ONTO, --into ONTO
                        Branch you would like to cherry pick onto (Example: release/2.1)
  --fixes FIXES         Link to the issue that your PR fixes (Example: https://github.com/pytorch/pytorch/issues/110666)
  -c {regression,critical,fixnewfeature,docs,release}, --classification {regression,critical,fixnewfeature,docs,release}
                        A machine-friendly classification of the cherry-pick reason.

@huydhn
Copy link
Contributor Author

huydhn commented Oct 14, 2025

@pytorchbot cherry-pick --onto release/2.9 --fixes 'vLLM CUDA 12.9 build' -c release

pytorchbot pushed a commit that referenced this pull request Oct 14, 2025
Revert part of #161916 to continue building CUDA 12.9 nightly

Pull Request resolved: #163029
Approved by: https://github.com/malfet

(cherry picked from commit 4400c5d)
@pytorchbot
Copy link
Collaborator

Cherry picking #163029

The cherry pick PR is at #165466 and it is linked with issue vLLM CUDA 12.9 build. The following tracker issues are updated:

Details for Dev Infra team Raised by workflow job

Camyll pushed a commit that referenced this pull request Oct 16, 2025
* Continue to build nightly CUDA 12.9 for internal (#163029)

Revert part of #161916 to continue building CUDA 12.9 nightly

Pull Request resolved: #163029
Approved by: https://github.com/malfet

(cherry picked from commit 4400c5d)

* Fix lint

Signed-off-by: Huy Do <huydhn@gmail.com>

---------

Signed-off-by: Huy Do <huydhn@gmail.com>
Co-authored-by: Huy Do <huydhn@gmail.com>
pytorchmergebot pushed a commit that referenced this pull request Oct 18, 2025
When trying to bring cu129 back in #163029, I mainly looked at #163029 and missed another tweak coming from #162455

I discover this issue when testing aarch64+cu129 builds in https://github.com/pytorch/test-infra/actions/runs/18603342105/job/53046883322?pr=7373.  Surprisingly, there is no test running for aarch64 CUDA build from what I see in https://hud.pytorch.org/pytorch/pytorch/commit/79a37055e790482c12bf32e69b28c8e473d0209d.
Pull Request resolved: #165794
Approved by: https://github.com/malfet
pytorchbot pushed a commit that referenced this pull request Oct 18, 2025
When trying to bring cu129 back in #163029, I mainly looked at #163029 and missed another tweak coming from #162455

I discover this issue when testing aarch64+cu129 builds in https://github.com/pytorch/test-infra/actions/runs/18603342105/job/53046883322?pr=7373.  Surprisingly, there is no test running for aarch64 CUDA build from what I see in https://hud.pytorch.org/pytorch/pytorch/commit/79a37055e790482c12bf32e69b28c8e473d0209d.
Pull Request resolved: #165794
Approved by: https://github.com/malfet

(cherry picked from commit 9095a9d)
huydhn added a commit that referenced this pull request Oct 18, 2025
[CD] Apply the fix from #162455 to aarch64+cu129 build (#165794)

When trying to bring cu129 back in #163029, I mainly looked at #163029 and missed another tweak coming from #162455

I discover this issue when testing aarch64+cu129 builds in https://github.com/pytorch/test-infra/actions/runs/18603342105/job/53046883322?pr=7373.  Surprisingly, there is no test running for aarch64 CUDA build from what I see in https://hud.pytorch.org/pytorch/pytorch/commit/79a37055e790482c12bf32e69b28c8e473d0209d.
Pull Request resolved: #165794
Approved by: https://github.com/malfet

(cherry picked from commit 9095a9d)

Co-authored-by: Huy Do <huydhn@gmail.com>
Chao1Han pushed a commit to Chao1Han/pytorch that referenced this pull request Oct 21, 2025
Revert part of pytorch#161916 to continue building CUDA 12.9 nightly

Pull Request resolved: pytorch#163029
Approved by: https://github.com/malfet
Chao1Han pushed a commit to Chao1Han/pytorch that referenced this pull request Oct 21, 2025
…h#165794)

When trying to bring cu129 back in pytorch#163029, I mainly looked at pytorch#163029 and missed another tweak coming from pytorch#162455

I discover this issue when testing aarch64+cu129 builds in https://github.com/pytorch/test-infra/actions/runs/18603342105/job/53046883322?pr=7373.  Surprisingly, there is no test running for aarch64 CUDA build from what I see in https://hud.pytorch.org/pytorch/pytorch/commit/79a37055e790482c12bf32e69b28c8e473d0209d.
Pull Request resolved: pytorch#165794
Approved by: https://github.com/malfet
zhudada0120 pushed a commit to zhudada0120/pytorch that referenced this pull request Oct 22, 2025
…h#165794)

When trying to bring cu129 back in pytorch#163029, I mainly looked at pytorch#163029 and missed another tweak coming from pytorch#162455

I discover this issue when testing aarch64+cu129 builds in https://github.com/pytorch/test-infra/actions/runs/18603342105/job/53046883322?pr=7373.  Surprisingly, there is no test running for aarch64 CUDA build from what I see in https://hud.pytorch.org/pytorch/pytorch/commit/79a37055e790482c12bf32e69b28c8e473d0209d.
Pull Request resolved: pytorch#165794
Approved by: https://github.com/malfet
@huydhn huydhn deleted the continue-build-cu129-for-vllm branch December 16, 2025 08:03
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ciflow/binaries_wheel Trigger binary build and upload jobs for wheel on the PR Merged skip-pr-sanity-checks test-config/default topic: not user facing topic category

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants