Skip to content

[inductor] Reduce cold compilation time caused by duplicated user-defined Triton kernels#168292

Closed
desertfire wants to merge 2 commits intogh/desertfire/611/basefrom
gh/desertfire/611/head
Closed

[inductor] Reduce cold compilation time caused by duplicated user-defined Triton kernels#168292
desertfire wants to merge 2 commits intogh/desertfire/611/basefrom
gh/desertfire/611/head

Conversation

@desertfire
Copy link
Contributor

@desertfire desertfire commented Nov 20, 2025

Stack from ghstack (oldest at bottom):

Summary: Similar to #167132, but the previous PR didn't consider user-defined Triton kernels. When cudagraphs-partition is enabled in Inductor, different partitions can use the same user-defined Triton kernels. Each user-defined Trition kernel should only be defined and compiled once.

Local measure shoes this PR can reduce Qwen/Qwen3-VL-235B-A22B-Instruct's cold compilation time from 243.65s to 114.69s.

cc @voznesenskym @penguinwu @EikanWang @jgong5 @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @wenzhe-nrv @jiayisunx @ipiszy @chenyang78 @kadeng @muchulee8 @amjames @chauhang @aakhundov @coconutruben @mlazos

…ined Triton kernels

Summary: Similar to #167132, but the previous PR didn't consider user-defined Triton kernels. When cudagraphs-partition is enabled in Inductor, different partitions can use the same user-defined Triton kernels. Each user-defined Trition kernel should only be defined and compiled once.

Local measure shoes this PR can reduce Qwen/Qwen3-VL-235B-A22B-Instruct's cold compilation time from 243.65s to 114.69s.

[ghstack-poisoned]
desertfire added a commit that referenced this pull request Nov 20, 2025
…ined Triton kernels

Summary: Similar to #167132, but the previous PR didn't consider user-defined Triton kernels. When cudagraphs-partition is enabled in Inductor, different partitions can use the same user-defined Triton kernels. Each user-defined Trition kernel should only be defined and compiled once.

Local measure shoes this PR can reduce Qwen/Qwen3-VL-235B-A22B-Instruct's cold compilation time from 243.65s to 114.69s.

ghstack-source-id: 5556193
Pull Request resolved: #168292
@pytorch-bot
Copy link

pytorch-bot bot commented Nov 20, 2025

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/168292

Note: Links to docs will display an error until the docs builds have been completed.

This comment was automatically generated by Dr. CI and updates every 15 minutes.

…ed user-defined Triton kernels"

Summary: Similar to #167132, but the previous PR didn't consider user-defined Triton kernels. When cudagraphs-partition is enabled in Inductor, different partitions can use the same user-defined Triton kernels. Each user-defined Trition kernel should only be defined and compiled once.

Local measure shoes this PR can reduce Qwen/Qwen3-VL-235B-A22B-Instruct's cold compilation time from 243.65s to 114.69s.

cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx ipiszy chenyang78 kadeng muchulee8 amjames chauhang aakhundov coconutruben mlazos

[ghstack-poisoned]
desertfire added a commit that referenced this pull request Nov 21, 2025
…ined Triton kernels

Summary: Similar to #167132, but the previous PR didn't consider user-defined Triton kernels. When cudagraphs-partition is enabled in Inductor, different partitions can use the same user-defined Triton kernels. Each user-defined Trition kernel should only be defined and compiled once.

Local measure shoes this PR can reduce Qwen/Qwen3-VL-235B-A22B-Instruct's cold compilation time from 243.65s to 114.69s.

ghstack-source-id: 2ffe739
Pull Request resolved: #168292
@desertfire
Copy link
Contributor Author

@pytorchbot merge

@pytorch-bot pytorch-bot bot added the ciflow/trunk Trigger trunk jobs on your pull request label Nov 21, 2025
@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

JacobSzwejbka pushed a commit that referenced this pull request Dec 8, 2025
…ined Triton kernels (#168292)

Summary: Similar to #167132, but the previous PR didn't consider user-defined Triton kernels. When cudagraphs-partition is enabled in Inductor, different partitions can use the same user-defined Triton kernels. Each user-defined Trition kernel should only be defined and compiled once.

Local measure shoes this PR can reduce Qwen/Qwen3-VL-235B-A22B-Instruct's cold compilation time from 243.65s to 114.69s.

Pull Request resolved: #168292
Approved by: https://github.com/eellison
ghstack dependencies: #168281
@github-actions github-actions bot deleted the gh/desertfire/611/head branch December 22, 2025 02:20
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants