Fix static launcher 16-bit float scalars by jansel · Pull Request #184065 · pytorch/pytorch

jansel · 2026-05-17T02:50:14Z

Stack from ghstack (oldest at bottom):

-> Fix static launcher 16-bit float scalars #184065

Encode fp16 and bf16 scalar arguments with distinct launcher ABI codes so static and fast CUDA launchers pack them as 16-bit values instead of 32-bit floats. Add regression coverage for fp16 and bf16 scalar launches.

Fixes #158521

Generated by my agent

cc @voznesenskym @penguinwu @EikanWang @jgong5 @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @wenzhe-nrv @jiayisunx @ipiszy @kadeng @muchulee8 @amjames @chauhang @aakhundov @coconutruben @jataylo

[ghstack-poisoned]

pytorch-bot · 2026-05-17T02:50:18Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/184065

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki

Note: Links to docs will display an error until the docs builds have been completed.

❗ 1 Active SEVs

There are 1 currently active SEVs. If your PR is affected, please view them below:

Run pull request jobs on OSDC runners in shadow mode

✅ No Failures

As of commit 495da93 with merge base 75ab8d6 ():
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

pytorch-bot · 2026-05-17T02:50:21Z

This PR needs a `release notes:` label

If your changes are user facing and intended to be a part of release notes, please use a label starting with release notes:.

If not, please add the topic: not user facing label.

To add a label, you can comment to pytorchbot, for example
@pytorchbot label "topic: not user facing"

For more information, see
https://github.com/pytorch/pytorch/wiki/PyTorch-AutoLabel-Bot#why-categorize-for-release-notes-and-how-does-it-work.

jansel · 2026-05-17T02:50:40Z

@claude review these changes

claude · 2026-05-17T02:51:09Z

Claude finished @jansel's task in 2m 5s —— View job

Code Review: Fix static launcher 16-bit float scalars

Gather context and read changed files
Analyze the implementation for correctness
Check test coverage
Provide review feedback

LGTM — this is a clean, well-scoped fix. The implementation correctly adds distinct ABI codes for fp16 ('e') and bf16 ('y') so the static and fast CUDA launchers pack them as 2-byte values instead of incorrectly treating them as 32-bit floats.

Details

Python mapping (static_triton_launcher.py:165-166): Previously both fp16 and bf16 mapped to 'f' (float32), which caused the C++ side to write a 4-byte float into the kernel argument slot instead of the expected 2-byte half-precision value. Now they map to 'e' and 'y' respectively.

C++ parsing (cuda.cpp and xpu.cpp): The new case 'e' / case 'y' branches use convertType<c10::Half> and convertType<c10::BFloat16> with THPUtils_unpackDouble as the Python converter. The flow is:

THPUtils_unpackDouble extracts a double from the Python object
static_cast<c10::Half>(double_val) narrows through float → Half (or BFloat16)
The 2-byte result is written to the start of the zero-initialized 8-byte uint64_t slot
CUDA's cuLaunchKernel reads exactly 2 bytes from the kernelArgs pointer — correct behavior

Alignment: Each argument slot is a uint64_t (8-byte aligned), and the pointers in kernelArgs point to the start of each slot. CUDA reads the natural size of the kernel parameter (2 bytes for half types), so the zero-padding in the upper 6 bytes is harmless.

Tests: Good regression coverage for both TestStaticTritonLauncher and TestFastCudaLauncher, testing fp16 and bf16 separately. The bf16 test correctly gates on SM80OrLater. The arg_tys assertions ("Oefd" and "Oyfd") verify the encoding, and the numerical equality checks verify end-to-end correctness.

No issues found.

Update

495da93

[ghstack-poisoned]

pytorch-bot Bot added ciflow/inductor ciflow/torchtitan Run TorchTitan integration tests module: inductor labels May 17, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix static launcher 16-bit float scalars#184065

Fix static launcher 16-bit float scalars#184065
jansel wants to merge 1 commit into
gh/jansel/727/basefrom
gh/jansel/727/head

jansel commented May 17, 2026 •

edited by pytorch-bot Bot

Loading

Uh oh!

pytorch-bot Bot commented May 17, 2026 •

edited

Loading

Uh oh!

pytorch-bot Bot commented May 17, 2026

Uh oh!

jansel commented May 17, 2026

Uh oh!

claude Bot commented May 17, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

jansel commented May 17, 2026 • edited by pytorch-bot Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorch-bot Bot commented May 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/184065

❗ 1 Active SEVs

✅ No Failures

Uh oh!

pytorch-bot Bot commented May 17, 2026

This PR needs a release notes: label

Uh oh!

jansel commented May 17, 2026

Uh oh!

claude Bot commented May 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Code Review: Fix static launcher 16-bit float scalars

Details

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

jansel commented May 17, 2026 •

edited by pytorch-bot Bot

Loading

pytorch-bot Bot commented May 17, 2026 •

edited

Loading

This PR needs a `release notes:` label

claude Bot commented May 17, 2026 •

edited

Loading