Prevent cudaStreamSync when indexing GPU tensors with boolean CPU mask #156384

lgeiger · 2025-06-19T01:01:03Z

index_put with a boolean mask (target[mask] = src) causes a cudaStreamSynchronize. When both mask and target tensors are on GPU this is expected.

However, the sync can be prevented if the mask is a CPU tensor.
Internally a new index tensor is created with mask.nonzero() so we can use a non-blocking copy to transfer it to the GPU since it cannot be accidentally mutated by the user between its creation and the device copy. @ngimel Let me know if I'm missing something.

I think this is useful since users can't prevent a sync simply by making sure all tensors are on the same device as with other ops. Instead one would need to do something like this which is much less readable

indices = mask.nonzero().squeeze(1).to("cuda", non_blocking=True)
target[indices] = src

Fixes #12461

pytorch-bot · 2025-06-19T01:01:06Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/156384

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki or our office hours

Note: Links to docs will display an error until the docs builds have been completed.

✅ You can merge normally! (1 Unrelated Failure)

As of commit c172e4e with merge base 6b7767f ():

UNSTABLE - The following job is marked as unstable, possibly due to flakiness on trunk:

pull / cuda12.8-py3.10-gcc9-sm75 / test (pr_time_benchmarks, 1, 1, linux.g4dn.metal.nvidia.gpu, unstable) (gh) (#153987)
MISSING REGRESSION TEST

This comment was automatically generated by Dr. CI and updates every 15 minutes.

linux-foundation-easycla · 2025-06-19T01:01:10Z

The committers listed above are authorized under a signed CLA.

✅ login: lgeiger / name: Lukas Geiger (c172e4e, 9efc333, cc3987f, 97a4fd9)

lgeiger · 2025-06-19T01:10:04Z

@pytorchbot label "release notes: cuda"

aten/src/ATen/native/IndexingUtils.h

lgeiger · 2025-06-24T23:49:24Z

@pytorchmergebot merge

pytorch-bot · 2025-06-24T23:49:28Z

Pull workflow has not been scheduled for the PR yet. It could be because author doesn't have permissions to run those or skip-checks keywords were added to PR/commits, aborting merge. Please get/give approval for the workflows and/or remove skip ci decorators before next merge attempt. If you think this is a mistake, please contact PyTorch Dev Infra.

lgeiger · 2025-06-25T16:13:18Z

dfdf9bd should fix the build failure on CI

lgeiger · 2025-06-27T10:09:44Z

@ngimel any chance you could trigger CI to verify that this PR is good to go?

cyyever · 2025-06-28T02:17:07Z

@pytorchmergebot merge -r

pytorchmergebot · 2025-06-28T02:18:54Z

@pytorchbot started a rebase job onto refs/remotes/origin/viable/strict. Check the current status here

pytorchmergebot · 2025-06-28T02:18:58Z

Successfully rebased index-put-no-sync onto refs/remotes/origin/viable/strict, please pull locally before adding more changes (for example, via git checkout index-put-no-sync && git pull --rebase)

pytorchmergebot · 2025-06-28T02:20:16Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

pytorchbot added the open source label Jun 19, 2025

pytorch-bot bot added the release notes: cuda release notes category label Jun 19, 2025

ngimel reviewed Jun 19, 2025

View reviewed changes

aten/src/ATen/native/IndexingUtils.h Outdated Show resolved Hide resolved

lgeiger requested a review from ngimel June 19, 2025 17:14

mikaylagawarecki added the triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module label Jun 20, 2025

ngimel approved these changes Jun 23, 2025

View reviewed changes

lgeiger mentioned this pull request Jun 23, 2025

[Models] Remove GPU-CPU sync when do_pan_and_scan=false in Gemma3 vllm-project/vllm#19999

Open

lgeiger requested a review from ngimel June 24, 2025 23:50

pytorch-bot bot added the ciflow/trunk Trigger trunk jobs on your pull request label Jun 28, 2025

lgeiger added 4 commits June 28, 2025 02:18

Prevent cudaStreamSync when indexing GPU tensors with boolean CPU mask

97a4fd9

Create nonzero tensor in pinned memory

9efc333

Fix whitespace

cc3987f

Fix build for USE_PER_OPERATOR_HEADERS=OFF

c172e4e

pytorchmergebot force-pushed the index-put-no-sync branch from dfdf9bd to c172e4e Compare June 28, 2025 02:18

pytorch-bot bot removed the ciflow/trunk Trigger trunk jobs on your pull request label Jun 28, 2025

pytorchmergebot added the merging label Jun 28, 2025

pytorchmergebot added the Merged label Jun 28, 2025

pytorchmergebot closed this in a92b24c Jun 28, 2025

pytorchmergebot removed the merging label Jun 28, 2025

lgeiger deleted the index-put-no-sync branch June 28, 2025 08:16

lgeiger mentioned this pull request Oct 21, 2025

Remove index_put from MM embeddings merging vllm-project/vllm#22105

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Prevent cudaStreamSync when indexing GPU tensors with boolean CPU mask #156384

Prevent cudaStreamSync when indexing GPU tensors with boolean CPU mask #156384

Uh oh!

lgeiger commented Jun 19, 2025

Uh oh!

pytorch-bot bot commented Jun 19, 2025 •

edited

Loading

Uh oh!

linux-foundation-easycla bot commented Jun 19, 2025 •

edited

Loading

Uh oh!

lgeiger commented Jun 19, 2025

Uh oh!

Uh oh!

lgeiger commented Jun 24, 2025

Uh oh!

pytorch-bot bot commented Jun 24, 2025

Uh oh!

lgeiger commented Jun 25, 2025

Uh oh!

lgeiger commented Jun 27, 2025

Uh oh!

cyyever commented Jun 28, 2025

Uh oh!

pytorchmergebot commented Jun 28, 2025

Uh oh!

pytorchmergebot commented Jun 28, 2025

Uh oh!

pytorchmergebot commented Jun 28, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

Prevent cudaStreamSync when indexing GPU tensors with boolean CPU mask #156384

Prevent cudaStreamSync when indexing GPU tensors with boolean CPU mask #156384

Uh oh!

Conversation

lgeiger commented Jun 19, 2025

Uh oh!

pytorch-bot bot commented Jun 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/156384

✅ You can merge normally! (1 Unrelated Failure)

Uh oh!

linux-foundation-easycla bot commented Jun 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

lgeiger commented Jun 19, 2025

Uh oh!

Uh oh!

lgeiger commented Jun 24, 2025

Uh oh!

pytorch-bot bot commented Jun 24, 2025

Uh oh!

lgeiger commented Jun 25, 2025

Uh oh!

lgeiger commented Jun 27, 2025

Uh oh!

cyyever commented Jun 28, 2025

Uh oh!

pytorchmergebot commented Jun 28, 2025

Uh oh!

pytorchmergebot commented Jun 28, 2025

Uh oh!

pytorchmergebot commented Jun 28, 2025

Merge started

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

pytorch-bot bot commented Jun 19, 2025 •

edited

Loading

linux-foundation-easycla bot commented Jun 19, 2025 •

edited

Loading