[Qwen] avoid creating attention masks when there is no padding by kashif · Pull Request #12987 · huggingface/diffusers

kashif · 2026-01-16T14:25:25Z

What does this PR do?

This pull request sets all ones masks to None.

Before submitting

This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
Did you read the contributor guideline?
Did you read our philosophy doc (important for complex PRs)?
Was this discussed/approved via a GitHub issue or the forum? Please add a link to it if that's the case.
Did you make sure to update the documentation with your changes? Here are the
documentation guidelines, and
here are tips on formatting docstrings.
Did you write any new necessary tests?

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.

HuggingFaceDocBuilderDev · 2026-01-16T14:36:33Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

sayakpaul

Thanks!

Could you update with the following things?

Shed light into what caused the speed regression
Add a test with masks in the compilation tests here
Do a before and after comparison in the outputs with the PR

kashif · 2026-01-16T15:18:21Z

will do!

src/diffusers/pipelines/qwenimage/pipeline_qwenimage_img2img.py

tests/models/transformers/test_models_transformer_qwenimage.py

sayakpaul · 2026-01-19T03:57:08Z

tests/models/transformers/test_models_transformer_qwenimage.py

+        with torch.no_grad():
+            output_no_mask = compiled_model(**inputs_no_mask)


Does it lead to graph breaks? If not, then we should add additional contexts:

diffusers/tests/models/test_modeling_common.py

Line 2145 in 3996788

torch._dynamo.config.patch(error_on_recompile=True),

dxqb · 2026-01-19T21:01:50Z

does this PR lack a merge, or is the amount of code changes intentional and really part of this PR only? (+318 - 170)
if intentional, could you explain what it does and why all these changes are necessary?

avoiding masks if they're not necessary has been one line before
attention_mask = attention_mask if not torch.all(text_attention_mask) else None
(on a boolean mask)

kashif · 2026-01-19T21:09:20Z

@dxqb i intially mis-calculated and was adding support for the pipeline to be compiled, i will revert and simplify

dxqb · 2026-01-19T21:16:33Z

@dxqb i intially mis-calculated and was adding support for the pipeline to be compiled, i will revert and simplify

thanks!
regarding compile,

Regional compilation is usually as efficient as full compilation (even if it was possible). Regional compilation compiles the transformer blocks, but not the entire pipeline.
whenever you want to branch depending on GPU data, that's a graph break for torch.compile. Either inefficient, or fails depending on fullgraph. Or, in less abstract terms:
you want to set your attention mask to None if the entire attention mask is True. But the attention mask lives on GPU. Checking whether all values of a tensor on GPU are True requires a transfer back to CPU - that's always a graph break and cannot be compiled (efficiently).

Therefore, I'd suggest to

check for an all-True mask in the pipeline
pass a None-Mask to the transformer block in that case - but don't check any GPU data in the transformer block, so the transformer block can be compiled.

dxqb · 2026-01-19T21:31:34Z

Thanks!

Could you update with the following things?
* Shed light into what caused the speed regression

Here is a benchmark of the impact of using a mask unnecessarily (second graph): #12870 (comment)
torch SDPA falls back to a non-flash algorithm if a mask is used.

kashif · 2026-01-19T21:36:04Z

yes i also benchmarked the mask and got:

kashif · 2026-01-19T21:58:30Z

thanks @dxqb please check now

DefTruth · 2026-01-22T01:57:48Z

I tested this #12987 on L20, and there was an 11% performance improvement in my test.

sayakpaul · 2026-01-22T03:05:35Z

And do you find that to resolve your concerns?

DefTruth · 2026-01-22T05:10:32Z

I think this has solved my problem. Based on this PR, the performance in my test cases has been working as expected.

dxqb · 2026-01-24T07:14:33Z

thanks @dxqb please check now

I don't have a simple test case, but code looks simple and good now

yiyixuxu

ohh thanks!

yiyixuxu · 2026-01-27T22:43:20Z

thanks again @kashif !!

avoid creating attention masks when there is no padding

661febb

kashif mentioned this pull request Jan 16, 2026

Fix QwenImage txt_seq_lens handling #12702

Merged

6 tasks

make fix-copies

5507b5e

kashif requested a review from sayakpaul January 16, 2026 15:10

sayakpaul reviewed Jan 16, 2026

View reviewed changes

yiyixuxu reviewed Jan 17, 2026

View reviewed changes

src/diffusers/pipelines/qwenimage/pipeline_qwenimage_img2img.py Outdated Show resolved Hide resolved

kashif added 4 commits January 17, 2026 12:09

Merge branch 'main' into fix-reg

cd85aae

torch compile tests

4839fcf

set all ones mask to none

23150e4

fix positional encoding from becoming > 4096

47f6585

sayakpaul reviewed Jan 19, 2026

View reviewed changes

tests/models/transformers/test_models_transformer_qwenimage.py Outdated Show resolved Hide resolved

sayakpaul reviewed Jan 19, 2026

View reviewed changes

kashif added 2 commits January 19, 2026 08:49

fix from review

da6e128

slice freqs_cis to match the input sequence length

283df92

dxqb mentioned this pull request Jan 19, 2026

Flux2 Dev+Klein support Nerogar/OneTrainer#1261

Merged

12 tasks

Merge branch 'main' into fix-reg

19d9d09

keep only attenton masking change

3a0fd2d

dxqb mentioned this pull request Jan 21, 2026

Split attention backends #12870

Draft

Merge branch 'main' into fix-reg

0f70ef5

sayakpaul requested a review from yiyixuxu January 22, 2026 03:06

yiyixuxu approved these changes Jan 27, 2026

View reviewed changes

yiyixuxu merged commit d54669a into huggingface:main Jan 27, 2026
9 of 11 checks passed

This was referenced Feb 3, 2026

Fix qwen image prompt padding #12075 #12643

Open

Qwen Image prompt encoding is not padding to max seq len #12075

Open

fix(hooks): Add padding support to context parallel hooks #12595

Open

		with torch.no_grad():
		output_no_mask = compiled_model(**inputs_no_mask)

Conversation

kashif commented Jan 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

Before submitting

Who can review?

Uh oh!

HuggingFaceDocBuilderDev commented Jan 16, 2026

Uh oh!

sayakpaul left a comment

Choose a reason for hiding this comment

Uh oh!

kashif commented Jan 16, 2026

Uh oh!

Uh oh!

Uh oh!

sayakpaul Jan 19, 2026

Choose a reason for hiding this comment

Uh oh!

dxqb commented Jan 19, 2026

Uh oh!

kashif commented Jan 19, 2026

Uh oh!

dxqb commented Jan 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

dxqb commented Jan 19, 2026

Uh oh!

kashif commented Jan 19, 2026

Uh oh!

kashif commented Jan 19, 2026

Uh oh!

DefTruth commented Jan 22, 2026

Uh oh!

sayakpaul commented Jan 22, 2026

Uh oh!

DefTruth commented Jan 22, 2026

Uh oh!

dxqb commented Jan 24, 2026

Uh oh!

yiyixuxu left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

yiyixuxu commented Jan 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

kashif commented Jan 16, 2026 •

edited

Loading

dxqb commented Jan 19, 2026 •

edited

Loading