[CPU] add onednn context cache for qlinear to improve performance by Xia-Weiwen · Pull Request #168150 · pytorch/pytorch

Xia-Weiwen · 2025-11-19T08:39:19Z

Summary
We noticed big framework overhead of qlinear. It's because to call onednn's primitive, we need to prepare a bunch of data structs as its args, which has big overhead. In the past, such things are cached in the context and attached to torch jit graph. However, Inductor does not support non-tensor data on graph.

This PR adds a cache of those data structs by using a static std::unordered_map, whose key is weight data address as an int64 and value is a struct that contains all data needed to run a primitive.

This cache is safe in most cases where weight data address won't change during inference and weight data are not reused by different layers. However, since we cannot guarantee the assumption, we define an environment variable "ONEDNN_CACHE_CONTEXT_UNSAFE" to control this feature. Users should use it at their own risk.

We found >5% E2E performance gain when running ViT with PT2E static quantization on an 6th gen of Intel Xeon CPU.

Test plan

pytest -sv test/test_quantization.py -k "qlinear and pt2e"

cc @jgong5 @mingfeima @XiaobingSuper @sanchitintel @ashokei @jingxu10 @jerryzh168 @aditew01

pytorch-bot · 2025-11-19T08:39:25Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/168150

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit e058379 with merge base 7a963ff ():
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

mingfeima

leaving users to choose the behavior of cache is not a good design.

suggest that we fix the crash issue from ground.

aten/src/ATen/native/quantized/cpu/qlinear.cpp

Xia-Weiwen · 2025-11-20T03:35:06Z

leaving users to choose the behavior of cache is not a good design.

suggest that we fix the crash issue from ground.

Thanks for reviewing. The issue of this feature is not crash. The issue is that we use weight data address as the cache key. We cannot guarantee that the same weight tensor will never be shared among different linear layers. And this actually happens. I recall that we once met such a model, some linear layers of which share the same weight. That's why this feature cannot be enabled by default and marked as unsafe.
The reason that we use weight data address instead of a combination of all parameters as the key is performance. It's slow to construct a key at runtime if we consider all parameters to ensure safety. And we don't use LRU cache because of performance issue too.

Xia-Weiwen · 2025-11-24T09:08:59Z

Hi @jerryzh168 Could you please review this PR? It affects the qlinear op of onednn backend only. Thanks.

jerryzh168 · 2025-11-25T01:44:18Z

@Xia-Weiwen any plans to migrate these ops to torchao? we are likely to delete pt2e flow in pytorch soon

Xia-Weiwen · 2025-11-25T01:51:52Z

@Xia-Weiwen any plans to migrate these ops to torchao? we are likely to delete pt2e flow in pytorch soon

Hi @jerryzh168 I have asked you about where to put quantized ops after the migration of PT2E API. And you didn't have a plan at that time. Do you have a plan now? And there might be technical issues of the migration, such as (1) how to call oneDNN in torchao (2) if ops are migrated, fusion and lowering passes in Inductor should be migrated, too. Do you have any suggestions? Thanks.

Xia-Weiwen · 2025-11-25T01:52:14Z

@pytorchbot merge

pytorchmergebot · 2025-11-25T01:54:02Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

…torch#168150) **Summary** We noticed big framework overhead of `qlinear`. It's because to call onednn's primitive, we need to prepare a bunch of data structs as its args, which has big overhead. In the past, such things are cached in the context and attached to torch jit graph. However, Inductor does not support non-tensor data on graph. This PR adds a cache of those data structs by using a static `std::unordered_map`, whose key is weight data address as an `int64` and value is a struct that contains all data needed to run a primitive. This cache is safe in most cases where weight data address won't change during inference and weight data are not reused by different layers. However, since we cannot guarantee the assumption, we define an environment variable `"ONEDNN_CACHE_CONTEXT_UNSAFE"` to control this feature. Users should use it at their own risk. We found >5% E2E performance gain when running ViT with PT2E static quantization on an 6th gen of Intel Xeon CPU. **Test plan** ``` pytest -sv test/test_quantization.py -k "qlinear and pt2e" ``` Pull Request resolved: pytorch#168150 Approved by: https://github.com/mingfeima, https://github.com/jerryzh168

…68150) **Summary** We noticed big framework overhead of `qlinear`. It's because to call onednn's primitive, we need to prepare a bunch of data structs as its args, which has big overhead. In the past, such things are cached in the context and attached to torch jit graph. However, Inductor does not support non-tensor data on graph. This PR adds a cache of those data structs by using a static `std::unordered_map`, whose key is weight data address as an `int64` and value is a struct that contains all data needed to run a primitive. This cache is safe in most cases where weight data address won't change during inference and weight data are not reused by different layers. However, since we cannot guarantee the assumption, we define an environment variable `"ONEDNN_CACHE_CONTEXT_UNSAFE"` to control this feature. Users should use it at their own risk. We found >5% E2E performance gain when running ViT with PT2E static quantization on an 6th gen of Intel Xeon CPU. **Test plan** ``` pytest -sv test/test_quantization.py -k "qlinear and pt2e" ``` Pull Request resolved: #168150 Approved by: https://github.com/mingfeima, https://github.com/jerryzh168

…torch#168150) **Summary** We noticed big framework overhead of `qlinear`. It's because to call onednn's primitive, we need to prepare a bunch of data structs as its args, which has big overhead. In the past, such things are cached in the context and attached to torch jit graph. However, Inductor does not support non-tensor data on graph. This PR adds a cache of those data structs by using a static `std::unordered_map`, whose key is weight data address as an `int64` and value is a struct that contains all data needed to run a primitive. This cache is safe in most cases where weight data address won't change during inference and weight data are not reused by different layers. However, since we cannot guarantee the assumption, we define an environment variable `"ONEDNN_CACHE_CONTEXT_UNSAFE"` to control this feature. Users should use it at their own risk. We found >5% E2E performance gain when running ViT with PT2E static quantization on an 6th gen of Intel Xeon CPU. **Test plan** ``` pytest -sv test/test_quantization.py -k "qlinear and pt2e" ``` Pull Request resolved: pytorch#168150 Approved by: https://github.com/mingfeima, https://github.com/jerryzh168

[CPU] add onednn context cache for qlinear to improve performance

705874c

pytorch-bot bot added module: cpu CPU specific problem (e.g., perf, algorithm) release notes: quantization release notes category labels Nov 19, 2025

Xia-Weiwen added the intel This tag is for PR from Intel label Nov 19, 2025

pytorchbot added the open source label Nov 19, 2025

Merge branch 'main' into onednn_qlinear_cache

007fd31

Xia-Weiwen requested a review from mingfeima November 20, 2025 01:17

mingfeima added the ciflow/trunk Trigger trunk jobs on your pull request label Nov 20, 2025

mingfeima requested changes Nov 20, 2025

View reviewed changes

Refine cache look-up code

fa41765

Xia-Weiwen requested a review from mingfeima November 20, 2025 05:56

Refine code and update UT

e058379

mingfeima approved these changes Nov 24, 2025

View reviewed changes

mingfeima marked this pull request as ready for review November 24, 2025 05:20

mingfeima requested review from digantdesai, jerryzh168, jianyuh, kimishpatel and salilsdesai as code owners November 24, 2025 05:20

jerryzh168 approved these changes Nov 25, 2025

View reviewed changes

pytorchmergebot added the merging label Nov 25, 2025

pytorchmergebot added the Merged label Nov 25, 2025

pytorchmergebot closed this in a5436a5 Nov 25, 2025

pytorchmergebot removed the merging label Nov 25, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[CPU] add onednn context cache for qlinear to improve performance#168150

[CPU] add onednn context cache for qlinear to improve performance#168150
Xia-Weiwen wants to merge 4 commits intopytorch:mainfrom
Xia-Weiwen:onednn_qlinear_cache

Xia-Weiwen commented Nov 19, 2025 •

edited

Loading

Uh oh!

pytorch-bot bot commented Nov 19, 2025 •

edited

Loading

Uh oh!

mingfeima left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Xia-Weiwen commented Nov 20, 2025 •

edited

Loading

Uh oh!

Xia-Weiwen commented Nov 24, 2025

Uh oh!

jerryzh168 commented Nov 25, 2025

Uh oh!

Xia-Weiwen commented Nov 25, 2025

Uh oh!

Xia-Weiwen commented Nov 25, 2025

Uh oh!

pytorchmergebot commented Nov 25, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Conversation

Xia-Weiwen commented Nov 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorch-bot bot commented Nov 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/168150

✅ No Failures

Uh oh!

mingfeima left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Xia-Weiwen commented Nov 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Xia-Weiwen commented Nov 24, 2025

Uh oh!

jerryzh168 commented Nov 25, 2025

Uh oh!

Xia-Weiwen commented Nov 25, 2025

Uh oh!

Xia-Weiwen commented Nov 25, 2025

Uh oh!

pytorchmergebot commented Nov 25, 2025

Merge started

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Xia-Weiwen commented Nov 19, 2025 •

edited

Loading

pytorch-bot bot commented Nov 19, 2025 •

edited

Loading

Xia-Weiwen commented Nov 20, 2025 •

edited

Loading