Skip to content

[CPU] add onednn context cache for qlinear to improve performance#168150

Closed
Xia-Weiwen wants to merge 4 commits intopytorch:mainfrom
Xia-Weiwen:onednn_qlinear_cache
Closed

[CPU] add onednn context cache for qlinear to improve performance#168150
Xia-Weiwen wants to merge 4 commits intopytorch:mainfrom
Xia-Weiwen:onednn_qlinear_cache

Conversation

@Xia-Weiwen
Copy link
Collaborator

@Xia-Weiwen Xia-Weiwen commented Nov 19, 2025

Summary
We noticed big framework overhead of qlinear. It's because to call onednn's primitive, we need to prepare a bunch of data structs as its args, which has big overhead. In the past, such things are cached in the context and attached to torch jit graph. However, Inductor does not support non-tensor data on graph.

This PR adds a cache of those data structs by using a static std::unordered_map, whose key is weight data address as an int64 and value is a struct that contains all data needed to run a primitive.

This cache is safe in most cases where weight data address won't change during inference and weight data are not reused by different layers. However, since we cannot guarantee the assumption, we define an environment variable "ONEDNN_CACHE_CONTEXT_UNSAFE" to control this feature. Users should use it at their own risk.

We found >5% E2E performance gain when running ViT with PT2E static quantization on an 6th gen of Intel Xeon CPU.

Test plan

pytest -sv test/test_quantization.py -k "qlinear and pt2e"

cc @jgong5 @mingfeima @XiaobingSuper @sanchitintel @ashokei @jingxu10 @jerryzh168 @aditew01

@pytorch-bot pytorch-bot bot added module: cpu CPU specific problem (e.g., perf, algorithm) release notes: quantization release notes category labels Nov 19, 2025
@pytorch-bot
Copy link

pytorch-bot bot commented Nov 19, 2025

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/168150

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit e058379 with merge base 7a963ff (image):
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@Xia-Weiwen Xia-Weiwen added the intel This tag is for PR from Intel label Nov 19, 2025
@Xia-Weiwen Xia-Weiwen requested a review from mingfeima November 20, 2025 01:17
@mingfeima mingfeima added the ciflow/trunk Trigger trunk jobs on your pull request label Nov 20, 2025
Copy link
Collaborator

@mingfeima mingfeima left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

leaving users to choose the behavior of cache is not a good design.

suggest that we fix the crash issue from ground.

@Xia-Weiwen
Copy link
Collaborator Author

Xia-Weiwen commented Nov 20, 2025

leaving users to choose the behavior of cache is not a good design.

suggest that we fix the crash issue from ground.

Thanks for reviewing. The issue of this feature is not crash. The issue is that we use weight data address as the cache key. We cannot guarantee that the same weight tensor will never be shared among different linear layers. And this actually happens. I recall that we once met such a model, some linear layers of which share the same weight. That's why this feature cannot be enabled by default and marked as unsafe.
The reason that we use weight data address instead of a combination of all parameters as the key is performance. It's slow to construct a key at runtime if we consider all parameters to ensure safety. And we don't use LRU cache because of performance issue too.

@Xia-Weiwen Xia-Weiwen requested a review from mingfeima November 20, 2025 05:56
@mingfeima mingfeima marked this pull request as ready for review November 24, 2025 05:20
@Xia-Weiwen
Copy link
Collaborator Author

Hi @jerryzh168 Could you please review this PR? It affects the qlinear op of onednn backend only. Thanks.

@jerryzh168
Copy link
Contributor

@Xia-Weiwen any plans to migrate these ops to torchao? we are likely to delete pt2e flow in pytorch soon

@Xia-Weiwen
Copy link
Collaborator Author

@Xia-Weiwen any plans to migrate these ops to torchao? we are likely to delete pt2e flow in pytorch soon

Hi @jerryzh168 I have asked you about where to put quantized ops after the migration of PT2E API. And you didn't have a plan at that time. Do you have a plan now? And there might be technical issues of the migration, such as (1) how to call oneDNN in torchao (2) if ops are migrated, fusion and lowering passes in Inductor should be migrated, too. Do you have any suggestions? Thanks.

@Xia-Weiwen
Copy link
Collaborator Author

@pytorchbot merge

@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

zhudada0120 pushed a commit to zhudada0120/pytorch that referenced this pull request Nov 25, 2025
…torch#168150)

**Summary**
We noticed big framework overhead of `qlinear`. It's because to call onednn's primitive, we need to prepare a bunch of data structs as its args, which has big overhead. In the past, such things are cached in the context and attached to torch jit graph. However, Inductor does not support non-tensor data on graph.

This PR adds a cache of those data structs by using a static `std::unordered_map`, whose key is weight data address as an `int64` and value is a struct that contains all data needed to run a primitive.

This cache is safe in most cases where weight data address won't change during inference and weight data are not reused by different layers. However, since we cannot guarantee the assumption, we define an environment variable `"ONEDNN_CACHE_CONTEXT_UNSAFE"` to control this feature. Users should use it at their own risk.

We found >5% E2E performance gain when running ViT with PT2E static quantization on an 6th gen of Intel Xeon CPU.

**Test plan**
```
pytest -sv test/test_quantization.py -k "qlinear and pt2e"
```

Pull Request resolved: pytorch#168150
Approved by: https://github.com/mingfeima, https://github.com/jerryzh168
JacobSzwejbka pushed a commit that referenced this pull request Dec 8, 2025
…68150)

**Summary**
We noticed big framework overhead of `qlinear`. It's because to call onednn's primitive, we need to prepare a bunch of data structs as its args, which has big overhead. In the past, such things are cached in the context and attached to torch jit graph. However, Inductor does not support non-tensor data on graph.

This PR adds a cache of those data structs by using a static `std::unordered_map`, whose key is weight data address as an `int64` and value is a struct that contains all data needed to run a primitive.

This cache is safe in most cases where weight data address won't change during inference and weight data are not reused by different layers. However, since we cannot guarantee the assumption, we define an environment variable `"ONEDNN_CACHE_CONTEXT_UNSAFE"` to control this feature. Users should use it at their own risk.

We found >5% E2E performance gain when running ViT with PT2E static quantization on an 6th gen of Intel Xeon CPU.

**Test plan**
```
pytest -sv test/test_quantization.py -k "qlinear and pt2e"
```

Pull Request resolved: #168150
Approved by: https://github.com/mingfeima, https://github.com/jerryzh168
Xia-Weiwen added a commit to yanbing-j/pytorch that referenced this pull request Dec 19, 2025
…torch#168150)

**Summary**
We noticed big framework overhead of `qlinear`. It's because to call onednn's primitive, we need to prepare a bunch of data structs as its args, which has big overhead. In the past, such things are cached in the context and attached to torch jit graph. However, Inductor does not support non-tensor data on graph.

This PR adds a cache of those data structs by using a static `std::unordered_map`, whose key is weight data address as an `int64` and value is a struct that contains all data needed to run a primitive.

This cache is safe in most cases where weight data address won't change during inference and weight data are not reused by different layers. However, since we cannot guarantee the assumption, we define an environment variable `"ONEDNN_CACHE_CONTEXT_UNSAFE"` to control this feature. Users should use it at their own risk.

We found >5% E2E performance gain when running ViT with PT2E static quantization on an 6th gen of Intel Xeon CPU.

**Test plan**
```
pytest -sv test/test_quantization.py -k "qlinear and pt2e"
```

Pull Request resolved: pytorch#168150
Approved by: https://github.com/mingfeima, https://github.com/jerryzh168
yanbing-j pushed a commit to yanbing-j/pytorch that referenced this pull request Dec 22, 2025
…torch#168150)

**Summary**
We noticed big framework overhead of `qlinear`. It's because to call onednn's primitive, we need to prepare a bunch of data structs as its args, which has big overhead. In the past, such things are cached in the context and attached to torch jit graph. However, Inductor does not support non-tensor data on graph.

This PR adds a cache of those data structs by using a static `std::unordered_map`, whose key is weight data address as an `int64` and value is a struct that contains all data needed to run a primitive.

This cache is safe in most cases where weight data address won't change during inference and weight data are not reused by different layers. However, since we cannot guarantee the assumption, we define an environment variable `"ONEDNN_CACHE_CONTEXT_UNSAFE"` to control this feature. Users should use it at their own risk.

We found >5% E2E performance gain when running ViT with PT2E static quantization on an 6th gen of Intel Xeon CPU.

**Test plan**
```
pytest -sv test/test_quantization.py -k "qlinear and pt2e"
```

Pull Request resolved: pytorch#168150
Approved by: https://github.com/mingfeima, https://github.com/jerryzh168
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ciflow/trunk Trigger trunk jobs on your pull request intel This tag is for PR from Intel Merged module: cpu CPU specific problem (e.g., perf, algorithm) open source release notes: quantization release notes category

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants