[CPU] add onednn context cache for qlinear to improve performance#168150
[CPU] add onednn context cache for qlinear to improve performance#168150Xia-Weiwen wants to merge 4 commits intopytorch:mainfrom
Conversation
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/168150
Note: Links to docs will display an error until the docs builds have been completed. ✅ No FailuresAs of commit e058379 with merge base 7a963ff ( This comment was automatically generated by Dr. CI and updates every 15 minutes. |
mingfeima
left a comment
There was a problem hiding this comment.
leaving users to choose the behavior of cache is not a good design.
suggest that we fix the crash issue from ground.
Thanks for reviewing. The issue of this feature is not crash. The issue is that we use weight data address as the cache key. We cannot guarantee that the same weight tensor will never be shared among different linear layers. And this actually happens. I recall that we once met such a model, some linear layers of which share the same weight. That's why this feature cannot be enabled by default and marked as unsafe. |
|
Hi @jerryzh168 Could you please review this PR? It affects the qlinear op of onednn backend only. Thanks. |
|
@Xia-Weiwen any plans to migrate these ops to torchao? we are likely to delete pt2e flow in pytorch soon |
Hi @jerryzh168 I have asked you about where to put quantized ops after the migration of PT2E API. And you didn't have a plan at that time. Do you have a plan now? And there might be technical issues of the migration, such as (1) how to call oneDNN in torchao (2) if ops are migrated, fusion and lowering passes in Inductor should be migrated, too. Do you have any suggestions? Thanks. |
|
@pytorchbot merge |
Merge startedYour change will be merged once all checks pass (ETA 0-4 Hours). Learn more about merging in the wiki. Questions? Feedback? Please reach out to the PyTorch DevX Team |
…torch#168150) **Summary** We noticed big framework overhead of `qlinear`. It's because to call onednn's primitive, we need to prepare a bunch of data structs as its args, which has big overhead. In the past, such things are cached in the context and attached to torch jit graph. However, Inductor does not support non-tensor data on graph. This PR adds a cache of those data structs by using a static `std::unordered_map`, whose key is weight data address as an `int64` and value is a struct that contains all data needed to run a primitive. This cache is safe in most cases where weight data address won't change during inference and weight data are not reused by different layers. However, since we cannot guarantee the assumption, we define an environment variable `"ONEDNN_CACHE_CONTEXT_UNSAFE"` to control this feature. Users should use it at their own risk. We found >5% E2E performance gain when running ViT with PT2E static quantization on an 6th gen of Intel Xeon CPU. **Test plan** ``` pytest -sv test/test_quantization.py -k "qlinear and pt2e" ``` Pull Request resolved: pytorch#168150 Approved by: https://github.com/mingfeima, https://github.com/jerryzh168
…68150) **Summary** We noticed big framework overhead of `qlinear`. It's because to call onednn's primitive, we need to prepare a bunch of data structs as its args, which has big overhead. In the past, such things are cached in the context and attached to torch jit graph. However, Inductor does not support non-tensor data on graph. This PR adds a cache of those data structs by using a static `std::unordered_map`, whose key is weight data address as an `int64` and value is a struct that contains all data needed to run a primitive. This cache is safe in most cases where weight data address won't change during inference and weight data are not reused by different layers. However, since we cannot guarantee the assumption, we define an environment variable `"ONEDNN_CACHE_CONTEXT_UNSAFE"` to control this feature. Users should use it at their own risk. We found >5% E2E performance gain when running ViT with PT2E static quantization on an 6th gen of Intel Xeon CPU. **Test plan** ``` pytest -sv test/test_quantization.py -k "qlinear and pt2e" ``` Pull Request resolved: #168150 Approved by: https://github.com/mingfeima, https://github.com/jerryzh168
…torch#168150) **Summary** We noticed big framework overhead of `qlinear`. It's because to call onednn's primitive, we need to prepare a bunch of data structs as its args, which has big overhead. In the past, such things are cached in the context and attached to torch jit graph. However, Inductor does not support non-tensor data on graph. This PR adds a cache of those data structs by using a static `std::unordered_map`, whose key is weight data address as an `int64` and value is a struct that contains all data needed to run a primitive. This cache is safe in most cases where weight data address won't change during inference and weight data are not reused by different layers. However, since we cannot guarantee the assumption, we define an environment variable `"ONEDNN_CACHE_CONTEXT_UNSAFE"` to control this feature. Users should use it at their own risk. We found >5% E2E performance gain when running ViT with PT2E static quantization on an 6th gen of Intel Xeon CPU. **Test plan** ``` pytest -sv test/test_quantization.py -k "qlinear and pt2e" ``` Pull Request resolved: pytorch#168150 Approved by: https://github.com/mingfeima, https://github.com/jerryzh168
…torch#168150) **Summary** We noticed big framework overhead of `qlinear`. It's because to call onednn's primitive, we need to prepare a bunch of data structs as its args, which has big overhead. In the past, such things are cached in the context and attached to torch jit graph. However, Inductor does not support non-tensor data on graph. This PR adds a cache of those data structs by using a static `std::unordered_map`, whose key is weight data address as an `int64` and value is a struct that contains all data needed to run a primitive. This cache is safe in most cases where weight data address won't change during inference and weight data are not reused by different layers. However, since we cannot guarantee the assumption, we define an environment variable `"ONEDNN_CACHE_CONTEXT_UNSAFE"` to control this feature. Users should use it at their own risk. We found >5% E2E performance gain when running ViT with PT2E static quantization on an 6th gen of Intel Xeon CPU. **Test plan** ``` pytest -sv test/test_quantization.py -k "qlinear and pt2e" ``` Pull Request resolved: pytorch#168150 Approved by: https://github.com/mingfeima, https://github.com/jerryzh168
Summary
We noticed big framework overhead of
qlinear. It's because to call onednn's primitive, we need to prepare a bunch of data structs as its args, which has big overhead. In the past, such things are cached in the context and attached to torch jit graph. However, Inductor does not support non-tensor data on graph.This PR adds a cache of those data structs by using a static
std::unordered_map, whose key is weight data address as anint64and value is a struct that contains all data needed to run a primitive.This cache is safe in most cases where weight data address won't change during inference and weight data are not reused by different layers. However, since we cannot guarantee the assumption, we define an environment variable
"ONEDNN_CACHE_CONTEXT_UNSAFE"to control this feature. Users should use it at their own risk.We found >5% E2E performance gain when running ViT with PT2E static quantization on an 6th gen of Intel Xeon CPU.
Test plan
cc @jgong5 @mingfeima @XiaobingSuper @sanchitintel @ashokei @jingxu10 @jerryzh168 @aditew01