-
Notifications
You must be signed in to change notification settings - Fork 26.3k
[PyTorch Edge] Make contexts thread local for quantized matmul #74676
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
🔗 Helpful links
💊 CI failures summary and remediationsAs of commit cf49eb1 (more details on the Dr. CI page): 💚 💚 Looks good so far! There are no failures yet. 💚 💚 This comment was automatically generated by Dr. CI (expand for details).Please report bugs/suggestions to the (internal) Dr. CI Users group. |
|
This pull request was exported from Phabricator. Differential Revision: D34756288 |
d85a8be to
38fa1a7
Compare
|
This pull request was exported from Phabricator. Differential Revision: D34756288 |
…ch#74676) Summary: Pull Request resolved: pytorch#74676 We don't want to create and destroy a new context with each multiplication Test Plan: From fbcode: ```buck test caffe2/test:quantization -- test_qmatmul``` # Performance Improvement *Benchmarking done by on a model which performs matmuls of the same shapes and counts as Transformer Model, as determined in D30901505* *Notebook in which Benchmarking was performed: https://www.internalfb.com/intern/anp/view/?id=1582075&revision_id=1891629751047842* **Improvement from this diff alone** ~9.71% Reduction in Latency - Non Thread Local Contexts (before this diff, D35087184 v2): [8.5410ms](https://www.internalfb.com/intern/aibench/details/661728682381311 ) - Thread Local Contexts (this diff, v12): [7.7113ms](https://www.internalfb.com/intern/aibench/details/956655867696198) **FP32 Matmul vs Quantized Matmul, Overall Improvement from this diff stack** 56% reduction in latency compared to FP32 Matmul, 71% reduction in latency compared to Naive QMatmul - FP32 Matmul: [17.4910ms](https://www.internalfb.com/intern/aibench/details/875394396322469) - Quantized Matmul (after this diff): [7.7113ms](https://www.internalfb.com/intern/aibench/details/956655867696198 ) - Naive Quantized Matmul (dequantize → fp32matmul → quantize): [26.8639ms](https://www.internalfb.com/intern/aibench/details/52181682131461 ) Reviewed By: kimishpatel Differential Revision: D34756288 fbshipit-source-id: 27c46645f1084a07974dbe2be9b52c15f539928b
|
This pull request was exported from Phabricator. Differential Revision: D34756288 |
38fa1a7 to
cf49eb1
Compare
Summary: Pull Request resolved: #74676 We don't want to create and destroy a new context with each multiplication Test Plan: From fbcode: ```buck test caffe2/test:quantization -- test_qmatmul``` # Performance Improvement *Benchmarking done by on a model which performs matmuls of the same shapes and counts as Transformer Model, as determined in D30901505* *Notebook in which Benchmarking was performed: https://www.internalfb.com/intern/anp/view/?id=1582075&revision_id=1891629751047842* **Improvement from this diff alone** ~9.71% Reduction in Latency - Non Thread Local Contexts (before this diff, D35087184 v2): [8.5410ms](https://www.internalfb.com/intern/aibench/details/661728682381311 ) - Thread Local Contexts (this diff, v12): [7.7113ms](https://www.internalfb.com/intern/aibench/details/956655867696198) **FP32 Matmul vs Quantized Matmul, Overall Improvement from this diff stack** 56% reduction in latency compared to FP32 Matmul, 71% reduction in latency compared to Naive QMatmul - FP32 Matmul: [17.4910ms](https://www.internalfb.com/intern/aibench/details/875394396322469) - Quantized Matmul (after this diff): [7.7113ms](https://www.internalfb.com/intern/aibench/details/956655867696198 ) - Naive Quantized Matmul (dequantize → fp32matmul → quantize): [26.8639ms](https://www.internalfb.com/intern/aibench/details/52181682131461 ) Reviewed By: kimishpatel Differential Revision: D34756288 fbshipit-source-id: b000658152cf71b4185dcd34a3cccc71b4cec1f0
|
Hey @salilsdesai. |
Summary: We don't want to create and destroy a new context with each multiplication
Test Plan:
From fbcode:
buck test caffe2/test:quantization -- test_qmatmulPerformance Improvement
Benchmarking done by on a model which performs matmuls of the same shapes and counts as Transformer Model, as determined in D30901505
Notebook in which Benchmarking was performed: https://www.internalfb.com/intern/anp/view/?id=1582075&revision_id=1891629751047842
Improvement from this diff alone
~9.71% Reduction in Latency
FP32 Matmul vs Quantized Matmul, Overall Improvement from this diff stack
56% reduction in latency compared to FP32 Matmul, 71% reduction in latency compared to Naive QMatmul
Reviewed By: kimishpatel
Differential Revision: D34756288