Skip to content

Conversation

@salilsdesai
Copy link
Contributor

@salilsdesai salilsdesai commented Mar 24, 2022

Summary: We don't want to create and destroy a new context with each multiplication

Test Plan:
From fbcode:
buck test caffe2/test:quantization -- test_qmatmul

Performance Improvement

Benchmarking done by on a model which performs matmuls of the same shapes and counts as Transformer Model, as determined in D30901505

Notebook in which Benchmarking was performed: https://www.internalfb.com/intern/anp/view/?id=1582075&revision_id=1891629751047842

Improvement from this diff alone
~9.71% Reduction in Latency

  • Non Thread Local Contexts (before this diff, D35087184 v2): 8.5410ms
  • Thread Local Contexts (this diff, v12): 7.7113ms

FP32 Matmul vs Quantized Matmul, Overall Improvement from this diff stack
56% reduction in latency compared to FP32 Matmul, 71% reduction in latency compared to Naive QMatmul

  • FP32 Matmul: 17.4910ms
  • Quantized Matmul (after this diff): 7.7113ms
  • Naive Quantized Matmul (dequantize → fp32matmul → quantize): 26.8639ms

Reviewed By: kimishpatel

Differential Revision: D34756288

@facebook-github-bot
Copy link
Contributor

facebook-github-bot commented Mar 24, 2022

🔗 Helpful links

💊 CI failures summary and remediations

As of commit cf49eb1 (more details on the Dr. CI page):


💚 💚 Looks good so far! There are no failures yet. 💚 💚


This comment was automatically generated by Dr. CI (expand for details).

Please report bugs/suggestions to the (internal) Dr. CI Users group.

Click here to manually regenerate this comment.

@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D34756288

@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D34756288

…ch#74676)

Summary:
Pull Request resolved: pytorch#74676

We don't want to create and destroy a new context with each multiplication

Test Plan:
From fbcode:
```buck test caffe2/test:quantization -- test_qmatmul```

# Performance Improvement
*Benchmarking done by on a model which performs matmuls of the same shapes and counts as Transformer Model, as determined in D30901505*

*Notebook in which Benchmarking was performed: https://www.internalfb.com/intern/anp/view/?id=1582075&revision_id=1891629751047842*

**Improvement from this diff alone**
~9.71% Reduction in Latency
- Non Thread Local Contexts (before this diff, D35087184 v2): [8.5410ms](https://www.internalfb.com/intern/aibench/details/661728682381311
)
- Thread Local Contexts (this diff, v12): [7.7113ms](https://www.internalfb.com/intern/aibench/details/956655867696198)

**FP32 Matmul vs Quantized Matmul, Overall Improvement from this diff stack**
56% reduction in latency compared to FP32 Matmul, 71% reduction in latency compared to Naive QMatmul
- FP32 Matmul: [17.4910ms](https://www.internalfb.com/intern/aibench/details/875394396322469)
- Quantized Matmul (after this diff): [7.7113ms](https://www.internalfb.com/intern/aibench/details/956655867696198
)
- Naive Quantized Matmul (dequantize → fp32matmul → quantize): [26.8639ms](https://www.internalfb.com/intern/aibench/details/52181682131461
)

Reviewed By: kimishpatel

Differential Revision: D34756288

fbshipit-source-id: 27c46645f1084a07974dbe2be9b52c15f539928b
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D34756288

facebook-github-bot pushed a commit that referenced this pull request Mar 25, 2022
Summary:
Pull Request resolved: #74676

We don't want to create and destroy a new context with each multiplication

Test Plan:
From fbcode:
```buck test caffe2/test:quantization -- test_qmatmul```

# Performance Improvement
*Benchmarking done by on a model which performs matmuls of the same shapes and counts as Transformer Model, as determined in D30901505*

*Notebook in which Benchmarking was performed: https://www.internalfb.com/intern/anp/view/?id=1582075&revision_id=1891629751047842*

**Improvement from this diff alone**
~9.71% Reduction in Latency
- Non Thread Local Contexts (before this diff, D35087184 v2): [8.5410ms](https://www.internalfb.com/intern/aibench/details/661728682381311
)
- Thread Local Contexts (this diff, v12): [7.7113ms](https://www.internalfb.com/intern/aibench/details/956655867696198)

**FP32 Matmul vs Quantized Matmul, Overall Improvement from this diff stack**
56% reduction in latency compared to FP32 Matmul, 71% reduction in latency compared to Naive QMatmul
- FP32 Matmul: [17.4910ms](https://www.internalfb.com/intern/aibench/details/875394396322469)
- Quantized Matmul (after this diff): [7.7113ms](https://www.internalfb.com/intern/aibench/details/956655867696198
)
- Naive Quantized Matmul (dequantize → fp32matmul → quantize): [26.8639ms](https://www.internalfb.com/intern/aibench/details/52181682131461
)

Reviewed By: kimishpatel

Differential Revision: D34756288

fbshipit-source-id: b000658152cf71b4185dcd34a3cccc71b4cec1f0
@github-actions
Copy link
Contributor

Hey @salilsdesai.
You've committed this PR, but it does not have both a 'release notes: ...' and 'topics: ...' label. Please add one of each to the PR. The 'release notes: ...' label should represent the part of PyTorch that this PR changes (fx, autograd, distributed, etc) and the 'topics: ...' label should represent the kind of PR it is (not user facing, new feature, bug fix, perf improvement, etc). The list of valid labels can be found here for the 'release notes: ...' and here for the 'topics: ...'.
For changes that are 'topic: not user facing' there is no need for a release notes label.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants