Skip to content

Conversation

@robieta
Copy link
Contributor

@robieta robieta commented Mar 21, 2022

Summary:
In my first attempt at this in December I stamped out specializations using variadic templates. However I'm able to get comparable performance using simple conditionals since the branch is very predictable and AppendOnlyList::emplace_back is low enough overhead that multiple calls don't cause an issue.

This is also a chance to do some BE: rather than force ops and backend events to use the same fields (which in practice means setting a bunch of default values when reporting backend events), I just split them and use a variant.

Test Plan: The single threaded benchmark (with no extra options set) improved considerably from ~0.88 us to ~0.62 us. The stress test benchmark improved modestly from ~6.1 us to ~5.8 us. So the bottleneck for multi-threading is somewhere else, but doing less wasted work is still able to move the needle a little bit.

Reviewed By: swolchok

Differential Revision: D34779994

@facebook-github-bot
Copy link
Contributor

facebook-github-bot commented Mar 21, 2022

🔗 Helpful links

💊 CI failures summary and remediations

As of commit 9af2568 (more details on the Dr. CI page):


💚 💚 Looks good so far! There are no failures yet. 💚 💚


This comment was automatically generated by Dr. CI (expand for details).

Please report bugs/suggestions to the (internal) Dr. CI Users group.

Click here to manually regenerate this comment.

@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D34779994

@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D34779994

@albanD albanD removed their request for review March 22, 2022 19:45
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D34779994

@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D34779994

Summary:
Pull Request resolved: pytorch#74484

In my first attempt at this in December I stamped out specializations using variadic templates. However I'm able to get comparable performance using simple conditionals since the branch is very predictable and AppendOnlyList::emplace_back is low enough overhead that multiple calls don't cause an issue.

This is also a chance to do some BE: rather than force ops and backend events to use the same fields (which in practice means setting a bunch of default values when reporting backend events), I just split them and use a variant.

Test Plan: The single threaded benchmark (with no extra options set) improved considerably from ~0.88 us to ~0.62 us. The stress test benchmark improved modestly from ~6.1 us to ~5.8 us. So the bottleneck for multi-threading is somewhere else, but doing less wasted work is still able to move the needle a little bit.

Reviewed By: swolchok

Differential Revision: D34779994

fbshipit-source-id: 0b0503678929b2d61e9ddb1b27887e64fdcb7fe4
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D34779994

Copy link
Member

@aaronenyeshi aaronenyeshi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, reviewed internally! Thanks a lot for this patch!

facebook-github-bot pushed a commit that referenced this pull request Mar 24, 2022
Summary:
Pull Request resolved: #74484

In my first attempt at this in December I stamped out specializations using variadic templates. However I'm able to get comparable performance using simple conditionals since the branch is very predictable and AppendOnlyList::emplace_back is low enough overhead that multiple calls don't cause an issue.

This is also a chance to do some BE: rather than force ops and backend events to use the same fields (which in practice means setting a bunch of default values when reporting backend events), I just split them and use a variant.

Test Plan: The single threaded benchmark (with no extra options set) improved considerably from ~0.88 us to ~0.62 us. The stress test benchmark improved modestly from ~6.1 us to ~5.8 us. So the bottleneck for multi-threading is somewhere else, but doing less wasted work is still able to move the needle a little bit.

Reviewed By: swolchok

Differential Revision: D34779994

fbshipit-source-id: 392bc7c6f12797fa5e18777063aa21210d9d2067
@github-actions
Copy link
Contributor

Hey @robieta.
You've committed this PR, but it does not have both a 'release notes: ...' and 'topics: ...' label. Please add one of each to the PR. The 'release notes: ...' label should represent the part of PyTorch that this PR changes (fx, autograd, distributed, etc) and the 'topics: ...' label should represent the kind of PR it is (not user facing, new feature, bug fix, perf improvement, etc). The list of valid labels can be found here for the 'release notes: ...' and here for the 'topics: ...'.
For changes that are 'topic: not user facing' there is no need for a release notes label.

@robieta robieta added the release notes: profiler release notes category label Mar 24, 2022
shahofblah pushed a commit that referenced this pull request Mar 25, 2022
Summary:
Pull Request resolved: #74484

In my first attempt at this in December I stamped out specializations using variadic templates. However I'm able to get comparable performance using simple conditionals since the branch is very predictable and AppendOnlyList::emplace_back is low enough overhead that multiple calls don't cause an issue.

This is also a chance to do some BE: rather than force ops and backend events to use the same fields (which in practice means setting a bunch of default values when reporting backend events), I just split them and use a variant.

Test Plan: The single threaded benchmark (with no extra options set) improved considerably from ~0.88 us to ~0.62 us. The stress test benchmark improved modestly from ~6.1 us to ~5.8 us. So the bottleneck for multi-threading is somewhere else, but doing less wasted work is still able to move the needle a little bit.

Reviewed By: swolchok

Differential Revision: D34779994

fbshipit-source-id: 392bc7c6f12797fa5e18777063aa21210d9d2067
(cherry picked from commit f0a49ff)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants