-
Notifications
You must be signed in to change notification settings - Fork 26.3k
[Profiler] Prefer TSC to wall clock when available #73855
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
CI Flow Status⚛️ CI FlowRuleset - Version:
|
🔗 Helpful links
💊 CI failures summary and remediationsAs of commit 2cb6dd4 (more details on the Dr. CI page):
🕵️ 1 new failure recognized by patternsThe following CI failures do not appear to be due to upstream breakages:
|
|
This pull request was exported from Phabricator. Differential Revision: D34231071 |
00ea484 to
d864e09
Compare
|
This pull request was exported from Phabricator. Differential Revision: D34231071 |
aaronenyeshi
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! See comments internal!
|
This pull request was exported from Phabricator. Differential Revision: D34231071 |
d864e09 to
0b2be5e
Compare
|
This pull request was exported from Phabricator. Differential Revision: D34231071 |
0b2be5e to
f99f2ca
Compare
|
This pull request was exported from Phabricator. Differential Revision: D34231071 |
f99f2ca to
458ede4
Compare
Summary: Pull Request resolved: pytorch#73855 Calling the clock is one of the most expensive parts of profiling. We can reduce the profiling overhead by using `rdtsc` instead. The tradeoff is that we have to measure and convert. (shift and scale) Test Plan: I added a cpp unit test with *very* aggressive anti-flake measures. I also ran the overhead benchmark (9 replicates) with `--stressTestKineto` (0.94 -> 0.89 us) and `--stressTestKineto --kinetoProfileMemory` (1.27 -> 1.17 us) Reviewed By: chaekit Differential Revision: D34231071 fbshipit-source-id: a7c2e9a05d5f1328444231dc439570afe8b82fc8
|
This pull request was exported from Phabricator. Differential Revision: D34231071 |
458ede4 to
2cb6dd4
Compare
Summary: Pull Request resolved: #73855 Calling the clock is one of the most expensive parts of profiling. We can reduce the profiling overhead by using `rdtsc` instead. The tradeoff is that we have to measure and convert. (shift and scale) Test Plan: I added a cpp unit test with *very* aggressive anti-flake measures. I also ran the overhead benchmark (9 replicates) with `--stressTestKineto` (0.94 -> 0.89 us) and `--stressTestKineto --kinetoProfileMemory` (1.27 -> 1.17 us) Reviewed By: chaekit Differential Revision: D34231071 fbshipit-source-id: e3b3dd7580d93bcc783e87c7f2fc726cb74f4df8
|
Hey @robieta. |
Summary: Calling the clock is one of the most expensive parts of profiling. We can reduce the profiling overhead by using
rdtscinstead. The tradeoff is that we have to measure and convert. (shift and scale)Test Plan: I added a cpp unit test with very aggressive anti-flake measures. I also ran the overhead benchmark (9 replicates) with
--stressTestKineto(0.94 -> 0.89 us) and--stressTestKineto --kinetoProfileMemory(1.27 -> 1.17 us)Differential Revision: D34231071