[logging] Fix duration logging for dynamo_compile #151749

masnesral · 2025-04-19T16:40:18Z

Stack from ghstack (oldest at bottom):

Summary: There are a few issues I'm solving:.

It's too hard to measure total pt2 overhead using the dynamo_compile table because users need to know the columns representing all the top-level events (dynamo_cumulative_compile_time_us, etc.). Instead, let's populate the existing duration_us field for all top-level events. The complication is that runtime events in particular (Triton autotuning, cudagraphify) can be collapsed into a single row, with gaps in between, so we can't simply use end_time - start_time in all cases. Instead, we'll sum durations for all outer events when updating the compile-time or runtime metrics context. Introduce a 'depth' counter in TLS to track the nesting of CompilationMetrics events.
The existing implementation relies on callers of dynamo_timed to specify whether the event is a runtime or compile-time event. That doesn't work because some methods can be called in both situations, e.g., CachingAutotuner.benchmark_all_configs. For example TORCHINDUCTOR_BENCHMARK_FUSION=1 enables benchmarking during compile-time. Instead, we can figure out automatically whether we're measuring a compile-time or runtime event and log accordingling.
If log_compilation_events were to throw an exception, we'd fail to clear the aggregated counters for runtime logs and they could be attributed to the wrong compile ID. I didn't actually find evidence of this in practice, but I added exception handling for extra safety.

Test Plan:
Ran internal models and compared dynamo_compile to pt2_compile_events:
TORCHINDUCTOR_BENCHMARK_FUSION=0

tlparse: https://fburl.com/itciwnxc
dynamo_compile: https://fburl.com/scuba/dynamo_compile/yvkif5vb
pt2_compile_events: https://fburl.com/scuba/pt2_compile_events/segijet7

TORCHINDUCTOR_BENCHMARK_FUSION=1

tlparse: https://fburl.com/jgurcvkw
dynamo_compile: https://fburl.com/scuba/dynamo_compile/uum91ceb
pt2_compile_events: https://fburl.com/scuba/pt2_compile_events/x4xnisez

cc @voznesenskym @penguinwu @EikanWang @jgong5 @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @wenzhe-nrv @jiayisunx @ipiszy @chenyang78 @kadeng @muchulee8 @amjames @chauhang @aakhundov

Summary: There are a few issues I'm solving:. 1. It's too hard to measure total pt2 overhead using the dynamo_compile table because users need to know the columns representing all the top-level events (dynamo_cumulative_compile_time_us, etc.). Instead, let's populate the existing duration_us field for all top-level events. The complication is that runtime events in particular (Triton autotuning, cudagraphify) can be collapsed into a single row, with gaps in between, so we can't simply use `end_time - start_time` in all cases. Instead, we'll sum durations for all outer events when updating the compile-time or runtime metrics context. Introduce a 'depth' counter in TLS to track the nesting of CompilationMetrics events. 2. The existing implementation relies on callers of dynamo_timed to specify whether the event is a runtime or compile-time event. That doesn't work because some methods can be called in both situations, e.g., `CachingAutotuner.benchmark_all_configs`. For example `TORCHINDUCTOR_BENCHMARK_FUSION=1` enables benchmarking during compile-time. Instead, we can figure out automatically whether we're measuring a compile-time or runtime event and log accordingling. 3. If `log_compilation_events` were to throw an exception, we'd fail to clear the aggregated counters for runtime logs and they could be attributed to the wrong compile ID. I didn't actually find evidence of this in practice, but I added exception handling for extra safety. Test Plan: Ran internal models and compared dynamo_compile to pt2_compile_events: `TORCHINDUCTOR_BENCHMARK_FUSION=0` * tlparse: https://fburl.com/itciwnxc * dynamo_compile: https://fburl.com/scuba/dynamo_compile/yvkif5vb * pt2_compile_events: https://fburl.com/scuba/pt2_compile_events/segijet7 `TORCHINDUCTOR_BENCHMARK_FUSION=1` * tlparse: https://fburl.com/jgurcvkw * dynamo_compile: https://fburl.com/scuba/dynamo_compile/pzcweldv * pt2_compile_events: https://fburl.com/scuba/pt2_compile_events/q9orpv1e [ghstack-poisoned]

pytorch-bot · 2025-04-19T16:40:21Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/151749

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki or our office hours

Note: Links to docs will display an error until the docs builds have been completed.

✅ You can merge normally! (2 Unrelated Failures)

As of commit 533d65c with merge base ed511cd ():

FLAKY - The following job failed but was likely due to flakiness present on trunk:

inductor / cuda12.6-py3.10-gcc9-sm86 / test (inductor_torchbench, 1, 2, linux.g5.4xlarge.nvidia.gpu) (gh) (matched linux rule in flaky-rules.json)
The runner has received a shutdown signal. This can happen when the runner service is stopped, or a manually started runner is canceled.

BROKEN TRUNK - The following job failed but was present on the merge base:

👉 Rebase onto the `viable/strict` branch to avoid these failures

linux-aarch64 / linux-jammy-aarch64-py3.10 / test (default, 2, 4, linux.arm64.2xlarge) (gh) (trunk failure)
The runner has received a shutdown signal. This can happen when the runner service is stopped, or a manually started runner is canceled.

This comment was automatically generated by Dr. CI and updates every 15 minutes.

Summary: There are a few issues I'm solving:. 1. It's too hard to measure total pt2 overhead using the dynamo_compile table because users need to know the columns representing all the top-level events (dynamo_cumulative_compile_time_us, etc.). Instead, let's populate the existing duration_us field for all top-level events. The complication is that runtime events in particular (Triton autotuning, cudagraphify) can be collapsed into a single row, with gaps in between, so we can't simply use `end_time - start_time` in all cases. Instead, we'll sum durations for all outer events when updating the compile-time or runtime metrics context. Introduce a 'depth' counter in TLS to track the nesting of CompilationMetrics events. 2. The existing implementation relies on callers of dynamo_timed to specify whether the event is a runtime or compile-time event. That doesn't work because some methods can be called in both situations, e.g., `CachingAutotuner.benchmark_all_configs`. For example `TORCHINDUCTOR_BENCHMARK_FUSION=1` enables benchmarking during compile-time. Instead, we can figure out automatically whether we're measuring a compile-time or runtime event and log accordingling. 3. If `log_compilation_events` were to throw an exception, we'd fail to clear the aggregated counters for runtime logs and they could be attributed to the wrong compile ID. I didn't actually find evidence of this in practice, but I added exception handling for extra safety. Test Plan: Ran internal models and compared dynamo_compile to pt2_compile_events: `TORCHINDUCTOR_BENCHMARK_FUSION=0` * tlparse: https://fburl.com/itciwnxc * dynamo_compile: https://fburl.com/scuba/dynamo_compile/yvkif5vb * pt2_compile_events: https://fburl.com/scuba/pt2_compile_events/segijet7 `TORCHINDUCTOR_BENCHMARK_FUSION=1` * tlparse: https://fburl.com/jgurcvkw * dynamo_compile: https://fburl.com/scuba/dynamo_compile/pzcweldv * pt2_compile_events: https://fburl.com/scuba/pt2_compile_events/q9orpv1e ghstack-source-id: 206e891 Pull Request resolved: #151749

Summary: There are a few issues I'm solving:. 1. It's too hard to measure total pt2 overhead using the dynamo_compile table because users need to know the columns representing all the top-level events (dynamo_cumulative_compile_time_us, etc.). Instead, let's populate the existing duration_us field for all top-level events. The complication is that runtime events in particular (Triton autotuning, cudagraphify) can be collapsed into a single row, with gaps in between, so we can't simply use `end_time - start_time` in all cases. Instead, we'll sum durations for all outer events when updating the compile-time or runtime metrics context. Introduce a 'depth' counter in TLS to track the nesting of CompilationMetrics events. 2. The existing implementation relies on callers of dynamo_timed to specify whether the event is a runtime or compile-time event. That doesn't work because some methods can be called in both situations, e.g., `CachingAutotuner.benchmark_all_configs`. For example `TORCHINDUCTOR_BENCHMARK_FUSION=1` enables benchmarking during compile-time. Instead, we can figure out automatically whether we're measuring a compile-time or runtime event and log accordingling. 3. If `log_compilation_events` were to throw an exception, we'd fail to clear the aggregated counters for runtime logs and they could be attributed to the wrong compile ID. I didn't actually find evidence of this in practice, but I added exception handling for extra safety. Test Plan: Ran internal models and compared dynamo_compile to pt2_compile_events: `TORCHINDUCTOR_BENCHMARK_FUSION=0` * tlparse: https://fburl.com/itciwnxc * dynamo_compile: https://fburl.com/scuba/dynamo_compile/yvkif5vb * pt2_compile_events: https://fburl.com/scuba/pt2_compile_events/segijet7 `TORCHINDUCTOR_BENCHMARK_FUSION=1` * tlparse: https://fburl.com/jgurcvkw * dynamo_compile: https://fburl.com/scuba/dynamo_compile/uum91ceb * pt2_compile_events: https://fburl.com/scuba/pt2_compile_events/x4xnisez cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx ipiszy chenyang78 kadeng muchulee8 amjames chauhang aakhundov [ghstack-poisoned]

Summary: There are a few issues I'm solving:. 1. It's too hard to measure total pt2 overhead using the dynamo_compile table because users need to know the columns representing all the top-level events (dynamo_cumulative_compile_time_us, etc.). Instead, let's populate the existing duration_us field for all top-level events. The complication is that runtime events in particular (Triton autotuning, cudagraphify) can be collapsed into a single row, with gaps in between, so we can't simply use `end_time - start_time` in all cases. Instead, we'll sum durations for all outer events when updating the compile-time or runtime metrics context. Introduce a 'depth' counter in TLS to track the nesting of CompilationMetrics events. 2. The existing implementation relies on callers of dynamo_timed to specify whether the event is a runtime or compile-time event. That doesn't work because some methods can be called in both situations, e.g., `CachingAutotuner.benchmark_all_configs`. For example `TORCHINDUCTOR_BENCHMARK_FUSION=1` enables benchmarking during compile-time. Instead, we can figure out automatically whether we're measuring a compile-time or runtime event and log accordingling. 3. If `log_compilation_events` were to throw an exception, we'd fail to clear the aggregated counters for runtime logs and they could be attributed to the wrong compile ID. I didn't actually find evidence of this in practice, but I added exception handling for extra safety. Test Plan: Ran internal models and compared dynamo_compile to pt2_compile_events: `TORCHINDUCTOR_BENCHMARK_FUSION=0` * tlparse: https://fburl.com/itciwnxc * dynamo_compile: https://fburl.com/scuba/dynamo_compile/yvkif5vb * pt2_compile_events: https://fburl.com/scuba/pt2_compile_events/segijet7 `TORCHINDUCTOR_BENCHMARK_FUSION=1` * tlparse: https://fburl.com/jgurcvkw * dynamo_compile: https://fburl.com/scuba/dynamo_compile/pzcweldv * pt2_compile_events: https://fburl.com/scuba/pt2_compile_events/q9orpv1e ghstack-source-id: 07bb6c2 Pull Request resolved: #151749

Skylion007 · 2025-04-20T14:54:05Z

torch/_dynamo/metrics_context.py

+                )
+            except Exception:
+                log.exception("Unexpected exception logging compilation metrics")



Do you not want to reraise the exception after logging it?

pytorchmergebot · 2025-04-22T03:23:18Z

Starting merge as part of PR stack under #151757

Summary: The main motivation is to capture the cudagraphs overhead in a WaitCounter. We'll combine that with Triton autotuning, and therefore rename to "compile_runtime_overheads". Since we have a couple WaitCounters where we want to capture all runtime and compile overheads, let's put the accounting in dynamo_timed so we'll automatically capture any toplevel timed regions that get added in the future. Also, dynamo_timed already has to figure out if we're timing a runtime vs. compile-time event, so we can reuse some of that logic. Test Plan: Ran an internal model with `TORCHINDUCTOR_BENCHMARK_FUSION=1` (to get benchmarking at compile time in addition to runtime). Overall compile time from various sources matches up: * tlparse: https://fburl.com/9fgsstkr. Eyeballing, total time should be 32 ranks x 2175 = ~69.6k s * ods: https://fburl.com/canvas/r4clhnb7. Right on. * dynamo_compile: https://fburl.com/scuba/dynamo_compile/ax71aqox. Right on. * pt2_compile_events: https://fburl.com/scuba/pt2_compile_events/shcjd9ql. Right on. And the runtime overhead: * ods: https://fburl.com/canvas/nvgjb282 * dynamo_compile: https://fburl.com/scuba/dynamo_compile/f2dtv0qh If we compare that to a run of the same model without the changes in this stack, results can mismatch by a lot: * tlparse: https://fburl.com/cchxwd1s. Eyeballing, total time should be 32 ranks x 2300s = ~73.5k s * ods: https://fburl.com/canvas/x1i3wvf4. It's kinda close * dynamo_compile: https://fburl.com/scuba/dynamo_compile/l7sgxdxd. Waaay too high. * pt2_compile_events: https://fburl.com/scuba/pt2_compile_events/jb4s9z1u. This is the only one that's actually correct. The discrepancy is even worse if we focus on the runtime events: * ods: https://fburl.com/canvas/a4o9f7ou * dynamo_compile: https://fburl.com/scuba/dynamo_compile/95izaes1 Pull Request resolved: #151757 Approved by: https://github.com/ppanchalia ghstack dependencies: #151749

Summary: There are a few issues I'm solving:. 1. It's too hard to measure total pt2 overhead using the dynamo_compile table because users need to know the columns representing all the top-level events (dynamo_cumulative_compile_time_us, etc.). Instead, let's populate the existing duration_us field for all top-level events. The complication is that runtime events in particular (Triton autotuning, cudagraphify) can be collapsed into a single row, with gaps in between, so we can't simply use `end_time - start_time` in all cases. Instead, we'll sum durations for all outer events when updating the compile-time or runtime metrics context. Introduce a 'depth' counter in TLS to track the nesting of CompilationMetrics events. 2. The existing implementation relies on callers of dynamo_timed to specify whether the event is a runtime or compile-time event. That doesn't work because some methods can be called in both situations, e.g., `CachingAutotuner.benchmark_all_configs`. For example `TORCHINDUCTOR_BENCHMARK_FUSION=1` enables benchmarking during compile-time. Instead, we can figure out automatically whether we're measuring a compile-time or runtime event and log accordingling. 3. If `log_compilation_events` were to throw an exception, we'd fail to clear the aggregated counters for runtime logs and they could be attributed to the wrong compile ID. I didn't actually find evidence of this in practice, but I added exception handling for extra safety. X-link: pytorch/pytorch#151749 Approved by: https://github.com/Skylion007 Reviewed By: wdvr Differential Revision: D73440137 fbshipit-source-id: 7f176a9ffb4a87bc7176cf737f4bed04a5879a34

Summary: There are a few issues I'm solving:. 1. It's too hard to measure total pt2 overhead using the dynamo_compile table because users need to know the columns representing all the top-level events (dynamo_cumulative_compile_time_us, etc.). Instead, let's populate the existing duration_us field for all top-level events. The complication is that runtime events in particular (Triton autotuning, cudagraphify) can be collapsed into a single row, with gaps in between, so we can't simply use `end_time - start_time` in all cases. Instead, we'll sum durations for all outer events when updating the compile-time or runtime metrics context. Introduce a 'depth' counter in TLS to track the nesting of CompilationMetrics events. 2. The existing implementation relies on callers of dynamo_timed to specify whether the event is a runtime or compile-time event. That doesn't work because some methods can be called in both situations, e.g., `CachingAutotuner.benchmark_all_configs`. For example `TORCHINDUCTOR_BENCHMARK_FUSION=1` enables benchmarking during compile-time. Instead, we can figure out automatically whether we're measuring a compile-time or runtime event and log accordingling. 3. If `log_compilation_events` were to throw an exception, we'd fail to clear the aggregated counters for runtime logs and they could be attributed to the wrong compile ID. I didn't actually find evidence of this in practice, but I added exception handling for extra safety. X-link: pytorch/pytorch#151749 Approved by: https://github.com/Skylion007 Reviewed By: wdvr Differential Revision: D73440137 fbshipit-source-id: 7f176a9ffb4a87bc7176cf737f4bed04a5879a34 Put "everything" WaitCounters in dynamo_timed (#151757) Summary: The main motivation is to capture the cudagraphs overhead in a WaitCounter. We'll combine that with Triton autotuning, and therefore rename to "compile_runtime_overheads". Since we have a couple WaitCounters where we want to capture all runtime and compile overheads, let's put the accounting in dynamo_timed so we'll automatically capture any toplevel timed regions that get added in the future. Also, dynamo_timed already has to figure out if we're timing a runtime vs. compile-time event, so we can reuse some of that logic. X-link: pytorch/pytorch#151757 Approved by: https://github.com/ppanchalia ghstack dependencies: #151749 Reviewed By: wdvr Differential Revision: D73440149 fbshipit-source-id: 1b9074bef52b902da09001b4c006661c7d537477 context manager/decorator for dynamo config patching during tracing (#150586) Summary: Implement traceable config patching for Dynamo: enables restricted patching of Dynamo config where user can use a context manager/decorator to change tracing behavior for parts of the code. The new `dont_skip_tracing` decorator/context manager for ignoring most trace rules is easily implemented with this more generic traceable config patching feature. Implementation: - Create a new specialized context manager class representing a wrapper around torch._dynamo.config.patch - Dynamo doesn't trace into the context manager but updates config at compile time - Correctness is based on our correctness for handling supported context managers - Implementation is inspired by how `GradModeVariable` is implemented. Previous attempts: pytorch/pytorch#148736 (decorator-only global approach) and pytorch/pytorch#149439 (decorator-only traceback approach) See https://docs.google.com/document/d/1vWNwKL_jpg-PLopifcaSa338wks3GqSVF4GHRguybGg/edit?tab=t.0 for more details on implementation - including previous approaches. NOTE: this PR fixes a bug where skipped code objects were not tracked by convert_frame.py, leading to cases where code objects would be automatically skipped even after `torch._dynamo.reset()`. This exposed some latent dynamo-wrapped test failures in CI that previously passed in CI but not locally. X-link: pytorch/pytorch#150586 Approved by: https://github.com/jansel, https://github.com/zou3519, https://github.com/anijain2305 Reviewed By: ZainRizvi Differential Revision: D73519157 fbshipit-source-id: 37a42c1aedbe27c27b3eda8514cdb67a6fc54793 Opt-in unified cuBLAS + cuBLASLt workspaces (#151163) Summary: opt-in version of pytorch/pytorch#145130 as there was a lack of repro for the 70% forward issue `TORCH_CUBLASLT_UNIFIED_WORKSPACE=1` izaitsevfb could you comment if it was repeatable per every forward pass, on startup, or something else? X-link: pytorch/pytorch#151163 Approved by: https://github.com/ngimel Reviewed By: ZainRizvi Differential Revision: D73519132 fbshipit-source-id: cf3f786893a3b64e742411a1c027b375ed9103c9 Improve metadata skip log message

pytorch-bot bot added ciflow/inductor module: dynamo module: inductor labels Apr 19, 2025

masnesral added the topic: not user facing topic category label Apr 19, 2025

Skylion007 approved these changes Apr 19, 2025

View reviewed changes

masnesral added the ciflow/trunk Trigger trunk jobs on your pull request label Apr 19, 2025

masnesral mentioned this pull request Apr 20, 2025

[logging] Put "everything" WaitCounters in dynamo_timed #151757

Closed

Skylion007 reviewed Apr 20, 2025

View reviewed changes

pytorchmergebot closed this in edba20b Apr 22, 2025

pytorchmergebot added the Merged label Apr 22, 2025

github-actions bot deleted the gh/masnesral/192/head branch May 28, 2025 02:20

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[logging] Fix duration logging for dynamo_compile #151749

[logging] Fix duration logging for dynamo_compile #151749

Uh oh!

masnesral commented Apr 19, 2025 •

edited

Loading

Uh oh!

pytorch-bot bot commented Apr 19, 2025 •

edited

Loading

Uh oh!

Skylion007 Apr 20, 2025

Uh oh!

pytorchmergebot commented Apr 22, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

[logging] Fix duration logging for dynamo_compile #151749

[logging] Fix duration logging for dynamo_compile #151749

Uh oh!

Conversation

masnesral commented Apr 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorch-bot bot commented Apr 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/151749

✅ You can merge normally! (2 Unrelated Failures)

Uh oh!

Skylion007 Apr 20, 2025

Choose a reason for hiding this comment

Uh oh!

pytorchmergebot commented Apr 22, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

masnesral commented Apr 19, 2025 •

edited

Loading

pytorch-bot bot commented Apr 19, 2025 •

edited

Loading