Skip to content

[pytorch][torchelastic] Duplicate stdout and stderr and apply custom filter in torchrun#160712

Closed
cnphil wants to merge 1 commit intopytorch:mainfrom
cnphil:export-D80188995
Closed

[pytorch][torchelastic] Duplicate stdout and stderr and apply custom filter in torchrun#160712
cnphil wants to merge 1 commit intopytorch:mainfrom
cnphil:export-D80188995

Conversation

@cnphil
Copy link
Member

@cnphil cnphil commented Aug 15, 2025

Summary:
Part of an effort to extract some important error logs (e.g. #157996) that was tee'ed to stdout and stderr.

The general idea is to:

  • Duplicate the tees on stdout and stderr to a separate file, filtered_stdout.log and filtered_stderr.log, respectively.
  • In these files, as its name suggests, only log lines matching a customizable filter.
  • Later on in another PR, append the contents of these files to the reply file.

Outline of changes in this PR:

  • Enhance TailLog to be able to 1) stream to a file, and 2) only write when the line matches the passed filter.
  • Add filtered_stdout and filtered_stderr to LogsDest and have LogsSpecs reify them.
  • In start_processes() and PContext, add params duplicate_stdout_filters and duplicate_stderr_filters to filter and write the duplicated stream to the files above. When no filters are passed in, no duplicated streams are created.

Test Plan:

$ buck test 'fbcode//mode/opt' caffe2/test/distributed/elastic/multiprocessing:api_test
Buck UI: https://www.internalfb.com/buck2/f5c6b7da-217d-4a0b-872a-c7cd3d05587f
Test UI: https://www.internalfb.com/intern/testinfra/testrun/4222124951617688
Network: Up: 398B  Down: 44MiB  (reSessionID-a489a961-b602-45be-b851-3490ebb7a26a)
Analyzing targets. Remaining     0/200
Executing actions. Remaining     0/12856                                                                                                                                        0.1s exec time total
Command: test.     Finished 1 local
Time elapsed: 17:37.9s
Tests finished: Pass 52. Fail 0. Fatal 0. Skip 0. Build failure 0
$ buck test 'fbcode//mode/opt' caffe2/test/distributed/elastic/multiprocessing:tail_log_test
Buck UI: https://www.internalfb.com/buck2/d6d5c1c1-db98-4d9c-b608-7ba6fbb5e3ee
Test UI: https://www.internalfb.com/intern/testinfra/testrun/13510798985149262
Network: Up: 94KiB  Down: 417MiB  (reSessionID-27b46fba-d31c-4c04-8ede-a506454e6922)
Analyzing targets. Remaining     0/3                                                                                                                                            536 actions, 555 artifacts declared
Executing actions. Remaining     0/186                                                                                                                                          1:05.5s exec time total
Command: test.     Finished 7 local, 1 remote, 115 cache (93% hit)                                                                                                              37.0s exec time cached (56%)
Time elapsed: 1:11.5s
Tests finished: Pass 7. Fail 0. Fatal 0. Skip 0. Build failure 0

Rollback Plan:

Differential Revision: D80188995

cc @H-Huang @awgu @wanchaol @fegin @fduwjj @wz337 @wconstab @d4l3k @pragupta @msaroufim @dcci

@pytorch-bot
Copy link

pytorch-bot bot commented Aug 15, 2025

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/160712

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit 280aefc with merge base 05b2e02 (image):
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@pytorch-bot pytorch-bot bot added oncall: distributed Add this issue/PR to distributed oncall triage queue release notes: distributed (torchelastic) labels Aug 15, 2025
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D80188995

@github-actions
Copy link
Contributor

Looks like this PR hasn't been updated in a while so we're going to go ahead and mark this as Stale.
Feel free to remove the Stale label if you feel this was a mistake.
If you are unable to remove the Stale label please contact a maintainer in order to do so.
If you want the bot to never mark this PR stale again, add the no-stale label.
Stale pull requests will automatically be closed after 30 days of inactivity.

@meta-codesync
Copy link

meta-codesync bot commented Oct 21, 2025

@cnphil has exported this pull request. If you are a Meta employee, you can view the originating Diff in D80188995.

@cnphil
Copy link
Member Author

cnphil commented Oct 21, 2025

@fduwjj Exported new changes from Pharbicator, PTAL :)

@cnphil cnphil removed the Stale label Oct 21, 2025
cnphil added a commit to cnphil/pytorch that referenced this pull request Oct 21, 2025
…filter in torchrun (pytorch#160712)

Summary:

Part of an effort to extract some important error logs (e.g. [pytorch#157996](pytorch#157996)) that was `tee`'ed to `stdout` and `stderr`.

The general idea is to:

- Duplicate the `tee`s on `stdout` and `stderr` to a separate file, `filtered_stdout.log` and `filtered_stderr.log`, respectively.
- In these files, as its name suggests, only log lines matching a customizable filter.
- Later on in another PR, append the contents of these files to the reply file.

Outline of changes in this PR:

- Enhance `TailLog` to be able to 1) stream to a file, and 2) only write when the line matches the passed filter.
- Add `filtered_stdout` and `filtered_stderr` to `LogsDest` and have `LogsSpecs` `reify` them.
- In `start_processes()` and `PContext`, add params `duplicate_stdout_filters` and `duplicate_stderr_filters` to filter and write the duplicated stream to the files above. When no filters are passed in, no duplicated streams are created.

Test Plan:
```
$ buck test 'fbcode//mode/opt' caffe2/test/distributed/elastic/multiprocessing:api_test
```
```
Buck UI: https://www.internalfb.com/buck2/f5c6b7da-217d-4a0b-872a-c7cd3d05587f
Test UI: https://www.internalfb.com/intern/testinfra/testrun/4222124951617688
Network: Up: 398B  Down: 44MiB  (reSessionID-a489a961-b602-45be-b851-3490ebb7a26a)
Analyzing targets. Remaining     0/200
Executing actions. Remaining     0/12856                                                                                                                                        0.1s exec time total
Command: test.     Finished 1 local
Time elapsed: 17:37.9s
Tests finished: Pass 52. Fail 0. Fatal 0. Skip 0. Build failure 0
```
```
$ buck test 'fbcode//mode/opt' caffe2/test/distributed/elastic/multiprocessing:tail_log_test
```
```
Buck UI: https://www.internalfb.com/buck2/d6d5c1c1-db98-4d9c-b608-7ba6fbb5e3ee
Test UI: https://www.internalfb.com/intern/testinfra/testrun/13510798985149262
Network: Up: 94KiB  Down: 417MiB  (reSessionID-27b46fba-d31c-4c04-8ede-a506454e6922)
Analyzing targets. Remaining     0/3                                                                                                                                            536 actions, 555 artifacts declared
Executing actions. Remaining     0/186                                                                                                                                          1:05.5s exec time total
Command: test.     Finished 7 local, 1 remote, 115 cache (93% hit)                                                                                                              37.0s exec time cached (56%)
Time elapsed: 1:11.5s
Tests finished: Pass 7. Fail 0. Fatal 0. Skip 0. Build failure 0
```
```
$ buck test 'fbcode//mode/opt' caffe2/test/distributed/elastic/agent/server/test:api_test
```
```
Buck UI: https://www.internalfb.com/buck2/34f426fd-25a0-4cf5-8da3-2f3d84767d1e
Test UI: https://www.internalfb.com/intern/testinfra/testrun/14918173871977118
Network: Up: 1.0MiB  Down: 2.9GiB  (reSessionID-048daa50-9ad4-4826-886f-08cec54c7d72)
Analyzing targets. Remaining     0/5                                                                                                                                          533 actions, 552 artifacts declared
Executing actions. Remaining     0/176                                                                                                                                        1:22.7s exec time total
Command: test.     Finished 51 local, 13 remote, 50 cache (44% hit)                                                                                                           19.8s exec time cached (23%)
Time elapsed: 1:45.2s
Tests finished: Pass 31. Fail 0. Fatal 0. Skip 0. Build failure 0
```
```
$ buck test 'fbcode//mode/opt' caffe2/test/distributed/elastic/agent/server/test:local_agent_test
[DISABLED]
```
```
$ buck test 'fbcode//mode/opt' caffe2/test/distributed/elastic/agent/server/test/fb:local_agent_fb_internal_test
[DISABLED]
```
```
$ buck test 'fbcode//mode/opt' caffe2/test/distributed/launcher:api_test
```
```
$ buck test 'fbcode//mode/opt' caffe2/test/distributed/launcher:launch_test
```
```
$ buck test 'fbcode//mode/opt' caffe2/test/distributed/launcher:test_run
```
```
$ buck test 'fbcode//mode/opt' caffe2/test/distributed/launcher/fb:api_test
```
```
$ buck test 'fbcode//mode/opt' caffe2/test/distributed/launcher/fb:local_launch_mast_test
```
```
$ buck test 'fbcode//mode/opt' caffe2/test/distributed/launcher/fb:fb_run_test
```
```
$ buck test 'fbcode//mode/opt' caffe2/test/distributed/launcher/fb:launch_test
```

Reviewed By: mradmila

Differential Revision: D80188995
cnphil added a commit to cnphil/pytorch that referenced this pull request Oct 21, 2025
…filter in torchrun (pytorch#160712)

Summary:

Part of an effort to extract some important error logs (e.g. [pytorch#157996](pytorch#157996)) that was `tee`'ed to `stdout` and `stderr`.

The general idea is to:

- Duplicate the `tee`s on `stdout` and `stderr` to a separate file, `filtered_stdout.log` and `filtered_stderr.log`, respectively.
- In these files, as its name suggests, only log lines matching a customizable filter.
- Later on in another PR, append the contents of these files to the reply file.

Outline of changes in this PR:

- Enhance `TailLog` to be able to 1) stream to a file, and 2) only write when the line matches the passed filter.
- Add `filtered_stdout` and `filtered_stderr` to `LogsDest` and have `LogsSpecs` `reify` them.
- In `start_processes()` and `PContext`, add params `duplicate_stdout_filters` and `duplicate_stderr_filters` to filter and write the duplicated stream to the files above. When no filters are passed in, no duplicated streams are created.

Test Plan:
```
$ buck test 'fbcode//mode/opt' caffe2/test/distributed/elastic/multiprocessing:api_test
```
```
Buck UI: https://www.internalfb.com/buck2/f5c6b7da-217d-4a0b-872a-c7cd3d05587f
Test UI: https://www.internalfb.com/intern/testinfra/testrun/4222124951617688
Network: Up: 398B  Down: 44MiB  (reSessionID-a489a961-b602-45be-b851-3490ebb7a26a)
Analyzing targets. Remaining     0/200
Executing actions. Remaining     0/12856                                                                                                                                        0.1s exec time total
Command: test.     Finished 1 local
Time elapsed: 17:37.9s
Tests finished: Pass 52. Fail 0. Fatal 0. Skip 0. Build failure 0
```
```
$ buck test 'fbcode//mode/opt' caffe2/test/distributed/elastic/multiprocessing:tail_log_test
```
```
Buck UI: https://www.internalfb.com/buck2/d6d5c1c1-db98-4d9c-b608-7ba6fbb5e3ee
Test UI: https://www.internalfb.com/intern/testinfra/testrun/13510798985149262
Network: Up: 94KiB  Down: 417MiB  (reSessionID-27b46fba-d31c-4c04-8ede-a506454e6922)
Analyzing targets. Remaining     0/3                                                                                                                                            536 actions, 555 artifacts declared
Executing actions. Remaining     0/186                                                                                                                                          1:05.5s exec time total
Command: test.     Finished 7 local, 1 remote, 115 cache (93% hit)                                                                                                              37.0s exec time cached (56%)
Time elapsed: 1:11.5s
Tests finished: Pass 7. Fail 0. Fatal 0. Skip 0. Build failure 0
```
```
$ buck test 'fbcode//mode/opt' caffe2/test/distributed/elastic/agent/server/test:api_test
```
```
Buck UI: https://www.internalfb.com/buck2/34f426fd-25a0-4cf5-8da3-2f3d84767d1e
Test UI: https://www.internalfb.com/intern/testinfra/testrun/14918173871977118
Network: Up: 1.0MiB  Down: 2.9GiB  (reSessionID-048daa50-9ad4-4826-886f-08cec54c7d72)
Analyzing targets. Remaining     0/5                                                                                                                                          533 actions, 552 artifacts declared
Executing actions. Remaining     0/176                                                                                                                                        1:22.7s exec time total
Command: test.     Finished 51 local, 13 remote, 50 cache (44% hit)                                                                                                           19.8s exec time cached (23%)
Time elapsed: 1:45.2s
Tests finished: Pass 31. Fail 0. Fatal 0. Skip 0. Build failure 0
```
```
$ buck test 'fbcode//mode/opt' caffe2/test/distributed/elastic/agent/server/test:local_agent_test
[DISABLED]
```
```
$ buck test 'fbcode//mode/opt' caffe2/test/distributed/elastic/agent/server/test/fb:local_agent_fb_internal_test
[DISABLED]
```
```
$ buck test 'fbcode//mode/opt' caffe2/test/distributed/launcher:api_test
```
```
$ buck test 'fbcode//mode/opt' caffe2/test/distributed/launcher:launch_test
```
```
$ buck test 'fbcode//mode/opt' caffe2/test/distributed/launcher:test_run
```
```
$ buck test 'fbcode//mode/opt' caffe2/test/distributed/launcher/fb:api_test
```
```
$ buck test 'fbcode//mode/opt' caffe2/test/distributed/launcher/fb:local_launch_mast_test
```
```
$ buck test 'fbcode//mode/opt' caffe2/test/distributed/launcher/fb:fb_run_test
```
```
$ buck test 'fbcode//mode/opt' caffe2/test/distributed/launcher/fb:launch_test
```

Reviewed By: mradmila

Differential Revision: D80188995
@cnphil cnphil force-pushed the export-D80188995 branch 2 times, most recently from 4440f65 to 9d89360 Compare October 21, 2025 16:56
cnphil added a commit to cnphil/pytorch that referenced this pull request Oct 21, 2025
…filter in torchrun (pytorch#160712)

Summary:

Part of an effort to extract some important error logs (e.g. [pytorch#157996](pytorch#157996)) that was `tee`'ed to `stdout` and `stderr`.

The general idea is to:

- Duplicate the `tee`s on `stdout` and `stderr` to a separate file, `filtered_stdout.log` and `filtered_stderr.log`, respectively.
- In these files, as its name suggests, only log lines matching a customizable filter.
- Later on in another PR, append the contents of these files to the reply file.

Outline of changes in this PR:

- Enhance `TailLog` to be able to 1) stream to a file, and 2) only write when the line matches the passed filter.
- Add `filtered_stdout` and `filtered_stderr` to `LogsDest` and have `LogsSpecs` `reify` them.
- In `start_processes()` and `PContext`, add params `duplicate_stdout_filters` and `duplicate_stderr_filters` to filter and write the duplicated stream to the files above. When no filters are passed in, no duplicated streams are created.

Test Plan:
```
$ buck test 'fbcode//mode/opt' caffe2/test/distributed/elastic/multiprocessing:api_test
```
```
Buck UI: https://www.internalfb.com/buck2/f5c6b7da-217d-4a0b-872a-c7cd3d05587f
Test UI: https://www.internalfb.com/intern/testinfra/testrun/4222124951617688
Network: Up: 398B  Down: 44MiB  (reSessionID-a489a961-b602-45be-b851-3490ebb7a26a)
Analyzing targets. Remaining     0/200
Executing actions. Remaining     0/12856                                                                                                                                        0.1s exec time total
Command: test.     Finished 1 local
Time elapsed: 17:37.9s
Tests finished: Pass 52. Fail 0. Fatal 0. Skip 0. Build failure 0
```
```
$ buck test 'fbcode//mode/opt' caffe2/test/distributed/elastic/multiprocessing:tail_log_test
```
```
Buck UI: https://www.internalfb.com/buck2/d6d5c1c1-db98-4d9c-b608-7ba6fbb5e3ee
Test UI: https://www.internalfb.com/intern/testinfra/testrun/13510798985149262
Network: Up: 94KiB  Down: 417MiB  (reSessionID-27b46fba-d31c-4c04-8ede-a506454e6922)
Analyzing targets. Remaining     0/3                                                                                                                                            536 actions, 555 artifacts declared
Executing actions. Remaining     0/186                                                                                                                                          1:05.5s exec time total
Command: test.     Finished 7 local, 1 remote, 115 cache (93% hit)                                                                                                              37.0s exec time cached (56%)
Time elapsed: 1:11.5s
Tests finished: Pass 7. Fail 0. Fatal 0. Skip 0. Build failure 0
```
```
$ buck test 'fbcode//mode/opt' caffe2/test/distributed/elastic/agent/server/test:api_test
```
```
Buck UI: https://www.internalfb.com/buck2/34f426fd-25a0-4cf5-8da3-2f3d84767d1e
Test UI: https://www.internalfb.com/intern/testinfra/testrun/14918173871977118
Network: Up: 1.0MiB  Down: 2.9GiB  (reSessionID-048daa50-9ad4-4826-886f-08cec54c7d72)
Analyzing targets. Remaining     0/5                                                                                                                                          533 actions, 552 artifacts declared
Executing actions. Remaining     0/176                                                                                                                                        1:22.7s exec time total
Command: test.     Finished 51 local, 13 remote, 50 cache (44% hit)                                                                                                           19.8s exec time cached (23%)
Time elapsed: 1:45.2s
Tests finished: Pass 31. Fail 0. Fatal 0. Skip 0. Build failure 0
```
```
$ buck test 'fbcode//mode/opt' caffe2/test/distributed/elastic/agent/server/test:local_agent_test
[DISABLED]
```
```
$ buck test 'fbcode//mode/opt' caffe2/test/distributed/elastic/agent/server/test/fb:local_agent_fb_internal_test
[DISABLED]
```
```
$ buck test 'fbcode//mode/opt' caffe2/test/distributed/launcher:api_test
```
```
$ buck test 'fbcode//mode/opt' caffe2/test/distributed/launcher:launch_test
```
```
$ buck test 'fbcode//mode/opt' caffe2/test/distributed/launcher:test_run
```
```
$ buck test 'fbcode//mode/opt' caffe2/test/distributed/launcher/fb:api_test
```
```
$ buck test 'fbcode//mode/opt' caffe2/test/distributed/launcher/fb:local_launch_mast_test
```
```
$ buck test 'fbcode//mode/opt' caffe2/test/distributed/launcher/fb:fb_run_test
```
```
$ buck test 'fbcode//mode/opt' caffe2/test/distributed/launcher/fb:launch_test
```

Reviewed By: mradmila

Differential Revision: D80188995
cnphil added a commit to cnphil/pytorch that referenced this pull request Oct 21, 2025
…filter in torchrun (pytorch#160712)

Summary:

Part of an effort to extract some important error logs (e.g. [pytorch#157996](pytorch#157996)) that was `tee`'ed to `stdout` and `stderr`.

The general idea is to:

- Duplicate the `tee`s on `stdout` and `stderr` to a separate file, `filtered_stdout.log` and `filtered_stderr.log`, respectively.
- In these files, as its name suggests, only log lines matching a customizable filter.
- Later on in another PR, append the contents of these files to the reply file.

Outline of changes in this PR:

- Enhance `TailLog` to be able to 1) stream to a file, and 2) only write when the line matches the passed filter.
- Add `filtered_stdout` and `filtered_stderr` to `LogsDest` and have `LogsSpecs` `reify` them.
- In `start_processes()` and `PContext`, add params `duplicate_stdout_filters` and `duplicate_stderr_filters` to filter and write the duplicated stream to the files above. When no filters are passed in, no duplicated streams are created.

Test Plan:
```
$ buck test 'fbcode//mode/opt' caffe2/test/distributed/elastic/multiprocessing:api_test
```
```
Buck UI: https://www.internalfb.com/buck2/f5c6b7da-217d-4a0b-872a-c7cd3d05587f
Test UI: https://www.internalfb.com/intern/testinfra/testrun/4222124951617688
Network: Up: 398B  Down: 44MiB  (reSessionID-a489a961-b602-45be-b851-3490ebb7a26a)
Analyzing targets. Remaining     0/200
Executing actions. Remaining     0/12856                                                                                                                                        0.1s exec time total
Command: test.     Finished 1 local
Time elapsed: 17:37.9s
Tests finished: Pass 52. Fail 0. Fatal 0. Skip 0. Build failure 0
```
```
$ buck test 'fbcode//mode/opt' caffe2/test/distributed/elastic/multiprocessing:tail_log_test
```
```
Buck UI: https://www.internalfb.com/buck2/d6d5c1c1-db98-4d9c-b608-7ba6fbb5e3ee
Test UI: https://www.internalfb.com/intern/testinfra/testrun/13510798985149262
Network: Up: 94KiB  Down: 417MiB  (reSessionID-27b46fba-d31c-4c04-8ede-a506454e6922)
Analyzing targets. Remaining     0/3                                                                                                                                            536 actions, 555 artifacts declared
Executing actions. Remaining     0/186                                                                                                                                          1:05.5s exec time total
Command: test.     Finished 7 local, 1 remote, 115 cache (93% hit)                                                                                                              37.0s exec time cached (56%)
Time elapsed: 1:11.5s
Tests finished: Pass 7. Fail 0. Fatal 0. Skip 0. Build failure 0
```
```
$ buck test 'fbcode//mode/opt' caffe2/test/distributed/elastic/agent/server/test:api_test
```
```
Buck UI: https://www.internalfb.com/buck2/34f426fd-25a0-4cf5-8da3-2f3d84767d1e
Test UI: https://www.internalfb.com/intern/testinfra/testrun/14918173871977118
Network: Up: 1.0MiB  Down: 2.9GiB  (reSessionID-048daa50-9ad4-4826-886f-08cec54c7d72)
Analyzing targets. Remaining     0/5                                                                                                                                          533 actions, 552 artifacts declared
Executing actions. Remaining     0/176                                                                                                                                        1:22.7s exec time total
Command: test.     Finished 51 local, 13 remote, 50 cache (44% hit)                                                                                                           19.8s exec time cached (23%)
Time elapsed: 1:45.2s
Tests finished: Pass 31. Fail 0. Fatal 0. Skip 0. Build failure 0
```
```
$ buck test 'fbcode//mode/opt' caffe2/test/distributed/elastic/agent/server/test:local_agent_test
[DISABLED]
```
```
$ buck test 'fbcode//mode/opt' caffe2/test/distributed/elastic/agent/server/test/fb:local_agent_fb_internal_test
[DISABLED]
```
```
$ buck test 'fbcode//mode/opt' caffe2/test/distributed/launcher:api_test
```
```
$ buck test 'fbcode//mode/opt' caffe2/test/distributed/launcher:launch_test
```
```
$ buck test 'fbcode//mode/opt' caffe2/test/distributed/launcher:test_run
```
```
$ buck test 'fbcode//mode/opt' caffe2/test/distributed/launcher/fb:api_test
```
```
$ buck test 'fbcode//mode/opt' caffe2/test/distributed/launcher/fb:local_launch_mast_test
```
```
$ buck test 'fbcode//mode/opt' caffe2/test/distributed/launcher/fb:fb_run_test
```
```
$ buck test 'fbcode//mode/opt' caffe2/test/distributed/launcher/fb:launch_test
```

Reviewed By: mradmila

Differential Revision: D80188995
@cnphil cnphil force-pushed the export-D80188995 branch 2 times, most recently from 99841b8 to f814794 Compare October 21, 2025 17:52
cnphil added a commit to cnphil/pytorch that referenced this pull request Oct 21, 2025
…filter in torchrun (pytorch#160712)

Summary:

Part of an effort to extract some important error logs (e.g. [pytorch#157996](pytorch#157996)) that was `tee`'ed to `stdout` and `stderr`.

The general idea is to:

- Duplicate the `tee`s on `stdout` and `stderr` to a separate file, `filtered_stdout.log` and `filtered_stderr.log`, respectively.
- In these files, as its name suggests, only log lines matching a customizable filter.
- Later on in another PR, append the contents of these files to the reply file.

Outline of changes in this PR:

- Enhance `TailLog` to be able to 1) stream to a file, and 2) only write when the line matches the passed filter.
- Add `filtered_stdout` and `filtered_stderr` to `LogsDest` and have `LogsSpecs` `reify` them.
- In `start_processes()` and `PContext`, add params `duplicate_stdout_filters` and `duplicate_stderr_filters` to filter and write the duplicated stream to the files above. When no filters are passed in, no duplicated streams are created.

Test Plan:
```
$ buck test 'fbcode//mode/opt' caffe2/test/distributed/elastic/multiprocessing:api_test
```
```
Buck UI: https://www.internalfb.com/buck2/f5c6b7da-217d-4a0b-872a-c7cd3d05587f
Test UI: https://www.internalfb.com/intern/testinfra/testrun/4222124951617688
Network: Up: 398B  Down: 44MiB  (reSessionID-a489a961-b602-45be-b851-3490ebb7a26a)
Analyzing targets. Remaining     0/200
Executing actions. Remaining     0/12856                                                                                                                                        0.1s exec time total
Command: test.     Finished 1 local
Time elapsed: 17:37.9s
Tests finished: Pass 52. Fail 0. Fatal 0. Skip 0. Build failure 0
```
```
$ buck test 'fbcode//mode/opt' caffe2/test/distributed/elastic/multiprocessing:tail_log_test
```
```
Buck UI: https://www.internalfb.com/buck2/d6d5c1c1-db98-4d9c-b608-7ba6fbb5e3ee
Test UI: https://www.internalfb.com/intern/testinfra/testrun/13510798985149262
Network: Up: 94KiB  Down: 417MiB  (reSessionID-27b46fba-d31c-4c04-8ede-a506454e6922)
Analyzing targets. Remaining     0/3                                                                                                                                            536 actions, 555 artifacts declared
Executing actions. Remaining     0/186                                                                                                                                          1:05.5s exec time total
Command: test.     Finished 7 local, 1 remote, 115 cache (93% hit)                                                                                                              37.0s exec time cached (56%)
Time elapsed: 1:11.5s
Tests finished: Pass 7. Fail 0. Fatal 0. Skip 0. Build failure 0
```
```
$ buck test 'fbcode//mode/opt' caffe2/test/distributed/elastic/agent/server/test:api_test
```
```
Buck UI: https://www.internalfb.com/buck2/34f426fd-25a0-4cf5-8da3-2f3d84767d1e
Test UI: https://www.internalfb.com/intern/testinfra/testrun/14918173871977118
Network: Up: 1.0MiB  Down: 2.9GiB  (reSessionID-048daa50-9ad4-4826-886f-08cec54c7d72)
Analyzing targets. Remaining     0/5                                                                                                                                          533 actions, 552 artifacts declared
Executing actions. Remaining     0/176                                                                                                                                        1:22.7s exec time total
Command: test.     Finished 51 local, 13 remote, 50 cache (44% hit)                                                                                                           19.8s exec time cached (23%)
Time elapsed: 1:45.2s
Tests finished: Pass 31. Fail 0. Fatal 0. Skip 0. Build failure 0
```
```
$ buck test 'fbcode//mode/opt' caffe2/test/distributed/elastic/agent/server/test:local_agent_test
[DISABLED]
```
```
$ buck test 'fbcode//mode/opt' caffe2/test/distributed/elastic/agent/server/test/fb:local_agent_fb_internal_test
[DISABLED]
```
```
$ buck test 'fbcode//mode/opt' caffe2/test/distributed/launcher:api_test
```
```
$ buck test 'fbcode//mode/opt' caffe2/test/distributed/launcher:launch_test
```
```
$ buck test 'fbcode//mode/opt' caffe2/test/distributed/launcher:test_run
```
```
$ buck test 'fbcode//mode/opt' caffe2/test/distributed/launcher/fb:api_test
```
```
$ buck test 'fbcode//mode/opt' caffe2/test/distributed/launcher/fb:local_launch_mast_test
```
```
$ buck test 'fbcode//mode/opt' caffe2/test/distributed/launcher/fb:fb_run_test
```
```
$ buck test 'fbcode//mode/opt' caffe2/test/distributed/launcher/fb:launch_test
```

Reviewed By: mradmila

Differential Revision: D80188995
cnphil added a commit to cnphil/pytorch that referenced this pull request Oct 21, 2025
…filter in torchrun (pytorch#160712)

Summary:

Part of an effort to extract some important error logs (e.g. [pytorch#157996](pytorch#157996)) that was `tee`'ed to `stdout` and `stderr`.

The general idea is to:

- Duplicate the `tee`s on `stdout` and `stderr` to a separate file, `filtered_stdout.log` and `filtered_stderr.log`, respectively.
- In these files, as its name suggests, only log lines matching a customizable filter.
- Later on in another PR, append the contents of these files to the reply file.

Outline of changes in this PR:

- Enhance `TailLog` to be able to 1) stream to a file, and 2) only write when the line matches the passed filter.
- Add `filtered_stdout` and `filtered_stderr` to `LogsDest` and have `LogsSpecs` `reify` them.
- In `start_processes()` and `PContext`, add params `duplicate_stdout_filters` and `duplicate_stderr_filters` to filter and write the duplicated stream to the files above. When no filters are passed in, no duplicated streams are created.

Test Plan:
```
$ buck test 'fbcode//mode/opt' caffe2/test/distributed/elastic/multiprocessing:api_test
```
```
Buck UI: https://www.internalfb.com/buck2/f5c6b7da-217d-4a0b-872a-c7cd3d05587f
Test UI: https://www.internalfb.com/intern/testinfra/testrun/4222124951617688
Network: Up: 398B  Down: 44MiB  (reSessionID-a489a961-b602-45be-b851-3490ebb7a26a)
Analyzing targets. Remaining     0/200
Executing actions. Remaining     0/12856                                                                                                                                        0.1s exec time total
Command: test.     Finished 1 local
Time elapsed: 17:37.9s
Tests finished: Pass 52. Fail 0. Fatal 0. Skip 0. Build failure 0
```
```
$ buck test 'fbcode//mode/opt' caffe2/test/distributed/elastic/multiprocessing:tail_log_test
```
```
Buck UI: https://www.internalfb.com/buck2/d6d5c1c1-db98-4d9c-b608-7ba6fbb5e3ee
Test UI: https://www.internalfb.com/intern/testinfra/testrun/13510798985149262
Network: Up: 94KiB  Down: 417MiB  (reSessionID-27b46fba-d31c-4c04-8ede-a506454e6922)
Analyzing targets. Remaining     0/3                                                                                                                                            536 actions, 555 artifacts declared
Executing actions. Remaining     0/186                                                                                                                                          1:05.5s exec time total
Command: test.     Finished 7 local, 1 remote, 115 cache (93% hit)                                                                                                              37.0s exec time cached (56%)
Time elapsed: 1:11.5s
Tests finished: Pass 7. Fail 0. Fatal 0. Skip 0. Build failure 0
```
```
$ buck test 'fbcode//mode/opt' caffe2/test/distributed/elastic/agent/server/test:api_test
```
```
Buck UI: https://www.internalfb.com/buck2/34f426fd-25a0-4cf5-8da3-2f3d84767d1e
Test UI: https://www.internalfb.com/intern/testinfra/testrun/14918173871977118
Network: Up: 1.0MiB  Down: 2.9GiB  (reSessionID-048daa50-9ad4-4826-886f-08cec54c7d72)
Analyzing targets. Remaining     0/5                                                                                                                                          533 actions, 552 artifacts declared
Executing actions. Remaining     0/176                                                                                                                                        1:22.7s exec time total
Command: test.     Finished 51 local, 13 remote, 50 cache (44% hit)                                                                                                           19.8s exec time cached (23%)
Time elapsed: 1:45.2s
Tests finished: Pass 31. Fail 0. Fatal 0. Skip 0. Build failure 0
```
```
$ buck test 'fbcode//mode/opt' caffe2/test/distributed/elastic/agent/server/test:local_agent_test
[DISABLED]
```
```
$ buck test 'fbcode//mode/opt' caffe2/test/distributed/elastic/agent/server/test/fb:local_agent_fb_internal_test
[DISABLED]
```
```
$ buck test 'fbcode//mode/opt' caffe2/test/distributed/launcher:api_test
```
```
$ buck test 'fbcode//mode/opt' caffe2/test/distributed/launcher:launch_test
```
```
$ buck test 'fbcode//mode/opt' caffe2/test/distributed/launcher:test_run
```
```
$ buck test 'fbcode//mode/opt' caffe2/test/distributed/launcher/fb:api_test
```
```
$ buck test 'fbcode//mode/opt' caffe2/test/distributed/launcher/fb:local_launch_mast_test
```
```
$ buck test 'fbcode//mode/opt' caffe2/test/distributed/launcher/fb:fb_run_test
```
```
$ buck test 'fbcode//mode/opt' caffe2/test/distributed/launcher/fb:launch_test
```

Reviewed By: mradmila

Differential Revision: D80188995
@cnphil cnphil force-pushed the export-D80188995 branch 2 times, most recently from 9fa3b52 to b997417 Compare October 21, 2025 18:53
cnphil added a commit to cnphil/pytorch that referenced this pull request Oct 21, 2025
…filter in torchrun (pytorch#160712)

Summary:
Pull Request resolved: pytorch#160712

Part of an effort to extract some important error logs (e.g. [pytorch#157996](pytorch#157996)) that was `tee`'ed to `stdout` and `stderr`.

The general idea is to:

- Duplicate the `tee`s on `stdout` and `stderr` to a separate file, `filtered_stdout.log` and `filtered_stderr.log`, respectively.
- In these files, as its name suggests, only log lines matching a customizable filter.
- Later on in another PR, append the contents of these files to the reply file.

Outline of changes in this PR:

- Enhance `TailLog` to be able to 1) stream to a file, and 2) only write when the line matches the passed filter.
- Add `filtered_stdout` and `filtered_stderr` to `LogsDest` and have `LogsSpecs` `reify` them.
- In `start_processes()` and `PContext`, add params `duplicate_stdout_filters` and `duplicate_stderr_filters` to filter and write the duplicated stream to the files above. When no filters are passed in, no duplicated streams are created.

Test Plan:
```
$ buck test 'fbcode//mode/opt' caffe2/test/distributed/elastic/multiprocessing:api_test
```
```
Buck UI: https://www.internalfb.com/buck2/f5c6b7da-217d-4a0b-872a-c7cd3d05587f
Test UI: https://www.internalfb.com/intern/testinfra/testrun/4222124951617688
Network: Up: 398B  Down: 44MiB  (reSessionID-a489a961-b602-45be-b851-3490ebb7a26a)
Analyzing targets. Remaining     0/200
Executing actions. Remaining     0/12856                                                                                                                                        0.1s exec time total
Command: test.     Finished 1 local
Time elapsed: 17:37.9s
Tests finished: Pass 52. Fail 0. Fatal 0. Skip 0. Build failure 0
```
```
$ buck test 'fbcode//mode/opt' caffe2/test/distributed/elastic/multiprocessing:tail_log_test
```
```
Buck UI: https://www.internalfb.com/buck2/d6d5c1c1-db98-4d9c-b608-7ba6fbb5e3ee
Test UI: https://www.internalfb.com/intern/testinfra/testrun/13510798985149262
Network: Up: 94KiB  Down: 417MiB  (reSessionID-27b46fba-d31c-4c04-8ede-a506454e6922)
Analyzing targets. Remaining     0/3                                                                                                                                            536 actions, 555 artifacts declared
Executing actions. Remaining     0/186                                                                                                                                          1:05.5s exec time total
Command: test.     Finished 7 local, 1 remote, 115 cache (93% hit)                                                                                                              37.0s exec time cached (56%)
Time elapsed: 1:11.5s
Tests finished: Pass 7. Fail 0. Fatal 0. Skip 0. Build failure 0
```
```
$ buck test 'fbcode//mode/opt' caffe2/test/distributed/elastic/agent/server/test:api_test
```
```
Buck UI: https://www.internalfb.com/buck2/34f426fd-25a0-4cf5-8da3-2f3d84767d1e
Test UI: https://www.internalfb.com/intern/testinfra/testrun/14918173871977118
Network: Up: 1.0MiB  Down: 2.9GiB  (reSessionID-048daa50-9ad4-4826-886f-08cec54c7d72)
Analyzing targets. Remaining     0/5                                                                                                                                          533 actions, 552 artifacts declared
Executing actions. Remaining     0/176                                                                                                                                        1:22.7s exec time total
Command: test.     Finished 51 local, 13 remote, 50 cache (44% hit)                                                                                                           19.8s exec time cached (23%)
Time elapsed: 1:45.2s
Tests finished: Pass 31. Fail 0. Fatal 0. Skip 0. Build failure 0
```
```
$ buck test 'fbcode//mode/opt' caffe2/test/distributed/elastic/agent/server/test:local_agent_test
[DISABLED]
```
```
$ buck test 'fbcode//mode/opt' caffe2/test/distributed/elastic/agent/server/test/fb:local_agent_fb_internal_test
[DISABLED]
```
```
$ buck test 'fbcode//mode/opt' caffe2/test/distributed/launcher:api_test
```
```
$ buck test 'fbcode//mode/opt' caffe2/test/distributed/launcher:launch_test
```
```
$ buck test 'fbcode//mode/opt' caffe2/test/distributed/launcher:test_run
```
```
$ buck test 'fbcode//mode/opt' caffe2/test/distributed/launcher/fb:api_test
```
```
$ buck test 'fbcode//mode/opt' caffe2/test/distributed/launcher/fb:local_launch_mast_test
```
```
$ buck test 'fbcode//mode/opt' caffe2/test/distributed/launcher/fb:fb_run_test
```
```
$ buck test 'fbcode//mode/opt' caffe2/test/distributed/launcher/fb:launch_test
```

Reviewed By: mradmila

Differential Revision: D80188995
cnphil added a commit to cnphil/pytorch that referenced this pull request Oct 21, 2025
…filter in torchrun (pytorch#160712)

Summary:
Pull Request resolved: pytorch#160712

Part of an effort to extract some important error logs (e.g. [pytorch#157996](pytorch#157996)) that was `tee`'ed to `stdout` and `stderr`.

The general idea is to:

- Duplicate the `tee`s on `stdout` and `stderr` to a separate file, `filtered_stdout.log` and `filtered_stderr.log`, respectively.
- In these files, as its name suggests, only log lines matching a customizable filter.
- Later on in another PR, append the contents of these files to the reply file.

Outline of changes in this PR:

- Enhance `TailLog` to be able to 1) stream to a file, and 2) only write when the line matches the passed filter.
- Add `filtered_stdout` and `filtered_stderr` to `LogsDest` and have `LogsSpecs` `reify` them.
- In `start_processes()` and `PContext`, add params `duplicate_stdout_filters` and `duplicate_stderr_filters` to filter and write the duplicated stream to the files above. When no filters are passed in, no duplicated streams are created.

Test Plan:
```
$ buck test 'fbcode//mode/opt' caffe2/test/distributed/elastic/multiprocessing:api_test
```
```
Buck UI: https://www.internalfb.com/buck2/f5c6b7da-217d-4a0b-872a-c7cd3d05587f
Test UI: https://www.internalfb.com/intern/testinfra/testrun/4222124951617688
Network: Up: 398B  Down: 44MiB  (reSessionID-a489a961-b602-45be-b851-3490ebb7a26a)
Analyzing targets. Remaining     0/200
Executing actions. Remaining     0/12856                                                                                                                                        0.1s exec time total
Command: test.     Finished 1 local
Time elapsed: 17:37.9s
Tests finished: Pass 52. Fail 0. Fatal 0. Skip 0. Build failure 0
```
```
$ buck test 'fbcode//mode/opt' caffe2/test/distributed/elastic/multiprocessing:tail_log_test
```
```
Buck UI: https://www.internalfb.com/buck2/d6d5c1c1-db98-4d9c-b608-7ba6fbb5e3ee
Test UI: https://www.internalfb.com/intern/testinfra/testrun/13510798985149262
Network: Up: 94KiB  Down: 417MiB  (reSessionID-27b46fba-d31c-4c04-8ede-a506454e6922)
Analyzing targets. Remaining     0/3                                                                                                                                            536 actions, 555 artifacts declared
Executing actions. Remaining     0/186                                                                                                                                          1:05.5s exec time total
Command: test.     Finished 7 local, 1 remote, 115 cache (93% hit)                                                                                                              37.0s exec time cached (56%)
Time elapsed: 1:11.5s
Tests finished: Pass 7. Fail 0. Fatal 0. Skip 0. Build failure 0
```
```
$ buck test 'fbcode//mode/opt' caffe2/test/distributed/elastic/agent/server/test:api_test
```
```
Buck UI: https://www.internalfb.com/buck2/34f426fd-25a0-4cf5-8da3-2f3d84767d1e
Test UI: https://www.internalfb.com/intern/testinfra/testrun/14918173871977118
Network: Up: 1.0MiB  Down: 2.9GiB  (reSessionID-048daa50-9ad4-4826-886f-08cec54c7d72)
Analyzing targets. Remaining     0/5                                                                                                                                          533 actions, 552 artifacts declared
Executing actions. Remaining     0/176                                                                                                                                        1:22.7s exec time total
Command: test.     Finished 51 local, 13 remote, 50 cache (44% hit)                                                                                                           19.8s exec time cached (23%)
Time elapsed: 1:45.2s
Tests finished: Pass 31. Fail 0. Fatal 0. Skip 0. Build failure 0
```
```
$ buck test 'fbcode//mode/opt' caffe2/test/distributed/elastic/agent/server/test:local_agent_test
[DISABLED]
```
```
$ buck test 'fbcode//mode/opt' caffe2/test/distributed/elastic/agent/server/test/fb:local_agent_fb_internal_test
[DISABLED]
```
```
$ buck test 'fbcode//mode/opt' caffe2/test/distributed/launcher:api_test
```
```
$ buck test 'fbcode//mode/opt' caffe2/test/distributed/launcher:launch_test
```
```
$ buck test 'fbcode//mode/opt' caffe2/test/distributed/launcher:test_run
```
```
$ buck test 'fbcode//mode/opt' caffe2/test/distributed/launcher/fb:api_test
```
```
$ buck test 'fbcode//mode/opt' caffe2/test/distributed/launcher/fb:local_launch_mast_test
```
```
$ buck test 'fbcode//mode/opt' caffe2/test/distributed/launcher/fb:fb_run_test
```
```
$ buck test 'fbcode//mode/opt' caffe2/test/distributed/launcher/fb:launch_test
```

Reviewed By: mradmila

Differential Revision: D80188995
cnphil added a commit to cnphil/pytorch that referenced this pull request Oct 21, 2025
…filter in torchrun (pytorch#160712)

Summary:

Part of an effort to extract some important error logs (e.g. [pytorch#157996](pytorch#157996)) that was `tee`'ed to `stdout` and `stderr`.

The general idea is to:

- Duplicate the `tee`s on `stdout` and `stderr` to a separate file, `filtered_stdout.log` and `filtered_stderr.log`, respectively.
- In these files, as its name suggests, only log lines matching a customizable filter.
- Later on in another PR, append the contents of these files to the reply file.

Outline of changes in this PR:

- Enhance `TailLog` to be able to 1) stream to a file, and 2) only write when the line matches the passed filter.
- Add `filtered_stdout` and `filtered_stderr` to `LogsDest` and have `LogsSpecs` `reify` them.
- In `start_processes()` and `PContext`, add params `duplicate_stdout_filters` and `duplicate_stderr_filters` to filter and write the duplicated stream to the files above. When no filters are passed in, no duplicated streams are created.

Test Plan:
```
$ buck test 'fbcode//mode/opt' caffe2/test/distributed/elastic/multiprocessing:api_test
```
```
Buck UI: https://www.internalfb.com/buck2/f5c6b7da-217d-4a0b-872a-c7cd3d05587f
Test UI: https://www.internalfb.com/intern/testinfra/testrun/4222124951617688
Network: Up: 398B  Down: 44MiB  (reSessionID-a489a961-b602-45be-b851-3490ebb7a26a)
Analyzing targets. Remaining     0/200
Executing actions. Remaining     0/12856                                                                                                                                        0.1s exec time total
Command: test.     Finished 1 local
Time elapsed: 17:37.9s
Tests finished: Pass 52. Fail 0. Fatal 0. Skip 0. Build failure 0
```
```
$ buck test 'fbcode//mode/opt' caffe2/test/distributed/elastic/multiprocessing:tail_log_test
```
```
Buck UI: https://www.internalfb.com/buck2/d6d5c1c1-db98-4d9c-b608-7ba6fbb5e3ee
Test UI: https://www.internalfb.com/intern/testinfra/testrun/13510798985149262
Network: Up: 94KiB  Down: 417MiB  (reSessionID-27b46fba-d31c-4c04-8ede-a506454e6922)
Analyzing targets. Remaining     0/3                                                                                                                                            536 actions, 555 artifacts declared
Executing actions. Remaining     0/186                                                                                                                                          1:05.5s exec time total
Command: test.     Finished 7 local, 1 remote, 115 cache (93% hit)                                                                                                              37.0s exec time cached (56%)
Time elapsed: 1:11.5s
Tests finished: Pass 7. Fail 0. Fatal 0. Skip 0. Build failure 0
```
```
$ buck test 'fbcode//mode/opt' caffe2/test/distributed/elastic/agent/server/test:api_test
```
```
Buck UI: https://www.internalfb.com/buck2/34f426fd-25a0-4cf5-8da3-2f3d84767d1e
Test UI: https://www.internalfb.com/intern/testinfra/testrun/14918173871977118
Network: Up: 1.0MiB  Down: 2.9GiB  (reSessionID-048daa50-9ad4-4826-886f-08cec54c7d72)
Analyzing targets. Remaining     0/5                                                                                                                                          533 actions, 552 artifacts declared
Executing actions. Remaining     0/176                                                                                                                                        1:22.7s exec time total
Command: test.     Finished 51 local, 13 remote, 50 cache (44% hit)                                                                                                           19.8s exec time cached (23%)
Time elapsed: 1:45.2s
Tests finished: Pass 31. Fail 0. Fatal 0. Skip 0. Build failure 0
```
```
$ buck test 'fbcode//mode/opt' caffe2/test/distributed/elastic/agent/server/test:local_agent_test
[DISABLED]
```
```
$ buck test 'fbcode//mode/opt' caffe2/test/distributed/elastic/agent/server/test/fb:local_agent_fb_internal_test
[DISABLED]
```
```
$ buck test 'fbcode//mode/opt' caffe2/test/distributed/launcher:api_test
```
```
$ buck test 'fbcode//mode/opt' caffe2/test/distributed/launcher:launch_test
```
```
$ buck test 'fbcode//mode/opt' caffe2/test/distributed/launcher:test_run
```
```
$ buck test 'fbcode//mode/opt' caffe2/test/distributed/launcher/fb:api_test
```
```
$ buck test 'fbcode//mode/opt' caffe2/test/distributed/launcher/fb:local_launch_mast_test
```
```
$ buck test 'fbcode//mode/opt' caffe2/test/distributed/launcher/fb:fb_run_test
```
```
$ buck test 'fbcode//mode/opt' caffe2/test/distributed/launcher/fb:launch_test
```

Reviewed By: mradmila

Differential Revision: D80188995
cnphil added a commit to cnphil/pytorch that referenced this pull request Oct 21, 2025
…filter in torchrun (pytorch#160712)

Summary:

Part of an effort to extract some important error logs (e.g. [pytorch#157996](pytorch#157996)) that was `tee`'ed to `stdout` and `stderr`.

The general idea is to:

- Duplicate the `tee`s on `stdout` and `stderr` to a separate file, `filtered_stdout.log` and `filtered_stderr.log`, respectively.
- In these files, as its name suggests, only log lines matching a customizable filter.
- Later on in another PR, append the contents of these files to the reply file.

Outline of changes in this PR:

- Enhance `TailLog` to be able to 1) stream to a file, and 2) only write when the line matches the passed filter.
- Add `filtered_stdout` and `filtered_stderr` to `LogsDest` and have `LogsSpecs` `reify` them.
- In `start_processes()` and `PContext`, add params `duplicate_stdout_filters` and `duplicate_stderr_filters` to filter and write the duplicated stream to the files above. When no filters are passed in, no duplicated streams are created.

Test Plan:
```
$ buck test 'fbcode//mode/opt' caffe2/test/distributed/elastic/multiprocessing:api_test
```
```
Buck UI: https://www.internalfb.com/buck2/f5c6b7da-217d-4a0b-872a-c7cd3d05587f
Test UI: https://www.internalfb.com/intern/testinfra/testrun/4222124951617688
Network: Up: 398B  Down: 44MiB  (reSessionID-a489a961-b602-45be-b851-3490ebb7a26a)
Analyzing targets. Remaining     0/200
Executing actions. Remaining     0/12856                                                                                                                                        0.1s exec time total
Command: test.     Finished 1 local
Time elapsed: 17:37.9s
Tests finished: Pass 52. Fail 0. Fatal 0. Skip 0. Build failure 0
```
```
$ buck test 'fbcode//mode/opt' caffe2/test/distributed/elastic/multiprocessing:tail_log_test
```
```
Buck UI: https://www.internalfb.com/buck2/d6d5c1c1-db98-4d9c-b608-7ba6fbb5e3ee
Test UI: https://www.internalfb.com/intern/testinfra/testrun/13510798985149262
Network: Up: 94KiB  Down: 417MiB  (reSessionID-27b46fba-d31c-4c04-8ede-a506454e6922)
Analyzing targets. Remaining     0/3                                                                                                                                            536 actions, 555 artifacts declared
Executing actions. Remaining     0/186                                                                                                                                          1:05.5s exec time total
Command: test.     Finished 7 local, 1 remote, 115 cache (93% hit)                                                                                                              37.0s exec time cached (56%)
Time elapsed: 1:11.5s
Tests finished: Pass 7. Fail 0. Fatal 0. Skip 0. Build failure 0
```
```
$ buck test 'fbcode//mode/opt' caffe2/test/distributed/elastic/agent/server/test:api_test
```
```
Buck UI: https://www.internalfb.com/buck2/34f426fd-25a0-4cf5-8da3-2f3d84767d1e
Test UI: https://www.internalfb.com/intern/testinfra/testrun/14918173871977118
Network: Up: 1.0MiB  Down: 2.9GiB  (reSessionID-048daa50-9ad4-4826-886f-08cec54c7d72)
Analyzing targets. Remaining     0/5                                                                                                                                          533 actions, 552 artifacts declared
Executing actions. Remaining     0/176                                                                                                                                        1:22.7s exec time total
Command: test.     Finished 51 local, 13 remote, 50 cache (44% hit)                                                                                                           19.8s exec time cached (23%)
Time elapsed: 1:45.2s
Tests finished: Pass 31. Fail 0. Fatal 0. Skip 0. Build failure 0
```
```
$ buck test 'fbcode//mode/opt' caffe2/test/distributed/elastic/agent/server/test:local_agent_test
[DISABLED]
```
```
$ buck test 'fbcode//mode/opt' caffe2/test/distributed/elastic/agent/server/test/fb:local_agent_fb_internal_test
[DISABLED]
```
```
$ buck test 'fbcode//mode/opt' caffe2/test/distributed/launcher:api_test
```
```
$ buck test 'fbcode//mode/opt' caffe2/test/distributed/launcher:launch_test
```
```
$ buck test 'fbcode//mode/opt' caffe2/test/distributed/launcher:test_run
```
```
$ buck test 'fbcode//mode/opt' caffe2/test/distributed/launcher/fb:api_test
```
```
$ buck test 'fbcode//mode/opt' caffe2/test/distributed/launcher/fb:local_launch_mast_test
```
```
$ buck test 'fbcode//mode/opt' caffe2/test/distributed/launcher/fb:fb_run_test
```
```
$ buck test 'fbcode//mode/opt' caffe2/test/distributed/launcher/fb:launch_test
```

Reviewed By: mradmila

Differential Revision: D80188995
cnphil added a commit to cnphil/pytorch that referenced this pull request Oct 21, 2025
…filter in torchrun (pytorch#160712)

Summary:

Part of an effort to extract some important error logs (e.g. [pytorch#157996](pytorch#157996)) that was `tee`'ed to `stdout` and `stderr`.

The general idea is to:

- Duplicate the `tee`s on `stdout` and `stderr` to a separate file, `filtered_stdout.log` and `filtered_stderr.log`, respectively.
- In these files, as its name suggests, only log lines matching a customizable filter.
- Later on in another PR, append the contents of these files to the reply file.

Outline of changes in this PR:

- Enhance `TailLog` to be able to 1) stream to a file, and 2) only write when the line matches the passed filter.
- Add `filtered_stdout` and `filtered_stderr` to `LogsDest` and have `LogsSpecs` `reify` them.
- In `start_processes()` and `PContext`, add params `duplicate_stdout_filters` and `duplicate_stderr_filters` to filter and write the duplicated stream to the files above. When no filters are passed in, no duplicated streams are created.

Test Plan:
```
$ buck test 'fbcode//mode/opt' caffe2/test/distributed/elastic/multiprocessing:api_test
```
```
Buck UI: https://www.internalfb.com/buck2/f5c6b7da-217d-4a0b-872a-c7cd3d05587f
Test UI: https://www.internalfb.com/intern/testinfra/testrun/4222124951617688
Network: Up: 398B  Down: 44MiB  (reSessionID-a489a961-b602-45be-b851-3490ebb7a26a)
Analyzing targets. Remaining     0/200
Executing actions. Remaining     0/12856                                                                                                                                        0.1s exec time total
Command: test.     Finished 1 local
Time elapsed: 17:37.9s
Tests finished: Pass 52. Fail 0. Fatal 0. Skip 0. Build failure 0
```
```
$ buck test 'fbcode//mode/opt' caffe2/test/distributed/elastic/multiprocessing:tail_log_test
```
```
Buck UI: https://www.internalfb.com/buck2/d6d5c1c1-db98-4d9c-b608-7ba6fbb5e3ee
Test UI: https://www.internalfb.com/intern/testinfra/testrun/13510798985149262
Network: Up: 94KiB  Down: 417MiB  (reSessionID-27b46fba-d31c-4c04-8ede-a506454e6922)
Analyzing targets. Remaining     0/3                                                                                                                                            536 actions, 555 artifacts declared
Executing actions. Remaining     0/186                                                                                                                                          1:05.5s exec time total
Command: test.     Finished 7 local, 1 remote, 115 cache (93% hit)                                                                                                              37.0s exec time cached (56%)
Time elapsed: 1:11.5s
Tests finished: Pass 7. Fail 0. Fatal 0. Skip 0. Build failure 0
```
```
$ buck test 'fbcode//mode/opt' caffe2/test/distributed/elastic/agent/server/test:api_test
```
```
Buck UI: https://www.internalfb.com/buck2/34f426fd-25a0-4cf5-8da3-2f3d84767d1e
Test UI: https://www.internalfb.com/intern/testinfra/testrun/14918173871977118
Network: Up: 1.0MiB  Down: 2.9GiB  (reSessionID-048daa50-9ad4-4826-886f-08cec54c7d72)
Analyzing targets. Remaining     0/5                                                                                                                                          533 actions, 552 artifacts declared
Executing actions. Remaining     0/176                                                                                                                                        1:22.7s exec time total
Command: test.     Finished 51 local, 13 remote, 50 cache (44% hit)                                                                                                           19.8s exec time cached (23%)
Time elapsed: 1:45.2s
Tests finished: Pass 31. Fail 0. Fatal 0. Skip 0. Build failure 0
```
```
$ buck test 'fbcode//mode/opt' caffe2/test/distributed/elastic/agent/server/test:local_agent_test
[DISABLED]
```
```
$ buck test 'fbcode//mode/opt' caffe2/test/distributed/elastic/agent/server/test/fb:local_agent_fb_internal_test
[DISABLED]
```
```
$ buck test 'fbcode//mode/opt' caffe2/test/distributed/launcher:api_test
```
```
$ buck test 'fbcode//mode/opt' caffe2/test/distributed/launcher:launch_test
```
```
$ buck test 'fbcode//mode/opt' caffe2/test/distributed/launcher:test_run
```
```
$ buck test 'fbcode//mode/opt' caffe2/test/distributed/launcher/fb:api_test
```
```
$ buck test 'fbcode//mode/opt' caffe2/test/distributed/launcher/fb:local_launch_mast_test
```
```
$ buck test 'fbcode//mode/opt' caffe2/test/distributed/launcher/fb:fb_run_test
```
```
$ buck test 'fbcode//mode/opt' caffe2/test/distributed/launcher/fb:launch_test
```

Reviewed By: mradmila

Differential Revision: D80188995
cnphil added a commit to cnphil/pytorch that referenced this pull request Oct 21, 2025
…filter in torchrun (pytorch#160712)

Summary:

Part of an effort to extract some important error logs (e.g. [pytorch#157996](pytorch#157996)) that was `tee`'ed to `stdout` and `stderr`.

The general idea is to:

- Duplicate the `tee`s on `stdout` and `stderr` to a separate file, `filtered_stdout.log` and `filtered_stderr.log`, respectively.
- In these files, as its name suggests, only log lines matching a customizable filter.
- Later on in another PR, append the contents of these files to the reply file.

Outline of changes in this PR:

- Enhance `TailLog` to be able to 1) stream to a file, and 2) only write when the line matches the passed filter.
- Add `filtered_stdout` and `filtered_stderr` to `LogsDest` and have `LogsSpecs` `reify` them.
- In `start_processes()` and `PContext`, add params `duplicate_stdout_filters` and `duplicate_stderr_filters` to filter and write the duplicated stream to the files above. When no filters are passed in, no duplicated streams are created.

Test Plan:
```
$ buck test 'fbcode//mode/opt' caffe2/test/distributed/elastic/multiprocessing:api_test
```
```
Buck UI: https://www.internalfb.com/buck2/f5c6b7da-217d-4a0b-872a-c7cd3d05587f
Test UI: https://www.internalfb.com/intern/testinfra/testrun/4222124951617688
Network: Up: 398B  Down: 44MiB  (reSessionID-a489a961-b602-45be-b851-3490ebb7a26a)
Analyzing targets. Remaining     0/200
Executing actions. Remaining     0/12856                                                                                                                                        0.1s exec time total
Command: test.     Finished 1 local
Time elapsed: 17:37.9s
Tests finished: Pass 52. Fail 0. Fatal 0. Skip 0. Build failure 0
```
```
$ buck test 'fbcode//mode/opt' caffe2/test/distributed/elastic/multiprocessing:tail_log_test
```
```
Buck UI: https://www.internalfb.com/buck2/d6d5c1c1-db98-4d9c-b608-7ba6fbb5e3ee
Test UI: https://www.internalfb.com/intern/testinfra/testrun/13510798985149262
Network: Up: 94KiB  Down: 417MiB  (reSessionID-27b46fba-d31c-4c04-8ede-a506454e6922)
Analyzing targets. Remaining     0/3                                                                                                                                            536 actions, 555 artifacts declared
Executing actions. Remaining     0/186                                                                                                                                          1:05.5s exec time total
Command: test.     Finished 7 local, 1 remote, 115 cache (93% hit)                                                                                                              37.0s exec time cached (56%)
Time elapsed: 1:11.5s
Tests finished: Pass 7. Fail 0. Fatal 0. Skip 0. Build failure 0
```
```
$ buck test 'fbcode//mode/opt' caffe2/test/distributed/elastic/agent/server/test:api_test
```
```
Buck UI: https://www.internalfb.com/buck2/34f426fd-25a0-4cf5-8da3-2f3d84767d1e
Test UI: https://www.internalfb.com/intern/testinfra/testrun/14918173871977118
Network: Up: 1.0MiB  Down: 2.9GiB  (reSessionID-048daa50-9ad4-4826-886f-08cec54c7d72)
Analyzing targets. Remaining     0/5                                                                                                                                          533 actions, 552 artifacts declared
Executing actions. Remaining     0/176                                                                                                                                        1:22.7s exec time total
Command: test.     Finished 51 local, 13 remote, 50 cache (44% hit)                                                                                                           19.8s exec time cached (23%)
Time elapsed: 1:45.2s
Tests finished: Pass 31. Fail 0. Fatal 0. Skip 0. Build failure 0
```
```
$ buck test 'fbcode//mode/opt' caffe2/test/distributed/elastic/agent/server/test:local_agent_test
[DISABLED]
```
```
$ buck test 'fbcode//mode/opt' caffe2/test/distributed/elastic/agent/server/test/fb:local_agent_fb_internal_test
[DISABLED]
```
```
$ buck test 'fbcode//mode/opt' caffe2/test/distributed/launcher:api_test
```
```
$ buck test 'fbcode//mode/opt' caffe2/test/distributed/launcher:launch_test
```
```
$ buck test 'fbcode//mode/opt' caffe2/test/distributed/launcher:test_run
```
```
$ buck test 'fbcode//mode/opt' caffe2/test/distributed/launcher/fb:api_test
```
```
$ buck test 'fbcode//mode/opt' caffe2/test/distributed/launcher/fb:local_launch_mast_test
```
```
$ buck test 'fbcode//mode/opt' caffe2/test/distributed/launcher/fb:fb_run_test
```
```
$ buck test 'fbcode//mode/opt' caffe2/test/distributed/launcher/fb:launch_test
```

Reviewed By: mradmila

Differential Revision: D80188995
@fduwjj fduwjj added the ciflow/trunk Trigger trunk jobs on your pull request label Oct 21, 2025
cnphil added a commit to cnphil/pytorch that referenced this pull request Oct 21, 2025
…filter in torchrun (pytorch#160712)

Summary:

Part of an effort to extract some important error logs (e.g. [pytorch#157996](pytorch#157996)) that was `tee`'ed to `stdout` and `stderr`.

The general idea is to:

- Duplicate the `tee`s on `stdout` and `stderr` to a separate file, `filtered_stdout.log` and `filtered_stderr.log`, respectively.
- In these files, as its name suggests, only log lines matching a customizable filter.
- Later on in another PR, append the contents of these files to the reply file.

Outline of changes in this PR:

- Enhance `TailLog` to be able to 1) stream to a file, and 2) only write when the line matches the passed filter.
- Add `filtered_stdout` and `filtered_stderr` to `LogsDest` and have `LogsSpecs` `reify` them.
- In `start_processes()` and `PContext`, add params `duplicate_stdout_filters` and `duplicate_stderr_filters` to filter and write the duplicated stream to the files above. When no filters are passed in, no duplicated streams are created.

Test Plan:
```
$ buck test 'fbcode//mode/opt' caffe2/test/distributed/elastic/multiprocessing:api_test
```
```
Buck UI: https://www.internalfb.com/buck2/f5c6b7da-217d-4a0b-872a-c7cd3d05587f
Test UI: https://www.internalfb.com/intern/testinfra/testrun/4222124951617688
Network: Up: 398B  Down: 44MiB  (reSessionID-a489a961-b602-45be-b851-3490ebb7a26a)
Analyzing targets. Remaining     0/200
Executing actions. Remaining     0/12856                                                                                                                                        0.1s exec time total
Command: test.     Finished 1 local
Time elapsed: 17:37.9s
Tests finished: Pass 52. Fail 0. Fatal 0. Skip 0. Build failure 0
```
```
$ buck test 'fbcode//mode/opt' caffe2/test/distributed/elastic/multiprocessing:tail_log_test
```
```
Buck UI: https://www.internalfb.com/buck2/d6d5c1c1-db98-4d9c-b608-7ba6fbb5e3ee
Test UI: https://www.internalfb.com/intern/testinfra/testrun/13510798985149262
Network: Up: 94KiB  Down: 417MiB  (reSessionID-27b46fba-d31c-4c04-8ede-a506454e6922)
Analyzing targets. Remaining     0/3                                                                                                                                            536 actions, 555 artifacts declared
Executing actions. Remaining     0/186                                                                                                                                          1:05.5s exec time total
Command: test.     Finished 7 local, 1 remote, 115 cache (93% hit)                                                                                                              37.0s exec time cached (56%)
Time elapsed: 1:11.5s
Tests finished: Pass 7. Fail 0. Fatal 0. Skip 0. Build failure 0
```
```
$ buck test 'fbcode//mode/opt' caffe2/test/distributed/elastic/agent/server/test:api_test
```
```
Buck UI: https://www.internalfb.com/buck2/34f426fd-25a0-4cf5-8da3-2f3d84767d1e
Test UI: https://www.internalfb.com/intern/testinfra/testrun/14918173871977118
Network: Up: 1.0MiB  Down: 2.9GiB  (reSessionID-048daa50-9ad4-4826-886f-08cec54c7d72)
Analyzing targets. Remaining     0/5                                                                                                                                          533 actions, 552 artifacts declared
Executing actions. Remaining     0/176                                                                                                                                        1:22.7s exec time total
Command: test.     Finished 51 local, 13 remote, 50 cache (44% hit)                                                                                                           19.8s exec time cached (23%)
Time elapsed: 1:45.2s
Tests finished: Pass 31. Fail 0. Fatal 0. Skip 0. Build failure 0
```
```
$ buck test 'fbcode//mode/opt' caffe2/test/distributed/elastic/agent/server/test:local_agent_test
[DISABLED]
```
```
$ buck test 'fbcode//mode/opt' caffe2/test/distributed/elastic/agent/server/test/fb:local_agent_fb_internal_test
[DISABLED]
```
```
$ buck test 'fbcode//mode/opt' caffe2/test/distributed/launcher:api_test
```
```
$ buck test 'fbcode//mode/opt' caffe2/test/distributed/launcher:launch_test
```
```
$ buck test 'fbcode//mode/opt' caffe2/test/distributed/launcher:test_run
```
```
$ buck test 'fbcode//mode/opt' caffe2/test/distributed/launcher/fb:api_test
```
```
$ buck test 'fbcode//mode/opt' caffe2/test/distributed/launcher/fb:local_launch_mast_test
```
```
$ buck test 'fbcode//mode/opt' caffe2/test/distributed/launcher/fb:fb_run_test
```
```
$ buck test 'fbcode//mode/opt' caffe2/test/distributed/launcher/fb:launch_test
```

Reviewed By: mradmila

Differential Revision: D80188995
cnphil added a commit to cnphil/pytorch that referenced this pull request Oct 21, 2025
…filter in torchrun (pytorch#160712)

Summary:

Part of an effort to extract some important error logs (e.g. [pytorch#157996](pytorch#157996)) that was `tee`'ed to `stdout` and `stderr`.

The general idea is to:

- Duplicate the `tee`s on `stdout` and `stderr` to a separate file, `filtered_stdout.log` and `filtered_stderr.log`, respectively.
- In these files, as its name suggests, only log lines matching a customizable filter.
- Later on in another PR, append the contents of these files to the reply file.

Outline of changes in this PR:

- Enhance `TailLog` to be able to 1) stream to a file, and 2) only write when the line matches the passed filter.
- Add `filtered_stdout` and `filtered_stderr` to `LogsDest` and have `LogsSpecs` `reify` them.
- In `start_processes()` and `PContext`, add params `duplicate_stdout_filters` and `duplicate_stderr_filters` to filter and write the duplicated stream to the files above. When no filters are passed in, no duplicated streams are created.

Test Plan:
```
$ buck test 'fbcode//mode/opt' caffe2/test/distributed/elastic/multiprocessing:api_test
```
```
Buck UI: https://www.internalfb.com/buck2/f5c6b7da-217d-4a0b-872a-c7cd3d05587f
Test UI: https://www.internalfb.com/intern/testinfra/testrun/4222124951617688
Network: Up: 398B  Down: 44MiB  (reSessionID-a489a961-b602-45be-b851-3490ebb7a26a)
Analyzing targets. Remaining     0/200
Executing actions. Remaining     0/12856                                                                                                                                        0.1s exec time total
Command: test.     Finished 1 local
Time elapsed: 17:37.9s
Tests finished: Pass 52. Fail 0. Fatal 0. Skip 0. Build failure 0
```
```
$ buck test 'fbcode//mode/opt' caffe2/test/distributed/elastic/multiprocessing:tail_log_test
```
```
Buck UI: https://www.internalfb.com/buck2/d6d5c1c1-db98-4d9c-b608-7ba6fbb5e3ee
Test UI: https://www.internalfb.com/intern/testinfra/testrun/13510798985149262
Network: Up: 94KiB  Down: 417MiB  (reSessionID-27b46fba-d31c-4c04-8ede-a506454e6922)
Analyzing targets. Remaining     0/3                                                                                                                                            536 actions, 555 artifacts declared
Executing actions. Remaining     0/186                                                                                                                                          1:05.5s exec time total
Command: test.     Finished 7 local, 1 remote, 115 cache (93% hit)                                                                                                              37.0s exec time cached (56%)
Time elapsed: 1:11.5s
Tests finished: Pass 7. Fail 0. Fatal 0. Skip 0. Build failure 0
```
```
$ buck test 'fbcode//mode/opt' caffe2/test/distributed/elastic/agent/server/test:api_test
```
```
Buck UI: https://www.internalfb.com/buck2/34f426fd-25a0-4cf5-8da3-2f3d84767d1e
Test UI: https://www.internalfb.com/intern/testinfra/testrun/14918173871977118
Network: Up: 1.0MiB  Down: 2.9GiB  (reSessionID-048daa50-9ad4-4826-886f-08cec54c7d72)
Analyzing targets. Remaining     0/5                                                                                                                                          533 actions, 552 artifacts declared
Executing actions. Remaining     0/176                                                                                                                                        1:22.7s exec time total
Command: test.     Finished 51 local, 13 remote, 50 cache (44% hit)                                                                                                           19.8s exec time cached (23%)
Time elapsed: 1:45.2s
Tests finished: Pass 31. Fail 0. Fatal 0. Skip 0. Build failure 0
```
```
$ buck test 'fbcode//mode/opt' caffe2/test/distributed/elastic/agent/server/test:local_agent_test
[DISABLED]
```
```
$ buck test 'fbcode//mode/opt' caffe2/test/distributed/elastic/agent/server/test/fb:local_agent_fb_internal_test
[DISABLED]
```
```
$ buck test 'fbcode//mode/opt' caffe2/test/distributed/launcher:api_test
```
```
$ buck test 'fbcode//mode/opt' caffe2/test/distributed/launcher:launch_test
```
```
$ buck test 'fbcode//mode/opt' caffe2/test/distributed/launcher:test_run
```
```
$ buck test 'fbcode//mode/opt' caffe2/test/distributed/launcher/fb:api_test
```
```
$ buck test 'fbcode//mode/opt' caffe2/test/distributed/launcher/fb:local_launch_mast_test
```
```
$ buck test 'fbcode//mode/opt' caffe2/test/distributed/launcher/fb:fb_run_test
```
```
$ buck test 'fbcode//mode/opt' caffe2/test/distributed/launcher/fb:launch_test
```

Reviewed By: mradmila

Differential Revision: D80188995
Copy link
Contributor

@fduwjj fduwjj left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

cnphil added a commit to cnphil/pytorch that referenced this pull request Oct 22, 2025
…filter in torchrun (pytorch#160712)

Summary:

Part of an effort to extract some important error logs (e.g. [pytorch#157996](pytorch#157996)) that was `tee`'ed to `stdout` and `stderr`.

The general idea is to:

- Duplicate the `tee`s on `stdout` and `stderr` to a separate file, `filtered_stdout.log` and `filtered_stderr.log`, respectively.
- In these files, as its name suggests, only log lines matching a customizable filter.
- Later on in another PR, append the contents of these files to the reply file.

Outline of changes in this PR:

- Enhance `TailLog` to be able to 1) stream to a file, and 2) only write when the line matches the passed filter.
- Add `filtered_stdout` and `filtered_stderr` to `LogsDest` and have `LogsSpecs` `reify` them.
- In `start_processes()` and `PContext`, add params `duplicate_stdout_filters` and `duplicate_stderr_filters` to filter and write the duplicated stream to the files above. When no filters are passed in, no duplicated streams are created.

Test Plan:
```
$ buck test 'fbcode//mode/opt' caffe2/test/distributed/elastic/multiprocessing:api_test
```
```
Buck UI: https://www.internalfb.com/buck2/f5c6b7da-217d-4a0b-872a-c7cd3d05587f
Test UI: https://www.internalfb.com/intern/testinfra/testrun/4222124951617688
Network: Up: 398B  Down: 44MiB  (reSessionID-a489a961-b602-45be-b851-3490ebb7a26a)
Analyzing targets. Remaining     0/200
Executing actions. Remaining     0/12856                                                                                                                                        0.1s exec time total
Command: test.     Finished 1 local
Time elapsed: 17:37.9s
Tests finished: Pass 52. Fail 0. Fatal 0. Skip 0. Build failure 0
```
```
$ buck test 'fbcode//mode/opt' caffe2/test/distributed/elastic/multiprocessing:tail_log_test
```
```
Buck UI: https://www.internalfb.com/buck2/d6d5c1c1-db98-4d9c-b608-7ba6fbb5e3ee
Test UI: https://www.internalfb.com/intern/testinfra/testrun/13510798985149262
Network: Up: 94KiB  Down: 417MiB  (reSessionID-27b46fba-d31c-4c04-8ede-a506454e6922)
Analyzing targets. Remaining     0/3                                                                                                                                            536 actions, 555 artifacts declared
Executing actions. Remaining     0/186                                                                                                                                          1:05.5s exec time total
Command: test.     Finished 7 local, 1 remote, 115 cache (93% hit)                                                                                                              37.0s exec time cached (56%)
Time elapsed: 1:11.5s
Tests finished: Pass 7. Fail 0. Fatal 0. Skip 0. Build failure 0
```
```
$ buck test 'fbcode//mode/opt' caffe2/test/distributed/elastic/agent/server/test:api_test
```
```
Buck UI: https://www.internalfb.com/buck2/34f426fd-25a0-4cf5-8da3-2f3d84767d1e
Test UI: https://www.internalfb.com/intern/testinfra/testrun/14918173871977118
Network: Up: 1.0MiB  Down: 2.9GiB  (reSessionID-048daa50-9ad4-4826-886f-08cec54c7d72)
Analyzing targets. Remaining     0/5                                                                                                                                          533 actions, 552 artifacts declared
Executing actions. Remaining     0/176                                                                                                                                        1:22.7s exec time total
Command: test.     Finished 51 local, 13 remote, 50 cache (44% hit)                                                                                                           19.8s exec time cached (23%)
Time elapsed: 1:45.2s
Tests finished: Pass 31. Fail 0. Fatal 0. Skip 0. Build failure 0
```
```
$ buck test 'fbcode//mode/opt' caffe2/test/distributed/elastic/agent/server/test:local_agent_test
[DISABLED]
```
```
$ buck test 'fbcode//mode/opt' caffe2/test/distributed/elastic/agent/server/test/fb:local_agent_fb_internal_test
[DISABLED]
```
```
$ buck test 'fbcode//mode/opt' caffe2/test/distributed/launcher:api_test
```
```
$ buck test 'fbcode//mode/opt' caffe2/test/distributed/launcher:launch_test
```
```
$ buck test 'fbcode//mode/opt' caffe2/test/distributed/launcher:test_run
```
```
$ buck test 'fbcode//mode/opt' caffe2/test/distributed/launcher/fb:api_test
```
```
$ buck test 'fbcode//mode/opt' caffe2/test/distributed/launcher/fb:local_launch_mast_test
```
```
$ buck test 'fbcode//mode/opt' caffe2/test/distributed/launcher/fb:fb_run_test
```
```
$ buck test 'fbcode//mode/opt' caffe2/test/distributed/launcher/fb:launch_test
```

Reviewed By: fduwjj, mradmila

Differential Revision: D80188995
cnphil added a commit to cnphil/pytorch that referenced this pull request Oct 22, 2025
…filter in torchrun (pytorch#160712)

Summary:

Part of an effort to extract some important error logs (e.g. [pytorch#157996](pytorch#157996)) that was `tee`'ed to `stdout` and `stderr`.

The general idea is to:

- Duplicate the `tee`s on `stdout` and `stderr` to a separate file, `filtered_stdout.log` and `filtered_stderr.log`, respectively.
- In these files, as its name suggests, only log lines matching a customizable filter.
- Later on in another PR, append the contents of these files to the reply file.

Outline of changes in this PR:

- Enhance `TailLog` to be able to 1) stream to a file, and 2) only write when the line matches the passed filter.
- Add `filtered_stdout` and `filtered_stderr` to `LogsDest` and have `LogsSpecs` `reify` them.
- In `start_processes()` and `PContext`, add params `duplicate_stdout_filters` and `duplicate_stderr_filters` to filter and write the duplicated stream to the files above. When no filters are passed in, no duplicated streams are created.

Test Plan:
```
$ buck test 'fbcode//mode/opt' caffe2/test/distributed/elastic/multiprocessing:api_test
```
```
Buck UI: https://www.internalfb.com/buck2/f5c6b7da-217d-4a0b-872a-c7cd3d05587f
Test UI: https://www.internalfb.com/intern/testinfra/testrun/4222124951617688
Network: Up: 398B  Down: 44MiB  (reSessionID-a489a961-b602-45be-b851-3490ebb7a26a)
Analyzing targets. Remaining     0/200
Executing actions. Remaining     0/12856                                                                                                                                        0.1s exec time total
Command: test.     Finished 1 local
Time elapsed: 17:37.9s
Tests finished: Pass 52. Fail 0. Fatal 0. Skip 0. Build failure 0
```
```
$ buck test 'fbcode//mode/opt' caffe2/test/distributed/elastic/multiprocessing:tail_log_test
```
```
Buck UI: https://www.internalfb.com/buck2/d6d5c1c1-db98-4d9c-b608-7ba6fbb5e3ee
Test UI: https://www.internalfb.com/intern/testinfra/testrun/13510798985149262
Network: Up: 94KiB  Down: 417MiB  (reSessionID-27b46fba-d31c-4c04-8ede-a506454e6922)
Analyzing targets. Remaining     0/3                                                                                                                                            536 actions, 555 artifacts declared
Executing actions. Remaining     0/186                                                                                                                                          1:05.5s exec time total
Command: test.     Finished 7 local, 1 remote, 115 cache (93% hit)                                                                                                              37.0s exec time cached (56%)
Time elapsed: 1:11.5s
Tests finished: Pass 7. Fail 0. Fatal 0. Skip 0. Build failure 0
```
```
$ buck test 'fbcode//mode/opt' caffe2/test/distributed/elastic/agent/server/test:api_test
```
```
Buck UI: https://www.internalfb.com/buck2/34f426fd-25a0-4cf5-8da3-2f3d84767d1e
Test UI: https://www.internalfb.com/intern/testinfra/testrun/14918173871977118
Network: Up: 1.0MiB  Down: 2.9GiB  (reSessionID-048daa50-9ad4-4826-886f-08cec54c7d72)
Analyzing targets. Remaining     0/5                                                                                                                                          533 actions, 552 artifacts declared
Executing actions. Remaining     0/176                                                                                                                                        1:22.7s exec time total
Command: test.     Finished 51 local, 13 remote, 50 cache (44% hit)                                                                                                           19.8s exec time cached (23%)
Time elapsed: 1:45.2s
Tests finished: Pass 31. Fail 0. Fatal 0. Skip 0. Build failure 0
```
```
$ buck test 'fbcode//mode/opt' caffe2/test/distributed/elastic/agent/server/test:local_agent_test
[DISABLED]
```
```
$ buck test 'fbcode//mode/opt' caffe2/test/distributed/elastic/agent/server/test/fb:local_agent_fb_internal_test
[DISABLED]
```
```
$ buck test 'fbcode//mode/opt' caffe2/test/distributed/launcher:api_test
```
```
$ buck test 'fbcode//mode/opt' caffe2/test/distributed/launcher:launch_test
```
```
$ buck test 'fbcode//mode/opt' caffe2/test/distributed/launcher:test_run
```
```
$ buck test 'fbcode//mode/opt' caffe2/test/distributed/launcher/fb:api_test
```
```
$ buck test 'fbcode//mode/opt' caffe2/test/distributed/launcher/fb:local_launch_mast_test
```
```
$ buck test 'fbcode//mode/opt' caffe2/test/distributed/launcher/fb:fb_run_test
```
```
$ buck test 'fbcode//mode/opt' caffe2/test/distributed/launcher/fb:launch_test
```

Reviewed By: fduwjj, mradmila

Differential Revision: D80188995
…filter in torchrun (pytorch#160712)

Summary:

Part of an effort to extract some important error logs (e.g. [pytorch#157996](pytorch#157996)) that was `tee`'ed to `stdout` and `stderr`.

The general idea is to:

- Duplicate the `tee`s on `stdout` and `stderr` to a separate file, `filtered_stdout.log` and `filtered_stderr.log`, respectively.
- In these files, as its name suggests, only log lines matching a customizable filter.
- Later on in another PR, append the contents of these files to the reply file.

Outline of changes in this PR:

- Enhance `TailLog` to be able to 1) stream to a file, and 2) only write when the line matches the passed filter.
- Add `filtered_stdout` and `filtered_stderr` to `LogsDest` and have `LogsSpecs` `reify` them.
- In `start_processes()` and `PContext`, add params `duplicate_stdout_filters` and `duplicate_stderr_filters` to filter and write the duplicated stream to the files above. When no filters are passed in, no duplicated streams are created.

Test Plan:
```
$ buck test 'fbcode//mode/opt' caffe2/test/distributed/elastic/multiprocessing:api_test
```
```
Buck UI: https://www.internalfb.com/buck2/f5c6b7da-217d-4a0b-872a-c7cd3d05587f
Test UI: https://www.internalfb.com/intern/testinfra/testrun/4222124951617688
Network: Up: 398B  Down: 44MiB  (reSessionID-a489a961-b602-45be-b851-3490ebb7a26a)
Analyzing targets. Remaining     0/200
Executing actions. Remaining     0/12856                                                                                                                                        0.1s exec time total
Command: test.     Finished 1 local
Time elapsed: 17:37.9s
Tests finished: Pass 52. Fail 0. Fatal 0. Skip 0. Build failure 0
```
```
$ buck test 'fbcode//mode/opt' caffe2/test/distributed/elastic/multiprocessing:tail_log_test
```
```
Buck UI: https://www.internalfb.com/buck2/d6d5c1c1-db98-4d9c-b608-7ba6fbb5e3ee
Test UI: https://www.internalfb.com/intern/testinfra/testrun/13510798985149262
Network: Up: 94KiB  Down: 417MiB  (reSessionID-27b46fba-d31c-4c04-8ede-a506454e6922)
Analyzing targets. Remaining     0/3                                                                                                                                            536 actions, 555 artifacts declared
Executing actions. Remaining     0/186                                                                                                                                          1:05.5s exec time total
Command: test.     Finished 7 local, 1 remote, 115 cache (93% hit)                                                                                                              37.0s exec time cached (56%)
Time elapsed: 1:11.5s
Tests finished: Pass 7. Fail 0. Fatal 0. Skip 0. Build failure 0
```
```
$ buck test 'fbcode//mode/opt' caffe2/test/distributed/elastic/agent/server/test:api_test
```
```
Buck UI: https://www.internalfb.com/buck2/34f426fd-25a0-4cf5-8da3-2f3d84767d1e
Test UI: https://www.internalfb.com/intern/testinfra/testrun/14918173871977118
Network: Up: 1.0MiB  Down: 2.9GiB  (reSessionID-048daa50-9ad4-4826-886f-08cec54c7d72)
Analyzing targets. Remaining     0/5                                                                                                                                          533 actions, 552 artifacts declared
Executing actions. Remaining     0/176                                                                                                                                        1:22.7s exec time total
Command: test.     Finished 51 local, 13 remote, 50 cache (44% hit)                                                                                                           19.8s exec time cached (23%)
Time elapsed: 1:45.2s
Tests finished: Pass 31. Fail 0. Fatal 0. Skip 0. Build failure 0
```
```
$ buck test 'fbcode//mode/opt' caffe2/test/distributed/elastic/agent/server/test:local_agent_test
[DISABLED]
```
```
$ buck test 'fbcode//mode/opt' caffe2/test/distributed/elastic/agent/server/test/fb:local_agent_fb_internal_test
[DISABLED]
```
```
$ buck test 'fbcode//mode/opt' caffe2/test/distributed/launcher:api_test
```
```
$ buck test 'fbcode//mode/opt' caffe2/test/distributed/launcher:launch_test
```
```
$ buck test 'fbcode//mode/opt' caffe2/test/distributed/launcher:test_run
```
```
$ buck test 'fbcode//mode/opt' caffe2/test/distributed/launcher/fb:api_test
```
```
$ buck test 'fbcode//mode/opt' caffe2/test/distributed/launcher/fb:local_launch_mast_test
```
```
$ buck test 'fbcode//mode/opt' caffe2/test/distributed/launcher/fb:fb_run_test
```
```
$ buck test 'fbcode//mode/opt' caffe2/test/distributed/launcher/fb:launch_test
```

Reviewed By: fduwjj, mradmila

Differential Revision: D80188995
@cnphil
Copy link
Member Author

cnphil commented Oct 23, 2025

@pytorchbot merge

@facebook-github-bot
Copy link
Contributor

@pytorchbot merge

(Initiating merge automatically since Phabricator Diff has merged)

@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ciflow/trunk Trigger trunk jobs on your pull request fb-exported Merged meta-exported oncall: distributed Add this issue/PR to distributed oncall triage queue release notes: distributed (torchelastic)

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants