MultiThreaded FileSystemWriter for distributed checkpointing. #86327

kumpera · 2022-10-05T21:33:28Z

This adds two knobs to FileSystemWriter: thread_count and per_thread_copy_ahead.

Those two lead to up to 50% performance improvement on 32 GPUs workloads on AWS.

cc @mrshenli @pritamdamania87 @zhaojuanmao @satgera @rohan-varma @gqchen @aazzolini @osalpekar @jiayisuse @H-Huang @kwen2501 @awgu

pytorch-bot · 2022-10-05T21:33:30Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/86327

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit our office hours

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures, 27 Pending

As of commit 03d6713:
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

linux-foundation-easycla · 2022-10-05T21:33:31Z

The committers listed above are authorized under a signed CLA.

✅ login: kumpera / name: Rodrigo Kumpera (a78bb7e, 03d6713)

kumpera · 2022-10-11T15:32:31Z

@pytorchmergebot rebase

pytorchmergebot · 2022-10-11T15:34:05Z

@pytorchbot successfully started a rebase job. Check the current status here

pytorchmergebot · 2022-10-11T15:34:10Z

Successfully rebased dcp_fast_writer onto refs/remotes/origin/viable/strict, please pull locally before adding more changes (for example, via git checkout dcp_fast_writer && git pull --rebase)

wz337 · 2022-10-20T20:42:18Z

LGTM. Claiming this PR and adding tests to it.

@kumpera

…checkpointing and Update tests (#87987) This PR includes: Changes from @kumpera (#86327): adding MultiThreaded FileSystemWriter for distributed checkpointing, which adds two knobs to FileSystemWriter: thread_count and per_thread_copy_ahead. This increases up to 50% performance improvement on 32 GPUS workloads on AWS. Add parametrize tests to /test/distributed/_shard/checkpoint/test_file_system_checkpoint.py and /test/distributed/_shard/checkpoint/test_file_system_checkpoint_cpu.py Modify @with_comms in ShardedTensorTestBase to take in *args and **kwargs. Tests: ``` python3 test/distributed/checkpoint/test_file_system_checkpoint_cpu.py ``` test/distributed/checkpoint/test_file_system_checkpoint.py(GPU tests) runs fine locally but would timeout on CI. We will use thread-based PG and update this test in following PR. [T134844615] ## Add docstring and update comments in the following PRs. Pull Request resolved: #87987 Approved by: https://github.com/fduwjj

@kumpera

…checkpointing and Update tests (pytorch#87987) This PR includes: Changes from @kumpera (pytorch#86327): adding MultiThreaded FileSystemWriter for distributed checkpointing, which adds two knobs to FileSystemWriter: thread_count and per_thread_copy_ahead. This increases up to 50% performance improvement on 32 GPUS workloads on AWS. Add parametrize tests to /test/distributed/_shard/checkpoint/test_file_system_checkpoint.py and /test/distributed/_shard/checkpoint/test_file_system_checkpoint_cpu.py Modify @with_comms in ShardedTensorTestBase to take in *args and **kwargs. Tests: ``` python3 test/distributed/checkpoint/test_file_system_checkpoint_cpu.py ``` test/distributed/checkpoint/test_file_system_checkpoint.py(GPU tests) runs fine locally but would timeout on CI. We will use thread-based PG and update this test in following PR. [T134844615] ## Add docstring and update comments in the following PRs. Pull Request resolved: pytorch#87987 Approved by: https://github.com/fduwjj

github-actions · 2022-12-19T21:34:20Z

Looks like this PR hasn't been updated in a while so we're going to go ahead and mark this as Stale.
Feel free to remove the Stale label if you feel this was a mistake.
If you are unable to remove the Stale label please contact a maintainer in order to do so.
If you want the bot to never mark this PR stale again, add the no-stale label.
Stale pull requests will automatically be closed after 30 days of inactivity.

kumpera requested review from H-Huang, awgu, kwen2501, mingzhe09088, mrshenli, pritamdamania87, rohan-varma and zhaojuanmao as code owners October 5, 2022 21:33

facebook-github-bot added cla signed oncall: distributed Add this issue/PR to distributed oncall triage queue labels Oct 5, 2022

kumpera requested a review from wz337 October 5, 2022 21:33

Rodrigo Kumpera and others added 2 commits October 11, 2022 15:34

MultiThreaded FileSystemWriter for distributed checkpointing.

a78bb7e

fix linter errors

03d6713

pytorchmergebot force-pushed the dcp_fast_writer branch from 6d82e25 to 03d6713 Compare October 11, 2022 15:34

This was referenced Oct 24, 2022

[PT-D][Checkpoint]Add MultiThreaded FileSystemWriter for distributed checkpointing and Update tests #87652

Closed

[PT-D][Checkpoint]Add MultiThreaded FileSystemWriter for distributed checkpointing and Update tests #87987

Closed

github-actions bot added the Stale label Dec 19, 2022

github-actions bot closed this Jan 18, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

MultiThreaded FileSystemWriter for distributed checkpointing. #86327

MultiThreaded FileSystemWriter for distributed checkpointing. #86327

Uh oh!

kumpera commented Oct 5, 2022 •

edited by pytorch-bot bot

Loading

Uh oh!

pytorch-bot bot commented Oct 5, 2022 •

edited

Loading

Uh oh!

linux-foundation-easycla bot commented Oct 5, 2022 •

edited

Loading

Uh oh!

kumpera commented Oct 11, 2022

Uh oh!

pytorchmergebot commented Oct 11, 2022

Uh oh!

pytorchmergebot commented Oct 11, 2022

Uh oh!

wz337 commented Oct 20, 2022

Uh oh!

github-actions bot commented Dec 19, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

MultiThreaded FileSystemWriter for distributed checkpointing. #86327

MultiThreaded FileSystemWriter for distributed checkpointing. #86327

Uh oh!

Conversation

kumpera commented Oct 5, 2022 • edited by pytorch-bot bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorch-bot bot commented Oct 5, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/86327

✅ No Failures, 27 Pending

Uh oh!

linux-foundation-easycla bot commented Oct 5, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

kumpera commented Oct 11, 2022

Uh oh!

pytorchmergebot commented Oct 11, 2022

Uh oh!

pytorchmergebot commented Oct 11, 2022

Uh oh!

wz337 commented Oct 20, 2022

Uh oh!

github-actions bot commented Dec 19, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

kumpera commented Oct 5, 2022 •

edited by pytorch-bot bot

Loading

pytorch-bot bot commented Oct 5, 2022 •

edited

Loading

linux-foundation-easycla bot commented Oct 5, 2022 •

edited

Loading