Skip to content

Conversation

@kumpera
Copy link
Contributor

@kumpera kumpera commented Oct 5, 2022

This adds two knobs to FileSystemWriter: thread_count and per_thread_copy_ahead.

Those two lead to up to 50% performance improvement on 32 GPUs workloads on AWS.

cc @mrshenli @pritamdamania87 @zhaojuanmao @satgera @rohan-varma @gqchen @aazzolini @osalpekar @jiayisuse @H-Huang @kwen2501 @awgu

@pytorch-bot
Copy link

pytorch-bot bot commented Oct 5, 2022

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/86327

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures, 27 Pending

As of commit 03d6713:
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@linux-foundation-easycla
Copy link

linux-foundation-easycla bot commented Oct 5, 2022

CLA Signed

The committers listed above are authorized under a signed CLA.

@facebook-github-bot facebook-github-bot added cla signed oncall: distributed Add this issue/PR to distributed oncall triage queue labels Oct 5, 2022
@kumpera kumpera requested a review from wz337 October 5, 2022 21:33
@kumpera
Copy link
Contributor Author

kumpera commented Oct 11, 2022

@pytorchmergebot rebase

@pytorchmergebot
Copy link
Collaborator

@pytorchbot successfully started a rebase job. Check the current status here

@pytorchmergebot
Copy link
Collaborator

Successfully rebased dcp_fast_writer onto refs/remotes/origin/viable/strict, please pull locally before adding more changes (for example, via git checkout dcp_fast_writer && git pull --rebase)

@wz337
Copy link
Contributor

wz337 commented Oct 20, 2022

LGTM. Claiming this PR and adding tests to it.

pytorchmergebot pushed a commit that referenced this pull request Nov 30, 2022
…checkpointing and Update tests (#87987)

This PR includes:

Changes from @kumpera (#86327): adding MultiThreaded FileSystemWriter for distributed checkpointing, which adds two knobs to FileSystemWriter: thread_count and per_thread_copy_ahead. This increases up to 50% performance improvement on 32 GPUS workloads on AWS.
Add parametrize tests to /test/distributed/_shard/checkpoint/test_file_system_checkpoint.py and /test/distributed/_shard/checkpoint/test_file_system_checkpoint_cpu.py
Modify @with_comms in ShardedTensorTestBase to take in *args and **kwargs.
Tests:

```
python3 test/distributed/checkpoint/test_file_system_checkpoint_cpu.py
```

test/distributed/checkpoint/test_file_system_checkpoint.py(GPU tests) runs fine locally but would timeout on CI. We will use thread-based PG and update this test in following PR.

[T134844615]

## Add docstring and update comments in the following PRs.
Pull Request resolved: #87987
Approved by: https://github.com/fduwjj
kulinseth pushed a commit to kulinseth/pytorch that referenced this pull request Dec 10, 2022
…checkpointing and Update tests (pytorch#87987)

This PR includes:

Changes from @kumpera (pytorch#86327): adding MultiThreaded FileSystemWriter for distributed checkpointing, which adds two knobs to FileSystemWriter: thread_count and per_thread_copy_ahead. This increases up to 50% performance improvement on 32 GPUS workloads on AWS.
Add parametrize tests to /test/distributed/_shard/checkpoint/test_file_system_checkpoint.py and /test/distributed/_shard/checkpoint/test_file_system_checkpoint_cpu.py
Modify @with_comms in ShardedTensorTestBase to take in *args and **kwargs.
Tests:

```
python3 test/distributed/checkpoint/test_file_system_checkpoint_cpu.py
```

test/distributed/checkpoint/test_file_system_checkpoint.py(GPU tests) runs fine locally but would timeout on CI. We will use thread-based PG and update this test in following PR.

[T134844615]

## Add docstring and update comments in the following PRs.
Pull Request resolved: pytorch#87987
Approved by: https://github.com/fduwjj
@github-actions
Copy link
Contributor

Looks like this PR hasn't been updated in a while so we're going to go ahead and mark this as Stale.
Feel free to remove the Stale label if you feel this was a mistake.
If you are unable to remove the Stale label please contact a maintainer in order to do so.
If you want the bot to never mark this PR stale again, add the no-stale label.
Stale pull requests will automatically be closed after 30 days of inactivity.

@github-actions github-actions bot added the Stale label Dec 19, 2022
@github-actions github-actions bot closed this Jan 18, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

cla signed oncall: distributed Add this issue/PR to distributed oncall triage queue Stale

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants