Skip to content

Conversation

@wz337
Copy link
Contributor

@wz337 wz337 commented Oct 24, 2022

This PR includes:

  1. Changes from @kumpera (MultiThreaded FileSystemWriter for distributed checkpointing. #86327): adding MultiThreaded FileSystemWriter for distributed checkpointing, which adds two knobs to FileSystemWriter: thread_count and per_thread_copy_ahead. This increases up to 50% performance improvement on 32 GPUS workloads on AWS.
  2. Add parametrize tests to /test/distributed/_shard/checkpoint/test_file_system_checkpoint.py and /test/distributed/_shard/checkpoint/test_file_system_checkpoint_cpu.py
  3. Modify @with_comms in ShardedTensorTestBase to take in *args and **kwargs.

Tests:

python3 test/distributed/_shard/checkpoint/test_file_system_checkpoint.py
python3 test/distributed/_shard/checkpoint/test_file_system_checkpoint_cpu.py

[T134844615]

cc @jansel @mlazos @soumith @voznesenskym @yanboliang @penguinwu @anijain2305 @EikanWang @jgong5 @Guobing-Chen @chunyuan-w @XiaobingSuper @zhuhaozhe @blzheng @Xia-Weiwen @wenzhe-nrv @jiayisunx

@pytorch-bot
Copy link

pytorch-bot bot commented Oct 24, 2022

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/87652

Note: Links to docs will display an error until the docs builds have been completed.

❌ 1 Failures

As of commit 2e48b47:

The following jobs have failed:

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@pytorch-bot pytorch-bot bot added the release notes: distributed (sharded) release notes category label Oct 24, 2022
@wz337 wz337 requested review from fduwjj and wanchaol October 24, 2022 23:32
@wz337 wz337 changed the title Add tests to fast writer Add MultiThreaded FileSystemWriter for distributed checkpointing and Update tests Oct 24, 2022
@wz337 wz337 changed the title Add MultiThreaded FileSystemWriter for distributed checkpointing and Update tests [WIP]Add MultiThreaded FileSystemWriter for distributed checkpointing and Update tests Oct 25, 2022
@wz337 wz337 marked this pull request as draft October 25, 2022 23:46
@wz337 wz337 marked this pull request as ready for review October 28, 2022 03:41
@pytorch-bot pytorch-bot bot added the ciflow/mps Run MPS tests (subset of trunk) label Oct 28, 2022
@wz337 wz337 changed the title [WIP]Add MultiThreaded FileSystemWriter for distributed checkpointing and Update tests [PT-D][Checkpointing]Add MultiThreaded FileSystemWriter for distributed checkpointing and Update tests Oct 28, 2022
@wz337 wz337 marked this pull request as draft October 28, 2022 04:37
@wz337 wz337 closed this Oct 28, 2022
@wz337 wz337 changed the title [PT-D][Checkpointing]Add MultiThreaded FileSystemWriter for distributed checkpointing and Update tests [PT-D][Checkpoint]Add MultiThreaded FileSystemWriter for distributed checkpointing and Update tests Nov 16, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ciflow/mps Run MPS tests (subset of trunk) release notes: distributed (sharded) release notes category

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant