Skip to content

Conversation

@wz337
Copy link
Contributor

@wz337 wz337 commented Oct 28, 2022

This PR includes:

Changes from @kumpera (#86327): adding MultiThreaded FileSystemWriter for distributed checkpointing, which adds two knobs to FileSystemWriter: thread_count and per_thread_copy_ahead. This increases up to 50% performance improvement on 32 GPUS workloads on AWS.
Add parametrize tests to /test/distributed/_shard/checkpoint/test_file_system_checkpoint.py and /test/distributed/_shard/checkpoint/test_file_system_checkpoint_cpu.py
Modify @with_comms in ShardedTensorTestBase to take in *args and **kwargs.
Tests:

python3 test/distributed/checkpoint/test_file_system_checkpoint_cpu.py

test/distributed/checkpoint/test_file_system_checkpoint.py(GPU tests) runs fine locally but would timeout on CI. We will use thread-based PG and update this test in following PR.

[T134844615]

Add docstring and update comments in the following PRs.

cc @VitalyFedyunin @jgong5 @mingfeima @XiaobingSuper @sanchitintel @ashokei @jingxu10

@pytorch-bot
Copy link

pytorch-bot bot commented Oct 28, 2022

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/87987

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit ac5583a:
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@pytorch-bot pytorch-bot bot added the release notes: distributed (sharded) release notes category label Oct 28, 2022
@wz337 wz337 requested a review from fduwjj October 28, 2022 15:25
@wz337 wz337 marked this pull request as draft November 4, 2022 01:23
@wz337 wz337 changed the title [PT-D][Checkpointing]Add MultiThreaded FileSystemWriter for distributed checkpointing and Update tests [PT-D][Checkpoint]Add MultiThreaded FileSystemWriter for distributed checkpointing and Update tests Nov 16, 2022
@github-actions github-actions bot added ciflow/inductor module: amp (automated mixed precision) autocast module: cpu CPU specific problem (e.g., perf, algorithm) module: dynamo module: inductor module: mkldnn Related to Intel IDEEP or oneDNN (a.k.a. mkldnn) integration NNC oncall: quantization Quantization support in PyTorch labels Nov 16, 2022
@pytorch-bot pytorch-bot bot added the ciflow/mps Run MPS tests (subset of trunk) label Nov 16, 2022
@wz337 wz337 marked this pull request as ready for review November 17, 2022 21:54
@wz337 wz337 removed oncall: quantization Quantization support in PyTorch module: mkldnn Related to Intel IDEEP or oneDNN (a.k.a. mkldnn) integration module: amp (automated mixed precision) autocast NNC ciflow/mps Run MPS tests (subset of trunk) module: inductor module: dynamo ciflow/inductor labels Nov 17, 2022
@pytorchmergebot
Copy link
Collaborator

Successfully rebased fast_writer onto refs/remotes/origin/viable/strict, please pull locally before adding more changes (for example, via git checkout fast_writer && git pull --rebase)

@wz337
Copy link
Contributor Author

wz337 commented Nov 30, 2022

@pytorchmergebot merge

@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

kulinseth pushed a commit to kulinseth/pytorch that referenced this pull request Dec 10, 2022
…checkpointing and Update tests (pytorch#87987)

This PR includes:

Changes from @kumpera (pytorch#86327): adding MultiThreaded FileSystemWriter for distributed checkpointing, which adds two knobs to FileSystemWriter: thread_count and per_thread_copy_ahead. This increases up to 50% performance improvement on 32 GPUS workloads on AWS.
Add parametrize tests to /test/distributed/_shard/checkpoint/test_file_system_checkpoint.py and /test/distributed/_shard/checkpoint/test_file_system_checkpoint_cpu.py
Modify @with_comms in ShardedTensorTestBase to take in *args and **kwargs.
Tests:

```
python3 test/distributed/checkpoint/test_file_system_checkpoint_cpu.py
```

test/distributed/checkpoint/test_file_system_checkpoint.py(GPU tests) runs fine locally but would timeout on CI. We will use thread-based PG and update this test in following PR.

[T134844615]

## Add docstring and update comments in the following PRs.
Pull Request resolved: pytorch#87987
Approved by: https://github.com/fduwjj
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ciflow/trunk Trigger trunk jobs on your pull request Merged module: cpu CPU specific problem (e.g., perf, algorithm) release notes: distributed (sharded) release notes category

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants