Skip to content

Conversation

@sinannasir
Copy link
Contributor

@sinannasir sinannasir commented Aug 20, 2020

Stack from ghstack:

In this diff, we prepared some example DDP communication hooks #40848:

1. allreduce_hook: This DDP communication hook just calls allreduce using GradBucket tensors. Once gradient tensors are aggregated across all workers, its then callback takes the mean and returns the result. If user registers this hook DDP results is expected to be same as the case where no hook was registered. Hence, this won't change behavior of DDP and user can use this as a reference or modify this hook to log useful information or any other purposes while unaffecting DDP behavior.

2. allgather_then_aggregate_hook Similar to allreduce_hook, this hook first gathers GradBucket tensors and its then callback aggregates the gathered gradient tensors and takes mean. Instead of allreduce this hook uses allgather. Note that with W workers, both the computation and communication time scale as O(W) for allgather compared to O(logW) for allreduce. Therefore, this hook is expected to be much slower than allreduce_hook although both essentially do the same thing with the gradients.

3. fp16_compress_hook This DDP communication hook implements a simple gradient compression approach that converts GradBucket tensors whose type is assumed to be torch.float32 to half-precision floating point format (torch.float16). It allreduces those float16 gradient tensors. Once compressed gradient tensors are allreduced, its then callback called decompress converts the aggregated result back to float32 and takes the mean.

4. quantization_pertensor_hook does quantization per tensor and uses the idea in https://pytorch.org/docs/master/generated/torch.quantize_per_tensor.html. Note that we separately send scale and zero_point (two floats per rank) before quantized tensors.

5. quantization_perchannel_hook does quantization per channel similar to https://pytorch.org/docs/master/generated/torch.quantize_per_channel.html. The main motivation is that after the initial QSGD study diff, we realized that for considerably large gradient tensors such as a tensor that contains 6 million floats quantizing dividing it into smaller channels (512 float chunks) and quantizing independently may significantly increase the resolution and result with lower error.

Test Plan:
python torch/distributed/algorithms/ddp_comm_hooks/test_ddp_hooks.py

Couldn't download test skip set, leaving all tests enabled...
.....
Ran 5 tests in 26.724s
OK

P.S. Ignore the changes in reducer.cpp while reviewing, please see #43307
Differential Revision: D22937999

In this diff, we prepared some example DDP communication hooks [#40848](#40848):

1\. `allreduce_hook`: This DDP communication hook just calls ``allreduce`` using ``GradBucket`` tensors. Once gradient tensors are aggregated across all workers, its ``then`` callback takes the mean and returns the result. If user registers this hook DDP results is expected to be same as the case where no hook was registered. Hence, this won't change behavior of DDP and user can use this as a reference or modify this hook to log useful information or any other purposes while unaffecting DDP behavior.

2\. `allgather_then_aggregate_hook` Similar to ``allreduce_hook``, this hook first gathers ``GradBucket`` tensors and its ``then`` callback aggregates the gathered gradient tensors and takes mean. Instead of ``allreduce`` this hook uses ``allgather``. Note that with W workers, both the computation and communication time scale as O(W) for allgather compared to O(logW) for allreduce. Therefore, this hook is expected to be much slower than ``allreduce_hook`` although both essentially do the same thing with the gradients.

3\. `fp16_compress_hook` This DDP communication hook implements a simple gradient compression approach that converts ``GradBucket`` tensors whose type is assumed to be ``torch.float32`` to half-precision floating point format (``torch.float16``). It allreduces those ``float16`` gradient tensors. Once compressed gradient tensors are allreduced, its then callback called ``decompress`` converts the aggregated result back to ``float32`` and takes the mean.

4\. `quantization_pertensor_hook` does quantization per tensor and uses the idea in https://pytorch.org/docs/master/generated/torch.quantize_per_tensor.html.  Note that we separately send scale and zero_point (two floats per rank) before quantized tensors.

5\. `quantization_perchannel_hook` does quantization per channel similar to https://pytorch.org/docs/master/generated/torch.quantize_per_channel.html. The main motivation is that after the initial QSGD study diff, we realized that for considerably large gradient tensors such as a tensor that contains 6 million floats quantizing dividing it into smaller channels (512 float chunks) and quantizing independently may significantly increase the resolution and result with lower error.

Differential Revision: [D22937999](https://our.internmc.facebook.com/intern/diff/D22937999/)

[ghstack-poisoned]
sinannasir added a commit that referenced this pull request Aug 20, 2020
In this diff, we prepared some example DDP communication hooks [#40848](#40848):

1\. `allreduce_hook`: This DDP communication hook just calls ``allreduce`` using ``GradBucket`` tensors. Once gradient tensors are aggregated across all workers, its ``then`` callback takes the mean and returns the result. If user registers this hook DDP results is expected to be same as the case where no hook was registered. Hence, this won't change behavior of DDP and user can use this as a reference or modify this hook to log useful information or any other purposes while unaffecting DDP behavior.

2\. `allgather_then_aggregate_hook` Similar to ``allreduce_hook``, this hook first gathers ``GradBucket`` tensors and its ``then`` callback aggregates the gathered gradient tensors and takes mean. Instead of ``allreduce`` this hook uses ``allgather``. Note that with W workers, both the computation and communication time scale as O(W) for allgather compared to O(logW) for allreduce. Therefore, this hook is expected to be much slower than ``allreduce_hook`` although both essentially do the same thing with the gradients.

3\. `fp16_compress_hook` This DDP communication hook implements a simple gradient compression approach that converts ``GradBucket`` tensors whose type is assumed to be ``torch.float32`` to half-precision floating point format (``torch.float16``). It allreduces those ``float16`` gradient tensors. Once compressed gradient tensors are allreduced, its then callback called ``decompress`` converts the aggregated result back to ``float32`` and takes the mean.

4\. `quantization_pertensor_hook` does quantization per tensor and uses the idea in https://pytorch.org/docs/master/generated/torch.quantize_per_tensor.html.  Note that we separately send scale and zero_point (two floats per rank) before quantized tensors.

5\. `quantization_perchannel_hook` does quantization per channel similar to https://pytorch.org/docs/master/generated/torch.quantize_per_channel.html. The main motivation is that after the initial QSGD study diff, we realized that for considerably large gradient tensors such as a tensor that contains 6 million floats quantizing dividing it into smaller channels (512 float chunks) and quantizing independently may significantly increase the resolution and result with lower error.

Differential Revision: [D22937999](https://our.internmc.facebook.com/intern/diff/D22937999/)

ghstack-source-id: 110260865
Pull Request resolved: #43310
@dr-ci
Copy link

dr-ci bot commented Aug 20, 2020

💊 CI failures summary and remediations

As of commit c88ed7e (more details on the Dr. CI page):


💚 💚 Looks good so far! There are no failures yet. 💚 💚


This comment was automatically generated by Dr. CI (expand for details).Follow this link to opt-out of these comments for your Pull Requests.

Please report bugs/suggestions on the GitHub issue tracker or post in the (internal) Dr. CI Users group.

See how this bot performed.

This comment has been revised 26 times.

Copy link
Contributor

@mrshenli mrshenli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the future plan for torch/distributed/algorithms package? are we going to add other algorithms?

In this diff, we prepared some example DDP communication hooks [#40848](#40848):

1\. `allreduce_hook`: This DDP communication hook just calls ``allreduce`` using ``GradBucket`` tensors. Once gradient tensors are aggregated across all workers, its ``then`` callback takes the mean and returns the result. If user registers this hook DDP results is expected to be same as the case where no hook was registered. Hence, this won't change behavior of DDP and user can use this as a reference or modify this hook to log useful information or any other purposes while unaffecting DDP behavior.

2\. `allgather_then_aggregate_hook` Similar to ``allreduce_hook``, this hook first gathers ``GradBucket`` tensors and its ``then`` callback aggregates the gathered gradient tensors and takes mean. Instead of ``allreduce`` this hook uses ``allgather``. Note that with W workers, both the computation and communication time scale as O(W) for allgather compared to O(logW) for allreduce. Therefore, this hook is expected to be much slower than ``allreduce_hook`` although both essentially do the same thing with the gradients.

3\. `fp16_compress_hook` This DDP communication hook implements a simple gradient compression approach that converts ``GradBucket`` tensors whose type is assumed to be ``torch.float32`` to half-precision floating point format (``torch.float16``). It allreduces those ``float16`` gradient tensors. Once compressed gradient tensors are allreduced, its then callback called ``decompress`` converts the aggregated result back to ``float32`` and takes the mean.

4\. `quantization_pertensor_hook` does quantization per tensor and uses the idea in https://pytorch.org/docs/master/generated/torch.quantize_per_tensor.html.  Note that we separately send scale and zero_point (two floats per rank) before quantized tensors.

5\. `quantization_perchannel_hook` does quantization per channel similar to https://pytorch.org/docs/master/generated/torch.quantize_per_channel.html. The main motivation is that after the initial QSGD study diff, we realized that for considerably large gradient tensors such as a tensor that contains 6 million floats quantizing dividing it into smaller channels (512 float chunks) and quantizing independently may significantly increase the resolution and result with lower error.

**Test Plan:**
python torch/distributed/algorithms/ddp_comm_hooks/test_ddp_hooks.py 
```
Couldn't download test skip set, leaving all tests enabled...
.....
Ran 5 tests in 26.724s
OK
```

**P.S.** Ignore the changes in `reducer.cpp` while reviewing, please see #43307 
Differential Revision: [D22937999](https://our.internmc.facebook.com/intern/diff/D22937999/)

[ghstack-poisoned]
sinannasir added a commit that referenced this pull request Aug 21, 2020
Pull Request resolved: #43310

In this diff, we prepared some example DDP communication hooks [#40848](#40848):

1\. `allreduce_hook`: This DDP communication hook just calls ``allreduce`` using ``GradBucket`` tensors. Once gradient tensors are aggregated across all workers, its ``then`` callback takes the mean and returns the result. If user registers this hook DDP results is expected to be same as the case where no hook was registered. Hence, this won't change behavior of DDP and user can use this as a reference or modify this hook to log useful information or any other purposes while unaffecting DDP behavior.

2\. `allgather_then_aggregate_hook` Similar to ``allreduce_hook``, this hook first gathers ``GradBucket`` tensors and its ``then`` callback aggregates the gathered gradient tensors and takes mean. Instead of ``allreduce`` this hook uses ``allgather``. Note that with W workers, both the computation and communication time scale as O(W) for allgather compared to O(logW) for allreduce. Therefore, this hook is expected to be much slower than ``allreduce_hook`` although both essentially do the same thing with the gradients.

3\. `fp16_compress_hook` This DDP communication hook implements a simple gradient compression approach that converts ``GradBucket`` tensors whose type is assumed to be ``torch.float32`` to half-precision floating point format (``torch.float16``). It allreduces those ``float16`` gradient tensors. Once compressed gradient tensors are allreduced, its then callback called ``decompress`` converts the aggregated result back to ``float32`` and takes the mean.

4\. `quantization_pertensor_hook` does quantization per tensor and uses the idea in https://pytorch.org/docs/master/generated/torch.quantize_per_tensor.html.  Note that we separately send scale and zero_point (two floats per rank) before quantized tensors.

5\. `quantization_perchannel_hook` does quantization per channel similar to https://pytorch.org/docs/master/generated/torch.quantize_per_channel.html. The main motivation is that after the initial QSGD study diff, we realized that for considerably large gradient tensors such as a tensor that contains 6 million floats quantizing dividing it into smaller channels (512 float chunks) and quantizing independently may significantly increase the resolution and result with lower error.
ghstack-source-id: 110463009

Differential Revision: [D22937999](https://our.internmc.facebook.com/intern/diff/D22937999/)

**NOTE FOR REVIEWERS**: This PR has internal Facebook specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D22937999/)!
@sinannasir
Copy link
Contributor Author

What's the future plan for torch/distributed/algorithms package? are we going to add other algorithms?

Would like to mention @mrshenli's idea of moving these hooks to https://github.com/pytorch/pytorch/tree/master/benchmarks/distributed

In this diff, we prepared some example DDP communication hooks [#40848](#40848):

1\. `allreduce_hook`: This DDP communication hook just calls ``allreduce`` using ``GradBucket`` tensors. Once gradient tensors are aggregated across all workers, its ``then`` callback takes the mean and returns the result. If user registers this hook DDP results is expected to be same as the case where no hook was registered. Hence, this won't change behavior of DDP and user can use this as a reference or modify this hook to log useful information or any other purposes while unaffecting DDP behavior.

2\. `allgather_then_aggregate_hook` Similar to ``allreduce_hook``, this hook first gathers ``GradBucket`` tensors and its ``then`` callback aggregates the gathered gradient tensors and takes mean. Instead of ``allreduce`` this hook uses ``allgather``. Note that with W workers, both the computation and communication time scale as O(W) for allgather compared to O(logW) for allreduce. Therefore, this hook is expected to be much slower than ``allreduce_hook`` although both essentially do the same thing with the gradients.

3\. `fp16_compress_hook` This DDP communication hook implements a simple gradient compression approach that converts ``GradBucket`` tensors whose type is assumed to be ``torch.float32`` to half-precision floating point format (``torch.float16``). It allreduces those ``float16`` gradient tensors. Once compressed gradient tensors are allreduced, its then callback called ``decompress`` converts the aggregated result back to ``float32`` and takes the mean.

4\. `quantization_pertensor_hook` does quantization per tensor and uses the idea in https://pytorch.org/docs/master/generated/torch.quantize_per_tensor.html.  Note that we separately send scale and zero_point (two floats per rank) before quantized tensors.

5\. `quantization_perchannel_hook` does quantization per channel similar to https://pytorch.org/docs/master/generated/torch.quantize_per_channel.html. The main motivation is that after the initial QSGD study diff, we realized that for considerably large gradient tensors such as a tensor that contains 6 million floats quantizing dividing it into smaller channels (512 float chunks) and quantizing independently may significantly increase the resolution and result with lower error.

**Test Plan:**
python torch/distributed/algorithms/ddp_comm_hooks/test_ddp_hooks.py 
```
Couldn't download test skip set, leaving all tests enabled...
.....
Ran 5 tests in 26.724s
OK
```

**P.S.** Ignore the changes in `reducer.cpp` while reviewing, please see #43307 
Differential Revision: [D22937999](https://our.internmc.facebook.com/intern/diff/D22937999/)

[ghstack-poisoned]
sinannasir added a commit that referenced this pull request Aug 22, 2020
Pull Request resolved: #43310

In this diff, we prepared some example DDP communication hooks [#40848](#40848):

1\. `allreduce_hook`: This DDP communication hook just calls ``allreduce`` using ``GradBucket`` tensors. Once gradient tensors are aggregated across all workers, its ``then`` callback takes the mean and returns the result. If user registers this hook DDP results is expected to be same as the case where no hook was registered. Hence, this won't change behavior of DDP and user can use this as a reference or modify this hook to log useful information or any other purposes while unaffecting DDP behavior.

2\. `allgather_then_aggregate_hook` Similar to ``allreduce_hook``, this hook first gathers ``GradBucket`` tensors and its ``then`` callback aggregates the gathered gradient tensors and takes mean. Instead of ``allreduce`` this hook uses ``allgather``. Note that with W workers, both the computation and communication time scale as O(W) for allgather compared to O(logW) for allreduce. Therefore, this hook is expected to be much slower than ``allreduce_hook`` although both essentially do the same thing with the gradients.

3\. `fp16_compress_hook` This DDP communication hook implements a simple gradient compression approach that converts ``GradBucket`` tensors whose type is assumed to be ``torch.float32`` to half-precision floating point format (``torch.float16``). It allreduces those ``float16`` gradient tensors. Once compressed gradient tensors are allreduced, its then callback called ``decompress`` converts the aggregated result back to ``float32`` and takes the mean.

4\. `quantization_pertensor_hook` does quantization per tensor and uses the idea in https://pytorch.org/docs/master/generated/torch.quantize_per_tensor.html.  Note that we separately send scale and zero_point (two floats per rank) before quantized tensors.

5\. `quantization_perchannel_hook` does quantization per channel similar to https://pytorch.org/docs/master/generated/torch.quantize_per_channel.html. The main motivation is that after the initial QSGD study diff, we realized that for considerably large gradient tensors such as a tensor that contains 6 million floats quantizing dividing it into smaller channels (512 float chunks) and quantizing independently may significantly increase the resolution and result with lower error.
ghstack-source-id: 110501907

Differential Revision: [D22937999](https://our.internmc.facebook.com/intern/diff/D22937999/)

**NOTE FOR REVIEWERS**: This PR has internal Facebook specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D22937999/)!
In this diff, we prepared some example DDP communication hooks [#40848](#40848):

1\. `allreduce_hook`: This DDP communication hook just calls ``allreduce`` using ``GradBucket`` tensors. Once gradient tensors are aggregated across all workers, its ``then`` callback takes the mean and returns the result. If user registers this hook DDP results is expected to be same as the case where no hook was registered. Hence, this won't change behavior of DDP and user can use this as a reference or modify this hook to log useful information or any other purposes while unaffecting DDP behavior.

2\. `allgather_then_aggregate_hook` Similar to ``allreduce_hook``, this hook first gathers ``GradBucket`` tensors and its ``then`` callback aggregates the gathered gradient tensors and takes mean. Instead of ``allreduce`` this hook uses ``allgather``. Note that with W workers, both the computation and communication time scale as O(W) for allgather compared to O(logW) for allreduce. Therefore, this hook is expected to be much slower than ``allreduce_hook`` although both essentially do the same thing with the gradients.

3\. `fp16_compress_hook` This DDP communication hook implements a simple gradient compression approach that converts ``GradBucket`` tensors whose type is assumed to be ``torch.float32`` to half-precision floating point format (``torch.float16``). It allreduces those ``float16`` gradient tensors. Once compressed gradient tensors are allreduced, its then callback called ``decompress`` converts the aggregated result back to ``float32`` and takes the mean.

4\. `quantization_pertensor_hook` does quantization per tensor and uses the idea in https://pytorch.org/docs/master/generated/torch.quantize_per_tensor.html.  Note that we separately send scale and zero_point (two floats per rank) before quantized tensors.

5\. `quantization_perchannel_hook` does quantization per channel similar to https://pytorch.org/docs/master/generated/torch.quantize_per_channel.html. The main motivation is that after the initial QSGD study diff, we realized that for considerably large gradient tensors such as a tensor that contains 6 million floats quantizing dividing it into smaller channels (512 float chunks) and quantizing independently may significantly increase the resolution and result with lower error.

**Test Plan:**
python torch/distributed/algorithms/ddp_comm_hooks/test_ddp_hooks.py 
```
Couldn't download test skip set, leaving all tests enabled...
.....
Ran 5 tests in 26.724s
OK
```

**P.S.** Ignore the changes in `reducer.cpp` while reviewing, please see #43307 
Differential Revision: [D22937999](https://our.internmc.facebook.com/intern/diff/D22937999/)

[ghstack-poisoned]
sinannasir added a commit that referenced this pull request Aug 25, 2020
Pull Request resolved: #43310

In this diff, we prepared some example DDP communication hooks [#40848](#40848):

1\. `allreduce_hook`: This DDP communication hook just calls ``allreduce`` using ``GradBucket`` tensors. Once gradient tensors are aggregated across all workers, its ``then`` callback takes the mean and returns the result. If user registers this hook DDP results is expected to be same as the case where no hook was registered. Hence, this won't change behavior of DDP and user can use this as a reference or modify this hook to log useful information or any other purposes while unaffecting DDP behavior.

2\. `allgather_then_aggregate_hook` Similar to ``allreduce_hook``, this hook first gathers ``GradBucket`` tensors and its ``then`` callback aggregates the gathered gradient tensors and takes mean. Instead of ``allreduce`` this hook uses ``allgather``. Note that with W workers, both the computation and communication time scale as O(W) for allgather compared to O(logW) for allreduce. Therefore, this hook is expected to be much slower than ``allreduce_hook`` although both essentially do the same thing with the gradients.

3\. `fp16_compress_hook` This DDP communication hook implements a simple gradient compression approach that converts ``GradBucket`` tensors whose type is assumed to be ``torch.float32`` to half-precision floating point format (``torch.float16``). It allreduces those ``float16`` gradient tensors. Once compressed gradient tensors are allreduced, its then callback called ``decompress`` converts the aggregated result back to ``float32`` and takes the mean.

4\. `quantization_pertensor_hook` does quantization per tensor and uses the idea in https://pytorch.org/docs/master/generated/torch.quantize_per_tensor.html.  Note that we separately send scale and zero_point (two floats per rank) before quantized tensors.

5\. `quantization_perchannel_hook` does quantization per channel similar to https://pytorch.org/docs/master/generated/torch.quantize_per_channel.html. The main motivation is that after the initial QSGD study diff, we realized that for considerably large gradient tensors such as a tensor that contains 6 million floats quantizing dividing it into smaller channels (512 float chunks) and quantizing independently may significantly increase the resolution and result with lower error.
ghstack-source-id: 110620825

Differential Revision: [D22937999](https://our.internmc.facebook.com/intern/diff/D22937999/)

**NOTE FOR REVIEWERS**: This PR has internal Facebook specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D22937999/)!
In this diff, we prepared some example DDP communication hooks [#40848](#40848):

1\. `allreduce_hook`: This DDP communication hook just calls ``allreduce`` using ``GradBucket`` tensors. Once gradient tensors are aggregated across all workers, its ``then`` callback takes the mean and returns the result. If user registers this hook DDP results is expected to be same as the case where no hook was registered. Hence, this won't change behavior of DDP and user can use this as a reference or modify this hook to log useful information or any other purposes while unaffecting DDP behavior.

2\. `allgather_then_aggregate_hook` Similar to ``allreduce_hook``, this hook first gathers ``GradBucket`` tensors and its ``then`` callback aggregates the gathered gradient tensors and takes mean. Instead of ``allreduce`` this hook uses ``allgather``. Note that with W workers, both the computation and communication time scale as O(W) for allgather compared to O(logW) for allreduce. Therefore, this hook is expected to be much slower than ``allreduce_hook`` although both essentially do the same thing with the gradients.

3\. `fp16_compress_hook` This DDP communication hook implements a simple gradient compression approach that converts ``GradBucket`` tensors whose type is assumed to be ``torch.float32`` to half-precision floating point format (``torch.float16``). It allreduces those ``float16`` gradient tensors. Once compressed gradient tensors are allreduced, its then callback called ``decompress`` converts the aggregated result back to ``float32`` and takes the mean.

4\. `quantization_pertensor_hook` does quantization per tensor and uses the idea in https://pytorch.org/docs/master/generated/torch.quantize_per_tensor.html.  Note that we separately send scale and zero_point (two floats per rank) before quantized tensors.

5\. `quantization_perchannel_hook` does quantization per channel similar to https://pytorch.org/docs/master/generated/torch.quantize_per_channel.html. The main motivation is that after the initial QSGD study diff, we realized that for considerably large gradient tensors such as a tensor that contains 6 million floats quantizing dividing it into smaller channels (512 float chunks) and quantizing independently may significantly increase the resolution and result with lower error.

**Test Plan:**
python torch/distributed/algorithms/ddp_comm_hooks/test_ddp_hooks.py 
```
Couldn't download test skip set, leaving all tests enabled...
.....
Ran 5 tests in 26.724s
OK
```

**P.S.** Ignore the changes in `reducer.cpp` while reviewing, please see #43307 
Differential Revision: [D22937999](https://our.internmc.facebook.com/intern/diff/D22937999/)

[ghstack-poisoned]
In this diff, we prepared some example DDP communication hooks [#40848](#40848):

1\. `allreduce_hook`: This DDP communication hook just calls ``allreduce`` using ``GradBucket`` tensors. Once gradient tensors are aggregated across all workers, its ``then`` callback takes the mean and returns the result. If user registers this hook DDP results is expected to be same as the case where no hook was registered. Hence, this won't change behavior of DDP and user can use this as a reference or modify this hook to log useful information or any other purposes while unaffecting DDP behavior.

2\. `allgather_then_aggregate_hook` Similar to ``allreduce_hook``, this hook first gathers ``GradBucket`` tensors and its ``then`` callback aggregates the gathered gradient tensors and takes mean. Instead of ``allreduce`` this hook uses ``allgather``. Note that with W workers, both the computation and communication time scale as O(W) for allgather compared to O(logW) for allreduce. Therefore, this hook is expected to be much slower than ``allreduce_hook`` although both essentially do the same thing with the gradients.

3\. `fp16_compress_hook` This DDP communication hook implements a simple gradient compression approach that converts ``GradBucket`` tensors whose type is assumed to be ``torch.float32`` to half-precision floating point format (``torch.float16``). It allreduces those ``float16`` gradient tensors. Once compressed gradient tensors are allreduced, its then callback called ``decompress`` converts the aggregated result back to ``float32`` and takes the mean.

4\. `quantization_pertensor_hook` does quantization per tensor and uses the idea in https://pytorch.org/docs/master/generated/torch.quantize_per_tensor.html.  Note that we separately send scale and zero_point (two floats per rank) before quantized tensors.

5\. `quantization_perchannel_hook` does quantization per channel similar to https://pytorch.org/docs/master/generated/torch.quantize_per_channel.html. The main motivation is that after the initial QSGD study diff, we realized that for considerably large gradient tensors such as a tensor that contains 6 million floats quantizing dividing it into smaller channels (512 float chunks) and quantizing independently may significantly increase the resolution and result with lower error.

**Test Plan:**
python torch/distributed/algorithms/ddp_comm_hooks/test_ddp_hooks.py 
```
Couldn't download test skip set, leaving all tests enabled...
.....
Ran 5 tests in 26.724s
OK
```

**P.S.** Ignore the changes in `reducer.cpp` while reviewing, please see #43307 
Differential Revision: [D22937999](https://our.internmc.facebook.com/intern/diff/D22937999/)

[ghstack-poisoned]
sinannasir added a commit that referenced this pull request Aug 26, 2020
Pull Request resolved: #43310

In this diff, we prepared some example DDP communication hooks [#40848](#40848):

1\. `allreduce_hook`: This DDP communication hook just calls ``allreduce`` using ``GradBucket`` tensors. Once gradient tensors are aggregated across all workers, its ``then`` callback takes the mean and returns the result. If user registers this hook DDP results is expected to be same as the case where no hook was registered. Hence, this won't change behavior of DDP and user can use this as a reference or modify this hook to log useful information or any other purposes while unaffecting DDP behavior.

2\. `allgather_then_aggregate_hook` Similar to ``allreduce_hook``, this hook first gathers ``GradBucket`` tensors and its ``then`` callback aggregates the gathered gradient tensors and takes mean. Instead of ``allreduce`` this hook uses ``allgather``. Note that with W workers, both the computation and communication time scale as O(W) for allgather compared to O(logW) for allreduce. Therefore, this hook is expected to be much slower than ``allreduce_hook`` although both essentially do the same thing with the gradients.

3\. `fp16_compress_hook` This DDP communication hook implements a simple gradient compression approach that converts ``GradBucket`` tensors whose type is assumed to be ``torch.float32`` to half-precision floating point format (``torch.float16``). It allreduces those ``float16`` gradient tensors. Once compressed gradient tensors are allreduced, its then callback called ``decompress`` converts the aggregated result back to ``float32`` and takes the mean.

4\. `quantization_pertensor_hook` does quantization per tensor and uses the idea in https://pytorch.org/docs/master/generated/torch.quantize_per_tensor.html.  Note that we separately send scale and zero_point (two floats per rank) before quantized tensors.

5\. `quantization_perchannel_hook` does quantization per channel similar to https://pytorch.org/docs/master/generated/torch.quantize_per_channel.html. The main motivation is that after the initial QSGD study diff, we realized that for considerably large gradient tensors such as a tensor that contains 6 million floats quantizing dividing it into smaller channels (512 float chunks) and quantizing independently may significantly increase the resolution and result with lower error.
ghstack-source-id: 110706550

Differential Revision: [D22937999](https://our.internmc.facebook.com/intern/diff/D22937999/)

**NOTE FOR REVIEWERS**: This PR has internal Facebook specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D22937999/)!
In this diff, we prepared some example DDP communication hooks [#40848](#40848):

1\. `allreduce_hook`: This DDP communication hook just calls ``allreduce`` using ``GradBucket`` tensors. Once gradient tensors are aggregated across all workers, its ``then`` callback takes the mean and returns the result. If user registers this hook DDP results is expected to be same as the case where no hook was registered. Hence, this won't change behavior of DDP and user can use this as a reference or modify this hook to log useful information or any other purposes while unaffecting DDP behavior.

2\. `allgather_then_aggregate_hook` Similar to ``allreduce_hook``, this hook first gathers ``GradBucket`` tensors and its ``then`` callback aggregates the gathered gradient tensors and takes mean. Instead of ``allreduce`` this hook uses ``allgather``. Note that with W workers, both the computation and communication time scale as O(W) for allgather compared to O(logW) for allreduce. Therefore, this hook is expected to be much slower than ``allreduce_hook`` although both essentially do the same thing with the gradients.

3\. `fp16_compress_hook` This DDP communication hook implements a simple gradient compression approach that converts ``GradBucket`` tensors whose type is assumed to be ``torch.float32`` to half-precision floating point format (``torch.float16``). It allreduces those ``float16`` gradient tensors. Once compressed gradient tensors are allreduced, its then callback called ``decompress`` converts the aggregated result back to ``float32`` and takes the mean.

4\. `quantization_pertensor_hook` does quantization per tensor and uses the idea in https://pytorch.org/docs/master/generated/torch.quantize_per_tensor.html.  Note that we separately send scale and zero_point (two floats per rank) before quantized tensors.

5\. `quantization_perchannel_hook` does quantization per channel similar to https://pytorch.org/docs/master/generated/torch.quantize_per_channel.html. The main motivation is that after the initial QSGD study diff, we realized that for considerably large gradient tensors such as a tensor that contains 6 million floats quantizing dividing it into smaller channels (512 float chunks) and quantizing independently may significantly increase the resolution and result with lower error.

**Test Plan:**
python torch/distributed/algorithms/ddp_comm_hooks/test_ddp_hooks.py 
```
Couldn't download test skip set, leaving all tests enabled...
.....
Ran 5 tests in 26.724s
OK
```

**P.S.** Ignore the changes in `reducer.cpp` while reviewing, please see #43307 
Differential Revision: [D22937999](https://our.internmc.facebook.com/intern/diff/D22937999/)

[ghstack-poisoned]
In this diff, we prepared some example DDP communication hooks [#40848](#40848):

1\. `allreduce_hook`: This DDP communication hook just calls ``allreduce`` using ``GradBucket`` tensors. Once gradient tensors are aggregated across all workers, its ``then`` callback takes the mean and returns the result. If user registers this hook DDP results is expected to be same as the case where no hook was registered. Hence, this won't change behavior of DDP and user can use this as a reference or modify this hook to log useful information or any other purposes while unaffecting DDP behavior.

2\. `allgather_then_aggregate_hook` Similar to ``allreduce_hook``, this hook first gathers ``GradBucket`` tensors and its ``then`` callback aggregates the gathered gradient tensors and takes mean. Instead of ``allreduce`` this hook uses ``allgather``. Note that with W workers, both the computation and communication time scale as O(W) for allgather compared to O(logW) for allreduce. Therefore, this hook is expected to be much slower than ``allreduce_hook`` although both essentially do the same thing with the gradients.

3\. `fp16_compress_hook` This DDP communication hook implements a simple gradient compression approach that converts ``GradBucket`` tensors whose type is assumed to be ``torch.float32`` to half-precision floating point format (``torch.float16``). It allreduces those ``float16`` gradient tensors. Once compressed gradient tensors are allreduced, its then callback called ``decompress`` converts the aggregated result back to ``float32`` and takes the mean.

4\. `quantization_pertensor_hook` does quantization per tensor and uses the idea in https://pytorch.org/docs/master/generated/torch.quantize_per_tensor.html.  Note that we separately send scale and zero_point (two floats per rank) before quantized tensors.

5\. `quantization_perchannel_hook` does quantization per channel similar to https://pytorch.org/docs/master/generated/torch.quantize_per_channel.html. The main motivation is that after the initial QSGD study diff, we realized that for considerably large gradient tensors such as a tensor that contains 6 million floats quantizing dividing it into smaller channels (512 float chunks) and quantizing independently may significantly increase the resolution and result with lower error.

**Test Plan:**
python torch/distributed/algorithms/ddp_comm_hooks/test_ddp_hooks.py 
```
Couldn't download test skip set, leaving all tests enabled...
.....
Ran 5 tests in 26.724s
OK
```

**P.S.** Ignore the changes in `reducer.cpp` while reviewing, please see #43307 
Differential Revision: [D22937999](https://our.internmc.facebook.com/intern/diff/D22937999/)

[ghstack-poisoned]
sinannasir added a commit that referenced this pull request Aug 27, 2020
Pull Request resolved: #43310

In this diff, we prepared some example DDP communication hooks [#40848](#40848):

1\. `allreduce_hook`: This DDP communication hook just calls ``allreduce`` using ``GradBucket`` tensors. Once gradient tensors are aggregated across all workers, its ``then`` callback takes the mean and returns the result. If user registers this hook DDP results is expected to be same as the case where no hook was registered. Hence, this won't change behavior of DDP and user can use this as a reference or modify this hook to log useful information or any other purposes while unaffecting DDP behavior.

2\. `allgather_then_aggregate_hook` Similar to ``allreduce_hook``, this hook first gathers ``GradBucket`` tensors and its ``then`` callback aggregates the gathered gradient tensors and takes mean. Instead of ``allreduce`` this hook uses ``allgather``. Note that with W workers, both the computation and communication time scale as O(W) for allgather compared to O(logW) for allreduce. Therefore, this hook is expected to be much slower than ``allreduce_hook`` although both essentially do the same thing with the gradients.

3\. `fp16_compress_hook` This DDP communication hook implements a simple gradient compression approach that converts ``GradBucket`` tensors whose type is assumed to be ``torch.float32`` to half-precision floating point format (``torch.float16``). It allreduces those ``float16`` gradient tensors. Once compressed gradient tensors are allreduced, its then callback called ``decompress`` converts the aggregated result back to ``float32`` and takes the mean.

4\. `quantization_pertensor_hook` does quantization per tensor and uses the idea in https://pytorch.org/docs/master/generated/torch.quantize_per_tensor.html.  Note that we separately send scale and zero_point (two floats per rank) before quantized tensors.

5\. `quantization_perchannel_hook` does quantization per channel similar to https://pytorch.org/docs/master/generated/torch.quantize_per_channel.html. The main motivation is that after the initial QSGD study diff, we realized that for considerably large gradient tensors such as a tensor that contains 6 million floats quantizing dividing it into smaller channels (512 float chunks) and quantizing independently may significantly increase the resolution and result with lower error.
ghstack-source-id: 110868178

Differential Revision: [D22937999](https://our.internmc.facebook.com/intern/diff/D22937999/)

**NOTE FOR REVIEWERS**: This PR has internal Facebook specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D22937999/)!
In this diff, we prepared some example DDP communication hooks [#40848](#40848):

1\. `allreduce_hook`: This DDP communication hook just calls ``allreduce`` using ``GradBucket`` tensors. Once gradient tensors are aggregated across all workers, its ``then`` callback takes the mean and returns the result. If user registers this hook DDP results is expected to be same as the case where no hook was registered. Hence, this won't change behavior of DDP and user can use this as a reference or modify this hook to log useful information or any other purposes while unaffecting DDP behavior.

2\. `allgather_then_aggregate_hook` Similar to ``allreduce_hook``, this hook first gathers ``GradBucket`` tensors and its ``then`` callback aggregates the gathered gradient tensors and takes mean. Instead of ``allreduce`` this hook uses ``allgather``. Note that with W workers, both the computation and communication time scale as O(W) for allgather compared to O(logW) for allreduce. Therefore, this hook is expected to be much slower than ``allreduce_hook`` although both essentially do the same thing with the gradients.

3\. `fp16_compress_hook` This DDP communication hook implements a simple gradient compression approach that converts ``GradBucket`` tensors whose type is assumed to be ``torch.float32`` to half-precision floating point format (``torch.float16``). It allreduces those ``float16`` gradient tensors. Once compressed gradient tensors are allreduced, its then callback called ``decompress`` converts the aggregated result back to ``float32`` and takes the mean.

4\. `quantization_pertensor_hook` does quantization per tensor and uses the idea in https://pytorch.org/docs/master/generated/torch.quantize_per_tensor.html.  Note that we separately send scale and zero_point (two floats per rank) before quantized tensors.

5\. `quantization_perchannel_hook` does quantization per channel similar to https://pytorch.org/docs/master/generated/torch.quantize_per_channel.html. The main motivation is that after the initial QSGD study diff, we realized that for considerably large gradient tensors such as a tensor that contains 6 million floats quantizing dividing it into smaller channels (512 float chunks) and quantizing independently may significantly increase the resolution and result with lower error.

**Test Plan:**
python torch/distributed/algorithms/ddp_comm_hooks/test_ddp_hooks.py 
```
Couldn't download test skip set, leaving all tests enabled...
.....
Ran 5 tests in 26.724s
OK
```

**P.S.** Ignore the changes in `reducer.cpp` while reviewing, please see #43307 
Differential Revision: [D22937999](https://our.internmc.facebook.com/intern/diff/D22937999/)

[ghstack-poisoned]
sinannasir added a commit that referenced this pull request Aug 28, 2020
Pull Request resolved: #43310

In this diff, we prepared some example DDP communication hooks [#40848](#40848):

1\. `allreduce_hook`: This DDP communication hook just calls ``allreduce`` using ``GradBucket`` tensors. Once gradient tensors are aggregated across all workers, its ``then`` callback takes the mean and returns the result. If user registers this hook DDP results is expected to be same as the case where no hook was registered. Hence, this won't change behavior of DDP and user can use this as a reference or modify this hook to log useful information or any other purposes while unaffecting DDP behavior.

2\. `allgather_then_aggregate_hook` Similar to ``allreduce_hook``, this hook first gathers ``GradBucket`` tensors and its ``then`` callback aggregates the gathered gradient tensors and takes mean. Instead of ``allreduce`` this hook uses ``allgather``. Note that with W workers, both the computation and communication time scale as O(W) for allgather compared to O(logW) for allreduce. Therefore, this hook is expected to be much slower than ``allreduce_hook`` although both essentially do the same thing with the gradients.

3\. `fp16_compress_hook` This DDP communication hook implements a simple gradient compression approach that converts ``GradBucket`` tensors whose type is assumed to be ``torch.float32`` to half-precision floating point format (``torch.float16``). It allreduces those ``float16`` gradient tensors. Once compressed gradient tensors are allreduced, its then callback called ``decompress`` converts the aggregated result back to ``float32`` and takes the mean.

4\. `quantization_pertensor_hook` does quantization per tensor and uses the idea in https://pytorch.org/docs/master/generated/torch.quantize_per_tensor.html.  Note that we separately send scale and zero_point (two floats per rank) before quantized tensors.

5\. `quantization_perchannel_hook` does quantization per channel similar to https://pytorch.org/docs/master/generated/torch.quantize_per_channel.html. The main motivation is that after the initial QSGD study diff, we realized that for considerably large gradient tensors such as a tensor that contains 6 million floats quantizing dividing it into smaller channels (512 float chunks) and quantizing independently may significantly increase the resolution and result with lower error.
ghstack-source-id: 110908936

Differential Revision: [D22937999](https://our.internmc.facebook.com/intern/diff/D22937999/)

**NOTE FOR REVIEWERS**: This PR has internal Facebook specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D22937999/)!
In this diff, we prepared some example DDP communication hooks [#40848](#40848):

1\. `allreduce_hook`: This DDP communication hook just calls ``allreduce`` using ``GradBucket`` tensors. Once gradient tensors are aggregated across all workers, its ``then`` callback takes the mean and returns the result. If user registers this hook DDP results is expected to be same as the case where no hook was registered. Hence, this won't change behavior of DDP and user can use this as a reference or modify this hook to log useful information or any other purposes while unaffecting DDP behavior.

2\. `allgather_then_aggregate_hook` Similar to ``allreduce_hook``, this hook first gathers ``GradBucket`` tensors and its ``then`` callback aggregates the gathered gradient tensors and takes mean. Instead of ``allreduce`` this hook uses ``allgather``. Note that with W workers, both the computation and communication time scale as O(W) for allgather compared to O(logW) for allreduce. Therefore, this hook is expected to be much slower than ``allreduce_hook`` although both essentially do the same thing with the gradients.

3\. `fp16_compress_hook` This DDP communication hook implements a simple gradient compression approach that converts ``GradBucket`` tensors whose type is assumed to be ``torch.float32`` to half-precision floating point format (``torch.float16``). It allreduces those ``float16`` gradient tensors. Once compressed gradient tensors are allreduced, its then callback called ``decompress`` converts the aggregated result back to ``float32`` and takes the mean.

4\. `quantization_pertensor_hook` does quantization per tensor and uses the idea in https://pytorch.org/docs/master/generated/torch.quantize_per_tensor.html.  Note that we separately send scale and zero_point (two floats per rank) before quantized tensors.

5\. `quantization_perchannel_hook` does quantization per channel similar to https://pytorch.org/docs/master/generated/torch.quantize_per_channel.html. The main motivation is that after the initial QSGD study diff, we realized that for considerably large gradient tensors such as a tensor that contains 6 million floats quantizing dividing it into smaller channels (512 float chunks) and quantizing independently may significantly increase the resolution and result with lower error.

**Test Plan:**
python torch/distributed/algorithms/ddp_comm_hooks/test_ddp_hooks.py 
```
Couldn't download test skip set, leaving all tests enabled...
.....
Ran 5 tests in 26.724s
OK
```

**P.S.** Ignore the changes in `reducer.cpp` while reviewing, please see #43307 
Differential Revision: [D22937999](https://our.internmc.facebook.com/intern/diff/D22937999/)

[ghstack-poisoned]
sinannasir added a commit that referenced this pull request Aug 28, 2020
Pull Request resolved: #43310

In this diff, we prepared some example DDP communication hooks [#40848](#40848):

1\. `allreduce_hook`: This DDP communication hook just calls ``allreduce`` using ``GradBucket`` tensors. Once gradient tensors are aggregated across all workers, its ``then`` callback takes the mean and returns the result. If user registers this hook DDP results is expected to be same as the case where no hook was registered. Hence, this won't change behavior of DDP and user can use this as a reference or modify this hook to log useful information or any other purposes while unaffecting DDP behavior.

2\. `allgather_then_aggregate_hook` Similar to ``allreduce_hook``, this hook first gathers ``GradBucket`` tensors and its ``then`` callback aggregates the gathered gradient tensors and takes mean. Instead of ``allreduce`` this hook uses ``allgather``. Note that with W workers, both the computation and communication time scale as O(W) for allgather compared to O(logW) for allreduce. Therefore, this hook is expected to be much slower than ``allreduce_hook`` although both essentially do the same thing with the gradients.

3\. `fp16_compress_hook` This DDP communication hook implements a simple gradient compression approach that converts ``GradBucket`` tensors whose type is assumed to be ``torch.float32`` to half-precision floating point format (``torch.float16``). It allreduces those ``float16`` gradient tensors. Once compressed gradient tensors are allreduced, its then callback called ``decompress`` converts the aggregated result back to ``float32`` and takes the mean.

4\. `quantization_pertensor_hook` does quantization per tensor and uses the idea in https://pytorch.org/docs/master/generated/torch.quantize_per_tensor.html.  Note that we separately send scale and zero_point (two floats per rank) before quantized tensors.

5\. `quantization_perchannel_hook` does quantization per channel similar to https://pytorch.org/docs/master/generated/torch.quantize_per_channel.html. The main motivation is that after the initial QSGD study diff, we realized that for considerably large gradient tensors such as a tensor that contains 6 million floats quantizing dividing it into smaller channels (512 float chunks) and quantizing independently may significantly increase the resolution and result with lower error.
ghstack-source-id: 110923269

Differential Revision: [D22937999](https://our.internmc.facebook.com/intern/diff/D22937999/)

**NOTE FOR REVIEWERS**: This PR has internal Facebook specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D22937999/)!
@sinannasir
Copy link
Contributor Author

I'm wondering whether there is a circleci test that runs test/run_test.py. If not, probably I'll just try locally to validate whether hook tests were executed.

@codecov
Copy link

codecov bot commented Aug 28, 2020

Codecov Report

❗ No coverage uploaded for pull request base (gh/sinannasir/13/base@2082307). Click here to learn what that means.
The diff coverage is n/a.

Impacted file tree graph

@@                   Coverage Diff                    @@
##             gh/sinannasir/13/base   #43310   +/-   ##
========================================================
  Coverage                         ?   69.34%           
========================================================
  Files                            ?      378           
  Lines                            ?    46680           
  Branches                         ?        0           
========================================================
  Hits                             ?    32369           
  Misses                           ?    14311           
  Partials                         ?        0           

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 2082307...c88ed7e. Read the comment docs.

@facebook-github-bot
Copy link
Contributor

This pull request has been merged in 1a79d7b.

@facebook-github-bot facebook-github-bot deleted the gh/sinannasir/13/head branch September 1, 2020 14:17
@sinannasir sinannasir mentioned this pull request Sep 4, 2020
3 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants