Skip to content

Conversation

@beback4u
Copy link
Contributor

@beback4u beback4u commented Mar 1, 2022

Summary:
Optimized GRU operator by using pre-packing for weights and biases in the Vulkan GPU backend:

  • The weights and biases are always on the CPU side by design.
  • To reduce the overhead by retrieving the weight and bias tensors every time, it is the best way to store them by pre-packing.
    • A custom op context GruOpContext (derived from torch::jit::CustomClassHolder) is created to hold both packed and unpacked data. It corresponds to the unpacked_ struct which represents the data needed to construct the op context. This data will be pre-packed and be stored in the packed_ struct. The constructor of the GruOpContext loads the data into the unpacked_ and packed_ structs.
    • at::native::vulkan::ops::gru_prepack and at::native::vulkan::ops::gru_run methods use the op context. The gru_prepack takes in whatever data is needed to construct the op context and returns a pointer to a created context. The gru_run takes input tensors and a pointer to the op context that uses the data stored in the context to process the inputs.
    • Lastly, we need to register the op context class and ops in Register.cpp. And rewrite the subgraph function of GRU op in vulkan_rewrite.cpp so that gru_prepack and gru_run ops can be executed instead in the Vulkan GPU backend.
  • To avoid "Undefined symbols for architecture x86_64" compiler error on the x86_64 platform, c10::Dispatcher::callBoxed() API is used to call vulkan_prepack::gru_prepack and vulkan_prepack::gru_run by name. Otherwise, the test methods can't resolve the symbols.
  • Added new tests for the GRU pre-packing and run operations: gru_prepack_success and gru_prepack_invalidinputs_exceptions`
  • To build your PyTorch OSS on your local machine:
python setup.py clean
git submodule update --init --recursive
USE_VULKAN=1 USE_VULKAN_FP16_INFERENCE=1 python3 setup.py install --cmake
python setup.py develop && python -c "import torch"
  • To run and dump a model containing GRU operators in Python:
import torch
from torch.utils import mobile_optimizer
model = torch.jit.load("Mclaren_traced.pt")
vk_model = mobile_optimizer.optimize_for_mobile(model, backend="vulkan")
print(vk_model.graph)
  • The following torch scripts are the updated version by GRU pre-packing:
%15 : Tensor[] = prim::ListConstruct(%weight_ih_l0.1, %weight_hh_l0.1, %bias_ih_l0.1, %bias_hh_l0.1, %weight_ih_l1.1, %weight_hh_l1.1, %bias_ih_l1.1, %bias_hh_l1.1)
%19 : __torch__.torch.classes.vulkan.GruOpContext = vulkan_prepack::gru_prepack(%15, %4, %5, %6, %3, %3, %4)
%20 : Tensor, %21 : Tensor = vulkan_prepack::gru_run(%input.1, %hx.1, %19)
%18 : (Tensor, Tensor) = prim::TupleConstruct(%21, %20)
return (%18)
  • This implementation has some limitations:
    • Tensor dim should be 3 for input sequence and hidden state.
    • has_biases=True
    • train=False
    • bidirectional=False
    • batch_first=True
    • dropout=0.0
    • D=1 since bidirectional=False
    • N=1 (batch size)
    • L=1 (sequence length)

Test Plan:
Build & test on Android:

cd ~/fbsource
buck build -c ndk.custom_libcxx=false -c pt.enable_qpl=0 //xplat/caffe2:pt_vulkan_api_test_binAndroid\#android-arm64 --show-output
adb push buck-out/gen/xplat/caffe2/pt_vulkan_api_test_binAndroid\#android-arm64 /data/local/tmp/vulkan_api_test
adb shell "/data/local/tmp/vulkan_api_test"

Build & test on MacOS (x86_64):

cd ~/fbsource
buck build //xplat/caffe2:pt_vulkan_api_test_binAppleMac
./buck-out/gen/xplat/caffe2/pt_vulkan_api_test_binAppleMac\#macosx-x86_64

Test result on Android (Google Pixel 5):

Running main() from gtest_main.cc
[==========] Running 4 tests from 1 test case.
[----------] Global test environment set-up.
[----------] 4 tests from VulkanAPITest
[ RUN      ] VulkanAPITest.gru_mclareninputs_success
[       OK ] VulkanAPITest.gru_mclareninputs_success (1037 ms)
[ RUN      ] VulkanAPITest.gru_invalidinputs_exceptions
[       OK ] VulkanAPITest.gru_invalidinputs_exceptions (16 ms)
[ RUN      ] VulkanAPITest.gru_prepack_success
[       OK ] VulkanAPITest.gru_prepack_success (45 ms)
[ RUN      ] VulkanAPITest.gru_prepack_invalidinputs_exceptions
[       OK ] VulkanAPITest.gru_prepack_invalidinputs_exceptions (16 ms)
[----------] 4 tests from VulkanAPITest (1114 ms total)
[----------] Global test environment tear-down
[==========] 4 tests from 1 test case ran. (1114 ms total)
[  PASSED  ] 4 tests.

Test result on MacOS (x86_64):

Running main() from gtest_main.cc
[==========] Running 4 tests from 1 test case.
[----------] Global test environment set-up.
[----------] 4 tests from VulkanAPITest
[ RUN      ] VulkanAPITest.gru_mclareninputs_success
[       OK ] VulkanAPITest.gru_mclareninputs_success (1012 ms)
[ RUN      ] VulkanAPITest.gru_invalidinputs_exceptions
[       OK ] VulkanAPITest.gru_invalidinputs_exceptions (40 ms)
[ RUN      ] VulkanAPITest.gru_prepack_success
[       OK ] VulkanAPITest.gru_prepack_success (99 ms)
[ RUN      ] VulkanAPITest.gru_prepack_invalidinputs_exceptions
[       OK ] VulkanAPITest.gru_prepack_invalidinputs_exceptions (39 ms)
[----------] 4 tests from VulkanAPITest (1190 ms total)
[----------] Global test environment tear-down
[==========] 4 tests from 1 test case ran. (1190 ms total)
[  PASSED  ] 4 tests.

Differential Revision: D34556940

@pytorch-bot
Copy link

pytorch-bot bot commented Mar 1, 2022

CI Flow Status

⚛️ CI Flow

Ruleset - Version: v1
Ruleset - File: https://github.com/beback4u/pytorch/blob/c7758f5274d6f530e993621ef95f78e3dbfc313f/.github/generated-ciflow-ruleset.json
PR ciflow labels: ciflow/default
Add ciflow labels to this PR to trigger more builds:

Workflows Labels (bold enabled) Status
Triggered Workflows
linux-binary-conda ciflow/binaries, ciflow/binaries_conda, ciflow/default ✅ triggered
linux-binary-libtorch-cxx11-abi ciflow/binaries, ciflow/binaries_libtorch, ciflow/default ✅ triggered
linux-binary-libtorch-pre-cxx11 ciflow/binaries, ciflow/binaries_libtorch, ciflow/default ✅ triggered
linux-binary-manywheel ciflow/binaries, ciflow/binaries_wheel, ciflow/default ✅ triggered
linux-bionic-py3.7-clang9 ciflow/all, ciflow/cpu, ciflow/default, ciflow/linux, ciflow/noarch, ciflow/trunk ✅ triggered
linux-bionic-rocm4.5-py3.7 ciflow/all, ciflow/default, ciflow/linux, ciflow/rocm, ciflow/trunk ✅ triggered
linux-docs ciflow/all, ciflow/cpu, ciflow/default, ciflow/docs, ciflow/linux, ciflow/trunk ✅ triggered
linux-vulkan-bionic-py3.7-clang9 ciflow/all, ciflow/cpu, ciflow/default, ciflow/linux, ciflow/trunk, ciflow/vulkan ✅ triggered
linux-xenial-cuda11.3-py3.7-gcc7 ciflow/all, ciflow/cuda, ciflow/default, ciflow/linux, ciflow/trunk ✅ triggered
linux-xenial-cuda11.3-py3.7-gcc7-bazel-test ciflow/all, ciflow/bazel, ciflow/cpu, ciflow/default, ciflow/linux, ciflow/trunk ✅ triggered
linux-xenial-py3-clang5-mobile-build ciflow/all, ciflow/default, ciflow/linux, ciflow/mobile, ciflow/trunk ✅ triggered
linux-xenial-py3-clang5-mobile-custom-build-static ciflow/all, ciflow/default, ciflow/linux, ciflow/mobile, ciflow/trunk ✅ triggered
linux-xenial-py3.7-clang7-asan ciflow/all, ciflow/cpu, ciflow/default, ciflow/linux, ciflow/sanitizers, ciflow/trunk ✅ triggered
linux-xenial-py3.7-clang7-onnx ciflow/all, ciflow/cpu, ciflow/default, ciflow/linux, ciflow/onnx, ciflow/trunk ✅ triggered
linux-xenial-py3.7-gcc5.4 ciflow/all, ciflow/cpu, ciflow/default, ciflow/linux, ciflow/trunk ✅ triggered
linux-xenial-py3.7-gcc7 ciflow/all, ciflow/cpu, ciflow/default, ciflow/linux, ciflow/trunk ✅ triggered
linux-xenial-py3.7-gcc7-no-ops ciflow/all, ciflow/cpu, ciflow/default, ciflow/linux, ciflow/trunk ✅ triggered
macos-arm64-binary-conda ciflow/binaries, ciflow/binaries_conda, ciflow/default ✅ triggered
macos-arm64-binary-wheel ciflow/binaries, ciflow/binaries_wheel, ciflow/default ✅ triggered
macos-binary-conda ciflow/binaries, ciflow/binaries_conda, ciflow/default ✅ triggered
macos-binary-libtorch-cxx11-abi ciflow/binaries, ciflow/binaries_libtorch, ciflow/default ✅ triggered
macos-binary-libtorch-pre-cxx11 ciflow/binaries, ciflow/binaries_libtorch, ciflow/default ✅ triggered
macos-binary-wheel ciflow/binaries, ciflow/binaries_wheel, ciflow/default ✅ triggered
pytorch-linux-xenial-py3-clang5-android-ndk-r19c-gradle-custom-build-single ciflow/all, ciflow/android, ciflow/cpu, ciflow/default, ciflow/linux, ciflow/trunk ✅ triggered
pytorch-linux-xenial-py3-clang5-android-ndk-r19c-gradle-custom-build-single-full-jit ciflow/all, ciflow/android, ciflow/cpu, ciflow/default, ciflow/linux, ciflow/trunk ✅ triggered
win-vs2019-cpu-py3 ciflow/all, ciflow/cpu, ciflow/default, ciflow/trunk, ciflow/win ✅ triggered
win-vs2019-cuda11.3-py3 ciflow/all, ciflow/cuda, ciflow/default, ciflow/trunk, ciflow/win ✅ triggered
windows-binary-libtorch-cxx11-abi ciflow/binaries, ciflow/binaries_libtorch, ciflow/default ✅ triggered
windows-binary-libtorch-pre-cxx11 ciflow/binaries, ciflow/binaries_libtorch, ciflow/default ✅ triggered
windows-binary-wheel ciflow/binaries, ciflow/binaries_wheel, ciflow/default ✅ triggered
Skipped Workflows
caffe2-linux-xenial-py3.7-gcc5.4 ciflow/all, ciflow/cpu, ciflow/linux, ciflow/trunk 🚫 skipped
docker-builds ciflow/all, ciflow/trunk 🚫 skipped
ios-12-5-1-arm64 ciflow/all, ciflow/ios, ciflow/macos, ciflow/scheduled 🚫 skipped
ios-12-5-1-arm64-coreml ciflow/all, ciflow/ios, ciflow/macos, ciflow/scheduled 🚫 skipped
ios-12-5-1-arm64-custom-ops ciflow/all, ciflow/ios, ciflow/macos, ciflow/scheduled 🚫 skipped
ios-12-5-1-arm64-metal ciflow/all, ciflow/ios, ciflow/macos, ciflow/scheduled 🚫 skipped
ios-12-5-1-x86-64 ciflow/all, ciflow/ios, ciflow/macos, ciflow/trunk 🚫 skipped
ios-12-5-1-x86-64-coreml ciflow/all, ciflow/ios, ciflow/macos, ciflow/trunk 🚫 skipped
libtorch-linux-xenial-cuda10.2-py3.7-gcc7 ciflow/all, ciflow/cuda, ciflow/libtorch, ciflow/linux, ciflow/trunk 🚫 skipped
libtorch-linux-xenial-cuda11.3-py3.7-gcc7 ciflow/all, ciflow/cuda, ciflow/libtorch, ciflow/linux, ciflow/trunk 🚫 skipped
linux-bionic-cuda10.2-py3.9-gcc7 ciflow/all, ciflow/cuda, ciflow/linux, ciflow/slow, ciflow/trunk 🚫 skipped
linux-docs-push ciflow/all, ciflow/cpu, ciflow/linux, ciflow/scheduled 🚫 skipped
linux-xenial-cuda11.3-py3.7-gcc7-no-ops ciflow/all, ciflow/cuda, ciflow/linux, ciflow/trunk 🚫 skipped
macos-10-15-py3-arm64 ciflow/all, ciflow/macos, ciflow/trunk 🚫 skipped
macos-10-15-py3-lite-interpreter-x86-64 ciflow/all, ciflow/macos, ciflow/trunk 🚫 skipped
macos-11-py3-x86-64 ciflow/all, ciflow/macos, ciflow/trunk 🚫 skipped
parallelnative-linux-xenial-py3.7-gcc5.4 ciflow/all, ciflow/cpu, ciflow/linux, ciflow/trunk 🚫 skipped
periodic-libtorch-linux-bionic-cuda11.5-py3.7-gcc7 ciflow/all, ciflow/cuda, ciflow/libtorch, ciflow/linux, ciflow/scheduled 🚫 skipped
periodic-linux-bionic-cuda11.5-py3.7-gcc7 ciflow/all, ciflow/cuda, ciflow/linux, ciflow/scheduled 🚫 skipped
periodic-linux-xenial-cuda10.2-py3-gcc7-slow-gradcheck ciflow/all, ciflow/cuda, ciflow/linux, ciflow/scheduled, ciflow/slow, ciflow/slow-gradcheck 🚫 skipped
periodic-linux-xenial-cuda11.3-py3.7-gcc7-debug ciflow/all, ciflow/cuda, ciflow/linux, ciflow/scheduled 🚫 skipped
periodic-win-vs2019-cuda11.5-py3 ciflow/all, ciflow/cuda, ciflow/scheduled, ciflow/win 🚫 skipped
pytorch-linux-xenial-py3-clang5-android-ndk-r19c-build ciflow/all, ciflow/android, ciflow/cpu, ciflow/linux, ciflow/trunk 🚫 skipped
pytorch-xla-linux-bionic-py3.7-clang8 ciflow/all, ciflow/cpu, ciflow/linux, ciflow/trunk, ciflow/xla 🚫 skipped

@facebook-github-bot facebook-github-bot added cla signed oncall: jit Add this issue/PR to JIT oncall triage queue labels Mar 1, 2022
@facebook-github-bot
Copy link
Contributor

facebook-github-bot commented Mar 1, 2022

🔗 Helpful links

💊 CI failures summary and remediations

As of commit b52aa4d (more details on the Dr. CI page):


💚 💚 Looks good so far! There are no failures yet. 💚 💚


This comment was automatically generated by Dr. CI (expand for details).

Please report bugs/suggestions to the (internal) Dr. CI Users group.

Click here to manually regenerate this comment.

@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D34556940

1 similar comment
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D34556940

@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D34556940

@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D34556940

@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D34556940

@beback4u beback4u requested a review from SS-JIA March 16, 2022 18:09
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D34556940

@beback4u beback4u changed the title [Vulkan] Optimize GRU operator with prepacking [Vulkan] Optimize GRU operator with pre-packing Mar 16, 2022
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D34556940

Summary:
Pull Request resolved: pytorch#73599

Optimized GRU operator by using pre-packing for weights and biases in the Vulkan GPU backend:
* The weights and biases are always on the CPU side by design.
* To reduce the overhead by retrieving the weight and bias tensors every time, it is the best way to store them by pre-packing.
    * A custom op context `GruOpContext` (derived from `torch::jit::CustomClassHolder`) is created to hold both packed and unpacked data. It corresponds to the unpacked_ struct which represents the data needed to construct the op context. This data will be pre-packed and be stored in the packed_ struct. The constructor of the `GruOpContext` loads the data into the unpacked_ and packed_ structs.
    * `at::native::vulkan::ops::gru_prepack` and `at::native::vulkan::ops::gru_run` methods use the op context. The `gru_prepack` takes in whatever data is needed to construct the op context and returns a pointer to a created context. The `gru_run` takes input tensors and a pointer to the op context that uses the data stored in the context to process the inputs.
    * Lastly, we need to register the op context class and ops in [Register.cpp](https://github.com/pytorch/pytorch/blob/11dc1581298c5bb2b322897c7b3999d1a3971720/aten/src/ATen/native/vulkan/ops/Register.cpp). And rewrite the subgraph function of GRU op in [vulkan_rewrite.cpp](https://github.com/pytorch/pytorch/blob/11dc1581298c5bb2b322897c7b3999d1a3971720/torch/csrc/jit/passes/vulkan_rewrite.cpp) so that `gru_prepack` and `gru_run` ops can be executed instead in the Vulkan GPU backend.
* To avoid `"Undefined symbols for architecture x86_64"` compiler error on the x86_64 platform, `c10::Dispatcher::callBoxed()` API is used to call `vulkan_prepack::gru_prepack` and `vulkan_prepack::gru_run` by name. Otherwise, the test methods can't resolve the symbols.
* Added new tests for the GRU pre-packing and run operations: `gru_prepack_success` and gru_prepack_invalidinputs_exceptions`
* To build your PyTorch OSS on your local machine:
```
python setup.py clean
git submodule update --init --recursive
USE_VULKAN=1 USE_VULKAN_FP16_INFERENCE=1 python3 setup.py install --cmake
python setup.py develop && python -c "import torch"
```
* To run and dump a model containing GRU operators in Python:
```
import torch
from torch.utils import mobile_optimizer
model = torch.jit.load("Mclaren_traced.pt")
vk_model = mobile_optimizer.optimize_for_mobile(model, backend="vulkan")
print(vk_model.graph)
```
* The following torch scripts are the updated version by GRU pre-packing:
```
%15 : Tensor[] = prim::ListConstruct(%weight_ih_l0.1, %weight_hh_l0.1, %bias_ih_l0.1, %bias_hh_l0.1, %weight_ih_l1.1, %weight_hh_l1.1, %bias_ih_l1.1, %bias_hh_l1.1)
%19 : __torch__.torch.classes.vulkan.GruOpContext = vulkan_prepack::gru_prepack(%15, %4, %5, %6, %3, %3, %4)
%20 : Tensor, %21 : Tensor = vulkan_prepack::gru_run(%input.1, %hx.1, %19)
%18 : (Tensor, Tensor) = prim::TupleConstruct(%21, %20)
return (%18)
```
* This implementation has some limitations:
    * Tensor dim should be 3 for input sequence and hidden state.
    * has_biases=True
    * train=False
    * bidirectional=False
    * batch_first=True
    * dropout=0.0
    * D=1 since bidirectional=False
    * N=1 (batch size)
    * L=1 (sequence length)

Test Plan:
Build & test on Android:
```
cd ~/fbsource
buck build -c ndk.custom_libcxx=false -c pt.enable_qpl=0 //xplat/caffe2:pt_vulkan_api_test_binAndroid\#android-arm64 --show-output
adb push buck-out/gen/xplat/caffe2/pt_vulkan_api_test_binAndroid\#android-arm64 /data/local/tmp/vulkan_api_test
adb shell "/data/local/tmp/vulkan_api_test"
```
Build & test on MacOS (x86_64):
```
cd ~/fbsource
buck build //xplat/caffe2:pt_vulkan_api_test_binAppleMac
./buck-out/gen/xplat/caffe2/pt_vulkan_api_test_binAppleMac\#macosx-x86_64
```

Test result on Android (Google Pixel 5):
```
Running main() from gtest_main.cc
[==========] Running 4 tests from 1 test case.
[----------] Global test environment set-up.
[----------] 4 tests from VulkanAPITest
[ RUN      ] VulkanAPITest.gru_mclareninputs_success
[       OK ] VulkanAPITest.gru_mclareninputs_success (1037 ms)
[ RUN      ] VulkanAPITest.gru_invalidinputs_exceptions
[       OK ] VulkanAPITest.gru_invalidinputs_exceptions (16 ms)
[ RUN      ] VulkanAPITest.gru_prepack_success
[       OK ] VulkanAPITest.gru_prepack_success (45 ms)
[ RUN      ] VulkanAPITest.gru_prepack_invalidinputs_exceptions
[       OK ] VulkanAPITest.gru_prepack_invalidinputs_exceptions (16 ms)
[----------] 4 tests from VulkanAPITest (1114 ms total)
[----------] Global test environment tear-down
[==========] 4 tests from 1 test case ran. (1114 ms total)
[  PASSED  ] 4 tests.
```

Test result on MacOS (x86_64):
```
Running main() from gtest_main.cc
[==========] Running 4 tests from 1 test case.
[----------] Global test environment set-up.
[----------] 4 tests from VulkanAPITest
[ RUN      ] VulkanAPITest.gru_mclareninputs_success
[       OK ] VulkanAPITest.gru_mclareninputs_success (1012 ms)
[ RUN      ] VulkanAPITest.gru_invalidinputs_exceptions
[       OK ] VulkanAPITest.gru_invalidinputs_exceptions (40 ms)
[ RUN      ] VulkanAPITest.gru_prepack_success
[       OK ] VulkanAPITest.gru_prepack_success (99 ms)
[ RUN      ] VulkanAPITest.gru_prepack_invalidinputs_exceptions
[       OK ] VulkanAPITest.gru_prepack_invalidinputs_exceptions (39 ms)
[----------] 4 tests from VulkanAPITest (1190 ms total)
[----------] Global test environment tear-down
[==========] 4 tests from 1 test case ran. (1190 ms total)
[  PASSED  ] 4 tests.
```

Reviewed By: SS-JIA

Differential Revision: D34556940

fbshipit-source-id: 79ed13e81b804521e7dc7c7c1a28404ced8d3100
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D34556940

facebook-github-bot pushed a commit that referenced this pull request Mar 17, 2022
Summary:
Pull Request resolved: #73599

Optimized GRU operator by using pre-packing for weights and biases in the Vulkan GPU backend:
* The weights and biases are always on the CPU side by design.
* To reduce the overhead by retrieving the weight and bias tensors every time, it is the best way to store them by pre-packing.
    * A custom op context `GruOpContext` (derived from `torch::jit::CustomClassHolder`) is created to hold both packed and unpacked data. It corresponds to the unpacked_ struct which represents the data needed to construct the op context. This data will be pre-packed and be stored in the packed_ struct. The constructor of the `GruOpContext` loads the data into the unpacked_ and packed_ structs.
    * `at::native::vulkan::ops::gru_prepack` and `at::native::vulkan::ops::gru_run` methods use the op context. The `gru_prepack` takes in whatever data is needed to construct the op context and returns a pointer to a created context. The `gru_run` takes input tensors and a pointer to the op context that uses the data stored in the context to process the inputs.
    * Lastly, we need to register the op context class and ops in [Register.cpp](https://github.com/pytorch/pytorch/blob/11dc1581298c5bb2b322897c7b3999d1a3971720/aten/src/ATen/native/vulkan/ops/Register.cpp). And rewrite the subgraph function of GRU op in [vulkan_rewrite.cpp](https://github.com/pytorch/pytorch/blob/11dc1581298c5bb2b322897c7b3999d1a3971720/torch/csrc/jit/passes/vulkan_rewrite.cpp) so that `gru_prepack` and `gru_run` ops can be executed instead in the Vulkan GPU backend.
* To avoid `"Undefined symbols for architecture x86_64"` compiler error on the x86_64 platform, `c10::Dispatcher::callBoxed()` API is used to call `vulkan_prepack::gru_prepack` and `vulkan_prepack::gru_run` by name. Otherwise, the test methods can't resolve the symbols.
* Added new tests for the GRU pre-packing and run operations: `gru_prepack_success` and gru_prepack_invalidinputs_exceptions`
* To build your PyTorch OSS on your local machine:
```
python setup.py clean
git submodule update --init --recursive
USE_VULKAN=1 USE_VULKAN_FP16_INFERENCE=1 python3 setup.py install --cmake
python setup.py develop && python -c "import torch"
```
* To run and dump a model containing GRU operators in Python:
```
import torch
from torch.utils import mobile_optimizer
model = torch.jit.load("Mclaren_traced.pt")
vk_model = mobile_optimizer.optimize_for_mobile(model, backend="vulkan")
print(vk_model.graph)
```
* The following torch scripts are the updated version by GRU pre-packing:
```
%15 : Tensor[] = prim::ListConstruct(%weight_ih_l0.1, %weight_hh_l0.1, %bias_ih_l0.1, %bias_hh_l0.1, %weight_ih_l1.1, %weight_hh_l1.1, %bias_ih_l1.1, %bias_hh_l1.1)
%19 : __torch__.torch.classes.vulkan.GruOpContext = vulkan_prepack::gru_prepack(%15, %4, %5, %6, %3, %3, %4)
%20 : Tensor, %21 : Tensor = vulkan_prepack::gru_run(%input.1, %hx.1, %19)
%18 : (Tensor, Tensor) = prim::TupleConstruct(%21, %20)
return (%18)
```
* This implementation has some limitations:
    * Tensor dim should be 3 for input sequence and hidden state.
    * has_biases=True
    * train=False
    * bidirectional=False
    * batch_first=True
    * dropout=0.0
    * D=1 since bidirectional=False
    * N=1 (batch size)
    * L=1 (sequence length)

Test Plan:
Build & test on Android:
```
cd ~/fbsource
buck build -c ndk.custom_libcxx=false -c pt.enable_qpl=0 //xplat/caffe2:pt_vulkan_api_test_binAndroid\#android-arm64 --show-output
adb push buck-out/gen/xplat/caffe2/pt_vulkan_api_test_binAndroid\#android-arm64 /data/local/tmp/vulkan_api_test
adb shell "/data/local/tmp/vulkan_api_test"
```
Build & test on MacOS (x86_64):
```
cd ~/fbsource
buck build //xplat/caffe2:pt_vulkan_api_test_binAppleMac
./buck-out/gen/xplat/caffe2/pt_vulkan_api_test_binAppleMac\#macosx-x86_64
```

Test result on Android (Google Pixel 5):
```
Running main() from gtest_main.cc
[==========] Running 4 tests from 1 test case.
[----------] Global test environment set-up.
[----------] 4 tests from VulkanAPITest
[ RUN      ] VulkanAPITest.gru_mclareninputs_success
[       OK ] VulkanAPITest.gru_mclareninputs_success (1037 ms)
[ RUN      ] VulkanAPITest.gru_invalidinputs_exceptions
[       OK ] VulkanAPITest.gru_invalidinputs_exceptions (16 ms)
[ RUN      ] VulkanAPITest.gru_prepack_success
[       OK ] VulkanAPITest.gru_prepack_success (45 ms)
[ RUN      ] VulkanAPITest.gru_prepack_invalidinputs_exceptions
[       OK ] VulkanAPITest.gru_prepack_invalidinputs_exceptions (16 ms)
[----------] 4 tests from VulkanAPITest (1114 ms total)
[----------] Global test environment tear-down
[==========] 4 tests from 1 test case ran. (1114 ms total)
[  PASSED  ] 4 tests.
```

Test result on MacOS (x86_64):
```
Running main() from gtest_main.cc
[==========] Running 4 tests from 1 test case.
[----------] Global test environment set-up.
[----------] 4 tests from VulkanAPITest
[ RUN      ] VulkanAPITest.gru_mclareninputs_success
[       OK ] VulkanAPITest.gru_mclareninputs_success (1012 ms)
[ RUN      ] VulkanAPITest.gru_invalidinputs_exceptions
[       OK ] VulkanAPITest.gru_invalidinputs_exceptions (40 ms)
[ RUN      ] VulkanAPITest.gru_prepack_success
[       OK ] VulkanAPITest.gru_prepack_success (99 ms)
[ RUN      ] VulkanAPITest.gru_prepack_invalidinputs_exceptions
[       OK ] VulkanAPITest.gru_prepack_invalidinputs_exceptions (39 ms)
[----------] 4 tests from VulkanAPITest (1190 ms total)
[----------] Global test environment tear-down
[==========] 4 tests from 1 test case ran. (1190 ms total)
[  PASSED  ] 4 tests.
```

Reviewed By: SS-JIA

Differential Revision: D34556940

fbshipit-source-id: dce918de238fb8a4a0ea5e966e05ca99ed910c28
@github-actions
Copy link
Contributor

Hey @beback4u.
You've committed this PR, but it does not have both a 'release notes: ...' and 'topics: ...' label. Please add one of each to the PR. The 'release notes: ...' label should represent the part of PyTorch that this PR changes (fx, autograd, distributed, etc) and the 'topics: ...' label should represent the kind of PR it is (not user facing, new feature, bug fix, perf improvement, etc). The list of valid labels can be found here for the 'release notes: ...' and here for the 'topics: ...'.
For changes that are 'topic: not user facing' there is no need for a release notes label.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

cla signed fb-exported oncall: jit Add this issue/PR to JIT oncall triage queue

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants