-
Notifications
You must be signed in to change notification settings - Fork 26.3k
[aten] Call fbgemm functions for embedding prepack/unpack #44845
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
fbgemm functions are vectorized and faster ``` Finished test run: https://our.intern.facebook.com/intern/testinfra/testrun/6473924484856786 Summary (total time 15.08s): PASS: 7 FAIL: 0 SKIP: 0 FATAL: 0 TIMEOUT: 0 OMIT: 0 ``` Performance Before: ``` # ---------------------------------------- # PyTorch/Caffe2 Operator Micro-benchmarks # ---------------------------------------- # Tag : short # Benchmarking PyTorch: qembeddingbag_byte_prepack # Mode: Eager # Name: qembeddingbag_byte_prepack_num_embeddings80_embedding_dim128 # Input: num_embeddings: 80, embedding_dim: 128 Forward Execution Time (us) : 68.727 # Benchmarking PyTorch: qembeddingbag_byte_prepack # Mode: Eager # Name: qembeddingbag_byte_prepack_num_embeddings80_embedding_dim256 # Input: num_embeddings: 80, embedding_dim: 256 Forward Execution Time (us) : 131.500 # Benchmarking PyTorch: qembeddingbag_byte_prepack # Mode: Eager # Name: qembeddingbag_byte_prepack_num_embeddings80_embedding_dim512 # Input: num_embeddings: 80, embedding_dim: 512 Forward Execution Time (us) : 248.190 # Benchmarking PyTorch: qembeddingbag_4bit_prepack # Mode: Eager # Name: qembeddingbag_4bit_prepack_num_embeddings80_embedding_dim128 # Input: num_embeddings: 80, embedding_dim: 128 Forward Execution Time (us) : 172.742 # Benchmarking PyTorch: qembeddingbag_4bit_prepack # Mode: Eager # Name: qembeddingbag_4bit_prepack_num_embeddings80_embedding_dim256 # Input: num_embeddings: 80, embedding_dim: 256 Forward Execution Time (us) : 333.008 # Benchmarking PyTorch: qembeddingbag_4bit_prepack # Mode: Eager # Name: qembeddingbag_4bit_prepack_num_embeddings80_embedding_dim512 # Input: num_embeddings: 80, embedding_dim: 512 Forward Execution Time (us) : 652.423 # Benchmarking PyTorch: qembeddingbag_2bit_prepack # Mode: Eager # Name: qembeddingbag_2bit_prepack_num_embeddings80_embedding_dim128 # Input: num_embeddings: 80, embedding_dim: 128 Forward Execution Time (us) : 167.282 # Benchmarking PyTorch: qembeddingbag_2bit_prepack # Mode: Eager # Name: qembeddingbag_2bit_prepack_num_embeddings80_embedding_dim256 # Input: num_embeddings: 80, embedding_dim: 256 Forward Execution Time (us) : 398.901 # Benchmarking PyTorch: qembeddingbag_2bit_prepack # Mode: Eager # Name: qembeddingbag_2bit_prepack_num_embeddings80_embedding_dim512 # Input: num_embeddings: 80, embedding_dim: 512 Forward Execution Time (us) : 785.254 # Benchmarking PyTorch: qembeddingbag_byte_unpack # Mode: Eager # Name: qembeddingbag_byte_unpack_num_embeddings80_embedding_dim128 # Input: num_embeddings: 80, embedding_dim: 128 Forward Execution Time (us) : 122.653 # Benchmarking PyTorch: qembeddingbag_byte_unpack # Mode: Eager # Name: qembeddingbag_byte_unpack_num_embeddings80_embedding_dim256 # Input: num_embeddings: 80, embedding_dim: 256 Forward Execution Time (us) : 230.617 # Benchmarking PyTorch: qembeddingbag_byte_unpack # Mode: Eager # Name: qembeddingbag_byte_unpack_num_embeddings80_embedding_dim512 # Input: num_embeddings: 80, embedding_dim: 512 Forward Execution Time (us) : 408.807 # Benchmarking PyTorch: qembeddingbag_4bit_unpack # Mode: Eager # Name: qembeddingbag_4bit_unpack_num_embeddings80_embedding_dim128 # Input: num_embeddings: 80, embedding_dim: 128 Forward Execution Time (us) : 176.087 # Benchmarking PyTorch: qembeddingbag_4bit_unpack # Mode: Eager # Name: qembeddingbag_4bit_unpack_num_embeddings80_embedding_dim256 # Input: num_embeddings: 80, embedding_dim: 256 Forward Execution Time (us) : 337.514 # Benchmarking PyTorch: qembeddingbag_4bit_unpack # Mode: Eager # Name: qembeddingbag_4bit_unpack_num_embeddings80_embedding_dim512 # Input: num_embeddings: 80, embedding_dim: 512 Forward Execution Time (us) : 659.716 # Benchmarking PyTorch: qembeddingbag_2bit_unpack # Mode: Eager # Name: qembeddingbag_2bit_unpack_num_embeddings80_embedding_dim128 # Input: num_embeddings: 80, embedding_dim: 128 Forward Execution Time (us) : 342.529 # Benchmarking PyTorch: qembeddingbag_2bit_unpack # Mode: Eager # Name: qembeddingbag_2bit_unpack_num_embeddings80_embedding_dim256 # Input: num_embeddings: 80, embedding_dim: 256 Forward Execution Time (us) : 665.197 # Benchmarking PyTorch: qembeddingbag_2bit_unpack # Mode: Eager # Name: qembeddingbag_2bit_unpack_num_embeddings80_embedding_dim512 # Input: num_embeddings: 80, embedding_dim: 512 Forward Execution Time (us) : 1307.923 ``` Performance After: ``` # ---------------------------------------- # PyTorch/Caffe2 Operator Micro-benchmarks # ---------------------------------------- # Tag : short # Benchmarking PyTorch: qembeddingbag_byte_prepack # Mode: Eager # Name: qembeddingbag_byte_prepack_num_embeddings80_embedding_dim128 # Input: num_embeddings: 80, embedding_dim: 128 Forward Execution Time (us) : 10.782 # Benchmarking PyTorch: qembeddingbag_byte_prepack # Mode: Eager # Name: qembeddingbag_byte_prepack_num_embeddings80_embedding_dim256 # Input: num_embeddings: 80, embedding_dim: 256 Forward Execution Time (us) : 17.443 # Benchmarking PyTorch: qembeddingbag_byte_prepack # Mode: Eager # Name: qembeddingbag_byte_prepack_num_embeddings80_embedding_dim512 # Input: num_embeddings: 80, embedding_dim: 512 Forward Execution Time (us) : 25.898 # Benchmarking PyTorch: qembeddingbag_4bit_prepack # Mode: Eager # Name: qembeddingbag_4bit_prepack_num_embeddings80_embedding_dim128 # Input: num_embeddings: 80, embedding_dim: 128 Forward Execution Time (us) : 13.903 # Benchmarking PyTorch: qembeddingbag_4bit_prepack # Mode: Eager # Name: qembeddingbag_4bit_prepack_num_embeddings80_embedding_dim256 # Input: num_embeddings: 80, embedding_dim: 256 Forward Execution Time (us) : 18.575 # Benchmarking PyTorch: qembeddingbag_4bit_prepack # Mode: Eager # Name: qembeddingbag_4bit_prepack_num_embeddings80_embedding_dim512 # Input: num_embeddings: 80, embedding_dim: 512 Forward Execution Time (us) : 30.650 # Benchmarking PyTorch: qembeddingbag_2bit_prepack # Mode: Eager # Name: qembeddingbag_2bit_prepack_num_embeddings80_embedding_dim128 # Input: num_embeddings: 80, embedding_dim: 128 Forward Execution Time (us) : 14.158 # Benchmarking PyTorch: qembeddingbag_2bit_prepack # Mode: Eager # Name: qembeddingbag_2bit_prepack_num_embeddings80_embedding_dim256 # Input: num_embeddings: 80, embedding_dim: 256 Forward Execution Time (us) : 19.818 # Benchmarking PyTorch: qembeddingbag_2bit_prepack # Mode: Eager # Name: qembeddingbag_2bit_prepack_num_embeddings80_embedding_dim512 # Input: num_embeddings: 80, embedding_dim: 512 Forward Execution Time (us) : 30.852 # Benchmarking PyTorch: qembeddingbag_byte_unpack # Mode: Eager # Name: qembeddingbag_byte_unpack_num_embeddings80_embedding_dim128 # Input: num_embeddings: 80, embedding_dim: 128 Forward Execution Time (us) : 47.596 # Benchmarking PyTorch: qembeddingbag_byte_unpack # Mode: Eager # Name: qembeddingbag_byte_unpack_num_embeddings80_embedding_dim256 # Input: num_embeddings: 80, embedding_dim: 256 Forward Execution Time (us) : 91.025 # Benchmarking PyTorch: qembeddingbag_byte_unpack # Mode: Eager # Name: qembeddingbag_byte_unpack_num_embeddings80_embedding_dim512 # Input: num_embeddings: 80, embedding_dim: 512 Forward Execution Time (us) : 131.425 # Benchmarking PyTorch: qembeddingbag_4bit_unpack # Mode: Eager # Name: qembeddingbag_4bit_unpack_num_embeddings80_embedding_dim128 # Input: num_embeddings: 80, embedding_dim: 128 Forward Execution Time (us) : 12.637 # Benchmarking PyTorch: qembeddingbag_4bit_unpack # Mode: Eager # Name: qembeddingbag_4bit_unpack_num_embeddings80_embedding_dim256 # Input: num_embeddings: 80, embedding_dim: 256 Forward Execution Time (us) : 20.856 # Benchmarking PyTorch: qembeddingbag_4bit_unpack # Mode: Eager # Name: qembeddingbag_4bit_unpack_num_embeddings80_embedding_dim512 # Input: num_embeddings: 80, embedding_dim: 512 Forward Execution Time (us) : 33.944 # Benchmarking PyTorch: qembeddingbag_2bit_unpack # Mode: Eager # Name: qembeddingbag_2bit_unpack_num_embeddings80_embedding_dim128 # Input: num_embeddings: 80, embedding_dim: 128 Forward Execution Time (us) : 21.181 # Benchmarking PyTorch: qembeddingbag_2bit_unpack # Mode: Eager # Name: qembeddingbag_2bit_unpack_num_embeddings80_embedding_dim256 # Input: num_embeddings: 80, embedding_dim: 256 Forward Execution Time (us) : 34.213 # Benchmarking PyTorch: qembeddingbag_2bit_unpack # Mode: Eager # Name: qembeddingbag_2bit_unpack_num_embeddings80_embedding_dim512 # Input: num_embeddings: 80, embedding_dim: 512 Forward Execution Time (us) : 59.622 ``` Differential Revision: [D23675777](https://our.internmc.facebook.com/intern/diff/D23675777/) [ghstack-poisoned]
fbgemm functions are vectorized and faster ``` Finished test run: https://our.intern.facebook.com/intern/testinfra/testrun/6473924484856786 Summary (total time 15.08s): PASS: 7 FAIL: 0 SKIP: 0 FATAL: 0 TIMEOUT: 0 OMIT: 0 ``` Performance Before: ``` # ---------------------------------------- # PyTorch/Caffe2 Operator Micro-benchmarks # ---------------------------------------- # Tag : short # Benchmarking PyTorch: qembeddingbag_byte_prepack # Mode: Eager # Name: qembeddingbag_byte_prepack_num_embeddings80_embedding_dim128 # Input: num_embeddings: 80, embedding_dim: 128 Forward Execution Time (us) : 68.727 # Benchmarking PyTorch: qembeddingbag_byte_prepack # Mode: Eager # Name: qembeddingbag_byte_prepack_num_embeddings80_embedding_dim256 # Input: num_embeddings: 80, embedding_dim: 256 Forward Execution Time (us) : 131.500 # Benchmarking PyTorch: qembeddingbag_byte_prepack # Mode: Eager # Name: qembeddingbag_byte_prepack_num_embeddings80_embedding_dim512 # Input: num_embeddings: 80, embedding_dim: 512 Forward Execution Time (us) : 248.190 # Benchmarking PyTorch: qembeddingbag_4bit_prepack # Mode: Eager # Name: qembeddingbag_4bit_prepack_num_embeddings80_embedding_dim128 # Input: num_embeddings: 80, embedding_dim: 128 Forward Execution Time (us) : 172.742 # Benchmarking PyTorch: qembeddingbag_4bit_prepack # Mode: Eager # Name: qembeddingbag_4bit_prepack_num_embeddings80_embedding_dim256 # Input: num_embeddings: 80, embedding_dim: 256 Forward Execution Time (us) : 333.008 # Benchmarking PyTorch: qembeddingbag_4bit_prepack # Mode: Eager # Name: qembeddingbag_4bit_prepack_num_embeddings80_embedding_dim512 # Input: num_embeddings: 80, embedding_dim: 512 Forward Execution Time (us) : 652.423 # Benchmarking PyTorch: qembeddingbag_2bit_prepack # Mode: Eager # Name: qembeddingbag_2bit_prepack_num_embeddings80_embedding_dim128 # Input: num_embeddings: 80, embedding_dim: 128 Forward Execution Time (us) : 167.282 # Benchmarking PyTorch: qembeddingbag_2bit_prepack # Mode: Eager # Name: qembeddingbag_2bit_prepack_num_embeddings80_embedding_dim256 # Input: num_embeddings: 80, embedding_dim: 256 Forward Execution Time (us) : 398.901 # Benchmarking PyTorch: qembeddingbag_2bit_prepack # Mode: Eager # Name: qembeddingbag_2bit_prepack_num_embeddings80_embedding_dim512 # Input: num_embeddings: 80, embedding_dim: 512 Forward Execution Time (us) : 785.254 # Benchmarking PyTorch: qembeddingbag_byte_unpack # Mode: Eager # Name: qembeddingbag_byte_unpack_num_embeddings80_embedding_dim128 # Input: num_embeddings: 80, embedding_dim: 128 Forward Execution Time (us) : 122.653 # Benchmarking PyTorch: qembeddingbag_byte_unpack # Mode: Eager # Name: qembeddingbag_byte_unpack_num_embeddings80_embedding_dim256 # Input: num_embeddings: 80, embedding_dim: 256 Forward Execution Time (us) : 230.617 # Benchmarking PyTorch: qembeddingbag_byte_unpack # Mode: Eager # Name: qembeddingbag_byte_unpack_num_embeddings80_embedding_dim512 # Input: num_embeddings: 80, embedding_dim: 512 Forward Execution Time (us) : 408.807 # Benchmarking PyTorch: qembeddingbag_4bit_unpack # Mode: Eager # Name: qembeddingbag_4bit_unpack_num_embeddings80_embedding_dim128 # Input: num_embeddings: 80, embedding_dim: 128 Forward Execution Time (us) : 176.087 # Benchmarking PyTorch: qembeddingbag_4bit_unpack # Mode: Eager # Name: qembeddingbag_4bit_unpack_num_embeddings80_embedding_dim256 # Input: num_embeddings: 80, embedding_dim: 256 Forward Execution Time (us) : 337.514 # Benchmarking PyTorch: qembeddingbag_4bit_unpack # Mode: Eager # Name: qembeddingbag_4bit_unpack_num_embeddings80_embedding_dim512 # Input: num_embeddings: 80, embedding_dim: 512 Forward Execution Time (us) : 659.716 # Benchmarking PyTorch: qembeddingbag_2bit_unpack # Mode: Eager # Name: qembeddingbag_2bit_unpack_num_embeddings80_embedding_dim128 # Input: num_embeddings: 80, embedding_dim: 128 Forward Execution Time (us) : 342.529 # Benchmarking PyTorch: qembeddingbag_2bit_unpack # Mode: Eager # Name: qembeddingbag_2bit_unpack_num_embeddings80_embedding_dim256 # Input: num_embeddings: 80, embedding_dim: 256 Forward Execution Time (us) : 665.197 # Benchmarking PyTorch: qembeddingbag_2bit_unpack # Mode: Eager # Name: qembeddingbag_2bit_unpack_num_embeddings80_embedding_dim512 # Input: num_embeddings: 80, embedding_dim: 512 Forward Execution Time (us) : 1307.923 ``` Performance After: ``` # ---------------------------------------- # PyTorch/Caffe2 Operator Micro-benchmarks # ---------------------------------------- # Tag : short # Benchmarking PyTorch: qembeddingbag_byte_prepack # Mode: Eager # Name: qembeddingbag_byte_prepack_num_embeddings80_embedding_dim128 # Input: num_embeddings: 80, embedding_dim: 128 Forward Execution Time (us) : 10.782 # Benchmarking PyTorch: qembeddingbag_byte_prepack # Mode: Eager # Name: qembeddingbag_byte_prepack_num_embeddings80_embedding_dim256 # Input: num_embeddings: 80, embedding_dim: 256 Forward Execution Time (us) : 17.443 # Benchmarking PyTorch: qembeddingbag_byte_prepack # Mode: Eager # Name: qembeddingbag_byte_prepack_num_embeddings80_embedding_dim512 # Input: num_embeddings: 80, embedding_dim: 512 Forward Execution Time (us) : 25.898 # Benchmarking PyTorch: qembeddingbag_4bit_prepack # Mode: Eager # Name: qembeddingbag_4bit_prepack_num_embeddings80_embedding_dim128 # Input: num_embeddings: 80, embedding_dim: 128 Forward Execution Time (us) : 13.903 # Benchmarking PyTorch: qembeddingbag_4bit_prepack # Mode: Eager # Name: qembeddingbag_4bit_prepack_num_embeddings80_embedding_dim256 # Input: num_embeddings: 80, embedding_dim: 256 Forward Execution Time (us) : 18.575 # Benchmarking PyTorch: qembeddingbag_4bit_prepack # Mode: Eager # Name: qembeddingbag_4bit_prepack_num_embeddings80_embedding_dim512 # Input: num_embeddings: 80, embedding_dim: 512 Forward Execution Time (us) : 30.650 # Benchmarking PyTorch: qembeddingbag_2bit_prepack # Mode: Eager # Name: qembeddingbag_2bit_prepack_num_embeddings80_embedding_dim128 # Input: num_embeddings: 80, embedding_dim: 128 Forward Execution Time (us) : 14.158 # Benchmarking PyTorch: qembeddingbag_2bit_prepack # Mode: Eager # Name: qembeddingbag_2bit_prepack_num_embeddings80_embedding_dim256 # Input: num_embeddings: 80, embedding_dim: 256 Forward Execution Time (us) : 19.818 # Benchmarking PyTorch: qembeddingbag_2bit_prepack # Mode: Eager # Name: qembeddingbag_2bit_prepack_num_embeddings80_embedding_dim512 # Input: num_embeddings: 80, embedding_dim: 512 Forward Execution Time (us) : 30.852 # Benchmarking PyTorch: qembeddingbag_byte_unpack # Mode: Eager # Name: qembeddingbag_byte_unpack_num_embeddings80_embedding_dim128 # Input: num_embeddings: 80, embedding_dim: 128 Forward Execution Time (us) : 47.596 # Benchmarking PyTorch: qembeddingbag_byte_unpack # Mode: Eager # Name: qembeddingbag_byte_unpack_num_embeddings80_embedding_dim256 # Input: num_embeddings: 80, embedding_dim: 256 Forward Execution Time (us) : 91.025 # Benchmarking PyTorch: qembeddingbag_byte_unpack # Mode: Eager # Name: qembeddingbag_byte_unpack_num_embeddings80_embedding_dim512 # Input: num_embeddings: 80, embedding_dim: 512 Forward Execution Time (us) : 131.425 # Benchmarking PyTorch: qembeddingbag_4bit_unpack # Mode: Eager # Name: qembeddingbag_4bit_unpack_num_embeddings80_embedding_dim128 # Input: num_embeddings: 80, embedding_dim: 128 Forward Execution Time (us) : 12.637 # Benchmarking PyTorch: qembeddingbag_4bit_unpack # Mode: Eager # Name: qembeddingbag_4bit_unpack_num_embeddings80_embedding_dim256 # Input: num_embeddings: 80, embedding_dim: 256 Forward Execution Time (us) : 20.856 # Benchmarking PyTorch: qembeddingbag_4bit_unpack # Mode: Eager # Name: qembeddingbag_4bit_unpack_num_embeddings80_embedding_dim512 # Input: num_embeddings: 80, embedding_dim: 512 Forward Execution Time (us) : 33.944 # Benchmarking PyTorch: qembeddingbag_2bit_unpack # Mode: Eager # Name: qembeddingbag_2bit_unpack_num_embeddings80_embedding_dim128 # Input: num_embeddings: 80, embedding_dim: 128 Forward Execution Time (us) : 21.181 # Benchmarking PyTorch: qembeddingbag_2bit_unpack # Mode: Eager # Name: qembeddingbag_2bit_unpack_num_embeddings80_embedding_dim256 # Input: num_embeddings: 80, embedding_dim: 256 Forward Execution Time (us) : 34.213 # Benchmarking PyTorch: qembeddingbag_2bit_unpack # Mode: Eager # Name: qembeddingbag_2bit_unpack_num_embeddings80_embedding_dim512 # Input: num_embeddings: 80, embedding_dim: 512 Forward Execution Time (us) : 59.622 ``` Differential Revision: [D23675777](https://our.internmc.facebook.com/intern/diff/D23675777/) ghstack-source-id: 112188571 Pull Request resolved: #44845
💊 CI failures summary and remediationsAs of commit 168d4f8 (more details on the Dr. CI page): 💚 💚 Looks good so far! There are no failures yet. 💚 💚 This comment was automatically generated by Dr. CI (expand for details).Follow this link to opt-out of these comments for your Pull Requests.Please report bugs/suggestions on the GitHub issue tracker or post in the (internal) Dr. CI Users group. This comment has been revised 28 times. |
Codecov Report
@@ Coverage Diff @@
## gh/dskhudia/26/base #44845 +/- ##
======================================================
Coverage ? 68.06%
======================================================
Files ? 393
Lines ? 50918
Branches ? 0
======================================================
Hits ? 34655
Misses ? 16263
Partials ? 0 Continue to review full report at Codecov.
|
fbgemm functions are vectorized and faster ``` Finished test run: https://our.intern.facebook.com/intern/testinfra/testrun/6473924484856786 Summary (total time 15.08s): PASS: 7 FAIL: 0 SKIP: 0 FATAL: 0 TIMEOUT: 0 OMIT: 0 ``` Performance Before: ``` # ---------------------------------------- # PyTorch/Caffe2 Operator Micro-benchmarks # ---------------------------------------- # Tag : short # Benchmarking PyTorch: qembeddingbag_byte_prepack # Mode: Eager # Name: qembeddingbag_byte_prepack_num_embeddings80_embedding_dim128 # Input: num_embeddings: 80, embedding_dim: 128 Forward Execution Time (us) : 68.727 # Benchmarking PyTorch: qembeddingbag_byte_prepack # Mode: Eager # Name: qembeddingbag_byte_prepack_num_embeddings80_embedding_dim256 # Input: num_embeddings: 80, embedding_dim: 256 Forward Execution Time (us) : 131.500 # Benchmarking PyTorch: qembeddingbag_byte_prepack # Mode: Eager # Name: qembeddingbag_byte_prepack_num_embeddings80_embedding_dim512 # Input: num_embeddings: 80, embedding_dim: 512 Forward Execution Time (us) : 248.190 # Benchmarking PyTorch: qembeddingbag_4bit_prepack # Mode: Eager # Name: qembeddingbag_4bit_prepack_num_embeddings80_embedding_dim128 # Input: num_embeddings: 80, embedding_dim: 128 Forward Execution Time (us) : 172.742 # Benchmarking PyTorch: qembeddingbag_4bit_prepack # Mode: Eager # Name: qembeddingbag_4bit_prepack_num_embeddings80_embedding_dim256 # Input: num_embeddings: 80, embedding_dim: 256 Forward Execution Time (us) : 333.008 # Benchmarking PyTorch: qembeddingbag_4bit_prepack # Mode: Eager # Name: qembeddingbag_4bit_prepack_num_embeddings80_embedding_dim512 # Input: num_embeddings: 80, embedding_dim: 512 Forward Execution Time (us) : 652.423 # Benchmarking PyTorch: qembeddingbag_2bit_prepack # Mode: Eager # Name: qembeddingbag_2bit_prepack_num_embeddings80_embedding_dim128 # Input: num_embeddings: 80, embedding_dim: 128 Forward Execution Time (us) : 167.282 # Benchmarking PyTorch: qembeddingbag_2bit_prepack # Mode: Eager # Name: qembeddingbag_2bit_prepack_num_embeddings80_embedding_dim256 # Input: num_embeddings: 80, embedding_dim: 256 Forward Execution Time (us) : 398.901 # Benchmarking PyTorch: qembeddingbag_2bit_prepack # Mode: Eager # Name: qembeddingbag_2bit_prepack_num_embeddings80_embedding_dim512 # Input: num_embeddings: 80, embedding_dim: 512 Forward Execution Time (us) : 785.254 # Benchmarking PyTorch: qembeddingbag_byte_unpack # Mode: Eager # Name: qembeddingbag_byte_unpack_num_embeddings80_embedding_dim128 # Input: num_embeddings: 80, embedding_dim: 128 Forward Execution Time (us) : 122.653 # Benchmarking PyTorch: qembeddingbag_byte_unpack # Mode: Eager # Name: qembeddingbag_byte_unpack_num_embeddings80_embedding_dim256 # Input: num_embeddings: 80, embedding_dim: 256 Forward Execution Time (us) : 230.617 # Benchmarking PyTorch: qembeddingbag_byte_unpack # Mode: Eager # Name: qembeddingbag_byte_unpack_num_embeddings80_embedding_dim512 # Input: num_embeddings: 80, embedding_dim: 512 Forward Execution Time (us) : 408.807 # Benchmarking PyTorch: qembeddingbag_4bit_unpack # Mode: Eager # Name: qembeddingbag_4bit_unpack_num_embeddings80_embedding_dim128 # Input: num_embeddings: 80, embedding_dim: 128 Forward Execution Time (us) : 176.087 # Benchmarking PyTorch: qembeddingbag_4bit_unpack # Mode: Eager # Name: qembeddingbag_4bit_unpack_num_embeddings80_embedding_dim256 # Input: num_embeddings: 80, embedding_dim: 256 Forward Execution Time (us) : 337.514 # Benchmarking PyTorch: qembeddingbag_4bit_unpack # Mode: Eager # Name: qembeddingbag_4bit_unpack_num_embeddings80_embedding_dim512 # Input: num_embeddings: 80, embedding_dim: 512 Forward Execution Time (us) : 659.716 # Benchmarking PyTorch: qembeddingbag_2bit_unpack # Mode: Eager # Name: qembeddingbag_2bit_unpack_num_embeddings80_embedding_dim128 # Input: num_embeddings: 80, embedding_dim: 128 Forward Execution Time (us) : 342.529 # Benchmarking PyTorch: qembeddingbag_2bit_unpack # Mode: Eager # Name: qembeddingbag_2bit_unpack_num_embeddings80_embedding_dim256 # Input: num_embeddings: 80, embedding_dim: 256 Forward Execution Time (us) : 665.197 # Benchmarking PyTorch: qembeddingbag_2bit_unpack # Mode: Eager # Name: qembeddingbag_2bit_unpack_num_embeddings80_embedding_dim512 # Input: num_embeddings: 80, embedding_dim: 512 Forward Execution Time (us) : 1307.923 ``` Performance After: ``` # ---------------------------------------- # PyTorch/Caffe2 Operator Micro-benchmarks # ---------------------------------------- # Tag : short # Benchmarking PyTorch: qembeddingbag_byte_prepack # Mode: Eager # Name: qembeddingbag_byte_prepack_num_embeddings80_embedding_dim128 # Input: num_embeddings: 80, embedding_dim: 128 Forward Execution Time (us) : 10.782 # Benchmarking PyTorch: qembeddingbag_byte_prepack # Mode: Eager # Name: qembeddingbag_byte_prepack_num_embeddings80_embedding_dim256 # Input: num_embeddings: 80, embedding_dim: 256 Forward Execution Time (us) : 17.443 # Benchmarking PyTorch: qembeddingbag_byte_prepack # Mode: Eager # Name: qembeddingbag_byte_prepack_num_embeddings80_embedding_dim512 # Input: num_embeddings: 80, embedding_dim: 512 Forward Execution Time (us) : 25.898 # Benchmarking PyTorch: qembeddingbag_4bit_prepack # Mode: Eager # Name: qembeddingbag_4bit_prepack_num_embeddings80_embedding_dim128 # Input: num_embeddings: 80, embedding_dim: 128 Forward Execution Time (us) : 13.903 # Benchmarking PyTorch: qembeddingbag_4bit_prepack # Mode: Eager # Name: qembeddingbag_4bit_prepack_num_embeddings80_embedding_dim256 # Input: num_embeddings: 80, embedding_dim: 256 Forward Execution Time (us) : 18.575 # Benchmarking PyTorch: qembeddingbag_4bit_prepack # Mode: Eager # Name: qembeddingbag_4bit_prepack_num_embeddings80_embedding_dim512 # Input: num_embeddings: 80, embedding_dim: 512 Forward Execution Time (us) : 30.650 # Benchmarking PyTorch: qembeddingbag_2bit_prepack # Mode: Eager # Name: qembeddingbag_2bit_prepack_num_embeddings80_embedding_dim128 # Input: num_embeddings: 80, embedding_dim: 128 Forward Execution Time (us) : 14.158 # Benchmarking PyTorch: qembeddingbag_2bit_prepack # Mode: Eager # Name: qembeddingbag_2bit_prepack_num_embeddings80_embedding_dim256 # Input: num_embeddings: 80, embedding_dim: 256 Forward Execution Time (us) : 19.818 # Benchmarking PyTorch: qembeddingbag_2bit_prepack # Mode: Eager # Name: qembeddingbag_2bit_prepack_num_embeddings80_embedding_dim512 # Input: num_embeddings: 80, embedding_dim: 512 Forward Execution Time (us) : 30.852 # Benchmarking PyTorch: qembeddingbag_byte_unpack # Mode: Eager # Name: qembeddingbag_byte_unpack_num_embeddings80_embedding_dim128 # Input: num_embeddings: 80, embedding_dim: 128 Forward Execution Time (us) : 47.596 # Benchmarking PyTorch: qembeddingbag_byte_unpack # Mode: Eager # Name: qembeddingbag_byte_unpack_num_embeddings80_embedding_dim256 # Input: num_embeddings: 80, embedding_dim: 256 Forward Execution Time (us) : 91.025 # Benchmarking PyTorch: qembeddingbag_byte_unpack # Mode: Eager # Name: qembeddingbag_byte_unpack_num_embeddings80_embedding_dim512 # Input: num_embeddings: 80, embedding_dim: 512 Forward Execution Time (us) : 131.425 # Benchmarking PyTorch: qembeddingbag_4bit_unpack # Mode: Eager # Name: qembeddingbag_4bit_unpack_num_embeddings80_embedding_dim128 # Input: num_embeddings: 80, embedding_dim: 128 Forward Execution Time (us) : 12.637 # Benchmarking PyTorch: qembeddingbag_4bit_unpack # Mode: Eager # Name: qembeddingbag_4bit_unpack_num_embeddings80_embedding_dim256 # Input: num_embeddings: 80, embedding_dim: 256 Forward Execution Time (us) : 20.856 # Benchmarking PyTorch: qembeddingbag_4bit_unpack # Mode: Eager # Name: qembeddingbag_4bit_unpack_num_embeddings80_embedding_dim512 # Input: num_embeddings: 80, embedding_dim: 512 Forward Execution Time (us) : 33.944 # Benchmarking PyTorch: qembeddingbag_2bit_unpack # Mode: Eager # Name: qembeddingbag_2bit_unpack_num_embeddings80_embedding_dim128 # Input: num_embeddings: 80, embedding_dim: 128 Forward Execution Time (us) : 21.181 # Benchmarking PyTorch: qembeddingbag_2bit_unpack # Mode: Eager # Name: qembeddingbag_2bit_unpack_num_embeddings80_embedding_dim256 # Input: num_embeddings: 80, embedding_dim: 256 Forward Execution Time (us) : 34.213 # Benchmarking PyTorch: qembeddingbag_2bit_unpack # Mode: Eager # Name: qembeddingbag_2bit_unpack_num_embeddings80_embedding_dim512 # Input: num_embeddings: 80, embedding_dim: 512 Forward Execution Time (us) : 59.622 ``` Differential Revision: [D23675777](https://our.internmc.facebook.com/intern/diff/D23675777/) [ghstack-poisoned]
Pull Request resolved: #44845 fbgemm functions are vectorized and faster ``` Finished test run: https://our.intern.facebook.com/intern/testinfra/testrun/6473924484856786 Summary (total time 15.08s): PASS: 7 FAIL: 0 SKIP: 0 FATAL: 0 TIMEOUT: 0 OMIT: 0 ``` Performance Before: ``` # ---------------------------------------- # PyTorch/Caffe2 Operator Micro-benchmarks # ---------------------------------------- # Tag : short # Benchmarking PyTorch: qembeddingbag_byte_prepack # Mode: Eager # Name: qembeddingbag_byte_prepack_num_embeddings80_embedding_dim128 # Input: num_embeddings: 80, embedding_dim: 128 Forward Execution Time (us) : 68.727 # Benchmarking PyTorch: qembeddingbag_byte_prepack # Mode: Eager # Name: qembeddingbag_byte_prepack_num_embeddings80_embedding_dim256 # Input: num_embeddings: 80, embedding_dim: 256 Forward Execution Time (us) : 131.500 # Benchmarking PyTorch: qembeddingbag_byte_prepack # Mode: Eager # Name: qembeddingbag_byte_prepack_num_embeddings80_embedding_dim512 # Input: num_embeddings: 80, embedding_dim: 512 Forward Execution Time (us) : 248.190 # Benchmarking PyTorch: qembeddingbag_4bit_prepack # Mode: Eager # Name: qembeddingbag_4bit_prepack_num_embeddings80_embedding_dim128 # Input: num_embeddings: 80, embedding_dim: 128 Forward Execution Time (us) : 172.742 # Benchmarking PyTorch: qembeddingbag_4bit_prepack # Mode: Eager # Name: qembeddingbag_4bit_prepack_num_embeddings80_embedding_dim256 # Input: num_embeddings: 80, embedding_dim: 256 Forward Execution Time (us) : 333.008 # Benchmarking PyTorch: qembeddingbag_4bit_prepack # Mode: Eager # Name: qembeddingbag_4bit_prepack_num_embeddings80_embedding_dim512 # Input: num_embeddings: 80, embedding_dim: 512 Forward Execution Time (us) : 652.423 # Benchmarking PyTorch: qembeddingbag_2bit_prepack # Mode: Eager # Name: qembeddingbag_2bit_prepack_num_embeddings80_embedding_dim128 # Input: num_embeddings: 80, embedding_dim: 128 Forward Execution Time (us) : 167.282 # Benchmarking PyTorch: qembeddingbag_2bit_prepack # Mode: Eager # Name: qembeddingbag_2bit_prepack_num_embeddings80_embedding_dim256 # Input: num_embeddings: 80, embedding_dim: 256 Forward Execution Time (us) : 398.901 # Benchmarking PyTorch: qembeddingbag_2bit_prepack # Mode: Eager # Name: qembeddingbag_2bit_prepack_num_embeddings80_embedding_dim512 # Input: num_embeddings: 80, embedding_dim: 512 Forward Execution Time (us) : 785.254 # Benchmarking PyTorch: qembeddingbag_byte_unpack # Mode: Eager # Name: qembeddingbag_byte_unpack_num_embeddings80_embedding_dim128 # Input: num_embeddings: 80, embedding_dim: 128 Forward Execution Time (us) : 122.653 # Benchmarking PyTorch: qembeddingbag_byte_unpack # Mode: Eager # Name: qembeddingbag_byte_unpack_num_embeddings80_embedding_dim256 # Input: num_embeddings: 80, embedding_dim: 256 Forward Execution Time (us) : 230.617 # Benchmarking PyTorch: qembeddingbag_byte_unpack # Mode: Eager # Name: qembeddingbag_byte_unpack_num_embeddings80_embedding_dim512 # Input: num_embeddings: 80, embedding_dim: 512 Forward Execution Time (us) : 408.807 # Benchmarking PyTorch: qembeddingbag_4bit_unpack # Mode: Eager # Name: qembeddingbag_4bit_unpack_num_embeddings80_embedding_dim128 # Input: num_embeddings: 80, embedding_dim: 128 Forward Execution Time (us) : 176.087 # Benchmarking PyTorch: qembeddingbag_4bit_unpack # Mode: Eager # Name: qembeddingbag_4bit_unpack_num_embeddings80_embedding_dim256 # Input: num_embeddings: 80, embedding_dim: 256 Forward Execution Time (us) : 337.514 # Benchmarking PyTorch: qembeddingbag_4bit_unpack # Mode: Eager # Name: qembeddingbag_4bit_unpack_num_embeddings80_embedding_dim512 # Input: num_embeddings: 80, embedding_dim: 512 Forward Execution Time (us) : 659.716 # Benchmarking PyTorch: qembeddingbag_2bit_unpack # Mode: Eager # Name: qembeddingbag_2bit_unpack_num_embeddings80_embedding_dim128 # Input: num_embeddings: 80, embedding_dim: 128 Forward Execution Time (us) : 342.529 # Benchmarking PyTorch: qembeddingbag_2bit_unpack # Mode: Eager # Name: qembeddingbag_2bit_unpack_num_embeddings80_embedding_dim256 # Input: num_embeddings: 80, embedding_dim: 256 Forward Execution Time (us) : 665.197 # Benchmarking PyTorch: qembeddingbag_2bit_unpack # Mode: Eager # Name: qembeddingbag_2bit_unpack_num_embeddings80_embedding_dim512 # Input: num_embeddings: 80, embedding_dim: 512 Forward Execution Time (us) : 1307.923 ``` Performance After: ``` # ---------------------------------------- # PyTorch/Caffe2 Operator Micro-benchmarks # ---------------------------------------- # Tag : short # Benchmarking PyTorch: qembeddingbag_byte_prepack # Mode: Eager # Name: qembeddingbag_byte_prepack_num_embeddings80_embedding_dim128 # Input: num_embeddings: 80, embedding_dim: 128 Forward Execution Time (us) : 10.782 # Benchmarking PyTorch: qembeddingbag_byte_prepack # Mode: Eager # Name: qembeddingbag_byte_prepack_num_embeddings80_embedding_dim256 # Input: num_embeddings: 80, embedding_dim: 256 Forward Execution Time (us) : 17.443 # Benchmarking PyTorch: qembeddingbag_byte_prepack # Mode: Eager # Name: qembeddingbag_byte_prepack_num_embeddings80_embedding_dim512 # Input: num_embeddings: 80, embedding_dim: 512 Forward Execution Time (us) : 25.898 # Benchmarking PyTorch: qembeddingbag_4bit_prepack # Mode: Eager # Name: qembeddingbag_4bit_prepack_num_embeddings80_embedding_dim128 # Input: num_embeddings: 80, embedding_dim: 128 Forward Execution Time (us) : 13.903 # Benchmarking PyTorch: qembeddingbag_4bit_prepack # Mode: Eager # Name: qembeddingbag_4bit_prepack_num_embeddings80_embedding_dim256 # Input: num_embeddings: 80, embedding_dim: 256 Forward Execution Time (us) : 18.575 # Benchmarking PyTorch: qembeddingbag_4bit_prepack # Mode: Eager # Name: qembeddingbag_4bit_prepack_num_embeddings80_embedding_dim512 # Input: num_embeddings: 80, embedding_dim: 512 Forward Execution Time (us) : 30.650 # Benchmarking PyTorch: qembeddingbag_2bit_prepack # Mode: Eager # Name: qembeddingbag_2bit_prepack_num_embeddings80_embedding_dim128 # Input: num_embeddings: 80, embedding_dim: 128 Forward Execution Time (us) : 14.158 # Benchmarking PyTorch: qembeddingbag_2bit_prepack # Mode: Eager # Name: qembeddingbag_2bit_prepack_num_embeddings80_embedding_dim256 # Input: num_embeddings: 80, embedding_dim: 256 Forward Execution Time (us) : 19.818 # Benchmarking PyTorch: qembeddingbag_2bit_prepack # Mode: Eager # Name: qembeddingbag_2bit_prepack_num_embeddings80_embedding_dim512 # Input: num_embeddings: 80, embedding_dim: 512 Forward Execution Time (us) : 30.852 # Benchmarking PyTorch: qembeddingbag_byte_unpack # Mode: Eager # Name: qembeddingbag_byte_unpack_num_embeddings80_embedding_dim128 # Input: num_embeddings: 80, embedding_dim: 128 Forward Execution Time (us) : 47.596 # Benchmarking PyTorch: qembeddingbag_byte_unpack # Mode: Eager # Name: qembeddingbag_byte_unpack_num_embeddings80_embedding_dim256 # Input: num_embeddings: 80, embedding_dim: 256 Forward Execution Time (us) : 91.025 # Benchmarking PyTorch: qembeddingbag_byte_unpack # Mode: Eager # Name: qembeddingbag_byte_unpack_num_embeddings80_embedding_dim512 # Input: num_embeddings: 80, embedding_dim: 512 Forward Execution Time (us) : 131.425 # Benchmarking PyTorch: qembeddingbag_4bit_unpack # Mode: Eager # Name: qembeddingbag_4bit_unpack_num_embeddings80_embedding_dim128 # Input: num_embeddings: 80, embedding_dim: 128 Forward Execution Time (us) : 12.637 # Benchmarking PyTorch: qembeddingbag_4bit_unpack # Mode: Eager # Name: qembeddingbag_4bit_unpack_num_embeddings80_embedding_dim256 # Input: num_embeddings: 80, embedding_dim: 256 Forward Execution Time (us) : 20.856 # Benchmarking PyTorch: qembeddingbag_4bit_unpack # Mode: Eager # Name: qembeddingbag_4bit_unpack_num_embeddings80_embedding_dim512 # Input: num_embeddings: 80, embedding_dim: 512 Forward Execution Time (us) : 33.944 # Benchmarking PyTorch: qembeddingbag_2bit_unpack # Mode: Eager # Name: qembeddingbag_2bit_unpack_num_embeddings80_embedding_dim128 # Input: num_embeddings: 80, embedding_dim: 128 Forward Execution Time (us) : 21.181 # Benchmarking PyTorch: qembeddingbag_2bit_unpack # Mode: Eager # Name: qembeddingbag_2bit_unpack_num_embeddings80_embedding_dim256 # Input: num_embeddings: 80, embedding_dim: 256 Forward Execution Time (us) : 34.213 # Benchmarking PyTorch: qembeddingbag_2bit_unpack # Mode: Eager # Name: qembeddingbag_2bit_unpack_num_embeddings80_embedding_dim512 # Input: num_embeddings: 80, embedding_dim: 512 Forward Execution Time (us) : 59.622 ``` ghstack-source-id: 112295758 Differential Revision: [D23675777](https://our.internmc.facebook.com/intern/diff/D23675777/)
raghuramank100
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice speedup!
fbgemm functions are vectorized and faster ``` Finished test run: https://our.intern.facebook.com/intern/testinfra/testrun/6473924484856786 Summary (total time 15.08s): PASS: 7 FAIL: 0 SKIP: 0 FATAL: 0 TIMEOUT: 0 OMIT: 0 ``` Performance Before: ``` # ---------------------------------------- # PyTorch/Caffe2 Operator Micro-benchmarks # ---------------------------------------- # Tag : short # Benchmarking PyTorch: qembeddingbag_byte_prepack # Mode: Eager # Name: qembeddingbag_byte_prepack_num_embeddings80_embedding_dim128 # Input: num_embeddings: 80, embedding_dim: 128 Forward Execution Time (us) : 68.727 # Benchmarking PyTorch: qembeddingbag_byte_prepack # Mode: Eager # Name: qembeddingbag_byte_prepack_num_embeddings80_embedding_dim256 # Input: num_embeddings: 80, embedding_dim: 256 Forward Execution Time (us) : 131.500 # Benchmarking PyTorch: qembeddingbag_byte_prepack # Mode: Eager # Name: qembeddingbag_byte_prepack_num_embeddings80_embedding_dim512 # Input: num_embeddings: 80, embedding_dim: 512 Forward Execution Time (us) : 248.190 # Benchmarking PyTorch: qembeddingbag_4bit_prepack # Mode: Eager # Name: qembeddingbag_4bit_prepack_num_embeddings80_embedding_dim128 # Input: num_embeddings: 80, embedding_dim: 128 Forward Execution Time (us) : 172.742 # Benchmarking PyTorch: qembeddingbag_4bit_prepack # Mode: Eager # Name: qembeddingbag_4bit_prepack_num_embeddings80_embedding_dim256 # Input: num_embeddings: 80, embedding_dim: 256 Forward Execution Time (us) : 333.008 # Benchmarking PyTorch: qembeddingbag_4bit_prepack # Mode: Eager # Name: qembeddingbag_4bit_prepack_num_embeddings80_embedding_dim512 # Input: num_embeddings: 80, embedding_dim: 512 Forward Execution Time (us) : 652.423 # Benchmarking PyTorch: qembeddingbag_2bit_prepack # Mode: Eager # Name: qembeddingbag_2bit_prepack_num_embeddings80_embedding_dim128 # Input: num_embeddings: 80, embedding_dim: 128 Forward Execution Time (us) : 167.282 # Benchmarking PyTorch: qembeddingbag_2bit_prepack # Mode: Eager # Name: qembeddingbag_2bit_prepack_num_embeddings80_embedding_dim256 # Input: num_embeddings: 80, embedding_dim: 256 Forward Execution Time (us) : 398.901 # Benchmarking PyTorch: qembeddingbag_2bit_prepack # Mode: Eager # Name: qembeddingbag_2bit_prepack_num_embeddings80_embedding_dim512 # Input: num_embeddings: 80, embedding_dim: 512 Forward Execution Time (us) : 785.254 # Benchmarking PyTorch: qembeddingbag_byte_unpack # Mode: Eager # Name: qembeddingbag_byte_unpack_num_embeddings80_embedding_dim128 # Input: num_embeddings: 80, embedding_dim: 128 Forward Execution Time (us) : 122.653 # Benchmarking PyTorch: qembeddingbag_byte_unpack # Mode: Eager # Name: qembeddingbag_byte_unpack_num_embeddings80_embedding_dim256 # Input: num_embeddings: 80, embedding_dim: 256 Forward Execution Time (us) : 230.617 # Benchmarking PyTorch: qembeddingbag_byte_unpack # Mode: Eager # Name: qembeddingbag_byte_unpack_num_embeddings80_embedding_dim512 # Input: num_embeddings: 80, embedding_dim: 512 Forward Execution Time (us) : 408.807 # Benchmarking PyTorch: qembeddingbag_4bit_unpack # Mode: Eager # Name: qembeddingbag_4bit_unpack_num_embeddings80_embedding_dim128 # Input: num_embeddings: 80, embedding_dim: 128 Forward Execution Time (us) : 176.087 # Benchmarking PyTorch: qembeddingbag_4bit_unpack # Mode: Eager # Name: qembeddingbag_4bit_unpack_num_embeddings80_embedding_dim256 # Input: num_embeddings: 80, embedding_dim: 256 Forward Execution Time (us) : 337.514 # Benchmarking PyTorch: qembeddingbag_4bit_unpack # Mode: Eager # Name: qembeddingbag_4bit_unpack_num_embeddings80_embedding_dim512 # Input: num_embeddings: 80, embedding_dim: 512 Forward Execution Time (us) : 659.716 # Benchmarking PyTorch: qembeddingbag_2bit_unpack # Mode: Eager # Name: qembeddingbag_2bit_unpack_num_embeddings80_embedding_dim128 # Input: num_embeddings: 80, embedding_dim: 128 Forward Execution Time (us) : 342.529 # Benchmarking PyTorch: qembeddingbag_2bit_unpack # Mode: Eager # Name: qembeddingbag_2bit_unpack_num_embeddings80_embedding_dim256 # Input: num_embeddings: 80, embedding_dim: 256 Forward Execution Time (us) : 665.197 # Benchmarking PyTorch: qembeddingbag_2bit_unpack # Mode: Eager # Name: qembeddingbag_2bit_unpack_num_embeddings80_embedding_dim512 # Input: num_embeddings: 80, embedding_dim: 512 Forward Execution Time (us) : 1307.923 ``` Performance After: ``` # ---------------------------------------- # PyTorch/Caffe2 Operator Micro-benchmarks # ---------------------------------------- # Tag : short # Benchmarking PyTorch: qembeddingbag_byte_prepack # Mode: Eager # Name: qembeddingbag_byte_prepack_num_embeddings80_embedding_dim128 # Input: num_embeddings: 80, embedding_dim: 128 Forward Execution Time (us) : 10.782 # Benchmarking PyTorch: qembeddingbag_byte_prepack # Mode: Eager # Name: qembeddingbag_byte_prepack_num_embeddings80_embedding_dim256 # Input: num_embeddings: 80, embedding_dim: 256 Forward Execution Time (us) : 17.443 # Benchmarking PyTorch: qembeddingbag_byte_prepack # Mode: Eager # Name: qembeddingbag_byte_prepack_num_embeddings80_embedding_dim512 # Input: num_embeddings: 80, embedding_dim: 512 Forward Execution Time (us) : 25.898 # Benchmarking PyTorch: qembeddingbag_4bit_prepack # Mode: Eager # Name: qembeddingbag_4bit_prepack_num_embeddings80_embedding_dim128 # Input: num_embeddings: 80, embedding_dim: 128 Forward Execution Time (us) : 13.903 # Benchmarking PyTorch: qembeddingbag_4bit_prepack # Mode: Eager # Name: qembeddingbag_4bit_prepack_num_embeddings80_embedding_dim256 # Input: num_embeddings: 80, embedding_dim: 256 Forward Execution Time (us) : 18.575 # Benchmarking PyTorch: qembeddingbag_4bit_prepack # Mode: Eager # Name: qembeddingbag_4bit_prepack_num_embeddings80_embedding_dim512 # Input: num_embeddings: 80, embedding_dim: 512 Forward Execution Time (us) : 30.650 # Benchmarking PyTorch: qembeddingbag_2bit_prepack # Mode: Eager # Name: qembeddingbag_2bit_prepack_num_embeddings80_embedding_dim128 # Input: num_embeddings: 80, embedding_dim: 128 Forward Execution Time (us) : 14.158 # Benchmarking PyTorch: qembeddingbag_2bit_prepack # Mode: Eager # Name: qembeddingbag_2bit_prepack_num_embeddings80_embedding_dim256 # Input: num_embeddings: 80, embedding_dim: 256 Forward Execution Time (us) : 19.818 # Benchmarking PyTorch: qembeddingbag_2bit_prepack # Mode: Eager # Name: qembeddingbag_2bit_prepack_num_embeddings80_embedding_dim512 # Input: num_embeddings: 80, embedding_dim: 512 Forward Execution Time (us) : 30.852 # Benchmarking PyTorch: qembeddingbag_byte_unpack # Mode: Eager # Name: qembeddingbag_byte_unpack_num_embeddings80_embedding_dim128 # Input: num_embeddings: 80, embedding_dim: 128 Forward Execution Time (us) : 47.596 # Benchmarking PyTorch: qembeddingbag_byte_unpack # Mode: Eager # Name: qembeddingbag_byte_unpack_num_embeddings80_embedding_dim256 # Input: num_embeddings: 80, embedding_dim: 256 Forward Execution Time (us) : 91.025 # Benchmarking PyTorch: qembeddingbag_byte_unpack # Mode: Eager # Name: qembeddingbag_byte_unpack_num_embeddings80_embedding_dim512 # Input: num_embeddings: 80, embedding_dim: 512 Forward Execution Time (us) : 131.425 # Benchmarking PyTorch: qembeddingbag_4bit_unpack # Mode: Eager # Name: qembeddingbag_4bit_unpack_num_embeddings80_embedding_dim128 # Input: num_embeddings: 80, embedding_dim: 128 Forward Execution Time (us) : 12.637 # Benchmarking PyTorch: qembeddingbag_4bit_unpack # Mode: Eager # Name: qembeddingbag_4bit_unpack_num_embeddings80_embedding_dim256 # Input: num_embeddings: 80, embedding_dim: 256 Forward Execution Time (us) : 20.856 # Benchmarking PyTorch: qembeddingbag_4bit_unpack # Mode: Eager # Name: qembeddingbag_4bit_unpack_num_embeddings80_embedding_dim512 # Input: num_embeddings: 80, embedding_dim: 512 Forward Execution Time (us) : 33.944 # Benchmarking PyTorch: qembeddingbag_2bit_unpack # Mode: Eager # Name: qembeddingbag_2bit_unpack_num_embeddings80_embedding_dim128 # Input: num_embeddings: 80, embedding_dim: 128 Forward Execution Time (us) : 21.181 # Benchmarking PyTorch: qembeddingbag_2bit_unpack # Mode: Eager # Name: qembeddingbag_2bit_unpack_num_embeddings80_embedding_dim256 # Input: num_embeddings: 80, embedding_dim: 256 Forward Execution Time (us) : 34.213 # Benchmarking PyTorch: qembeddingbag_2bit_unpack # Mode: Eager # Name: qembeddingbag_2bit_unpack_num_embeddings80_embedding_dim512 # Input: num_embeddings: 80, embedding_dim: 512 Forward Execution Time (us) : 59.622 ``` Differential Revision: [D23675777](https://our.internmc.facebook.com/intern/diff/D23675777/) [ghstack-poisoned]
Pull Request resolved: #44845 fbgemm functions are vectorized and faster ``` Finished test run: https://our.intern.facebook.com/intern/testinfra/testrun/6473924484856786 Summary (total time 15.08s): PASS: 7 FAIL: 0 SKIP: 0 FATAL: 0 TIMEOUT: 0 OMIT: 0 ``` Performance Before: ``` # ---------------------------------------- # PyTorch/Caffe2 Operator Micro-benchmarks # ---------------------------------------- # Tag : short # Benchmarking PyTorch: qembeddingbag_byte_prepack # Mode: Eager # Name: qembeddingbag_byte_prepack_num_embeddings80_embedding_dim128 # Input: num_embeddings: 80, embedding_dim: 128 Forward Execution Time (us) : 68.727 # Benchmarking PyTorch: qembeddingbag_byte_prepack # Mode: Eager # Name: qembeddingbag_byte_prepack_num_embeddings80_embedding_dim256 # Input: num_embeddings: 80, embedding_dim: 256 Forward Execution Time (us) : 131.500 # Benchmarking PyTorch: qembeddingbag_byte_prepack # Mode: Eager # Name: qembeddingbag_byte_prepack_num_embeddings80_embedding_dim512 # Input: num_embeddings: 80, embedding_dim: 512 Forward Execution Time (us) : 248.190 # Benchmarking PyTorch: qembeddingbag_4bit_prepack # Mode: Eager # Name: qembeddingbag_4bit_prepack_num_embeddings80_embedding_dim128 # Input: num_embeddings: 80, embedding_dim: 128 Forward Execution Time (us) : 172.742 # Benchmarking PyTorch: qembeddingbag_4bit_prepack # Mode: Eager # Name: qembeddingbag_4bit_prepack_num_embeddings80_embedding_dim256 # Input: num_embeddings: 80, embedding_dim: 256 Forward Execution Time (us) : 333.008 # Benchmarking PyTorch: qembeddingbag_4bit_prepack # Mode: Eager # Name: qembeddingbag_4bit_prepack_num_embeddings80_embedding_dim512 # Input: num_embeddings: 80, embedding_dim: 512 Forward Execution Time (us) : 652.423 # Benchmarking PyTorch: qembeddingbag_2bit_prepack # Mode: Eager # Name: qembeddingbag_2bit_prepack_num_embeddings80_embedding_dim128 # Input: num_embeddings: 80, embedding_dim: 128 Forward Execution Time (us) : 167.282 # Benchmarking PyTorch: qembeddingbag_2bit_prepack # Mode: Eager # Name: qembeddingbag_2bit_prepack_num_embeddings80_embedding_dim256 # Input: num_embeddings: 80, embedding_dim: 256 Forward Execution Time (us) : 398.901 # Benchmarking PyTorch: qembeddingbag_2bit_prepack # Mode: Eager # Name: qembeddingbag_2bit_prepack_num_embeddings80_embedding_dim512 # Input: num_embeddings: 80, embedding_dim: 512 Forward Execution Time (us) : 785.254 # Benchmarking PyTorch: qembeddingbag_byte_unpack # Mode: Eager # Name: qembeddingbag_byte_unpack_num_embeddings80_embedding_dim128 # Input: num_embeddings: 80, embedding_dim: 128 Forward Execution Time (us) : 122.653 # Benchmarking PyTorch: qembeddingbag_byte_unpack # Mode: Eager # Name: qembeddingbag_byte_unpack_num_embeddings80_embedding_dim256 # Input: num_embeddings: 80, embedding_dim: 256 Forward Execution Time (us) : 230.617 # Benchmarking PyTorch: qembeddingbag_byte_unpack # Mode: Eager # Name: qembeddingbag_byte_unpack_num_embeddings80_embedding_dim512 # Input: num_embeddings: 80, embedding_dim: 512 Forward Execution Time (us) : 408.807 # Benchmarking PyTorch: qembeddingbag_4bit_unpack # Mode: Eager # Name: qembeddingbag_4bit_unpack_num_embeddings80_embedding_dim128 # Input: num_embeddings: 80, embedding_dim: 128 Forward Execution Time (us) : 176.087 # Benchmarking PyTorch: qembeddingbag_4bit_unpack # Mode: Eager # Name: qembeddingbag_4bit_unpack_num_embeddings80_embedding_dim256 # Input: num_embeddings: 80, embedding_dim: 256 Forward Execution Time (us) : 337.514 # Benchmarking PyTorch: qembeddingbag_4bit_unpack # Mode: Eager # Name: qembeddingbag_4bit_unpack_num_embeddings80_embedding_dim512 # Input: num_embeddings: 80, embedding_dim: 512 Forward Execution Time (us) : 659.716 # Benchmarking PyTorch: qembeddingbag_2bit_unpack # Mode: Eager # Name: qembeddingbag_2bit_unpack_num_embeddings80_embedding_dim128 # Input: num_embeddings: 80, embedding_dim: 128 Forward Execution Time (us) : 342.529 # Benchmarking PyTorch: qembeddingbag_2bit_unpack # Mode: Eager # Name: qembeddingbag_2bit_unpack_num_embeddings80_embedding_dim256 # Input: num_embeddings: 80, embedding_dim: 256 Forward Execution Time (us) : 665.197 # Benchmarking PyTorch: qembeddingbag_2bit_unpack # Mode: Eager # Name: qembeddingbag_2bit_unpack_num_embeddings80_embedding_dim512 # Input: num_embeddings: 80, embedding_dim: 512 Forward Execution Time (us) : 1307.923 ``` Performance After: ``` # ---------------------------------------- # PyTorch/Caffe2 Operator Micro-benchmarks # ---------------------------------------- # Tag : short # Benchmarking PyTorch: qembeddingbag_byte_prepack # Mode: Eager # Name: qembeddingbag_byte_prepack_num_embeddings80_embedding_dim128 # Input: num_embeddings: 80, embedding_dim: 128 Forward Execution Time (us) : 10.782 # Benchmarking PyTorch: qembeddingbag_byte_prepack # Mode: Eager # Name: qembeddingbag_byte_prepack_num_embeddings80_embedding_dim256 # Input: num_embeddings: 80, embedding_dim: 256 Forward Execution Time (us) : 17.443 # Benchmarking PyTorch: qembeddingbag_byte_prepack # Mode: Eager # Name: qembeddingbag_byte_prepack_num_embeddings80_embedding_dim512 # Input: num_embeddings: 80, embedding_dim: 512 Forward Execution Time (us) : 25.898 # Benchmarking PyTorch: qembeddingbag_4bit_prepack # Mode: Eager # Name: qembeddingbag_4bit_prepack_num_embeddings80_embedding_dim128 # Input: num_embeddings: 80, embedding_dim: 128 Forward Execution Time (us) : 13.903 # Benchmarking PyTorch: qembeddingbag_4bit_prepack # Mode: Eager # Name: qembeddingbag_4bit_prepack_num_embeddings80_embedding_dim256 # Input: num_embeddings: 80, embedding_dim: 256 Forward Execution Time (us) : 18.575 # Benchmarking PyTorch: qembeddingbag_4bit_prepack # Mode: Eager # Name: qembeddingbag_4bit_prepack_num_embeddings80_embedding_dim512 # Input: num_embeddings: 80, embedding_dim: 512 Forward Execution Time (us) : 30.650 # Benchmarking PyTorch: qembeddingbag_2bit_prepack # Mode: Eager # Name: qembeddingbag_2bit_prepack_num_embeddings80_embedding_dim128 # Input: num_embeddings: 80, embedding_dim: 128 Forward Execution Time (us) : 14.158 # Benchmarking PyTorch: qembeddingbag_2bit_prepack # Mode: Eager # Name: qembeddingbag_2bit_prepack_num_embeddings80_embedding_dim256 # Input: num_embeddings: 80, embedding_dim: 256 Forward Execution Time (us) : 19.818 # Benchmarking PyTorch: qembeddingbag_2bit_prepack # Mode: Eager # Name: qembeddingbag_2bit_prepack_num_embeddings80_embedding_dim512 # Input: num_embeddings: 80, embedding_dim: 512 Forward Execution Time (us) : 30.852 # Benchmarking PyTorch: qembeddingbag_byte_unpack # Mode: Eager # Name: qembeddingbag_byte_unpack_num_embeddings80_embedding_dim128 # Input: num_embeddings: 80, embedding_dim: 128 Forward Execution Time (us) : 47.596 # Benchmarking PyTorch: qembeddingbag_byte_unpack # Mode: Eager # Name: qembeddingbag_byte_unpack_num_embeddings80_embedding_dim256 # Input: num_embeddings: 80, embedding_dim: 256 Forward Execution Time (us) : 91.025 # Benchmarking PyTorch: qembeddingbag_byte_unpack # Mode: Eager # Name: qembeddingbag_byte_unpack_num_embeddings80_embedding_dim512 # Input: num_embeddings: 80, embedding_dim: 512 Forward Execution Time (us) : 131.425 # Benchmarking PyTorch: qembeddingbag_4bit_unpack # Mode: Eager # Name: qembeddingbag_4bit_unpack_num_embeddings80_embedding_dim128 # Input: num_embeddings: 80, embedding_dim: 128 Forward Execution Time (us) : 12.637 # Benchmarking PyTorch: qembeddingbag_4bit_unpack # Mode: Eager # Name: qembeddingbag_4bit_unpack_num_embeddings80_embedding_dim256 # Input: num_embeddings: 80, embedding_dim: 256 Forward Execution Time (us) : 20.856 # Benchmarking PyTorch: qembeddingbag_4bit_unpack # Mode: Eager # Name: qembeddingbag_4bit_unpack_num_embeddings80_embedding_dim512 # Input: num_embeddings: 80, embedding_dim: 512 Forward Execution Time (us) : 33.944 # Benchmarking PyTorch: qembeddingbag_2bit_unpack # Mode: Eager # Name: qembeddingbag_2bit_unpack_num_embeddings80_embedding_dim128 # Input: num_embeddings: 80, embedding_dim: 128 Forward Execution Time (us) : 21.181 # Benchmarking PyTorch: qembeddingbag_2bit_unpack # Mode: Eager # Name: qembeddingbag_2bit_unpack_num_embeddings80_embedding_dim256 # Input: num_embeddings: 80, embedding_dim: 256 Forward Execution Time (us) : 34.213 # Benchmarking PyTorch: qembeddingbag_2bit_unpack # Mode: Eager # Name: qembeddingbag_2bit_unpack_num_embeddings80_embedding_dim512 # Input: num_embeddings: 80, embedding_dim: 512 Forward Execution Time (us) : 59.622 ``` ghstack-source-id: 112470869 Differential Revision: [D23675777](https://our.internmc.facebook.com/intern/diff/D23675777/)
fbgemm functions are vectorized and faster ``` Finished test run: https://our.intern.facebook.com/intern/testinfra/testrun/6473924484856786 Summary (total time 15.08s): PASS: 7 FAIL: 0 SKIP: 0 FATAL: 0 TIMEOUT: 0 OMIT: 0 ``` Performance Before: ``` # ---------------------------------------- # PyTorch/Caffe2 Operator Micro-benchmarks # ---------------------------------------- # Tag : short # Benchmarking PyTorch: qembeddingbag_byte_prepack # Mode: Eager # Name: qembeddingbag_byte_prepack_num_embeddings80_embedding_dim128 # Input: num_embeddings: 80, embedding_dim: 128 Forward Execution Time (us) : 68.727 # Benchmarking PyTorch: qembeddingbag_byte_prepack # Mode: Eager # Name: qembeddingbag_byte_prepack_num_embeddings80_embedding_dim256 # Input: num_embeddings: 80, embedding_dim: 256 Forward Execution Time (us) : 131.500 # Benchmarking PyTorch: qembeddingbag_byte_prepack # Mode: Eager # Name: qembeddingbag_byte_prepack_num_embeddings80_embedding_dim512 # Input: num_embeddings: 80, embedding_dim: 512 Forward Execution Time (us) : 248.190 # Benchmarking PyTorch: qembeddingbag_4bit_prepack # Mode: Eager # Name: qembeddingbag_4bit_prepack_num_embeddings80_embedding_dim128 # Input: num_embeddings: 80, embedding_dim: 128 Forward Execution Time (us) : 172.742 # Benchmarking PyTorch: qembeddingbag_4bit_prepack # Mode: Eager # Name: qembeddingbag_4bit_prepack_num_embeddings80_embedding_dim256 # Input: num_embeddings: 80, embedding_dim: 256 Forward Execution Time (us) : 333.008 # Benchmarking PyTorch: qembeddingbag_4bit_prepack # Mode: Eager # Name: qembeddingbag_4bit_prepack_num_embeddings80_embedding_dim512 # Input: num_embeddings: 80, embedding_dim: 512 Forward Execution Time (us) : 652.423 # Benchmarking PyTorch: qembeddingbag_2bit_prepack # Mode: Eager # Name: qembeddingbag_2bit_prepack_num_embeddings80_embedding_dim128 # Input: num_embeddings: 80, embedding_dim: 128 Forward Execution Time (us) : 167.282 # Benchmarking PyTorch: qembeddingbag_2bit_prepack # Mode: Eager # Name: qembeddingbag_2bit_prepack_num_embeddings80_embedding_dim256 # Input: num_embeddings: 80, embedding_dim: 256 Forward Execution Time (us) : 398.901 # Benchmarking PyTorch: qembeddingbag_2bit_prepack # Mode: Eager # Name: qembeddingbag_2bit_prepack_num_embeddings80_embedding_dim512 # Input: num_embeddings: 80, embedding_dim: 512 Forward Execution Time (us) : 785.254 # Benchmarking PyTorch: qembeddingbag_byte_unpack # Mode: Eager # Name: qembeddingbag_byte_unpack_num_embeddings80_embedding_dim128 # Input: num_embeddings: 80, embedding_dim: 128 Forward Execution Time (us) : 122.653 # Benchmarking PyTorch: qembeddingbag_byte_unpack # Mode: Eager # Name: qembeddingbag_byte_unpack_num_embeddings80_embedding_dim256 # Input: num_embeddings: 80, embedding_dim: 256 Forward Execution Time (us) : 230.617 # Benchmarking PyTorch: qembeddingbag_byte_unpack # Mode: Eager # Name: qembeddingbag_byte_unpack_num_embeddings80_embedding_dim512 # Input: num_embeddings: 80, embedding_dim: 512 Forward Execution Time (us) : 408.807 # Benchmarking PyTorch: qembeddingbag_4bit_unpack # Mode: Eager # Name: qembeddingbag_4bit_unpack_num_embeddings80_embedding_dim128 # Input: num_embeddings: 80, embedding_dim: 128 Forward Execution Time (us) : 176.087 # Benchmarking PyTorch: qembeddingbag_4bit_unpack # Mode: Eager # Name: qembeddingbag_4bit_unpack_num_embeddings80_embedding_dim256 # Input: num_embeddings: 80, embedding_dim: 256 Forward Execution Time (us) : 337.514 # Benchmarking PyTorch: qembeddingbag_4bit_unpack # Mode: Eager # Name: qembeddingbag_4bit_unpack_num_embeddings80_embedding_dim512 # Input: num_embeddings: 80, embedding_dim: 512 Forward Execution Time (us) : 659.716 # Benchmarking PyTorch: qembeddingbag_2bit_unpack # Mode: Eager # Name: qembeddingbag_2bit_unpack_num_embeddings80_embedding_dim128 # Input: num_embeddings: 80, embedding_dim: 128 Forward Execution Time (us) : 342.529 # Benchmarking PyTorch: qembeddingbag_2bit_unpack # Mode: Eager # Name: qembeddingbag_2bit_unpack_num_embeddings80_embedding_dim256 # Input: num_embeddings: 80, embedding_dim: 256 Forward Execution Time (us) : 665.197 # Benchmarking PyTorch: qembeddingbag_2bit_unpack # Mode: Eager # Name: qembeddingbag_2bit_unpack_num_embeddings80_embedding_dim512 # Input: num_embeddings: 80, embedding_dim: 512 Forward Execution Time (us) : 1307.923 ``` Performance After: ``` # ---------------------------------------- # PyTorch/Caffe2 Operator Micro-benchmarks # ---------------------------------------- # Tag : short # Benchmarking PyTorch: qembeddingbag_byte_prepack # Mode: Eager # Name: qembeddingbag_byte_prepack_num_embeddings80_embedding_dim128 # Input: num_embeddings: 80, embedding_dim: 128 Forward Execution Time (us) : 10.782 # Benchmarking PyTorch: qembeddingbag_byte_prepack # Mode: Eager # Name: qembeddingbag_byte_prepack_num_embeddings80_embedding_dim256 # Input: num_embeddings: 80, embedding_dim: 256 Forward Execution Time (us) : 17.443 # Benchmarking PyTorch: qembeddingbag_byte_prepack # Mode: Eager # Name: qembeddingbag_byte_prepack_num_embeddings80_embedding_dim512 # Input: num_embeddings: 80, embedding_dim: 512 Forward Execution Time (us) : 25.898 # Benchmarking PyTorch: qembeddingbag_4bit_prepack # Mode: Eager # Name: qembeddingbag_4bit_prepack_num_embeddings80_embedding_dim128 # Input: num_embeddings: 80, embedding_dim: 128 Forward Execution Time (us) : 13.903 # Benchmarking PyTorch: qembeddingbag_4bit_prepack # Mode: Eager # Name: qembeddingbag_4bit_prepack_num_embeddings80_embedding_dim256 # Input: num_embeddings: 80, embedding_dim: 256 Forward Execution Time (us) : 18.575 # Benchmarking PyTorch: qembeddingbag_4bit_prepack # Mode: Eager # Name: qembeddingbag_4bit_prepack_num_embeddings80_embedding_dim512 # Input: num_embeddings: 80, embedding_dim: 512 Forward Execution Time (us) : 30.650 # Benchmarking PyTorch: qembeddingbag_2bit_prepack # Mode: Eager # Name: qembeddingbag_2bit_prepack_num_embeddings80_embedding_dim128 # Input: num_embeddings: 80, embedding_dim: 128 Forward Execution Time (us) : 14.158 # Benchmarking PyTorch: qembeddingbag_2bit_prepack # Mode: Eager # Name: qembeddingbag_2bit_prepack_num_embeddings80_embedding_dim256 # Input: num_embeddings: 80, embedding_dim: 256 Forward Execution Time (us) : 19.818 # Benchmarking PyTorch: qembeddingbag_2bit_prepack # Mode: Eager # Name: qembeddingbag_2bit_prepack_num_embeddings80_embedding_dim512 # Input: num_embeddings: 80, embedding_dim: 512 Forward Execution Time (us) : 30.852 # Benchmarking PyTorch: qembeddingbag_byte_unpack # Mode: Eager # Name: qembeddingbag_byte_unpack_num_embeddings80_embedding_dim128 # Input: num_embeddings: 80, embedding_dim: 128 Forward Execution Time (us) : 47.596 # Benchmarking PyTorch: qembeddingbag_byte_unpack # Mode: Eager # Name: qembeddingbag_byte_unpack_num_embeddings80_embedding_dim256 # Input: num_embeddings: 80, embedding_dim: 256 Forward Execution Time (us) : 91.025 # Benchmarking PyTorch: qembeddingbag_byte_unpack # Mode: Eager # Name: qembeddingbag_byte_unpack_num_embeddings80_embedding_dim512 # Input: num_embeddings: 80, embedding_dim: 512 Forward Execution Time (us) : 131.425 # Benchmarking PyTorch: qembeddingbag_4bit_unpack # Mode: Eager # Name: qembeddingbag_4bit_unpack_num_embeddings80_embedding_dim128 # Input: num_embeddings: 80, embedding_dim: 128 Forward Execution Time (us) : 12.637 # Benchmarking PyTorch: qembeddingbag_4bit_unpack # Mode: Eager # Name: qembeddingbag_4bit_unpack_num_embeddings80_embedding_dim256 # Input: num_embeddings: 80, embedding_dim: 256 Forward Execution Time (us) : 20.856 # Benchmarking PyTorch: qembeddingbag_4bit_unpack # Mode: Eager # Name: qembeddingbag_4bit_unpack_num_embeddings80_embedding_dim512 # Input: num_embeddings: 80, embedding_dim: 512 Forward Execution Time (us) : 33.944 # Benchmarking PyTorch: qembeddingbag_2bit_unpack # Mode: Eager # Name: qembeddingbag_2bit_unpack_num_embeddings80_embedding_dim128 # Input: num_embeddings: 80, embedding_dim: 128 Forward Execution Time (us) : 21.181 # Benchmarking PyTorch: qembeddingbag_2bit_unpack # Mode: Eager # Name: qembeddingbag_2bit_unpack_num_embeddings80_embedding_dim256 # Input: num_embeddings: 80, embedding_dim: 256 Forward Execution Time (us) : 34.213 # Benchmarking PyTorch: qembeddingbag_2bit_unpack # Mode: Eager # Name: qembeddingbag_2bit_unpack_num_embeddings80_embedding_dim512 # Input: num_embeddings: 80, embedding_dim: 512 Forward Execution Time (us) : 59.622 ``` Differential Revision: [D23675777](https://our.internmc.facebook.com/intern/diff/D23675777/) [ghstack-poisoned]
Pull Request resolved: #44845 fbgemm functions are vectorized and faster ``` Finished test run: https://our.intern.facebook.com/intern/testinfra/testrun/6473924484856786 Summary (total time 15.08s): PASS: 7 FAIL: 0 SKIP: 0 FATAL: 0 TIMEOUT: 0 OMIT: 0 ``` Performance Before: ``` # ---------------------------------------- # PyTorch/Caffe2 Operator Micro-benchmarks # ---------------------------------------- # Tag : short # Benchmarking PyTorch: qembeddingbag_byte_prepack # Mode: Eager # Name: qembeddingbag_byte_prepack_num_embeddings80_embedding_dim128 # Input: num_embeddings: 80, embedding_dim: 128 Forward Execution Time (us) : 68.727 # Benchmarking PyTorch: qembeddingbag_byte_prepack # Mode: Eager # Name: qembeddingbag_byte_prepack_num_embeddings80_embedding_dim256 # Input: num_embeddings: 80, embedding_dim: 256 Forward Execution Time (us) : 131.500 # Benchmarking PyTorch: qembeddingbag_byte_prepack # Mode: Eager # Name: qembeddingbag_byte_prepack_num_embeddings80_embedding_dim512 # Input: num_embeddings: 80, embedding_dim: 512 Forward Execution Time (us) : 248.190 # Benchmarking PyTorch: qembeddingbag_4bit_prepack # Mode: Eager # Name: qembeddingbag_4bit_prepack_num_embeddings80_embedding_dim128 # Input: num_embeddings: 80, embedding_dim: 128 Forward Execution Time (us) : 172.742 # Benchmarking PyTorch: qembeddingbag_4bit_prepack # Mode: Eager # Name: qembeddingbag_4bit_prepack_num_embeddings80_embedding_dim256 # Input: num_embeddings: 80, embedding_dim: 256 Forward Execution Time (us) : 333.008 # Benchmarking PyTorch: qembeddingbag_4bit_prepack # Mode: Eager # Name: qembeddingbag_4bit_prepack_num_embeddings80_embedding_dim512 # Input: num_embeddings: 80, embedding_dim: 512 Forward Execution Time (us) : 652.423 # Benchmarking PyTorch: qembeddingbag_2bit_prepack # Mode: Eager # Name: qembeddingbag_2bit_prepack_num_embeddings80_embedding_dim128 # Input: num_embeddings: 80, embedding_dim: 128 Forward Execution Time (us) : 167.282 # Benchmarking PyTorch: qembeddingbag_2bit_prepack # Mode: Eager # Name: qembeddingbag_2bit_prepack_num_embeddings80_embedding_dim256 # Input: num_embeddings: 80, embedding_dim: 256 Forward Execution Time (us) : 398.901 # Benchmarking PyTorch: qembeddingbag_2bit_prepack # Mode: Eager # Name: qembeddingbag_2bit_prepack_num_embeddings80_embedding_dim512 # Input: num_embeddings: 80, embedding_dim: 512 Forward Execution Time (us) : 785.254 # Benchmarking PyTorch: qembeddingbag_byte_unpack # Mode: Eager # Name: qembeddingbag_byte_unpack_num_embeddings80_embedding_dim128 # Input: num_embeddings: 80, embedding_dim: 128 Forward Execution Time (us) : 122.653 # Benchmarking PyTorch: qembeddingbag_byte_unpack # Mode: Eager # Name: qembeddingbag_byte_unpack_num_embeddings80_embedding_dim256 # Input: num_embeddings: 80, embedding_dim: 256 Forward Execution Time (us) : 230.617 # Benchmarking PyTorch: qembeddingbag_byte_unpack # Mode: Eager # Name: qembeddingbag_byte_unpack_num_embeddings80_embedding_dim512 # Input: num_embeddings: 80, embedding_dim: 512 Forward Execution Time (us) : 408.807 # Benchmarking PyTorch: qembeddingbag_4bit_unpack # Mode: Eager # Name: qembeddingbag_4bit_unpack_num_embeddings80_embedding_dim128 # Input: num_embeddings: 80, embedding_dim: 128 Forward Execution Time (us) : 176.087 # Benchmarking PyTorch: qembeddingbag_4bit_unpack # Mode: Eager # Name: qembeddingbag_4bit_unpack_num_embeddings80_embedding_dim256 # Input: num_embeddings: 80, embedding_dim: 256 Forward Execution Time (us) : 337.514 # Benchmarking PyTorch: qembeddingbag_4bit_unpack # Mode: Eager # Name: qembeddingbag_4bit_unpack_num_embeddings80_embedding_dim512 # Input: num_embeddings: 80, embedding_dim: 512 Forward Execution Time (us) : 659.716 # Benchmarking PyTorch: qembeddingbag_2bit_unpack # Mode: Eager # Name: qembeddingbag_2bit_unpack_num_embeddings80_embedding_dim128 # Input: num_embeddings: 80, embedding_dim: 128 Forward Execution Time (us) : 342.529 # Benchmarking PyTorch: qembeddingbag_2bit_unpack # Mode: Eager # Name: qembeddingbag_2bit_unpack_num_embeddings80_embedding_dim256 # Input: num_embeddings: 80, embedding_dim: 256 Forward Execution Time (us) : 665.197 # Benchmarking PyTorch: qembeddingbag_2bit_unpack # Mode: Eager # Name: qembeddingbag_2bit_unpack_num_embeddings80_embedding_dim512 # Input: num_embeddings: 80, embedding_dim: 512 Forward Execution Time (us) : 1307.923 ``` Performance After: ``` # ---------------------------------------- # PyTorch/Caffe2 Operator Micro-benchmarks # ---------------------------------------- # Tag : short # Benchmarking PyTorch: qembeddingbag_byte_prepack # Mode: Eager # Name: qembeddingbag_byte_prepack_num_embeddings80_embedding_dim128 # Input: num_embeddings: 80, embedding_dim: 128 Forward Execution Time (us) : 10.782 # Benchmarking PyTorch: qembeddingbag_byte_prepack # Mode: Eager # Name: qembeddingbag_byte_prepack_num_embeddings80_embedding_dim256 # Input: num_embeddings: 80, embedding_dim: 256 Forward Execution Time (us) : 17.443 # Benchmarking PyTorch: qembeddingbag_byte_prepack # Mode: Eager # Name: qembeddingbag_byte_prepack_num_embeddings80_embedding_dim512 # Input: num_embeddings: 80, embedding_dim: 512 Forward Execution Time (us) : 25.898 # Benchmarking PyTorch: qembeddingbag_4bit_prepack # Mode: Eager # Name: qembeddingbag_4bit_prepack_num_embeddings80_embedding_dim128 # Input: num_embeddings: 80, embedding_dim: 128 Forward Execution Time (us) : 13.903 # Benchmarking PyTorch: qembeddingbag_4bit_prepack # Mode: Eager # Name: qembeddingbag_4bit_prepack_num_embeddings80_embedding_dim256 # Input: num_embeddings: 80, embedding_dim: 256 Forward Execution Time (us) : 18.575 # Benchmarking PyTorch: qembeddingbag_4bit_prepack # Mode: Eager # Name: qembeddingbag_4bit_prepack_num_embeddings80_embedding_dim512 # Input: num_embeddings: 80, embedding_dim: 512 Forward Execution Time (us) : 30.650 # Benchmarking PyTorch: qembeddingbag_2bit_prepack # Mode: Eager # Name: qembeddingbag_2bit_prepack_num_embeddings80_embedding_dim128 # Input: num_embeddings: 80, embedding_dim: 128 Forward Execution Time (us) : 14.158 # Benchmarking PyTorch: qembeddingbag_2bit_prepack # Mode: Eager # Name: qembeddingbag_2bit_prepack_num_embeddings80_embedding_dim256 # Input: num_embeddings: 80, embedding_dim: 256 Forward Execution Time (us) : 19.818 # Benchmarking PyTorch: qembeddingbag_2bit_prepack # Mode: Eager # Name: qembeddingbag_2bit_prepack_num_embeddings80_embedding_dim512 # Input: num_embeddings: 80, embedding_dim: 512 Forward Execution Time (us) : 30.852 # Benchmarking PyTorch: qembeddingbag_byte_unpack # Mode: Eager # Name: qembeddingbag_byte_unpack_num_embeddings80_embedding_dim128 # Input: num_embeddings: 80, embedding_dim: 128 Forward Execution Time (us) : 47.596 # Benchmarking PyTorch: qembeddingbag_byte_unpack # Mode: Eager # Name: qembeddingbag_byte_unpack_num_embeddings80_embedding_dim256 # Input: num_embeddings: 80, embedding_dim: 256 Forward Execution Time (us) : 91.025 # Benchmarking PyTorch: qembeddingbag_byte_unpack # Mode: Eager # Name: qembeddingbag_byte_unpack_num_embeddings80_embedding_dim512 # Input: num_embeddings: 80, embedding_dim: 512 Forward Execution Time (us) : 131.425 # Benchmarking PyTorch: qembeddingbag_4bit_unpack # Mode: Eager # Name: qembeddingbag_4bit_unpack_num_embeddings80_embedding_dim128 # Input: num_embeddings: 80, embedding_dim: 128 Forward Execution Time (us) : 12.637 # Benchmarking PyTorch: qembeddingbag_4bit_unpack # Mode: Eager # Name: qembeddingbag_4bit_unpack_num_embeddings80_embedding_dim256 # Input: num_embeddings: 80, embedding_dim: 256 Forward Execution Time (us) : 20.856 # Benchmarking PyTorch: qembeddingbag_4bit_unpack # Mode: Eager # Name: qembeddingbag_4bit_unpack_num_embeddings80_embedding_dim512 # Input: num_embeddings: 80, embedding_dim: 512 Forward Execution Time (us) : 33.944 # Benchmarking PyTorch: qembeddingbag_2bit_unpack # Mode: Eager # Name: qembeddingbag_2bit_unpack_num_embeddings80_embedding_dim128 # Input: num_embeddings: 80, embedding_dim: 128 Forward Execution Time (us) : 21.181 # Benchmarking PyTorch: qembeddingbag_2bit_unpack # Mode: Eager # Name: qembeddingbag_2bit_unpack_num_embeddings80_embedding_dim256 # Input: num_embeddings: 80, embedding_dim: 256 Forward Execution Time (us) : 34.213 # Benchmarking PyTorch: qembeddingbag_2bit_unpack # Mode: Eager # Name: qembeddingbag_2bit_unpack_num_embeddings80_embedding_dim512 # Input: num_embeddings: 80, embedding_dim: 512 Forward Execution Time (us) : 59.622 ``` ghstack-source-id: 112496321 Differential Revision: [D23675777](https://our.internmc.facebook.com/intern/diff/D23675777/)
fbgemm functions are vectorized and faster ``` Finished test run: https://our.intern.facebook.com/intern/testinfra/testrun/6473924484856786 Summary (total time 15.08s): PASS: 7 FAIL: 0 SKIP: 0 FATAL: 0 TIMEOUT: 0 OMIT: 0 ``` Performance Before: ``` # ---------------------------------------- # PyTorch/Caffe2 Operator Micro-benchmarks # ---------------------------------------- # Tag : short # Benchmarking PyTorch: qembeddingbag_byte_prepack # Mode: Eager # Name: qembeddingbag_byte_prepack_num_embeddings80_embedding_dim128 # Input: num_embeddings: 80, embedding_dim: 128 Forward Execution Time (us) : 68.727 # Benchmarking PyTorch: qembeddingbag_byte_prepack # Mode: Eager # Name: qembeddingbag_byte_prepack_num_embeddings80_embedding_dim256 # Input: num_embeddings: 80, embedding_dim: 256 Forward Execution Time (us) : 131.500 # Benchmarking PyTorch: qembeddingbag_byte_prepack # Mode: Eager # Name: qembeddingbag_byte_prepack_num_embeddings80_embedding_dim512 # Input: num_embeddings: 80, embedding_dim: 512 Forward Execution Time (us) : 248.190 # Benchmarking PyTorch: qembeddingbag_4bit_prepack # Mode: Eager # Name: qembeddingbag_4bit_prepack_num_embeddings80_embedding_dim128 # Input: num_embeddings: 80, embedding_dim: 128 Forward Execution Time (us) : 172.742 # Benchmarking PyTorch: qembeddingbag_4bit_prepack # Mode: Eager # Name: qembeddingbag_4bit_prepack_num_embeddings80_embedding_dim256 # Input: num_embeddings: 80, embedding_dim: 256 Forward Execution Time (us) : 333.008 # Benchmarking PyTorch: qembeddingbag_4bit_prepack # Mode: Eager # Name: qembeddingbag_4bit_prepack_num_embeddings80_embedding_dim512 # Input: num_embeddings: 80, embedding_dim: 512 Forward Execution Time (us) : 652.423 # Benchmarking PyTorch: qembeddingbag_2bit_prepack # Mode: Eager # Name: qembeddingbag_2bit_prepack_num_embeddings80_embedding_dim128 # Input: num_embeddings: 80, embedding_dim: 128 Forward Execution Time (us) : 167.282 # Benchmarking PyTorch: qembeddingbag_2bit_prepack # Mode: Eager # Name: qembeddingbag_2bit_prepack_num_embeddings80_embedding_dim256 # Input: num_embeddings: 80, embedding_dim: 256 Forward Execution Time (us) : 398.901 # Benchmarking PyTorch: qembeddingbag_2bit_prepack # Mode: Eager # Name: qembeddingbag_2bit_prepack_num_embeddings80_embedding_dim512 # Input: num_embeddings: 80, embedding_dim: 512 Forward Execution Time (us) : 785.254 # Benchmarking PyTorch: qembeddingbag_byte_unpack # Mode: Eager # Name: qembeddingbag_byte_unpack_num_embeddings80_embedding_dim128 # Input: num_embeddings: 80, embedding_dim: 128 Forward Execution Time (us) : 122.653 # Benchmarking PyTorch: qembeddingbag_byte_unpack # Mode: Eager # Name: qembeddingbag_byte_unpack_num_embeddings80_embedding_dim256 # Input: num_embeddings: 80, embedding_dim: 256 Forward Execution Time (us) : 230.617 # Benchmarking PyTorch: qembeddingbag_byte_unpack # Mode: Eager # Name: qembeddingbag_byte_unpack_num_embeddings80_embedding_dim512 # Input: num_embeddings: 80, embedding_dim: 512 Forward Execution Time (us) : 408.807 # Benchmarking PyTorch: qembeddingbag_4bit_unpack # Mode: Eager # Name: qembeddingbag_4bit_unpack_num_embeddings80_embedding_dim128 # Input: num_embeddings: 80, embedding_dim: 128 Forward Execution Time (us) : 176.087 # Benchmarking PyTorch: qembeddingbag_4bit_unpack # Mode: Eager # Name: qembeddingbag_4bit_unpack_num_embeddings80_embedding_dim256 # Input: num_embeddings: 80, embedding_dim: 256 Forward Execution Time (us) : 337.514 # Benchmarking PyTorch: qembeddingbag_4bit_unpack # Mode: Eager # Name: qembeddingbag_4bit_unpack_num_embeddings80_embedding_dim512 # Input: num_embeddings: 80, embedding_dim: 512 Forward Execution Time (us) : 659.716 # Benchmarking PyTorch: qembeddingbag_2bit_unpack # Mode: Eager # Name: qembeddingbag_2bit_unpack_num_embeddings80_embedding_dim128 # Input: num_embeddings: 80, embedding_dim: 128 Forward Execution Time (us) : 342.529 # Benchmarking PyTorch: qembeddingbag_2bit_unpack # Mode: Eager # Name: qembeddingbag_2bit_unpack_num_embeddings80_embedding_dim256 # Input: num_embeddings: 80, embedding_dim: 256 Forward Execution Time (us) : 665.197 # Benchmarking PyTorch: qembeddingbag_2bit_unpack # Mode: Eager # Name: qembeddingbag_2bit_unpack_num_embeddings80_embedding_dim512 # Input: num_embeddings: 80, embedding_dim: 512 Forward Execution Time (us) : 1307.923 ``` Performance After: ``` # ---------------------------------------- # PyTorch/Caffe2 Operator Micro-benchmarks # ---------------------------------------- # Tag : short # Benchmarking PyTorch: qembeddingbag_byte_prepack # Mode: Eager # Name: qembeddingbag_byte_prepack_num_embeddings80_embedding_dim128 # Input: num_embeddings: 80, embedding_dim: 128 Forward Execution Time (us) : 10.782 # Benchmarking PyTorch: qembeddingbag_byte_prepack # Mode: Eager # Name: qembeddingbag_byte_prepack_num_embeddings80_embedding_dim256 # Input: num_embeddings: 80, embedding_dim: 256 Forward Execution Time (us) : 17.443 # Benchmarking PyTorch: qembeddingbag_byte_prepack # Mode: Eager # Name: qembeddingbag_byte_prepack_num_embeddings80_embedding_dim512 # Input: num_embeddings: 80, embedding_dim: 512 Forward Execution Time (us) : 25.898 # Benchmarking PyTorch: qembeddingbag_4bit_prepack # Mode: Eager # Name: qembeddingbag_4bit_prepack_num_embeddings80_embedding_dim128 # Input: num_embeddings: 80, embedding_dim: 128 Forward Execution Time (us) : 13.903 # Benchmarking PyTorch: qembeddingbag_4bit_prepack # Mode: Eager # Name: qembeddingbag_4bit_prepack_num_embeddings80_embedding_dim256 # Input: num_embeddings: 80, embedding_dim: 256 Forward Execution Time (us) : 18.575 # Benchmarking PyTorch: qembeddingbag_4bit_prepack # Mode: Eager # Name: qembeddingbag_4bit_prepack_num_embeddings80_embedding_dim512 # Input: num_embeddings: 80, embedding_dim: 512 Forward Execution Time (us) : 30.650 # Benchmarking PyTorch: qembeddingbag_2bit_prepack # Mode: Eager # Name: qembeddingbag_2bit_prepack_num_embeddings80_embedding_dim128 # Input: num_embeddings: 80, embedding_dim: 128 Forward Execution Time (us) : 14.158 # Benchmarking PyTorch: qembeddingbag_2bit_prepack # Mode: Eager # Name: qembeddingbag_2bit_prepack_num_embeddings80_embedding_dim256 # Input: num_embeddings: 80, embedding_dim: 256 Forward Execution Time (us) : 19.818 # Benchmarking PyTorch: qembeddingbag_2bit_prepack # Mode: Eager # Name: qembeddingbag_2bit_prepack_num_embeddings80_embedding_dim512 # Input: num_embeddings: 80, embedding_dim: 512 Forward Execution Time (us) : 30.852 # Benchmarking PyTorch: qembeddingbag_byte_unpack # Mode: Eager # Name: qembeddingbag_byte_unpack_num_embeddings80_embedding_dim128 # Input: num_embeddings: 80, embedding_dim: 128 Forward Execution Time (us) : 47.596 # Benchmarking PyTorch: qembeddingbag_byte_unpack # Mode: Eager # Name: qembeddingbag_byte_unpack_num_embeddings80_embedding_dim256 # Input: num_embeddings: 80, embedding_dim: 256 Forward Execution Time (us) : 91.025 # Benchmarking PyTorch: qembeddingbag_byte_unpack # Mode: Eager # Name: qembeddingbag_byte_unpack_num_embeddings80_embedding_dim512 # Input: num_embeddings: 80, embedding_dim: 512 Forward Execution Time (us) : 131.425 # Benchmarking PyTorch: qembeddingbag_4bit_unpack # Mode: Eager # Name: qembeddingbag_4bit_unpack_num_embeddings80_embedding_dim128 # Input: num_embeddings: 80, embedding_dim: 128 Forward Execution Time (us) : 12.637 # Benchmarking PyTorch: qembeddingbag_4bit_unpack # Mode: Eager # Name: qembeddingbag_4bit_unpack_num_embeddings80_embedding_dim256 # Input: num_embeddings: 80, embedding_dim: 256 Forward Execution Time (us) : 20.856 # Benchmarking PyTorch: qembeddingbag_4bit_unpack # Mode: Eager # Name: qembeddingbag_4bit_unpack_num_embeddings80_embedding_dim512 # Input: num_embeddings: 80, embedding_dim: 512 Forward Execution Time (us) : 33.944 # Benchmarking PyTorch: qembeddingbag_2bit_unpack # Mode: Eager # Name: qembeddingbag_2bit_unpack_num_embeddings80_embedding_dim128 # Input: num_embeddings: 80, embedding_dim: 128 Forward Execution Time (us) : 21.181 # Benchmarking PyTorch: qembeddingbag_2bit_unpack # Mode: Eager # Name: qembeddingbag_2bit_unpack_num_embeddings80_embedding_dim256 # Input: num_embeddings: 80, embedding_dim: 256 Forward Execution Time (us) : 34.213 # Benchmarking PyTorch: qembeddingbag_2bit_unpack # Mode: Eager # Name: qembeddingbag_2bit_unpack_num_embeddings80_embedding_dim512 # Input: num_embeddings: 80, embedding_dim: 512 Forward Execution Time (us) : 59.622 ``` Differential Revision: [D23675777](https://our.internmc.facebook.com/intern/diff/D23675777/) [ghstack-poisoned]
Pull Request resolved: #44845 fbgemm functions are vectorized and faster ``` Finished test run: https://our.intern.facebook.com/intern/testinfra/testrun/6473924484856786 Summary (total time 15.08s): PASS: 7 FAIL: 0 SKIP: 0 FATAL: 0 TIMEOUT: 0 OMIT: 0 ``` Performance Before: ``` # ---------------------------------------- # PyTorch/Caffe2 Operator Micro-benchmarks # ---------------------------------------- # Tag : short # Benchmarking PyTorch: qembeddingbag_byte_prepack # Mode: Eager # Name: qembeddingbag_byte_prepack_num_embeddings80_embedding_dim128 # Input: num_embeddings: 80, embedding_dim: 128 Forward Execution Time (us) : 68.727 # Benchmarking PyTorch: qembeddingbag_byte_prepack # Mode: Eager # Name: qembeddingbag_byte_prepack_num_embeddings80_embedding_dim256 # Input: num_embeddings: 80, embedding_dim: 256 Forward Execution Time (us) : 131.500 # Benchmarking PyTorch: qembeddingbag_byte_prepack # Mode: Eager # Name: qembeddingbag_byte_prepack_num_embeddings80_embedding_dim512 # Input: num_embeddings: 80, embedding_dim: 512 Forward Execution Time (us) : 248.190 # Benchmarking PyTorch: qembeddingbag_4bit_prepack # Mode: Eager # Name: qembeddingbag_4bit_prepack_num_embeddings80_embedding_dim128 # Input: num_embeddings: 80, embedding_dim: 128 Forward Execution Time (us) : 172.742 # Benchmarking PyTorch: qembeddingbag_4bit_prepack # Mode: Eager # Name: qembeddingbag_4bit_prepack_num_embeddings80_embedding_dim256 # Input: num_embeddings: 80, embedding_dim: 256 Forward Execution Time (us) : 333.008 # Benchmarking PyTorch: qembeddingbag_4bit_prepack # Mode: Eager # Name: qembeddingbag_4bit_prepack_num_embeddings80_embedding_dim512 # Input: num_embeddings: 80, embedding_dim: 512 Forward Execution Time (us) : 652.423 # Benchmarking PyTorch: qembeddingbag_2bit_prepack # Mode: Eager # Name: qembeddingbag_2bit_prepack_num_embeddings80_embedding_dim128 # Input: num_embeddings: 80, embedding_dim: 128 Forward Execution Time (us) : 167.282 # Benchmarking PyTorch: qembeddingbag_2bit_prepack # Mode: Eager # Name: qembeddingbag_2bit_prepack_num_embeddings80_embedding_dim256 # Input: num_embeddings: 80, embedding_dim: 256 Forward Execution Time (us) : 398.901 # Benchmarking PyTorch: qembeddingbag_2bit_prepack # Mode: Eager # Name: qembeddingbag_2bit_prepack_num_embeddings80_embedding_dim512 # Input: num_embeddings: 80, embedding_dim: 512 Forward Execution Time (us) : 785.254 # Benchmarking PyTorch: qembeddingbag_byte_unpack # Mode: Eager # Name: qembeddingbag_byte_unpack_num_embeddings80_embedding_dim128 # Input: num_embeddings: 80, embedding_dim: 128 Forward Execution Time (us) : 122.653 # Benchmarking PyTorch: qembeddingbag_byte_unpack # Mode: Eager # Name: qembeddingbag_byte_unpack_num_embeddings80_embedding_dim256 # Input: num_embeddings: 80, embedding_dim: 256 Forward Execution Time (us) : 230.617 # Benchmarking PyTorch: qembeddingbag_byte_unpack # Mode: Eager # Name: qembeddingbag_byte_unpack_num_embeddings80_embedding_dim512 # Input: num_embeddings: 80, embedding_dim: 512 Forward Execution Time (us) : 408.807 # Benchmarking PyTorch: qembeddingbag_4bit_unpack # Mode: Eager # Name: qembeddingbag_4bit_unpack_num_embeddings80_embedding_dim128 # Input: num_embeddings: 80, embedding_dim: 128 Forward Execution Time (us) : 176.087 # Benchmarking PyTorch: qembeddingbag_4bit_unpack # Mode: Eager # Name: qembeddingbag_4bit_unpack_num_embeddings80_embedding_dim256 # Input: num_embeddings: 80, embedding_dim: 256 Forward Execution Time (us) : 337.514 # Benchmarking PyTorch: qembeddingbag_4bit_unpack # Mode: Eager # Name: qembeddingbag_4bit_unpack_num_embeddings80_embedding_dim512 # Input: num_embeddings: 80, embedding_dim: 512 Forward Execution Time (us) : 659.716 # Benchmarking PyTorch: qembeddingbag_2bit_unpack # Mode: Eager # Name: qembeddingbag_2bit_unpack_num_embeddings80_embedding_dim128 # Input: num_embeddings: 80, embedding_dim: 128 Forward Execution Time (us) : 342.529 # Benchmarking PyTorch: qembeddingbag_2bit_unpack # Mode: Eager # Name: qembeddingbag_2bit_unpack_num_embeddings80_embedding_dim256 # Input: num_embeddings: 80, embedding_dim: 256 Forward Execution Time (us) : 665.197 # Benchmarking PyTorch: qembeddingbag_2bit_unpack # Mode: Eager # Name: qembeddingbag_2bit_unpack_num_embeddings80_embedding_dim512 # Input: num_embeddings: 80, embedding_dim: 512 Forward Execution Time (us) : 1307.923 ``` Performance After: ``` # ---------------------------------------- # PyTorch/Caffe2 Operator Micro-benchmarks # ---------------------------------------- # Tag : short # Benchmarking PyTorch: qembeddingbag_byte_prepack # Mode: Eager # Name: qembeddingbag_byte_prepack_num_embeddings80_embedding_dim128 # Input: num_embeddings: 80, embedding_dim: 128 Forward Execution Time (us) : 10.782 # Benchmarking PyTorch: qembeddingbag_byte_prepack # Mode: Eager # Name: qembeddingbag_byte_prepack_num_embeddings80_embedding_dim256 # Input: num_embeddings: 80, embedding_dim: 256 Forward Execution Time (us) : 17.443 # Benchmarking PyTorch: qembeddingbag_byte_prepack # Mode: Eager # Name: qembeddingbag_byte_prepack_num_embeddings80_embedding_dim512 # Input: num_embeddings: 80, embedding_dim: 512 Forward Execution Time (us) : 25.898 # Benchmarking PyTorch: qembeddingbag_4bit_prepack # Mode: Eager # Name: qembeddingbag_4bit_prepack_num_embeddings80_embedding_dim128 # Input: num_embeddings: 80, embedding_dim: 128 Forward Execution Time (us) : 13.903 # Benchmarking PyTorch: qembeddingbag_4bit_prepack # Mode: Eager # Name: qembeddingbag_4bit_prepack_num_embeddings80_embedding_dim256 # Input: num_embeddings: 80, embedding_dim: 256 Forward Execution Time (us) : 18.575 # Benchmarking PyTorch: qembeddingbag_4bit_prepack # Mode: Eager # Name: qembeddingbag_4bit_prepack_num_embeddings80_embedding_dim512 # Input: num_embeddings: 80, embedding_dim: 512 Forward Execution Time (us) : 30.650 # Benchmarking PyTorch: qembeddingbag_2bit_prepack # Mode: Eager # Name: qembeddingbag_2bit_prepack_num_embeddings80_embedding_dim128 # Input: num_embeddings: 80, embedding_dim: 128 Forward Execution Time (us) : 14.158 # Benchmarking PyTorch: qembeddingbag_2bit_prepack # Mode: Eager # Name: qembeddingbag_2bit_prepack_num_embeddings80_embedding_dim256 # Input: num_embeddings: 80, embedding_dim: 256 Forward Execution Time (us) : 19.818 # Benchmarking PyTorch: qembeddingbag_2bit_prepack # Mode: Eager # Name: qembeddingbag_2bit_prepack_num_embeddings80_embedding_dim512 # Input: num_embeddings: 80, embedding_dim: 512 Forward Execution Time (us) : 30.852 # Benchmarking PyTorch: qembeddingbag_byte_unpack # Mode: Eager # Name: qembeddingbag_byte_unpack_num_embeddings80_embedding_dim128 # Input: num_embeddings: 80, embedding_dim: 128 Forward Execution Time (us) : 47.596 # Benchmarking PyTorch: qembeddingbag_byte_unpack # Mode: Eager # Name: qembeddingbag_byte_unpack_num_embeddings80_embedding_dim256 # Input: num_embeddings: 80, embedding_dim: 256 Forward Execution Time (us) : 91.025 # Benchmarking PyTorch: qembeddingbag_byte_unpack # Mode: Eager # Name: qembeddingbag_byte_unpack_num_embeddings80_embedding_dim512 # Input: num_embeddings: 80, embedding_dim: 512 Forward Execution Time (us) : 131.425 # Benchmarking PyTorch: qembeddingbag_4bit_unpack # Mode: Eager # Name: qembeddingbag_4bit_unpack_num_embeddings80_embedding_dim128 # Input: num_embeddings: 80, embedding_dim: 128 Forward Execution Time (us) : 12.637 # Benchmarking PyTorch: qembeddingbag_4bit_unpack # Mode: Eager # Name: qembeddingbag_4bit_unpack_num_embeddings80_embedding_dim256 # Input: num_embeddings: 80, embedding_dim: 256 Forward Execution Time (us) : 20.856 # Benchmarking PyTorch: qembeddingbag_4bit_unpack # Mode: Eager # Name: qembeddingbag_4bit_unpack_num_embeddings80_embedding_dim512 # Input: num_embeddings: 80, embedding_dim: 512 Forward Execution Time (us) : 33.944 # Benchmarking PyTorch: qembeddingbag_2bit_unpack # Mode: Eager # Name: qembeddingbag_2bit_unpack_num_embeddings80_embedding_dim128 # Input: num_embeddings: 80, embedding_dim: 128 Forward Execution Time (us) : 21.181 # Benchmarking PyTorch: qembeddingbag_2bit_unpack # Mode: Eager # Name: qembeddingbag_2bit_unpack_num_embeddings80_embedding_dim256 # Input: num_embeddings: 80, embedding_dim: 256 Forward Execution Time (us) : 34.213 # Benchmarking PyTorch: qembeddingbag_2bit_unpack # Mode: Eager # Name: qembeddingbag_2bit_unpack_num_embeddings80_embedding_dim512 # Input: num_embeddings: 80, embedding_dim: 512 Forward Execution Time (us) : 59.622 ``` ghstack-source-id: 112664490 Differential Revision: [D23675777](https://our.internmc.facebook.com/intern/diff/D23675777/)
fbgemm functions are vectorized and faster ``` Finished test run: https://our.intern.facebook.com/intern/testinfra/testrun/6473924484856786 Summary (total time 15.08s): PASS: 7 FAIL: 0 SKIP: 0 FATAL: 0 TIMEOUT: 0 OMIT: 0 ``` Performance Before: ``` # ---------------------------------------- # PyTorch/Caffe2 Operator Micro-benchmarks # ---------------------------------------- # Tag : short # Benchmarking PyTorch: qembeddingbag_byte_prepack # Mode: Eager # Name: qembeddingbag_byte_prepack_num_embeddings80_embedding_dim128 # Input: num_embeddings: 80, embedding_dim: 128 Forward Execution Time (us) : 68.727 # Benchmarking PyTorch: qembeddingbag_byte_prepack # Mode: Eager # Name: qembeddingbag_byte_prepack_num_embeddings80_embedding_dim256 # Input: num_embeddings: 80, embedding_dim: 256 Forward Execution Time (us) : 131.500 # Benchmarking PyTorch: qembeddingbag_byte_prepack # Mode: Eager # Name: qembeddingbag_byte_prepack_num_embeddings80_embedding_dim512 # Input: num_embeddings: 80, embedding_dim: 512 Forward Execution Time (us) : 248.190 # Benchmarking PyTorch: qembeddingbag_4bit_prepack # Mode: Eager # Name: qembeddingbag_4bit_prepack_num_embeddings80_embedding_dim128 # Input: num_embeddings: 80, embedding_dim: 128 Forward Execution Time (us) : 172.742 # Benchmarking PyTorch: qembeddingbag_4bit_prepack # Mode: Eager # Name: qembeddingbag_4bit_prepack_num_embeddings80_embedding_dim256 # Input: num_embeddings: 80, embedding_dim: 256 Forward Execution Time (us) : 333.008 # Benchmarking PyTorch: qembeddingbag_4bit_prepack # Mode: Eager # Name: qembeddingbag_4bit_prepack_num_embeddings80_embedding_dim512 # Input: num_embeddings: 80, embedding_dim: 512 Forward Execution Time (us) : 652.423 # Benchmarking PyTorch: qembeddingbag_2bit_prepack # Mode: Eager # Name: qembeddingbag_2bit_prepack_num_embeddings80_embedding_dim128 # Input: num_embeddings: 80, embedding_dim: 128 Forward Execution Time (us) : 167.282 # Benchmarking PyTorch: qembeddingbag_2bit_prepack # Mode: Eager # Name: qembeddingbag_2bit_prepack_num_embeddings80_embedding_dim256 # Input: num_embeddings: 80, embedding_dim: 256 Forward Execution Time (us) : 398.901 # Benchmarking PyTorch: qembeddingbag_2bit_prepack # Mode: Eager # Name: qembeddingbag_2bit_prepack_num_embeddings80_embedding_dim512 # Input: num_embeddings: 80, embedding_dim: 512 Forward Execution Time (us) : 785.254 # Benchmarking PyTorch: qembeddingbag_byte_unpack # Mode: Eager # Name: qembeddingbag_byte_unpack_num_embeddings80_embedding_dim128 # Input: num_embeddings: 80, embedding_dim: 128 Forward Execution Time (us) : 122.653 # Benchmarking PyTorch: qembeddingbag_byte_unpack # Mode: Eager # Name: qembeddingbag_byte_unpack_num_embeddings80_embedding_dim256 # Input: num_embeddings: 80, embedding_dim: 256 Forward Execution Time (us) : 230.617 # Benchmarking PyTorch: qembeddingbag_byte_unpack # Mode: Eager # Name: qembeddingbag_byte_unpack_num_embeddings80_embedding_dim512 # Input: num_embeddings: 80, embedding_dim: 512 Forward Execution Time (us) : 408.807 # Benchmarking PyTorch: qembeddingbag_4bit_unpack # Mode: Eager # Name: qembeddingbag_4bit_unpack_num_embeddings80_embedding_dim128 # Input: num_embeddings: 80, embedding_dim: 128 Forward Execution Time (us) : 176.087 # Benchmarking PyTorch: qembeddingbag_4bit_unpack # Mode: Eager # Name: qembeddingbag_4bit_unpack_num_embeddings80_embedding_dim256 # Input: num_embeddings: 80, embedding_dim: 256 Forward Execution Time (us) : 337.514 # Benchmarking PyTorch: qembeddingbag_4bit_unpack # Mode: Eager # Name: qembeddingbag_4bit_unpack_num_embeddings80_embedding_dim512 # Input: num_embeddings: 80, embedding_dim: 512 Forward Execution Time (us) : 659.716 # Benchmarking PyTorch: qembeddingbag_2bit_unpack # Mode: Eager # Name: qembeddingbag_2bit_unpack_num_embeddings80_embedding_dim128 # Input: num_embeddings: 80, embedding_dim: 128 Forward Execution Time (us) : 342.529 # Benchmarking PyTorch: qembeddingbag_2bit_unpack # Mode: Eager # Name: qembeddingbag_2bit_unpack_num_embeddings80_embedding_dim256 # Input: num_embeddings: 80, embedding_dim: 256 Forward Execution Time (us) : 665.197 # Benchmarking PyTorch: qembeddingbag_2bit_unpack # Mode: Eager # Name: qembeddingbag_2bit_unpack_num_embeddings80_embedding_dim512 # Input: num_embeddings: 80, embedding_dim: 512 Forward Execution Time (us) : 1307.923 ``` Performance After: ``` # ---------------------------------------- # PyTorch/Caffe2 Operator Micro-benchmarks # ---------------------------------------- # Tag : short # Benchmarking PyTorch: qembeddingbag_byte_prepack # Mode: Eager # Name: qembeddingbag_byte_prepack_num_embeddings80_embedding_dim128 # Input: num_embeddings: 80, embedding_dim: 128 Forward Execution Time (us) : 10.782 # Benchmarking PyTorch: qembeddingbag_byte_prepack # Mode: Eager # Name: qembeddingbag_byte_prepack_num_embeddings80_embedding_dim256 # Input: num_embeddings: 80, embedding_dim: 256 Forward Execution Time (us) : 17.443 # Benchmarking PyTorch: qembeddingbag_byte_prepack # Mode: Eager # Name: qembeddingbag_byte_prepack_num_embeddings80_embedding_dim512 # Input: num_embeddings: 80, embedding_dim: 512 Forward Execution Time (us) : 25.898 # Benchmarking PyTorch: qembeddingbag_4bit_prepack # Mode: Eager # Name: qembeddingbag_4bit_prepack_num_embeddings80_embedding_dim128 # Input: num_embeddings: 80, embedding_dim: 128 Forward Execution Time (us) : 13.903 # Benchmarking PyTorch: qembeddingbag_4bit_prepack # Mode: Eager # Name: qembeddingbag_4bit_prepack_num_embeddings80_embedding_dim256 # Input: num_embeddings: 80, embedding_dim: 256 Forward Execution Time (us) : 18.575 # Benchmarking PyTorch: qembeddingbag_4bit_prepack # Mode: Eager # Name: qembeddingbag_4bit_prepack_num_embeddings80_embedding_dim512 # Input: num_embeddings: 80, embedding_dim: 512 Forward Execution Time (us) : 30.650 # Benchmarking PyTorch: qembeddingbag_2bit_prepack # Mode: Eager # Name: qembeddingbag_2bit_prepack_num_embeddings80_embedding_dim128 # Input: num_embeddings: 80, embedding_dim: 128 Forward Execution Time (us) : 14.158 # Benchmarking PyTorch: qembeddingbag_2bit_prepack # Mode: Eager # Name: qembeddingbag_2bit_prepack_num_embeddings80_embedding_dim256 # Input: num_embeddings: 80, embedding_dim: 256 Forward Execution Time (us) : 19.818 # Benchmarking PyTorch: qembeddingbag_2bit_prepack # Mode: Eager # Name: qembeddingbag_2bit_prepack_num_embeddings80_embedding_dim512 # Input: num_embeddings: 80, embedding_dim: 512 Forward Execution Time (us) : 30.852 # Benchmarking PyTorch: qembeddingbag_byte_unpack # Mode: Eager # Name: qembeddingbag_byte_unpack_num_embeddings80_embedding_dim128 # Input: num_embeddings: 80, embedding_dim: 128 Forward Execution Time (us) : 47.596 # Benchmarking PyTorch: qembeddingbag_byte_unpack # Mode: Eager # Name: qembeddingbag_byte_unpack_num_embeddings80_embedding_dim256 # Input: num_embeddings: 80, embedding_dim: 256 Forward Execution Time (us) : 91.025 # Benchmarking PyTorch: qembeddingbag_byte_unpack # Mode: Eager # Name: qembeddingbag_byte_unpack_num_embeddings80_embedding_dim512 # Input: num_embeddings: 80, embedding_dim: 512 Forward Execution Time (us) : 131.425 # Benchmarking PyTorch: qembeddingbag_4bit_unpack # Mode: Eager # Name: qembeddingbag_4bit_unpack_num_embeddings80_embedding_dim128 # Input: num_embeddings: 80, embedding_dim: 128 Forward Execution Time (us) : 12.637 # Benchmarking PyTorch: qembeddingbag_4bit_unpack # Mode: Eager # Name: qembeddingbag_4bit_unpack_num_embeddings80_embedding_dim256 # Input: num_embeddings: 80, embedding_dim: 256 Forward Execution Time (us) : 20.856 # Benchmarking PyTorch: qembeddingbag_4bit_unpack # Mode: Eager # Name: qembeddingbag_4bit_unpack_num_embeddings80_embedding_dim512 # Input: num_embeddings: 80, embedding_dim: 512 Forward Execution Time (us) : 33.944 # Benchmarking PyTorch: qembeddingbag_2bit_unpack # Mode: Eager # Name: qembeddingbag_2bit_unpack_num_embeddings80_embedding_dim128 # Input: num_embeddings: 80, embedding_dim: 128 Forward Execution Time (us) : 21.181 # Benchmarking PyTorch: qembeddingbag_2bit_unpack # Mode: Eager # Name: qembeddingbag_2bit_unpack_num_embeddings80_embedding_dim256 # Input: num_embeddings: 80, embedding_dim: 256 Forward Execution Time (us) : 34.213 # Benchmarking PyTorch: qembeddingbag_2bit_unpack # Mode: Eager # Name: qembeddingbag_2bit_unpack_num_embeddings80_embedding_dim512 # Input: num_embeddings: 80, embedding_dim: 512 Forward Execution Time (us) : 59.622 ``` Differential Revision: [D23675777](https://our.internmc.facebook.com/intern/diff/D23675777/) [ghstack-poisoned]
Pull Request resolved: #44845 fbgemm functions are vectorized and faster ``` Finished test run: https://our.intern.facebook.com/intern/testinfra/testrun/6473924484856786 Summary (total time 15.08s): PASS: 7 FAIL: 0 SKIP: 0 FATAL: 0 TIMEOUT: 0 OMIT: 0 ``` Performance Before: ``` # ---------------------------------------- # PyTorch/Caffe2 Operator Micro-benchmarks # ---------------------------------------- # Tag : short # Benchmarking PyTorch: qembeddingbag_byte_prepack # Mode: Eager # Name: qembeddingbag_byte_prepack_num_embeddings80_embedding_dim128 # Input: num_embeddings: 80, embedding_dim: 128 Forward Execution Time (us) : 68.727 # Benchmarking PyTorch: qembeddingbag_byte_prepack # Mode: Eager # Name: qembeddingbag_byte_prepack_num_embeddings80_embedding_dim256 # Input: num_embeddings: 80, embedding_dim: 256 Forward Execution Time (us) : 131.500 # Benchmarking PyTorch: qembeddingbag_byte_prepack # Mode: Eager # Name: qembeddingbag_byte_prepack_num_embeddings80_embedding_dim512 # Input: num_embeddings: 80, embedding_dim: 512 Forward Execution Time (us) : 248.190 # Benchmarking PyTorch: qembeddingbag_4bit_prepack # Mode: Eager # Name: qembeddingbag_4bit_prepack_num_embeddings80_embedding_dim128 # Input: num_embeddings: 80, embedding_dim: 128 Forward Execution Time (us) : 172.742 # Benchmarking PyTorch: qembeddingbag_4bit_prepack # Mode: Eager # Name: qembeddingbag_4bit_prepack_num_embeddings80_embedding_dim256 # Input: num_embeddings: 80, embedding_dim: 256 Forward Execution Time (us) : 333.008 # Benchmarking PyTorch: qembeddingbag_4bit_prepack # Mode: Eager # Name: qembeddingbag_4bit_prepack_num_embeddings80_embedding_dim512 # Input: num_embeddings: 80, embedding_dim: 512 Forward Execution Time (us) : 652.423 # Benchmarking PyTorch: qembeddingbag_2bit_prepack # Mode: Eager # Name: qembeddingbag_2bit_prepack_num_embeddings80_embedding_dim128 # Input: num_embeddings: 80, embedding_dim: 128 Forward Execution Time (us) : 167.282 # Benchmarking PyTorch: qembeddingbag_2bit_prepack # Mode: Eager # Name: qembeddingbag_2bit_prepack_num_embeddings80_embedding_dim256 # Input: num_embeddings: 80, embedding_dim: 256 Forward Execution Time (us) : 398.901 # Benchmarking PyTorch: qembeddingbag_2bit_prepack # Mode: Eager # Name: qembeddingbag_2bit_prepack_num_embeddings80_embedding_dim512 # Input: num_embeddings: 80, embedding_dim: 512 Forward Execution Time (us) : 785.254 # Benchmarking PyTorch: qembeddingbag_byte_unpack # Mode: Eager # Name: qembeddingbag_byte_unpack_num_embeddings80_embedding_dim128 # Input: num_embeddings: 80, embedding_dim: 128 Forward Execution Time (us) : 122.653 # Benchmarking PyTorch: qembeddingbag_byte_unpack # Mode: Eager # Name: qembeddingbag_byte_unpack_num_embeddings80_embedding_dim256 # Input: num_embeddings: 80, embedding_dim: 256 Forward Execution Time (us) : 230.617 # Benchmarking PyTorch: qembeddingbag_byte_unpack # Mode: Eager # Name: qembeddingbag_byte_unpack_num_embeddings80_embedding_dim512 # Input: num_embeddings: 80, embedding_dim: 512 Forward Execution Time (us) : 408.807 # Benchmarking PyTorch: qembeddingbag_4bit_unpack # Mode: Eager # Name: qembeddingbag_4bit_unpack_num_embeddings80_embedding_dim128 # Input: num_embeddings: 80, embedding_dim: 128 Forward Execution Time (us) : 176.087 # Benchmarking PyTorch: qembeddingbag_4bit_unpack # Mode: Eager # Name: qembeddingbag_4bit_unpack_num_embeddings80_embedding_dim256 # Input: num_embeddings: 80, embedding_dim: 256 Forward Execution Time (us) : 337.514 # Benchmarking PyTorch: qembeddingbag_4bit_unpack # Mode: Eager # Name: qembeddingbag_4bit_unpack_num_embeddings80_embedding_dim512 # Input: num_embeddings: 80, embedding_dim: 512 Forward Execution Time (us) : 659.716 # Benchmarking PyTorch: qembeddingbag_2bit_unpack # Mode: Eager # Name: qembeddingbag_2bit_unpack_num_embeddings80_embedding_dim128 # Input: num_embeddings: 80, embedding_dim: 128 Forward Execution Time (us) : 342.529 # Benchmarking PyTorch: qembeddingbag_2bit_unpack # Mode: Eager # Name: qembeddingbag_2bit_unpack_num_embeddings80_embedding_dim256 # Input: num_embeddings: 80, embedding_dim: 256 Forward Execution Time (us) : 665.197 # Benchmarking PyTorch: qembeddingbag_2bit_unpack # Mode: Eager # Name: qembeddingbag_2bit_unpack_num_embeddings80_embedding_dim512 # Input: num_embeddings: 80, embedding_dim: 512 Forward Execution Time (us) : 1307.923 ``` Performance After: ``` # ---------------------------------------- # PyTorch/Caffe2 Operator Micro-benchmarks # ---------------------------------------- # Tag : short # Benchmarking PyTorch: qembeddingbag_byte_prepack # Mode: Eager # Name: qembeddingbag_byte_prepack_num_embeddings80_embedding_dim128 # Input: num_embeddings: 80, embedding_dim: 128 Forward Execution Time (us) : 10.782 # Benchmarking PyTorch: qembeddingbag_byte_prepack # Mode: Eager # Name: qembeddingbag_byte_prepack_num_embeddings80_embedding_dim256 # Input: num_embeddings: 80, embedding_dim: 256 Forward Execution Time (us) : 17.443 # Benchmarking PyTorch: qembeddingbag_byte_prepack # Mode: Eager # Name: qembeddingbag_byte_prepack_num_embeddings80_embedding_dim512 # Input: num_embeddings: 80, embedding_dim: 512 Forward Execution Time (us) : 25.898 # Benchmarking PyTorch: qembeddingbag_4bit_prepack # Mode: Eager # Name: qembeddingbag_4bit_prepack_num_embeddings80_embedding_dim128 # Input: num_embeddings: 80, embedding_dim: 128 Forward Execution Time (us) : 13.903 # Benchmarking PyTorch: qembeddingbag_4bit_prepack # Mode: Eager # Name: qembeddingbag_4bit_prepack_num_embeddings80_embedding_dim256 # Input: num_embeddings: 80, embedding_dim: 256 Forward Execution Time (us) : 18.575 # Benchmarking PyTorch: qembeddingbag_4bit_prepack # Mode: Eager # Name: qembeddingbag_4bit_prepack_num_embeddings80_embedding_dim512 # Input: num_embeddings: 80, embedding_dim: 512 Forward Execution Time (us) : 30.650 # Benchmarking PyTorch: qembeddingbag_2bit_prepack # Mode: Eager # Name: qembeddingbag_2bit_prepack_num_embeddings80_embedding_dim128 # Input: num_embeddings: 80, embedding_dim: 128 Forward Execution Time (us) : 14.158 # Benchmarking PyTorch: qembeddingbag_2bit_prepack # Mode: Eager # Name: qembeddingbag_2bit_prepack_num_embeddings80_embedding_dim256 # Input: num_embeddings: 80, embedding_dim: 256 Forward Execution Time (us) : 19.818 # Benchmarking PyTorch: qembeddingbag_2bit_prepack # Mode: Eager # Name: qembeddingbag_2bit_prepack_num_embeddings80_embedding_dim512 # Input: num_embeddings: 80, embedding_dim: 512 Forward Execution Time (us) : 30.852 # Benchmarking PyTorch: qembeddingbag_byte_unpack # Mode: Eager # Name: qembeddingbag_byte_unpack_num_embeddings80_embedding_dim128 # Input: num_embeddings: 80, embedding_dim: 128 Forward Execution Time (us) : 47.596 # Benchmarking PyTorch: qembeddingbag_byte_unpack # Mode: Eager # Name: qembeddingbag_byte_unpack_num_embeddings80_embedding_dim256 # Input: num_embeddings: 80, embedding_dim: 256 Forward Execution Time (us) : 91.025 # Benchmarking PyTorch: qembeddingbag_byte_unpack # Mode: Eager # Name: qembeddingbag_byte_unpack_num_embeddings80_embedding_dim512 # Input: num_embeddings: 80, embedding_dim: 512 Forward Execution Time (us) : 131.425 # Benchmarking PyTorch: qembeddingbag_4bit_unpack # Mode: Eager # Name: qembeddingbag_4bit_unpack_num_embeddings80_embedding_dim128 # Input: num_embeddings: 80, embedding_dim: 128 Forward Execution Time (us) : 12.637 # Benchmarking PyTorch: qembeddingbag_4bit_unpack # Mode: Eager # Name: qembeddingbag_4bit_unpack_num_embeddings80_embedding_dim256 # Input: num_embeddings: 80, embedding_dim: 256 Forward Execution Time (us) : 20.856 # Benchmarking PyTorch: qembeddingbag_4bit_unpack # Mode: Eager # Name: qembeddingbag_4bit_unpack_num_embeddings80_embedding_dim512 # Input: num_embeddings: 80, embedding_dim: 512 Forward Execution Time (us) : 33.944 # Benchmarking PyTorch: qembeddingbag_2bit_unpack # Mode: Eager # Name: qembeddingbag_2bit_unpack_num_embeddings80_embedding_dim128 # Input: num_embeddings: 80, embedding_dim: 128 Forward Execution Time (us) : 21.181 # Benchmarking PyTorch: qembeddingbag_2bit_unpack # Mode: Eager # Name: qembeddingbag_2bit_unpack_num_embeddings80_embedding_dim256 # Input: num_embeddings: 80, embedding_dim: 256 Forward Execution Time (us) : 34.213 # Benchmarking PyTorch: qembeddingbag_2bit_unpack # Mode: Eager # Name: qembeddingbag_2bit_unpack_num_embeddings80_embedding_dim512 # Input: num_embeddings: 80, embedding_dim: 512 Forward Execution Time (us) : 59.622 ``` ghstack-source-id: 112699964 Differential Revision: [D23675777](https://our.internmc.facebook.com/intern/diff/D23675777/)
fbgemm functions are vectorized and faster ``` Finished test run: https://our.intern.facebook.com/intern/testinfra/testrun/6473924484856786 Summary (total time 15.08s): PASS: 7 FAIL: 0 SKIP: 0 FATAL: 0 TIMEOUT: 0 OMIT: 0 ``` Performance Before: ``` # ---------------------------------------- # PyTorch/Caffe2 Operator Micro-benchmarks # ---------------------------------------- # Tag : short # Benchmarking PyTorch: qembeddingbag_byte_prepack # Mode: Eager # Name: qembeddingbag_byte_prepack_num_embeddings80_embedding_dim128 # Input: num_embeddings: 80, embedding_dim: 128 Forward Execution Time (us) : 68.727 # Benchmarking PyTorch: qembeddingbag_byte_prepack # Mode: Eager # Name: qembeddingbag_byte_prepack_num_embeddings80_embedding_dim256 # Input: num_embeddings: 80, embedding_dim: 256 Forward Execution Time (us) : 131.500 # Benchmarking PyTorch: qembeddingbag_byte_prepack # Mode: Eager # Name: qembeddingbag_byte_prepack_num_embeddings80_embedding_dim512 # Input: num_embeddings: 80, embedding_dim: 512 Forward Execution Time (us) : 248.190 # Benchmarking PyTorch: qembeddingbag_4bit_prepack # Mode: Eager # Name: qembeddingbag_4bit_prepack_num_embeddings80_embedding_dim128 # Input: num_embeddings: 80, embedding_dim: 128 Forward Execution Time (us) : 172.742 # Benchmarking PyTorch: qembeddingbag_4bit_prepack # Mode: Eager # Name: qembeddingbag_4bit_prepack_num_embeddings80_embedding_dim256 # Input: num_embeddings: 80, embedding_dim: 256 Forward Execution Time (us) : 333.008 # Benchmarking PyTorch: qembeddingbag_4bit_prepack # Mode: Eager # Name: qembeddingbag_4bit_prepack_num_embeddings80_embedding_dim512 # Input: num_embeddings: 80, embedding_dim: 512 Forward Execution Time (us) : 652.423 # Benchmarking PyTorch: qembeddingbag_2bit_prepack # Mode: Eager # Name: qembeddingbag_2bit_prepack_num_embeddings80_embedding_dim128 # Input: num_embeddings: 80, embedding_dim: 128 Forward Execution Time (us) : 167.282 # Benchmarking PyTorch: qembeddingbag_2bit_prepack # Mode: Eager # Name: qembeddingbag_2bit_prepack_num_embeddings80_embedding_dim256 # Input: num_embeddings: 80, embedding_dim: 256 Forward Execution Time (us) : 398.901 # Benchmarking PyTorch: qembeddingbag_2bit_prepack # Mode: Eager # Name: qembeddingbag_2bit_prepack_num_embeddings80_embedding_dim512 # Input: num_embeddings: 80, embedding_dim: 512 Forward Execution Time (us) : 785.254 # Benchmarking PyTorch: qembeddingbag_byte_unpack # Mode: Eager # Name: qembeddingbag_byte_unpack_num_embeddings80_embedding_dim128 # Input: num_embeddings: 80, embedding_dim: 128 Forward Execution Time (us) : 122.653 # Benchmarking PyTorch: qembeddingbag_byte_unpack # Mode: Eager # Name: qembeddingbag_byte_unpack_num_embeddings80_embedding_dim256 # Input: num_embeddings: 80, embedding_dim: 256 Forward Execution Time (us) : 230.617 # Benchmarking PyTorch: qembeddingbag_byte_unpack # Mode: Eager # Name: qembeddingbag_byte_unpack_num_embeddings80_embedding_dim512 # Input: num_embeddings: 80, embedding_dim: 512 Forward Execution Time (us) : 408.807 # Benchmarking PyTorch: qembeddingbag_4bit_unpack # Mode: Eager # Name: qembeddingbag_4bit_unpack_num_embeddings80_embedding_dim128 # Input: num_embeddings: 80, embedding_dim: 128 Forward Execution Time (us) : 176.087 # Benchmarking PyTorch: qembeddingbag_4bit_unpack # Mode: Eager # Name: qembeddingbag_4bit_unpack_num_embeddings80_embedding_dim256 # Input: num_embeddings: 80, embedding_dim: 256 Forward Execution Time (us) : 337.514 # Benchmarking PyTorch: qembeddingbag_4bit_unpack # Mode: Eager # Name: qembeddingbag_4bit_unpack_num_embeddings80_embedding_dim512 # Input: num_embeddings: 80, embedding_dim: 512 Forward Execution Time (us) : 659.716 # Benchmarking PyTorch: qembeddingbag_2bit_unpack # Mode: Eager # Name: qembeddingbag_2bit_unpack_num_embeddings80_embedding_dim128 # Input: num_embeddings: 80, embedding_dim: 128 Forward Execution Time (us) : 342.529 # Benchmarking PyTorch: qembeddingbag_2bit_unpack # Mode: Eager # Name: qembeddingbag_2bit_unpack_num_embeddings80_embedding_dim256 # Input: num_embeddings: 80, embedding_dim: 256 Forward Execution Time (us) : 665.197 # Benchmarking PyTorch: qembeddingbag_2bit_unpack # Mode: Eager # Name: qembeddingbag_2bit_unpack_num_embeddings80_embedding_dim512 # Input: num_embeddings: 80, embedding_dim: 512 Forward Execution Time (us) : 1307.923 ``` Performance After: ``` # ---------------------------------------- # PyTorch/Caffe2 Operator Micro-benchmarks # ---------------------------------------- # Tag : short # Benchmarking PyTorch: qembeddingbag_byte_prepack # Mode: Eager # Name: qembeddingbag_byte_prepack_num_embeddings80_embedding_dim128 # Input: num_embeddings: 80, embedding_dim: 128 Forward Execution Time (us) : 10.782 # Benchmarking PyTorch: qembeddingbag_byte_prepack # Mode: Eager # Name: qembeddingbag_byte_prepack_num_embeddings80_embedding_dim256 # Input: num_embeddings: 80, embedding_dim: 256 Forward Execution Time (us) : 17.443 # Benchmarking PyTorch: qembeddingbag_byte_prepack # Mode: Eager # Name: qembeddingbag_byte_prepack_num_embeddings80_embedding_dim512 # Input: num_embeddings: 80, embedding_dim: 512 Forward Execution Time (us) : 25.898 # Benchmarking PyTorch: qembeddingbag_4bit_prepack # Mode: Eager # Name: qembeddingbag_4bit_prepack_num_embeddings80_embedding_dim128 # Input: num_embeddings: 80, embedding_dim: 128 Forward Execution Time (us) : 13.903 # Benchmarking PyTorch: qembeddingbag_4bit_prepack # Mode: Eager # Name: qembeddingbag_4bit_prepack_num_embeddings80_embedding_dim256 # Input: num_embeddings: 80, embedding_dim: 256 Forward Execution Time (us) : 18.575 # Benchmarking PyTorch: qembeddingbag_4bit_prepack # Mode: Eager # Name: qembeddingbag_4bit_prepack_num_embeddings80_embedding_dim512 # Input: num_embeddings: 80, embedding_dim: 512 Forward Execution Time (us) : 30.650 # Benchmarking PyTorch: qembeddingbag_2bit_prepack # Mode: Eager # Name: qembeddingbag_2bit_prepack_num_embeddings80_embedding_dim128 # Input: num_embeddings: 80, embedding_dim: 128 Forward Execution Time (us) : 14.158 # Benchmarking PyTorch: qembeddingbag_2bit_prepack # Mode: Eager # Name: qembeddingbag_2bit_prepack_num_embeddings80_embedding_dim256 # Input: num_embeddings: 80, embedding_dim: 256 Forward Execution Time (us) : 19.818 # Benchmarking PyTorch: qembeddingbag_2bit_prepack # Mode: Eager # Name: qembeddingbag_2bit_prepack_num_embeddings80_embedding_dim512 # Input: num_embeddings: 80, embedding_dim: 512 Forward Execution Time (us) : 30.852 # Benchmarking PyTorch: qembeddingbag_byte_unpack # Mode: Eager # Name: qembeddingbag_byte_unpack_num_embeddings80_embedding_dim128 # Input: num_embeddings: 80, embedding_dim: 128 Forward Execution Time (us) : 47.596 # Benchmarking PyTorch: qembeddingbag_byte_unpack # Mode: Eager # Name: qembeddingbag_byte_unpack_num_embeddings80_embedding_dim256 # Input: num_embeddings: 80, embedding_dim: 256 Forward Execution Time (us) : 91.025 # Benchmarking PyTorch: qembeddingbag_byte_unpack # Mode: Eager # Name: qembeddingbag_byte_unpack_num_embeddings80_embedding_dim512 # Input: num_embeddings: 80, embedding_dim: 512 Forward Execution Time (us) : 131.425 # Benchmarking PyTorch: qembeddingbag_4bit_unpack # Mode: Eager # Name: qembeddingbag_4bit_unpack_num_embeddings80_embedding_dim128 # Input: num_embeddings: 80, embedding_dim: 128 Forward Execution Time (us) : 12.637 # Benchmarking PyTorch: qembeddingbag_4bit_unpack # Mode: Eager # Name: qembeddingbag_4bit_unpack_num_embeddings80_embedding_dim256 # Input: num_embeddings: 80, embedding_dim: 256 Forward Execution Time (us) : 20.856 # Benchmarking PyTorch: qembeddingbag_4bit_unpack # Mode: Eager # Name: qembeddingbag_4bit_unpack_num_embeddings80_embedding_dim512 # Input: num_embeddings: 80, embedding_dim: 512 Forward Execution Time (us) : 33.944 # Benchmarking PyTorch: qembeddingbag_2bit_unpack # Mode: Eager # Name: qembeddingbag_2bit_unpack_num_embeddings80_embedding_dim128 # Input: num_embeddings: 80, embedding_dim: 128 Forward Execution Time (us) : 21.181 # Benchmarking PyTorch: qembeddingbag_2bit_unpack # Mode: Eager # Name: qembeddingbag_2bit_unpack_num_embeddings80_embedding_dim256 # Input: num_embeddings: 80, embedding_dim: 256 Forward Execution Time (us) : 34.213 # Benchmarking PyTorch: qembeddingbag_2bit_unpack # Mode: Eager # Name: qembeddingbag_2bit_unpack_num_embeddings80_embedding_dim512 # Input: num_embeddings: 80, embedding_dim: 512 Forward Execution Time (us) : 59.622 ``` Differential Revision: [D23675777](https://our.internmc.facebook.com/intern/diff/D23675777/) [ghstack-poisoned]
Pull Request resolved: #44845 fbgemm functions are vectorized and faster ``` Finished test run: https://our.intern.facebook.com/intern/testinfra/testrun/6473924484856786 Summary (total time 15.08s): PASS: 7 FAIL: 0 SKIP: 0 FATAL: 0 TIMEOUT: 0 OMIT: 0 ``` Performance Before: ``` # ---------------------------------------- # PyTorch/Caffe2 Operator Micro-benchmarks # ---------------------------------------- # Tag : short # Benchmarking PyTorch: qembeddingbag_byte_prepack # Mode: Eager # Name: qembeddingbag_byte_prepack_num_embeddings80_embedding_dim128 # Input: num_embeddings: 80, embedding_dim: 128 Forward Execution Time (us) : 68.727 # Benchmarking PyTorch: qembeddingbag_byte_prepack # Mode: Eager # Name: qembeddingbag_byte_prepack_num_embeddings80_embedding_dim256 # Input: num_embeddings: 80, embedding_dim: 256 Forward Execution Time (us) : 131.500 # Benchmarking PyTorch: qembeddingbag_byte_prepack # Mode: Eager # Name: qembeddingbag_byte_prepack_num_embeddings80_embedding_dim512 # Input: num_embeddings: 80, embedding_dim: 512 Forward Execution Time (us) : 248.190 # Benchmarking PyTorch: qembeddingbag_4bit_prepack # Mode: Eager # Name: qembeddingbag_4bit_prepack_num_embeddings80_embedding_dim128 # Input: num_embeddings: 80, embedding_dim: 128 Forward Execution Time (us) : 172.742 # Benchmarking PyTorch: qembeddingbag_4bit_prepack # Mode: Eager # Name: qembeddingbag_4bit_prepack_num_embeddings80_embedding_dim256 # Input: num_embeddings: 80, embedding_dim: 256 Forward Execution Time (us) : 333.008 # Benchmarking PyTorch: qembeddingbag_4bit_prepack # Mode: Eager # Name: qembeddingbag_4bit_prepack_num_embeddings80_embedding_dim512 # Input: num_embeddings: 80, embedding_dim: 512 Forward Execution Time (us) : 652.423 # Benchmarking PyTorch: qembeddingbag_2bit_prepack # Mode: Eager # Name: qembeddingbag_2bit_prepack_num_embeddings80_embedding_dim128 # Input: num_embeddings: 80, embedding_dim: 128 Forward Execution Time (us) : 167.282 # Benchmarking PyTorch: qembeddingbag_2bit_prepack # Mode: Eager # Name: qembeddingbag_2bit_prepack_num_embeddings80_embedding_dim256 # Input: num_embeddings: 80, embedding_dim: 256 Forward Execution Time (us) : 398.901 # Benchmarking PyTorch: qembeddingbag_2bit_prepack # Mode: Eager # Name: qembeddingbag_2bit_prepack_num_embeddings80_embedding_dim512 # Input: num_embeddings: 80, embedding_dim: 512 Forward Execution Time (us) : 785.254 # Benchmarking PyTorch: qembeddingbag_byte_unpack # Mode: Eager # Name: qembeddingbag_byte_unpack_num_embeddings80_embedding_dim128 # Input: num_embeddings: 80, embedding_dim: 128 Forward Execution Time (us) : 122.653 # Benchmarking PyTorch: qembeddingbag_byte_unpack # Mode: Eager # Name: qembeddingbag_byte_unpack_num_embeddings80_embedding_dim256 # Input: num_embeddings: 80, embedding_dim: 256 Forward Execution Time (us) : 230.617 # Benchmarking PyTorch: qembeddingbag_byte_unpack # Mode: Eager # Name: qembeddingbag_byte_unpack_num_embeddings80_embedding_dim512 # Input: num_embeddings: 80, embedding_dim: 512 Forward Execution Time (us) : 408.807 # Benchmarking PyTorch: qembeddingbag_4bit_unpack # Mode: Eager # Name: qembeddingbag_4bit_unpack_num_embeddings80_embedding_dim128 # Input: num_embeddings: 80, embedding_dim: 128 Forward Execution Time (us) : 176.087 # Benchmarking PyTorch: qembeddingbag_4bit_unpack # Mode: Eager # Name: qembeddingbag_4bit_unpack_num_embeddings80_embedding_dim256 # Input: num_embeddings: 80, embedding_dim: 256 Forward Execution Time (us) : 337.514 # Benchmarking PyTorch: qembeddingbag_4bit_unpack # Mode: Eager # Name: qembeddingbag_4bit_unpack_num_embeddings80_embedding_dim512 # Input: num_embeddings: 80, embedding_dim: 512 Forward Execution Time (us) : 659.716 # Benchmarking PyTorch: qembeddingbag_2bit_unpack # Mode: Eager # Name: qembeddingbag_2bit_unpack_num_embeddings80_embedding_dim128 # Input: num_embeddings: 80, embedding_dim: 128 Forward Execution Time (us) : 342.529 # Benchmarking PyTorch: qembeddingbag_2bit_unpack # Mode: Eager # Name: qembeddingbag_2bit_unpack_num_embeddings80_embedding_dim256 # Input: num_embeddings: 80, embedding_dim: 256 Forward Execution Time (us) : 665.197 # Benchmarking PyTorch: qembeddingbag_2bit_unpack # Mode: Eager # Name: qembeddingbag_2bit_unpack_num_embeddings80_embedding_dim512 # Input: num_embeddings: 80, embedding_dim: 512 Forward Execution Time (us) : 1307.923 ``` Performance After: ``` # ---------------------------------------- # PyTorch/Caffe2 Operator Micro-benchmarks # ---------------------------------------- # Tag : short # Benchmarking PyTorch: qembeddingbag_byte_prepack # Mode: Eager # Name: qembeddingbag_byte_prepack_num_embeddings80_embedding_dim128 # Input: num_embeddings: 80, embedding_dim: 128 Forward Execution Time (us) : 10.782 # Benchmarking PyTorch: qembeddingbag_byte_prepack # Mode: Eager # Name: qembeddingbag_byte_prepack_num_embeddings80_embedding_dim256 # Input: num_embeddings: 80, embedding_dim: 256 Forward Execution Time (us) : 17.443 # Benchmarking PyTorch: qembeddingbag_byte_prepack # Mode: Eager # Name: qembeddingbag_byte_prepack_num_embeddings80_embedding_dim512 # Input: num_embeddings: 80, embedding_dim: 512 Forward Execution Time (us) : 25.898 # Benchmarking PyTorch: qembeddingbag_4bit_prepack # Mode: Eager # Name: qembeddingbag_4bit_prepack_num_embeddings80_embedding_dim128 # Input: num_embeddings: 80, embedding_dim: 128 Forward Execution Time (us) : 13.903 # Benchmarking PyTorch: qembeddingbag_4bit_prepack # Mode: Eager # Name: qembeddingbag_4bit_prepack_num_embeddings80_embedding_dim256 # Input: num_embeddings: 80, embedding_dim: 256 Forward Execution Time (us) : 18.575 # Benchmarking PyTorch: qembeddingbag_4bit_prepack # Mode: Eager # Name: qembeddingbag_4bit_prepack_num_embeddings80_embedding_dim512 # Input: num_embeddings: 80, embedding_dim: 512 Forward Execution Time (us) : 30.650 # Benchmarking PyTorch: qembeddingbag_2bit_prepack # Mode: Eager # Name: qembeddingbag_2bit_prepack_num_embeddings80_embedding_dim128 # Input: num_embeddings: 80, embedding_dim: 128 Forward Execution Time (us) : 14.158 # Benchmarking PyTorch: qembeddingbag_2bit_prepack # Mode: Eager # Name: qembeddingbag_2bit_prepack_num_embeddings80_embedding_dim256 # Input: num_embeddings: 80, embedding_dim: 256 Forward Execution Time (us) : 19.818 # Benchmarking PyTorch: qembeddingbag_2bit_prepack # Mode: Eager # Name: qembeddingbag_2bit_prepack_num_embeddings80_embedding_dim512 # Input: num_embeddings: 80, embedding_dim: 512 Forward Execution Time (us) : 30.852 # Benchmarking PyTorch: qembeddingbag_byte_unpack # Mode: Eager # Name: qembeddingbag_byte_unpack_num_embeddings80_embedding_dim128 # Input: num_embeddings: 80, embedding_dim: 128 Forward Execution Time (us) : 47.596 # Benchmarking PyTorch: qembeddingbag_byte_unpack # Mode: Eager # Name: qembeddingbag_byte_unpack_num_embeddings80_embedding_dim256 # Input: num_embeddings: 80, embedding_dim: 256 Forward Execution Time (us) : 91.025 # Benchmarking PyTorch: qembeddingbag_byte_unpack # Mode: Eager # Name: qembeddingbag_byte_unpack_num_embeddings80_embedding_dim512 # Input: num_embeddings: 80, embedding_dim: 512 Forward Execution Time (us) : 131.425 # Benchmarking PyTorch: qembeddingbag_4bit_unpack # Mode: Eager # Name: qembeddingbag_4bit_unpack_num_embeddings80_embedding_dim128 # Input: num_embeddings: 80, embedding_dim: 128 Forward Execution Time (us) : 12.637 # Benchmarking PyTorch: qembeddingbag_4bit_unpack # Mode: Eager # Name: qembeddingbag_4bit_unpack_num_embeddings80_embedding_dim256 # Input: num_embeddings: 80, embedding_dim: 256 Forward Execution Time (us) : 20.856 # Benchmarking PyTorch: qembeddingbag_4bit_unpack # Mode: Eager # Name: qembeddingbag_4bit_unpack_num_embeddings80_embedding_dim512 # Input: num_embeddings: 80, embedding_dim: 512 Forward Execution Time (us) : 33.944 # Benchmarking PyTorch: qembeddingbag_2bit_unpack # Mode: Eager # Name: qembeddingbag_2bit_unpack_num_embeddings80_embedding_dim128 # Input: num_embeddings: 80, embedding_dim: 128 Forward Execution Time (us) : 21.181 # Benchmarking PyTorch: qembeddingbag_2bit_unpack # Mode: Eager # Name: qembeddingbag_2bit_unpack_num_embeddings80_embedding_dim256 # Input: num_embeddings: 80, embedding_dim: 256 Forward Execution Time (us) : 34.213 # Benchmarking PyTorch: qembeddingbag_2bit_unpack # Mode: Eager # Name: qembeddingbag_2bit_unpack_num_embeddings80_embedding_dim512 # Input: num_embeddings: 80, embedding_dim: 512 Forward Execution Time (us) : 59.622 ``` ghstack-source-id: 112812505 Differential Revision: [D23675777](https://our.internmc.facebook.com/intern/diff/D23675777/)
fbgemm functions are vectorized and faster ``` Finished test run: https://our.intern.facebook.com/intern/testinfra/testrun/6473924484856786 Summary (total time 15.08s): PASS: 7 FAIL: 0 SKIP: 0 FATAL: 0 TIMEOUT: 0 OMIT: 0 ``` Performance Before: ``` # ---------------------------------------- # PyTorch/Caffe2 Operator Micro-benchmarks # ---------------------------------------- # Tag : short # Benchmarking PyTorch: qembeddingbag_byte_prepack # Mode: Eager # Name: qembeddingbag_byte_prepack_num_embeddings80_embedding_dim128 # Input: num_embeddings: 80, embedding_dim: 128 Forward Execution Time (us) : 68.727 # Benchmarking PyTorch: qembeddingbag_byte_prepack # Mode: Eager # Name: qembeddingbag_byte_prepack_num_embeddings80_embedding_dim256 # Input: num_embeddings: 80, embedding_dim: 256 Forward Execution Time (us) : 131.500 # Benchmarking PyTorch: qembeddingbag_byte_prepack # Mode: Eager # Name: qembeddingbag_byte_prepack_num_embeddings80_embedding_dim512 # Input: num_embeddings: 80, embedding_dim: 512 Forward Execution Time (us) : 248.190 # Benchmarking PyTorch: qembeddingbag_4bit_prepack # Mode: Eager # Name: qembeddingbag_4bit_prepack_num_embeddings80_embedding_dim128 # Input: num_embeddings: 80, embedding_dim: 128 Forward Execution Time (us) : 172.742 # Benchmarking PyTorch: qembeddingbag_4bit_prepack # Mode: Eager # Name: qembeddingbag_4bit_prepack_num_embeddings80_embedding_dim256 # Input: num_embeddings: 80, embedding_dim: 256 Forward Execution Time (us) : 333.008 # Benchmarking PyTorch: qembeddingbag_4bit_prepack # Mode: Eager # Name: qembeddingbag_4bit_prepack_num_embeddings80_embedding_dim512 # Input: num_embeddings: 80, embedding_dim: 512 Forward Execution Time (us) : 652.423 # Benchmarking PyTorch: qembeddingbag_2bit_prepack # Mode: Eager # Name: qembeddingbag_2bit_prepack_num_embeddings80_embedding_dim128 # Input: num_embeddings: 80, embedding_dim: 128 Forward Execution Time (us) : 167.282 # Benchmarking PyTorch: qembeddingbag_2bit_prepack # Mode: Eager # Name: qembeddingbag_2bit_prepack_num_embeddings80_embedding_dim256 # Input: num_embeddings: 80, embedding_dim: 256 Forward Execution Time (us) : 398.901 # Benchmarking PyTorch: qembeddingbag_2bit_prepack # Mode: Eager # Name: qembeddingbag_2bit_prepack_num_embeddings80_embedding_dim512 # Input: num_embeddings: 80, embedding_dim: 512 Forward Execution Time (us) : 785.254 # Benchmarking PyTorch: qembeddingbag_byte_unpack # Mode: Eager # Name: qembeddingbag_byte_unpack_num_embeddings80_embedding_dim128 # Input: num_embeddings: 80, embedding_dim: 128 Forward Execution Time (us) : 122.653 # Benchmarking PyTorch: qembeddingbag_byte_unpack # Mode: Eager # Name: qembeddingbag_byte_unpack_num_embeddings80_embedding_dim256 # Input: num_embeddings: 80, embedding_dim: 256 Forward Execution Time (us) : 230.617 # Benchmarking PyTorch: qembeddingbag_byte_unpack # Mode: Eager # Name: qembeddingbag_byte_unpack_num_embeddings80_embedding_dim512 # Input: num_embeddings: 80, embedding_dim: 512 Forward Execution Time (us) : 408.807 # Benchmarking PyTorch: qembeddingbag_4bit_unpack # Mode: Eager # Name: qembeddingbag_4bit_unpack_num_embeddings80_embedding_dim128 # Input: num_embeddings: 80, embedding_dim: 128 Forward Execution Time (us) : 176.087 # Benchmarking PyTorch: qembeddingbag_4bit_unpack # Mode: Eager # Name: qembeddingbag_4bit_unpack_num_embeddings80_embedding_dim256 # Input: num_embeddings: 80, embedding_dim: 256 Forward Execution Time (us) : 337.514 # Benchmarking PyTorch: qembeddingbag_4bit_unpack # Mode: Eager # Name: qembeddingbag_4bit_unpack_num_embeddings80_embedding_dim512 # Input: num_embeddings: 80, embedding_dim: 512 Forward Execution Time (us) : 659.716 # Benchmarking PyTorch: qembeddingbag_2bit_unpack # Mode: Eager # Name: qembeddingbag_2bit_unpack_num_embeddings80_embedding_dim128 # Input: num_embeddings: 80, embedding_dim: 128 Forward Execution Time (us) : 342.529 # Benchmarking PyTorch: qembeddingbag_2bit_unpack # Mode: Eager # Name: qembeddingbag_2bit_unpack_num_embeddings80_embedding_dim256 # Input: num_embeddings: 80, embedding_dim: 256 Forward Execution Time (us) : 665.197 # Benchmarking PyTorch: qembeddingbag_2bit_unpack # Mode: Eager # Name: qembeddingbag_2bit_unpack_num_embeddings80_embedding_dim512 # Input: num_embeddings: 80, embedding_dim: 512 Forward Execution Time (us) : 1307.923 ``` Performance After: ``` # ---------------------------------------- # PyTorch/Caffe2 Operator Micro-benchmarks # ---------------------------------------- # Tag : short # Benchmarking PyTorch: qembeddingbag_byte_prepack # Mode: Eager # Name: qembeddingbag_byte_prepack_num_embeddings80_embedding_dim128 # Input: num_embeddings: 80, embedding_dim: 128 Forward Execution Time (us) : 10.782 # Benchmarking PyTorch: qembeddingbag_byte_prepack # Mode: Eager # Name: qembeddingbag_byte_prepack_num_embeddings80_embedding_dim256 # Input: num_embeddings: 80, embedding_dim: 256 Forward Execution Time (us) : 17.443 # Benchmarking PyTorch: qembeddingbag_byte_prepack # Mode: Eager # Name: qembeddingbag_byte_prepack_num_embeddings80_embedding_dim512 # Input: num_embeddings: 80, embedding_dim: 512 Forward Execution Time (us) : 25.898 # Benchmarking PyTorch: qembeddingbag_4bit_prepack # Mode: Eager # Name: qembeddingbag_4bit_prepack_num_embeddings80_embedding_dim128 # Input: num_embeddings: 80, embedding_dim: 128 Forward Execution Time (us) : 13.903 # Benchmarking PyTorch: qembeddingbag_4bit_prepack # Mode: Eager # Name: qembeddingbag_4bit_prepack_num_embeddings80_embedding_dim256 # Input: num_embeddings: 80, embedding_dim: 256 Forward Execution Time (us) : 18.575 # Benchmarking PyTorch: qembeddingbag_4bit_prepack # Mode: Eager # Name: qembeddingbag_4bit_prepack_num_embeddings80_embedding_dim512 # Input: num_embeddings: 80, embedding_dim: 512 Forward Execution Time (us) : 30.650 # Benchmarking PyTorch: qembeddingbag_2bit_prepack # Mode: Eager # Name: qembeddingbag_2bit_prepack_num_embeddings80_embedding_dim128 # Input: num_embeddings: 80, embedding_dim: 128 Forward Execution Time (us) : 14.158 # Benchmarking PyTorch: qembeddingbag_2bit_prepack # Mode: Eager # Name: qembeddingbag_2bit_prepack_num_embeddings80_embedding_dim256 # Input: num_embeddings: 80, embedding_dim: 256 Forward Execution Time (us) : 19.818 # Benchmarking PyTorch: qembeddingbag_2bit_prepack # Mode: Eager # Name: qembeddingbag_2bit_prepack_num_embeddings80_embedding_dim512 # Input: num_embeddings: 80, embedding_dim: 512 Forward Execution Time (us) : 30.852 # Benchmarking PyTorch: qembeddingbag_byte_unpack # Mode: Eager # Name: qembeddingbag_byte_unpack_num_embeddings80_embedding_dim128 # Input: num_embeddings: 80, embedding_dim: 128 Forward Execution Time (us) : 47.596 # Benchmarking PyTorch: qembeddingbag_byte_unpack # Mode: Eager # Name: qembeddingbag_byte_unpack_num_embeddings80_embedding_dim256 # Input: num_embeddings: 80, embedding_dim: 256 Forward Execution Time (us) : 91.025 # Benchmarking PyTorch: qembeddingbag_byte_unpack # Mode: Eager # Name: qembeddingbag_byte_unpack_num_embeddings80_embedding_dim512 # Input: num_embeddings: 80, embedding_dim: 512 Forward Execution Time (us) : 131.425 # Benchmarking PyTorch: qembeddingbag_4bit_unpack # Mode: Eager # Name: qembeddingbag_4bit_unpack_num_embeddings80_embedding_dim128 # Input: num_embeddings: 80, embedding_dim: 128 Forward Execution Time (us) : 12.637 # Benchmarking PyTorch: qembeddingbag_4bit_unpack # Mode: Eager # Name: qembeddingbag_4bit_unpack_num_embeddings80_embedding_dim256 # Input: num_embeddings: 80, embedding_dim: 256 Forward Execution Time (us) : 20.856 # Benchmarking PyTorch: qembeddingbag_4bit_unpack # Mode: Eager # Name: qembeddingbag_4bit_unpack_num_embeddings80_embedding_dim512 # Input: num_embeddings: 80, embedding_dim: 512 Forward Execution Time (us) : 33.944 # Benchmarking PyTorch: qembeddingbag_2bit_unpack # Mode: Eager # Name: qembeddingbag_2bit_unpack_num_embeddings80_embedding_dim128 # Input: num_embeddings: 80, embedding_dim: 128 Forward Execution Time (us) : 21.181 # Benchmarking PyTorch: qembeddingbag_2bit_unpack # Mode: Eager # Name: qembeddingbag_2bit_unpack_num_embeddings80_embedding_dim256 # Input: num_embeddings: 80, embedding_dim: 256 Forward Execution Time (us) : 34.213 # Benchmarking PyTorch: qembeddingbag_2bit_unpack # Mode: Eager # Name: qembeddingbag_2bit_unpack_num_embeddings80_embedding_dim512 # Input: num_embeddings: 80, embedding_dim: 512 Forward Execution Time (us) : 59.622 ``` Differential Revision: [D23675777](https://our.internmc.facebook.com/intern/diff/D23675777/) [ghstack-poisoned]
fbgemm functions are vectorized and faster ``` Finished test run: https://our.intern.facebook.com/intern/testinfra/testrun/6473924484856786 Summary (total time 15.08s): PASS: 7 FAIL: 0 SKIP: 0 FATAL: 0 TIMEOUT: 0 OMIT: 0 ``` Performance Before: ``` # ---------------------------------------- # PyTorch/Caffe2 Operator Micro-benchmarks # ---------------------------------------- # Tag : short # Benchmarking PyTorch: qembeddingbag_byte_prepack # Mode: Eager # Name: qembeddingbag_byte_prepack_num_embeddings80_embedding_dim128 # Input: num_embeddings: 80, embedding_dim: 128 Forward Execution Time (us) : 68.727 # Benchmarking PyTorch: qembeddingbag_byte_prepack # Mode: Eager # Name: qembeddingbag_byte_prepack_num_embeddings80_embedding_dim256 # Input: num_embeddings: 80, embedding_dim: 256 Forward Execution Time (us) : 131.500 # Benchmarking PyTorch: qembeddingbag_byte_prepack # Mode: Eager # Name: qembeddingbag_byte_prepack_num_embeddings80_embedding_dim512 # Input: num_embeddings: 80, embedding_dim: 512 Forward Execution Time (us) : 248.190 # Benchmarking PyTorch: qembeddingbag_4bit_prepack # Mode: Eager # Name: qembeddingbag_4bit_prepack_num_embeddings80_embedding_dim128 # Input: num_embeddings: 80, embedding_dim: 128 Forward Execution Time (us) : 172.742 # Benchmarking PyTorch: qembeddingbag_4bit_prepack # Mode: Eager # Name: qembeddingbag_4bit_prepack_num_embeddings80_embedding_dim256 # Input: num_embeddings: 80, embedding_dim: 256 Forward Execution Time (us) : 333.008 # Benchmarking PyTorch: qembeddingbag_4bit_prepack # Mode: Eager # Name: qembeddingbag_4bit_prepack_num_embeddings80_embedding_dim512 # Input: num_embeddings: 80, embedding_dim: 512 Forward Execution Time (us) : 652.423 # Benchmarking PyTorch: qembeddingbag_2bit_prepack # Mode: Eager # Name: qembeddingbag_2bit_prepack_num_embeddings80_embedding_dim128 # Input: num_embeddings: 80, embedding_dim: 128 Forward Execution Time (us) : 167.282 # Benchmarking PyTorch: qembeddingbag_2bit_prepack # Mode: Eager # Name: qembeddingbag_2bit_prepack_num_embeddings80_embedding_dim256 # Input: num_embeddings: 80, embedding_dim: 256 Forward Execution Time (us) : 398.901 # Benchmarking PyTorch: qembeddingbag_2bit_prepack # Mode: Eager # Name: qembeddingbag_2bit_prepack_num_embeddings80_embedding_dim512 # Input: num_embeddings: 80, embedding_dim: 512 Forward Execution Time (us) : 785.254 # Benchmarking PyTorch: qembeddingbag_byte_unpack # Mode: Eager # Name: qembeddingbag_byte_unpack_num_embeddings80_embedding_dim128 # Input: num_embeddings: 80, embedding_dim: 128 Forward Execution Time (us) : 122.653 # Benchmarking PyTorch: qembeddingbag_byte_unpack # Mode: Eager # Name: qembeddingbag_byte_unpack_num_embeddings80_embedding_dim256 # Input: num_embeddings: 80, embedding_dim: 256 Forward Execution Time (us) : 230.617 # Benchmarking PyTorch: qembeddingbag_byte_unpack # Mode: Eager # Name: qembeddingbag_byte_unpack_num_embeddings80_embedding_dim512 # Input: num_embeddings: 80, embedding_dim: 512 Forward Execution Time (us) : 408.807 # Benchmarking PyTorch: qembeddingbag_4bit_unpack # Mode: Eager # Name: qembeddingbag_4bit_unpack_num_embeddings80_embedding_dim128 # Input: num_embeddings: 80, embedding_dim: 128 Forward Execution Time (us) : 176.087 # Benchmarking PyTorch: qembeddingbag_4bit_unpack # Mode: Eager # Name: qembeddingbag_4bit_unpack_num_embeddings80_embedding_dim256 # Input: num_embeddings: 80, embedding_dim: 256 Forward Execution Time (us) : 337.514 # Benchmarking PyTorch: qembeddingbag_4bit_unpack # Mode: Eager # Name: qembeddingbag_4bit_unpack_num_embeddings80_embedding_dim512 # Input: num_embeddings: 80, embedding_dim: 512 Forward Execution Time (us) : 659.716 # Benchmarking PyTorch: qembeddingbag_2bit_unpack # Mode: Eager # Name: qembeddingbag_2bit_unpack_num_embeddings80_embedding_dim128 # Input: num_embeddings: 80, embedding_dim: 128 Forward Execution Time (us) : 342.529 # Benchmarking PyTorch: qembeddingbag_2bit_unpack # Mode: Eager # Name: qembeddingbag_2bit_unpack_num_embeddings80_embedding_dim256 # Input: num_embeddings: 80, embedding_dim: 256 Forward Execution Time (us) : 665.197 # Benchmarking PyTorch: qembeddingbag_2bit_unpack # Mode: Eager # Name: qembeddingbag_2bit_unpack_num_embeddings80_embedding_dim512 # Input: num_embeddings: 80, embedding_dim: 512 Forward Execution Time (us) : 1307.923 ``` Performance After: ``` # ---------------------------------------- # PyTorch/Caffe2 Operator Micro-benchmarks # ---------------------------------------- # Tag : short # Benchmarking PyTorch: qembeddingbag_byte_prepack # Mode: Eager # Name: qembeddingbag_byte_prepack_num_embeddings80_embedding_dim128 # Input: num_embeddings: 80, embedding_dim: 128 Forward Execution Time (us) : 10.782 # Benchmarking PyTorch: qembeddingbag_byte_prepack # Mode: Eager # Name: qembeddingbag_byte_prepack_num_embeddings80_embedding_dim256 # Input: num_embeddings: 80, embedding_dim: 256 Forward Execution Time (us) : 17.443 # Benchmarking PyTorch: qembeddingbag_byte_prepack # Mode: Eager # Name: qembeddingbag_byte_prepack_num_embeddings80_embedding_dim512 # Input: num_embeddings: 80, embedding_dim: 512 Forward Execution Time (us) : 25.898 # Benchmarking PyTorch: qembeddingbag_4bit_prepack # Mode: Eager # Name: qembeddingbag_4bit_prepack_num_embeddings80_embedding_dim128 # Input: num_embeddings: 80, embedding_dim: 128 Forward Execution Time (us) : 13.903 # Benchmarking PyTorch: qembeddingbag_4bit_prepack # Mode: Eager # Name: qembeddingbag_4bit_prepack_num_embeddings80_embedding_dim256 # Input: num_embeddings: 80, embedding_dim: 256 Forward Execution Time (us) : 18.575 # Benchmarking PyTorch: qembeddingbag_4bit_prepack # Mode: Eager # Name: qembeddingbag_4bit_prepack_num_embeddings80_embedding_dim512 # Input: num_embeddings: 80, embedding_dim: 512 Forward Execution Time (us) : 30.650 # Benchmarking PyTorch: qembeddingbag_2bit_prepack # Mode: Eager # Name: qembeddingbag_2bit_prepack_num_embeddings80_embedding_dim128 # Input: num_embeddings: 80, embedding_dim: 128 Forward Execution Time (us) : 14.158 # Benchmarking PyTorch: qembeddingbag_2bit_prepack # Mode: Eager # Name: qembeddingbag_2bit_prepack_num_embeddings80_embedding_dim256 # Input: num_embeddings: 80, embedding_dim: 256 Forward Execution Time (us) : 19.818 # Benchmarking PyTorch: qembeddingbag_2bit_prepack # Mode: Eager # Name: qembeddingbag_2bit_prepack_num_embeddings80_embedding_dim512 # Input: num_embeddings: 80, embedding_dim: 512 Forward Execution Time (us) : 30.852 # Benchmarking PyTorch: qembeddingbag_byte_unpack # Mode: Eager # Name: qembeddingbag_byte_unpack_num_embeddings80_embedding_dim128 # Input: num_embeddings: 80, embedding_dim: 128 Forward Execution Time (us) : 47.596 # Benchmarking PyTorch: qembeddingbag_byte_unpack # Mode: Eager # Name: qembeddingbag_byte_unpack_num_embeddings80_embedding_dim256 # Input: num_embeddings: 80, embedding_dim: 256 Forward Execution Time (us) : 91.025 # Benchmarking PyTorch: qembeddingbag_byte_unpack # Mode: Eager # Name: qembeddingbag_byte_unpack_num_embeddings80_embedding_dim512 # Input: num_embeddings: 80, embedding_dim: 512 Forward Execution Time (us) : 131.425 # Benchmarking PyTorch: qembeddingbag_4bit_unpack # Mode: Eager # Name: qembeddingbag_4bit_unpack_num_embeddings80_embedding_dim128 # Input: num_embeddings: 80, embedding_dim: 128 Forward Execution Time (us) : 12.637 # Benchmarking PyTorch: qembeddingbag_4bit_unpack # Mode: Eager # Name: qembeddingbag_4bit_unpack_num_embeddings80_embedding_dim256 # Input: num_embeddings: 80, embedding_dim: 256 Forward Execution Time (us) : 20.856 # Benchmarking PyTorch: qembeddingbag_4bit_unpack # Mode: Eager # Name: qembeddingbag_4bit_unpack_num_embeddings80_embedding_dim512 # Input: num_embeddings: 80, embedding_dim: 512 Forward Execution Time (us) : 33.944 # Benchmarking PyTorch: qembeddingbag_2bit_unpack # Mode: Eager # Name: qembeddingbag_2bit_unpack_num_embeddings80_embedding_dim128 # Input: num_embeddings: 80, embedding_dim: 128 Forward Execution Time (us) : 21.181 # Benchmarking PyTorch: qembeddingbag_2bit_unpack # Mode: Eager # Name: qembeddingbag_2bit_unpack_num_embeddings80_embedding_dim256 # Input: num_embeddings: 80, embedding_dim: 256 Forward Execution Time (us) : 34.213 # Benchmarking PyTorch: qembeddingbag_2bit_unpack # Mode: Eager # Name: qembeddingbag_2bit_unpack_num_embeddings80_embedding_dim512 # Input: num_embeddings: 80, embedding_dim: 512 Forward Execution Time (us) : 59.622 ``` Differential Revision: [D23675777](https://our.internmc.facebook.com/intern/diff/D23675777/) [ghstack-poisoned]
Pull Request resolved: #44845 fbgemm functions are vectorized and faster ``` Finished test run: https://our.intern.facebook.com/intern/testinfra/testrun/6473924484856786 Summary (total time 15.08s): PASS: 7 FAIL: 0 SKIP: 0 FATAL: 0 TIMEOUT: 0 OMIT: 0 ``` Performance Before: ``` # ---------------------------------------- # PyTorch/Caffe2 Operator Micro-benchmarks # ---------------------------------------- # Tag : short # Benchmarking PyTorch: qembeddingbag_byte_prepack # Mode: Eager # Name: qembeddingbag_byte_prepack_num_embeddings80_embedding_dim128 # Input: num_embeddings: 80, embedding_dim: 128 Forward Execution Time (us) : 68.727 # Benchmarking PyTorch: qembeddingbag_byte_prepack # Mode: Eager # Name: qembeddingbag_byte_prepack_num_embeddings80_embedding_dim256 # Input: num_embeddings: 80, embedding_dim: 256 Forward Execution Time (us) : 131.500 # Benchmarking PyTorch: qembeddingbag_byte_prepack # Mode: Eager # Name: qembeddingbag_byte_prepack_num_embeddings80_embedding_dim512 # Input: num_embeddings: 80, embedding_dim: 512 Forward Execution Time (us) : 248.190 # Benchmarking PyTorch: qembeddingbag_4bit_prepack # Mode: Eager # Name: qembeddingbag_4bit_prepack_num_embeddings80_embedding_dim128 # Input: num_embeddings: 80, embedding_dim: 128 Forward Execution Time (us) : 172.742 # Benchmarking PyTorch: qembeddingbag_4bit_prepack # Mode: Eager # Name: qembeddingbag_4bit_prepack_num_embeddings80_embedding_dim256 # Input: num_embeddings: 80, embedding_dim: 256 Forward Execution Time (us) : 333.008 # Benchmarking PyTorch: qembeddingbag_4bit_prepack # Mode: Eager # Name: qembeddingbag_4bit_prepack_num_embeddings80_embedding_dim512 # Input: num_embeddings: 80, embedding_dim: 512 Forward Execution Time (us) : 652.423 # Benchmarking PyTorch: qembeddingbag_2bit_prepack # Mode: Eager # Name: qembeddingbag_2bit_prepack_num_embeddings80_embedding_dim128 # Input: num_embeddings: 80, embedding_dim: 128 Forward Execution Time (us) : 167.282 # Benchmarking PyTorch: qembeddingbag_2bit_prepack # Mode: Eager # Name: qembeddingbag_2bit_prepack_num_embeddings80_embedding_dim256 # Input: num_embeddings: 80, embedding_dim: 256 Forward Execution Time (us) : 398.901 # Benchmarking PyTorch: qembeddingbag_2bit_prepack # Mode: Eager # Name: qembeddingbag_2bit_prepack_num_embeddings80_embedding_dim512 # Input: num_embeddings: 80, embedding_dim: 512 Forward Execution Time (us) : 785.254 # Benchmarking PyTorch: qembeddingbag_byte_unpack # Mode: Eager # Name: qembeddingbag_byte_unpack_num_embeddings80_embedding_dim128 # Input: num_embeddings: 80, embedding_dim: 128 Forward Execution Time (us) : 122.653 # Benchmarking PyTorch: qembeddingbag_byte_unpack # Mode: Eager # Name: qembeddingbag_byte_unpack_num_embeddings80_embedding_dim256 # Input: num_embeddings: 80, embedding_dim: 256 Forward Execution Time (us) : 230.617 # Benchmarking PyTorch: qembeddingbag_byte_unpack # Mode: Eager # Name: qembeddingbag_byte_unpack_num_embeddings80_embedding_dim512 # Input: num_embeddings: 80, embedding_dim: 512 Forward Execution Time (us) : 408.807 # Benchmarking PyTorch: qembeddingbag_4bit_unpack # Mode: Eager # Name: qembeddingbag_4bit_unpack_num_embeddings80_embedding_dim128 # Input: num_embeddings: 80, embedding_dim: 128 Forward Execution Time (us) : 176.087 # Benchmarking PyTorch: qembeddingbag_4bit_unpack # Mode: Eager # Name: qembeddingbag_4bit_unpack_num_embeddings80_embedding_dim256 # Input: num_embeddings: 80, embedding_dim: 256 Forward Execution Time (us) : 337.514 # Benchmarking PyTorch: qembeddingbag_4bit_unpack # Mode: Eager # Name: qembeddingbag_4bit_unpack_num_embeddings80_embedding_dim512 # Input: num_embeddings: 80, embedding_dim: 512 Forward Execution Time (us) : 659.716 # Benchmarking PyTorch: qembeddingbag_2bit_unpack # Mode: Eager # Name: qembeddingbag_2bit_unpack_num_embeddings80_embedding_dim128 # Input: num_embeddings: 80, embedding_dim: 128 Forward Execution Time (us) : 342.529 # Benchmarking PyTorch: qembeddingbag_2bit_unpack # Mode: Eager # Name: qembeddingbag_2bit_unpack_num_embeddings80_embedding_dim256 # Input: num_embeddings: 80, embedding_dim: 256 Forward Execution Time (us) : 665.197 # Benchmarking PyTorch: qembeddingbag_2bit_unpack # Mode: Eager # Name: qembeddingbag_2bit_unpack_num_embeddings80_embedding_dim512 # Input: num_embeddings: 80, embedding_dim: 512 Forward Execution Time (us) : 1307.923 ``` Performance After: ``` # ---------------------------------------- # PyTorch/Caffe2 Operator Micro-benchmarks # ---------------------------------------- # Tag : short # Benchmarking PyTorch: qembeddingbag_byte_prepack # Mode: Eager # Name: qembeddingbag_byte_prepack_num_embeddings80_embedding_dim128 # Input: num_embeddings: 80, embedding_dim: 128 Forward Execution Time (us) : 10.782 # Benchmarking PyTorch: qembeddingbag_byte_prepack # Mode: Eager # Name: qembeddingbag_byte_prepack_num_embeddings80_embedding_dim256 # Input: num_embeddings: 80, embedding_dim: 256 Forward Execution Time (us) : 17.443 # Benchmarking PyTorch: qembeddingbag_byte_prepack # Mode: Eager # Name: qembeddingbag_byte_prepack_num_embeddings80_embedding_dim512 # Input: num_embeddings: 80, embedding_dim: 512 Forward Execution Time (us) : 25.898 # Benchmarking PyTorch: qembeddingbag_4bit_prepack # Mode: Eager # Name: qembeddingbag_4bit_prepack_num_embeddings80_embedding_dim128 # Input: num_embeddings: 80, embedding_dim: 128 Forward Execution Time (us) : 13.903 # Benchmarking PyTorch: qembeddingbag_4bit_prepack # Mode: Eager # Name: qembeddingbag_4bit_prepack_num_embeddings80_embedding_dim256 # Input: num_embeddings: 80, embedding_dim: 256 Forward Execution Time (us) : 18.575 # Benchmarking PyTorch: qembeddingbag_4bit_prepack # Mode: Eager # Name: qembeddingbag_4bit_prepack_num_embeddings80_embedding_dim512 # Input: num_embeddings: 80, embedding_dim: 512 Forward Execution Time (us) : 30.650 # Benchmarking PyTorch: qembeddingbag_2bit_prepack # Mode: Eager # Name: qembeddingbag_2bit_prepack_num_embeddings80_embedding_dim128 # Input: num_embeddings: 80, embedding_dim: 128 Forward Execution Time (us) : 14.158 # Benchmarking PyTorch: qembeddingbag_2bit_prepack # Mode: Eager # Name: qembeddingbag_2bit_prepack_num_embeddings80_embedding_dim256 # Input: num_embeddings: 80, embedding_dim: 256 Forward Execution Time (us) : 19.818 # Benchmarking PyTorch: qembeddingbag_2bit_prepack # Mode: Eager # Name: qembeddingbag_2bit_prepack_num_embeddings80_embedding_dim512 # Input: num_embeddings: 80, embedding_dim: 512 Forward Execution Time (us) : 30.852 # Benchmarking PyTorch: qembeddingbag_byte_unpack # Mode: Eager # Name: qembeddingbag_byte_unpack_num_embeddings80_embedding_dim128 # Input: num_embeddings: 80, embedding_dim: 128 Forward Execution Time (us) : 47.596 # Benchmarking PyTorch: qembeddingbag_byte_unpack # Mode: Eager # Name: qembeddingbag_byte_unpack_num_embeddings80_embedding_dim256 # Input: num_embeddings: 80, embedding_dim: 256 Forward Execution Time (us) : 91.025 # Benchmarking PyTorch: qembeddingbag_byte_unpack # Mode: Eager # Name: qembeddingbag_byte_unpack_num_embeddings80_embedding_dim512 # Input: num_embeddings: 80, embedding_dim: 512 Forward Execution Time (us) : 131.425 # Benchmarking PyTorch: qembeddingbag_4bit_unpack # Mode: Eager # Name: qembeddingbag_4bit_unpack_num_embeddings80_embedding_dim128 # Input: num_embeddings: 80, embedding_dim: 128 Forward Execution Time (us) : 12.637 # Benchmarking PyTorch: qembeddingbag_4bit_unpack # Mode: Eager # Name: qembeddingbag_4bit_unpack_num_embeddings80_embedding_dim256 # Input: num_embeddings: 80, embedding_dim: 256 Forward Execution Time (us) : 20.856 # Benchmarking PyTorch: qembeddingbag_4bit_unpack # Mode: Eager # Name: qembeddingbag_4bit_unpack_num_embeddings80_embedding_dim512 # Input: num_embeddings: 80, embedding_dim: 512 Forward Execution Time (us) : 33.944 # Benchmarking PyTorch: qembeddingbag_2bit_unpack # Mode: Eager # Name: qembeddingbag_2bit_unpack_num_embeddings80_embedding_dim128 # Input: num_embeddings: 80, embedding_dim: 128 Forward Execution Time (us) : 21.181 # Benchmarking PyTorch: qembeddingbag_2bit_unpack # Mode: Eager # Name: qembeddingbag_2bit_unpack_num_embeddings80_embedding_dim256 # Input: num_embeddings: 80, embedding_dim: 256 Forward Execution Time (us) : 34.213 # Benchmarking PyTorch: qembeddingbag_2bit_unpack # Mode: Eager # Name: qembeddingbag_2bit_unpack_num_embeddings80_embedding_dim512 # Input: num_embeddings: 80, embedding_dim: 512 Forward Execution Time (us) : 59.622 ``` ghstack-source-id: 112836216 Differential Revision: [D23675777](https://our.internmc.facebook.com/intern/diff/D23675777/)
|
This pull request has been merged in 677a59d. |
Stack from ghstack:
fbgemm functions are vectorized and faster
Performance Before:
Performance After:
Differential Revision: D23675777