Performance fix for torch.cat operator on ROCm #46097

ashishfarmer · 2020-10-09T16:52:15Z

Summary:
This pull request is a partial revert of #44833 for ROCm to fix the performance of the concatenate operator. The changes only affect execution on ROCm and are guarded by the define __HIP_PLATFORM_HCC__

Test plan:
Benchmark
python -m pt.cat_test --tag_filter all --device cuda

Results on ROCm before the PR:

# ----------------------------------------
# PyTorch/Caffe2 Operator Micro-benchmarks
# ----------------------------------------
# Tag : all

# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes(1,1,1)_N2_dim0_cuda
# Input: sizes: (1, 1, 1), N: 2, dim: 0, device: cuda
Forward Execution Time (us) : 10828.314

# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes(512,512,2)_N2_dim1_cuda
# Input: sizes: (512, 512, 2), N: 2, dim: 1, device: cuda
Forward Execution Time (us) : 11888.028

# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes(128,1024,2)_N2_dim1_cuda
# Input: sizes: (128, 1024, 2), N: 2, dim: 1, device: cuda
Forward Execution Time (us) : 11898.945

# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes(1024,1024,2)_N2_dim0_cuda
# Input: sizes: (1024, 1024, 2), N: 2, dim: 0, device: cuda
Forward Execution Time (us) : 11787.744

# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes(1025,1023,2)_N2_dim1_cuda
# Input: sizes: (1025, 1023, 2), N: 2, dim: 1, device: cuda
Forward Execution Time (us) : 11792.479

# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes(1024,1024,2)_N2_dim2_cuda
# Input: sizes: (1024, 1024, 2), N: 2, dim: 2, device: cuda
Forward Execution Time (us) : 11769.718

# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes[<function<lambda>at0x7f989e5c2510>,111,65]_N5_dim0_cuda
# Input: sizes: [<function <lambda> at 0x7f989e5c2510>, 111, 65], N: 5, dim: 0, device: cuda
Forward Execution Time (us) : 11633.882

# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes[96,<function<lambda>at0x7f989e5c2620>,64]_N5_dim1_cuda
# Input: sizes: [96, <function <lambda> at 0x7f989e5c2620>, 64], N: 5, dim: 1, device: cuda
Forward Execution Time (us) : 11617.768

# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes[128,64,<function<lambda>at0x7f96eee4df28>]_N5_dim2_cuda
# Input: sizes: [128, 64, <function <lambda> at 0x7f96eee4df28>], N: 5, dim: 2, device: cuda
Forward Execution Time (us) : 11625.143

# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes[<function<lambda>at0x7f96ef874048>,32,64]_N50_dim0_cuda
# Input: sizes: [<function <lambda> at 0x7f96ef874048>, 32, 64], N: 50, dim: 0, device: cuda
Forward Execution Time (us) : 13079.204

# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes[32,<function<lambda>at0x7f96ef8740d0>,64]_N50_dim1_cuda
# Input: sizes: [32, <function <lambda> at 0x7f96ef8740d0>, 64], N: 50, dim: 1, device: cuda
Forward Execution Time (us) : 13095.620

# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes[33,65,<function<lambda>at0x7f96ef874158>]_N50_dim2_cuda
# Input: sizes: [33, 65, <function <lambda> at 0x7f96ef874158>], N: 50, dim: 2, device: cuda
Forward Execution Time (us) : 13403.086

# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes(64,32,4,16,32)_N2_dim2_cuda
# Input: sizes: (64, 32, 4, 16, 32), N: 2, dim: 2, device: cuda
Forward Execution Time (us) : 118.704

# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes(16,32,4,16,32)_N8_dim2_cuda
# Input: sizes: (16, 32, 4, 16, 32), N: 8, dim: 2, device: cuda
Forward Execution Time (us) : 263.273

# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes(9,31,5,15,33)_N17_dim4_cuda
# Input: sizes: (9, 31, 5, 15, 33), N: 17, dim: 4, device: cuda
Forward Execution Time (us) : 463.024

# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes[<function<lambda>at0x7f96ef8741e0>]_N100_dim0_cuda
# Input: sizes: [<function <lambda> at 0x7f96ef8741e0>], N: 100, dim: 0, device: cuda
Forward Execution Time (us) : 23818.032

# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes[<function<lambda>at0x7f96ef874268>]_N1000_dim0_cuda
# Input: sizes: [<function <lambda> at 0x7f96ef874268>], N: 1000, dim: 0, device: cuda
Forward Execution Time (us) : 234778.296

# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes[<function<lambda>at0x7f96ef8742f0>]_N2000_dim0_cuda
# Input: sizes: [<function <lambda> at 0x7f96ef8742f0>], N: 2000, dim: 0, device: cuda
Forward Execution Time (us) : 470288.132

# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes[<function<lambda>at0x7f96ef874378>]_N3000_dim0_cuda
# Input: sizes: [<function <lambda> at 0x7f96ef874378>], N: 3000, dim: 0, device: cuda
Forward Execution Time (us) : 704361.221

Results on ROCm after the PR:

# ----------------------------------------
# PyTorch/Caffe2 Operator Micro-benchmarks
# ----------------------------------------
# Tag : all

# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes(1,1,1)_N2_dim0_cuda
# Input: sizes: (1, 1, 1), N: 2, dim: 0, device: cuda
Forward Execution Time (us) : 29.292

# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes(512,512,2)_N2_dim1_cuda
# Input: sizes: (512, 512, 2), N: 2, dim: 1, device: cuda
Forward Execution Time (us) : 46.320

# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes(128,1024,2)_N2_dim1_cuda
# Input: sizes: (128, 1024, 2), N: 2, dim: 1, device: cuda
Forward Execution Time (us) : 36.969

# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes(1024,1024,2)_N2_dim0_cuda
# Input: sizes: (1024, 1024, 2), N: 2, dim: 0, device: cuda
Forward Execution Time (us) : 92.816

# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes(1025,1023,2)_N2_dim1_cuda
# Input: sizes: (1025, 1023, 2), N: 2, dim: 1, device: cuda
Forward Execution Time (us) : 93.943

# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes(1024,1024,2)_N2_dim2_cuda
# Input: sizes: (1024, 1024, 2), N: 2, dim: 2, device: cuda
Forward Execution Time (us) : 163.914

# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes[<function<lambda>at0x7f1da3186510>,111,65]_N5_dim0_cuda
# Input: sizes: [<function <lambda> at 0x7f1da3186510>, 111, 65], N: 5, dim: 0, device: cuda
Forward Execution Time (us) : 75.475

# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes[96,<function<lambda>at0x7f1da3186620>,64]_N5_dim1_cuda
# Input: sizes: [96, <function <lambda> at 0x7f1da3186620>, 64], N: 5, dim: 1, device: cuda
Forward Execution Time (us) : 68.880

# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes[128,64,<function<lambda>at0x7f1bf3c50f28>]_N5_dim2_cuda
# Input: sizes: [128, 64, <function <lambda> at 0x7f1bf3c50f28>], N: 5, dim: 2, device: cuda
Forward Execution Time (us) : 85.268

# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes[<function<lambda>at0x7f1bf4669048>,32,64]_N50_dim0_cuda
# Input: sizes: [<function <lambda> at 0x7f1bf4669048>, 32, 64], N: 50, dim: 0, device: cuda
Forward Execution Time (us) : 111.543

# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes[32,<function<lambda>at0x7f1bf46690d0>,64]_N50_dim1_cuda
# Input: sizes: [32, <function <lambda> at 0x7f1bf46690d0>, 64], N: 50, dim: 1, device: cuda
Forward Execution Time (us) : 110.644

# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes[33,65,<function<lambda>at0x7f1bf4669158>]_N50_dim2_cuda
# Input: sizes: [33, 65, <function <lambda> at 0x7f1bf4669158>], N: 50, dim: 2, device: cuda
Forward Execution Time (us) : 116.201

# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes(64,32,4,16,32)_N2_dim2_cuda
# Input: sizes: (64, 32, 4, 16, 32), N: 2, dim: 2, device: cuda
Forward Execution Time (us) : 117.708

# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes(16,32,4,16,32)_N8_dim2_cuda
# Input: sizes: (16, 32, 4, 16, 32), N: 8, dim: 2, device: cuda
Forward Execution Time (us) : 264.953

# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes(9,31,5,15,33)_N17_dim4_cuda
# Input: sizes: (9, 31, 5, 15, 33), N: 17, dim: 4, device: cuda
Forward Execution Time (us) : 480.304

# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes[<function<lambda>at0x7f1bf46691e0>]_N100_dim0_cuda
# Input: sizes: [<function <lambda> at 0x7f1bf46691e0>], N: 100, dim: 0, device: cuda
Forward Execution Time (us) : 116.385

# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes[<function<lambda>at0x7f1bf4669268>]_N1000_dim0_cuda
# Input: sizes: [<function <lambda> at 0x7f1bf4669268>], N: 1000, dim: 0, device: cuda
Forward Execution Time (us) : 913.591

# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes[<function<lambda>at0x7f1bf46692f0>]_N2000_dim0_cuda
# Input: sizes: [<function <lambda> at 0x7f1bf46692f0>], N: 2000, dim: 0, device: cuda
Forward Execution Time (us) : 2003.212

# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes[<function<lambda>at0x7f1bf4669378>]_N3000_dim0_cuda
# Input: sizes: [<function <lambda> at 0x7f1bf4669378>], N: 3000, dim: 0, device: cuda
Forward Execution Time (us) : 3004.174

jeffdaily · 2020-10-09T17:24:45Z

Requested review from @lly-zero-one as code owner.

jeffdaily · 2020-10-09T17:26:51Z

CC @ezyang @xw285cornell @malfet @sunway513 for awareness. We plan to submit this PR as a cherry-pick against release/1.7 once this passes CI and is merged to master. Refer to the performance comparison as justification.

lly-zero-one · 2020-10-09T17:51:52Z

@jeffdaily Thanks for the fix. I want to learn more why #44833 brings regression for HIP case. Could you help to elaborate that? Basically, that PR got rid of the H2D by using the constant memory instead.

jeffdaily · 2020-10-09T18:42:23Z

@lly-zero-one our compiler does not yet handle this use case efficiently, passing aggregate structs by value as kernel arguments. This PR restores the old behavior of instead passing a pointer to the struct. Our compiler team is working on a fix for this.

malfet · 2020-10-09T20:48:45Z

Since the implementations seems sufficiently diverged now, perhaps a HIP specific implementation can be checked into shapes.hip?

codecov · 2020-10-09T21:03:21Z

Codecov Report

Merging #46097 into master will increase coverage by 0.00%.
The diff coverage is n/a.

@@           Coverage Diff           @@
##           master   #46097   +/-   ##
=======================================
  Coverage   68.17%   68.17%           
=======================================
  Files         410      410           
  Lines       53422    53422           
=======================================
+ Hits        36422    36423    +1     
+ Misses      17000    16999    -1

Impacted Files	Coverage Δ
torch/testing/_internal/expecttest.py	`78.57% <0.00%> (+1.02%)`	⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 362d9a9...3128c39. Read the comment docs.

jeffdaily · 2020-10-09T22:26:17Z

Certainly not ideal, but there is some precedent established with diverging CUDA/ROCm implementations in the Loops.h file for elementwise operators, or the cublas vs rocblas differences. We make every effort to keep these differences to a minimum. We intend to revert this change once our compiler improves, but we do not have an estimate at this time.

We're adding ~160 lines to this file; it's not a complete duplication since the file was originally ~400 lines, and we're only providing ROCm-specific implementations of CatArrayBatchedCopy and parallel_cat, not the handful of remaining functions. The CUDA path is unchanged and is free to continue changing in the future, as needed. The #ifdef is much easier than adding new sources files. We're hoping the benefit of undoing the massive ROCm performance regression is worth the price of this short-term divergence.

jeffdaily · 2020-10-12T19:18:10Z

@malfet any response to my comment? Again, I want to stress that we will revert this workaround as soon as it becomes possible.

malfet

@jeffdaily having two different implementations of the functions with the same name makes reading this file quite confusing to the developer.
On the other hand, since uninstantiated template is essentially a no-op, so please consider adding HIP_ prefixes to a RocM specific implementation of templates and the have just a single small #ifdef in cat_out_cuda to call hip_parallel_cat if compiling with ROCM

malfet · 2020-10-12T19:30:08Z

aten/src/ATen/native/cuda/Shape.cu

+
+template <typename T, typename IndexType, int Dims>
+C10_LAUNCH_BOUNDS_1(512)
+__global__ void CatArrayBatchedCopy(


Suggested change

__global__ void CatArrayBatchedCopy(

__global__ void HIP_CatArrayBatchedCopy(

malfet · 2020-10-12T19:30:27Z

aten/src/ATen/native/cuda/Shape.cu


+#ifdef __HIP_PLATFORM_HCC__
+template <typename scalar_t>
+void parallel_cat(Tensor &out, const TensorList &inputs, int64_t dimension,


Suggested change

void parallel_cat(Tensor &out, const TensorList &inputs, int64_t dimension,

void hip_parallel_cat(Tensor &out, const TensorList &inputs, int64_t dimension,

malfet · 2020-10-12T19:30:49Z

aten/src/ATen/native/cuda/Shape.cu

+    }
+    // Template Declarations for dim = 1, 2, 3, 4
+#define HANDLE_CASE(DIMS) \
+    CatArrayBatchedCopy<scalar_t, unsigned int, DIMS><<<\


Suggested change

CatArrayBatchedCopy<scalar_t, unsigned int, DIMS><<<\

HIP_CatArrayBatchedCopy<scalar_t, unsigned int, DIMS><<<\

Updated the branch with suggested changes

jeffdaily · 2020-10-12T19:35:31Z

@ashishfarmer please update based on comments from @malfet. Thank you.

dr-ci · 2020-10-12T23:40:33Z

💊 CI failures summary and remediations

As of commit c29834d (more details on the Dr. CI page):

1/1 failures introduced in this PR

XLA failure

Job pytorch_xla_linux_bionic_py3_6_clang9_test is failing. Please create an issue with title prefixed by [PT_BREAK] in pytorch/xla and link to to this PR. If you have questions, please reach out to @ailzhang / @dlibenzi / @JackCaoG.

This comment was automatically generated by Dr. CI (expand for details).

Follow this link to opt-out of these comments for your Pull Requests.

Please report bugs/suggestions on the GitHub issue tracker or post in the (internal) Dr. CI Users group.

See how this bot performed.

This comment has been revised 4 times.

malfet · 2020-10-13T00:36:45Z

aten/src/ATen/native/cuda/Shape.cu

+    dim3 catGrid;
+    getCatGrid(batchCounter, catGrid);
+
+


nit

Suggested change

facebook-github-bot

@malfet has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

facebook-github-bot · 2020-10-14T06:23:53Z

@malfet merged this pull request in d5ca53c.

Summary: This pull request is a partial revert of pytorch#44833 for ROCm to fix the performance of the concatenate operator. The changes only affect execution on ROCm and are guarded by the define `__HIP_PLATFORM_HCC__` Pull Request resolved: pytorch#46097 Test Plan: Benchmark `python -m pt.cat_test --tag_filter all --device cuda` Results on ROCm before the PR: ``` # ---------------------------------------- # PyTorch/Caffe2 Operator Micro-benchmarks # ---------------------------------------- # Tag : all # Benchmarking PyTorch: cat # Mode: Eager # Name: cat_sizes(1,1,1)_N2_dim0_cuda # Input: sizes: (1, 1, 1), N: 2, dim: 0, device: cuda Forward Execution Time (us) : 10828.314 # Benchmarking PyTorch: cat # Mode: Eager # Name: cat_sizes(512,512,2)_N2_dim1_cuda # Input: sizes: (512, 512, 2), N: 2, dim: 1, device: cuda Forward Execution Time (us) : 11888.028 # Benchmarking PyTorch: cat # Mode: Eager # Name: cat_sizes(128,1024,2)_N2_dim1_cuda # Input: sizes: (128, 1024, 2), N: 2, dim: 1, device: cuda Forward Execution Time (us) : 11898.945 # Benchmarking PyTorch: cat # Mode: Eager # Name: cat_sizes(1024,1024,2)_N2_dim0_cuda # Input: sizes: (1024, 1024, 2), N: 2, dim: 0, device: cuda Forward Execution Time (us) : 11787.744 # Benchmarking PyTorch: cat # Mode: Eager # Name: cat_sizes(1025,1023,2)_N2_dim1_cuda # Input: sizes: (1025, 1023, 2), N: 2, dim: 1, device: cuda Forward Execution Time (us) : 11792.479 # Benchmarking PyTorch: cat # Mode: Eager # Name: cat_sizes(1024,1024,2)_N2_dim2_cuda # Input: sizes: (1024, 1024, 2), N: 2, dim: 2, device: cuda Forward Execution Time (us) : 11769.718 # Benchmarking PyTorch: cat # Mode: Eager # Name: cat_sizes[<function<lambda>at0x7f989e5c2510>,111,65]_N5_dim0_cuda # Input: sizes: [<function <lambda> at 0x7f989e5c2510>, 111, 65], N: 5, dim: 0, device: cuda Forward Execution Time (us) : 11633.882 # Benchmarking PyTorch: cat # Mode: Eager # Name: cat_sizes[96,<function<lambda>at0x7f989e5c2620>,64]_N5_dim1_cuda # Input: sizes: [96, <function <lambda> at 0x7f989e5c2620>, 64], N: 5, dim: 1, device: cuda Forward Execution Time (us) : 11617.768 # Benchmarking PyTorch: cat # Mode: Eager # Name: cat_sizes[128,64,<function<lambda>at0x7f96eee4df28>]_N5_dim2_cuda # Input: sizes: [128, 64, <function <lambda> at 0x7f96eee4df28>], N: 5, dim: 2, device: cuda Forward Execution Time (us) : 11625.143 # Benchmarking PyTorch: cat # Mode: Eager # Name: cat_sizes[<function<lambda>at0x7f96ef874048>,32,64]_N50_dim0_cuda # Input: sizes: [<function <lambda> at 0x7f96ef874048>, 32, 64], N: 50, dim: 0, device: cuda Forward Execution Time (us) : 13079.204 # Benchmarking PyTorch: cat # Mode: Eager # Name: cat_sizes[32,<function<lambda>at0x7f96ef8740d0>,64]_N50_dim1_cuda # Input: sizes: [32, <function <lambda> at 0x7f96ef8740d0>, 64], N: 50, dim: 1, device: cuda Forward Execution Time (us) : 13095.620 # Benchmarking PyTorch: cat # Mode: Eager # Name: cat_sizes[33,65,<function<lambda>at0x7f96ef874158>]_N50_dim2_cuda # Input: sizes: [33, 65, <function <lambda> at 0x7f96ef874158>], N: 50, dim: 2, device: cuda Forward Execution Time (us) : 13403.086 # Benchmarking PyTorch: cat # Mode: Eager # Name: cat_sizes(64,32,4,16,32)_N2_dim2_cuda # Input: sizes: (64, 32, 4, 16, 32), N: 2, dim: 2, device: cuda Forward Execution Time (us) : 118.704 # Benchmarking PyTorch: cat # Mode: Eager # Name: cat_sizes(16,32,4,16,32)_N8_dim2_cuda # Input: sizes: (16, 32, 4, 16, 32), N: 8, dim: 2, device: cuda Forward Execution Time (us) : 263.273 # Benchmarking PyTorch: cat # Mode: Eager # Name: cat_sizes(9,31,5,15,33)_N17_dim4_cuda # Input: sizes: (9, 31, 5, 15, 33), N: 17, dim: 4, device: cuda Forward Execution Time (us) : 463.024 # Benchmarking PyTorch: cat # Mode: Eager # Name: cat_sizes[<function<lambda>at0x7f96ef8741e0>]_N100_dim0_cuda # Input: sizes: [<function <lambda> at 0x7f96ef8741e0>], N: 100, dim: 0, device: cuda Forward Execution Time (us) : 23818.032 # Benchmarking PyTorch: cat # Mode: Eager # Name: cat_sizes[<function<lambda>at0x7f96ef874268>]_N1000_dim0_cuda # Input: sizes: [<function <lambda> at 0x7f96ef874268>], N: 1000, dim: 0, device: cuda Forward Execution Time (us) : 234778.296 # Benchmarking PyTorch: cat # Mode: Eager # Name: cat_sizes[<function<lambda>at0x7f96ef8742f0>]_N2000_dim0_cuda # Input: sizes: [<function <lambda> at 0x7f96ef8742f0>], N: 2000, dim: 0, device: cuda Forward Execution Time (us) : 470288.132 # Benchmarking PyTorch: cat # Mode: Eager # Name: cat_sizes[<function<lambda>at0x7f96ef874378>]_N3000_dim0_cuda # Input: sizes: [<function <lambda> at 0x7f96ef874378>], N: 3000, dim: 0, device: cuda Forward Execution Time (us) : 704361.221 ``` Results on ROCm after the PR: ``` # ---------------------------------------- # PyTorch/Caffe2 Operator Micro-benchmarks # ---------------------------------------- # Tag : all # Benchmarking PyTorch: cat # Mode: Eager # Name: cat_sizes(1,1,1)_N2_dim0_cuda # Input: sizes: (1, 1, 1), N: 2, dim: 0, device: cuda Forward Execution Time (us) : 29.292 # Benchmarking PyTorch: cat # Mode: Eager # Name: cat_sizes(512,512,2)_N2_dim1_cuda # Input: sizes: (512, 512, 2), N: 2, dim: 1, device: cuda Forward Execution Time (us) : 46.320 # Benchmarking PyTorch: cat # Mode: Eager # Name: cat_sizes(128,1024,2)_N2_dim1_cuda # Input: sizes: (128, 1024, 2), N: 2, dim: 1, device: cuda Forward Execution Time (us) : 36.969 # Benchmarking PyTorch: cat # Mode: Eager # Name: cat_sizes(1024,1024,2)_N2_dim0_cuda # Input: sizes: (1024, 1024, 2), N: 2, dim: 0, device: cuda Forward Execution Time (us) : 92.816 # Benchmarking PyTorch: cat # Mode: Eager # Name: cat_sizes(1025,1023,2)_N2_dim1_cuda # Input: sizes: (1025, 1023, 2), N: 2, dim: 1, device: cuda Forward Execution Time (us) : 93.943 # Benchmarking PyTorch: cat # Mode: Eager # Name: cat_sizes(1024,1024,2)_N2_dim2_cuda # Input: sizes: (1024, 1024, 2), N: 2, dim: 2, device: cuda Forward Execution Time (us) : 163.914 # Benchmarking PyTorch: cat # Mode: Eager # Name: cat_sizes[<function<lambda>at0x7f1da3186510>,111,65]_N5_dim0_cuda # Input: sizes: [<function <lambda> at 0x7f1da3186510>, 111, 65], N: 5, dim: 0, device: cuda Forward Execution Time (us) : 75.475 # Benchmarking PyTorch: cat # Mode: Eager # Name: cat_sizes[96,<function<lambda>at0x7f1da3186620>,64]_N5_dim1_cuda # Input: sizes: [96, <function <lambda> at 0x7f1da3186620>, 64], N: 5, dim: 1, device: cuda Forward Execution Time (us) : 68.880 # Benchmarking PyTorch: cat # Mode: Eager # Name: cat_sizes[128,64,<function<lambda>at0x7f1bf3c50f28>]_N5_dim2_cuda # Input: sizes: [128, 64, <function <lambda> at 0x7f1bf3c50f28>], N: 5, dim: 2, device: cuda Forward Execution Time (us) : 85.268 # Benchmarking PyTorch: cat # Mode: Eager # Name: cat_sizes[<function<lambda>at0x7f1bf4669048>,32,64]_N50_dim0_cuda # Input: sizes: [<function <lambda> at 0x7f1bf4669048>, 32, 64], N: 50, dim: 0, device: cuda Forward Execution Time (us) : 111.543 # Benchmarking PyTorch: cat # Mode: Eager # Name: cat_sizes[32,<function<lambda>at0x7f1bf46690d0>,64]_N50_dim1_cuda # Input: sizes: [32, <function <lambda> at 0x7f1bf46690d0>, 64], N: 50, dim: 1, device: cuda Forward Execution Time (us) : 110.644 # Benchmarking PyTorch: cat # Mode: Eager # Name: cat_sizes[33,65,<function<lambda>at0x7f1bf4669158>]_N50_dim2_cuda # Input: sizes: [33, 65, <function <lambda> at 0x7f1bf4669158>], N: 50, dim: 2, device: cuda Forward Execution Time (us) : 116.201 # Benchmarking PyTorch: cat # Mode: Eager # Name: cat_sizes(64,32,4,16,32)_N2_dim2_cuda # Input: sizes: (64, 32, 4, 16, 32), N: 2, dim: 2, device: cuda Forward Execution Time (us) : 117.708 # Benchmarking PyTorch: cat # Mode: Eager # Name: cat_sizes(16,32,4,16,32)_N8_dim2_cuda # Input: sizes: (16, 32, 4, 16, 32), N: 8, dim: 2, device: cuda Forward Execution Time (us) : 264.953 # Benchmarking PyTorch: cat # Mode: Eager # Name: cat_sizes(9,31,5,15,33)_N17_dim4_cuda # Input: sizes: (9, 31, 5, 15, 33), N: 17, dim: 4, device: cuda Forward Execution Time (us) : 480.304 # Benchmarking PyTorch: cat # Mode: Eager # Name: cat_sizes[<function<lambda>at0x7f1bf46691e0>]_N100_dim0_cuda # Input: sizes: [<function <lambda> at 0x7f1bf46691e0>], N: 100, dim: 0, device: cuda Forward Execution Time (us) : 116.385 # Benchmarking PyTorch: cat # Mode: Eager # Name: cat_sizes[<function<lambda>at0x7f1bf4669268>]_N1000_dim0_cuda # Input: sizes: [<function <lambda> at 0x7f1bf4669268>], N: 1000, dim: 0, device: cuda Forward Execution Time (us) : 913.591 # Benchmarking PyTorch: cat # Mode: Eager # Name: cat_sizes[<function<lambda>at0x7f1bf46692f0>]_N2000_dim0_cuda # Input: sizes: [<function <lambda> at 0x7f1bf46692f0>], N: 2000, dim: 0, device: cuda Forward Execution Time (us) : 2003.212 # Benchmarking PyTorch: cat # Mode: Eager # Name: cat_sizes[<function<lambda>at0x7f1bf4669378>]_N3000_dim0_cuda # Input: sizes: [<function <lambda> at 0x7f1bf4669378>], N: 3000, dim: 0, device: cuda Forward Execution Time (us) : 3004.174 ``` Reviewed By: bdhirsh Differential Revision: D24286324 Pulled By: malfet fbshipit-source-id: 291f3f3f80f9d2f9ba52a455a942f3fb0406e7d2

Summary: This pull request is a partial revert of #44833 for ROCm to fix the performance of the concatenate operator. The changes only affect execution on ROCm and are guarded by the define `__HIP_PLATFORM_HCC__` Pull Request resolved: #46097 Test Plan: Benchmark `python -m pt.cat_test --tag_filter all --device cuda` Results on ROCm before the PR: ``` # ---------------------------------------- # PyTorch/Caffe2 Operator Micro-benchmarks # ---------------------------------------- # Tag : all # Benchmarking PyTorch: cat # Mode: Eager # Name: cat_sizes(1,1,1)_N2_dim0_cuda # Input: sizes: (1, 1, 1), N: 2, dim: 0, device: cuda Forward Execution Time (us) : 10828.314 # Benchmarking PyTorch: cat # Mode: Eager # Name: cat_sizes(512,512,2)_N2_dim1_cuda # Input: sizes: (512, 512, 2), N: 2, dim: 1, device: cuda Forward Execution Time (us) : 11888.028 # Benchmarking PyTorch: cat # Mode: Eager # Name: cat_sizes(128,1024,2)_N2_dim1_cuda # Input: sizes: (128, 1024, 2), N: 2, dim: 1, device: cuda Forward Execution Time (us) : 11898.945 # Benchmarking PyTorch: cat # Mode: Eager # Name: cat_sizes(1024,1024,2)_N2_dim0_cuda # Input: sizes: (1024, 1024, 2), N: 2, dim: 0, device: cuda Forward Execution Time (us) : 11787.744 # Benchmarking PyTorch: cat # Mode: Eager # Name: cat_sizes(1025,1023,2)_N2_dim1_cuda # Input: sizes: (1025, 1023, 2), N: 2, dim: 1, device: cuda Forward Execution Time (us) : 11792.479 # Benchmarking PyTorch: cat # Mode: Eager # Name: cat_sizes(1024,1024,2)_N2_dim2_cuda # Input: sizes: (1024, 1024, 2), N: 2, dim: 2, device: cuda Forward Execution Time (us) : 11769.718 # Benchmarking PyTorch: cat # Mode: Eager # Name: cat_sizes[<function<lambda>at0x7f989e5c2510>,111,65]_N5_dim0_cuda # Input: sizes: [<function <lambda> at 0x7f989e5c2510>, 111, 65], N: 5, dim: 0, device: cuda Forward Execution Time (us) : 11633.882 # Benchmarking PyTorch: cat # Mode: Eager # Name: cat_sizes[96,<function<lambda>at0x7f989e5c2620>,64]_N5_dim1_cuda # Input: sizes: [96, <function <lambda> at 0x7f989e5c2620>, 64], N: 5, dim: 1, device: cuda Forward Execution Time (us) : 11617.768 # Benchmarking PyTorch: cat # Mode: Eager # Name: cat_sizes[128,64,<function<lambda>at0x7f96eee4df28>]_N5_dim2_cuda # Input: sizes: [128, 64, <function <lambda> at 0x7f96eee4df28>], N: 5, dim: 2, device: cuda Forward Execution Time (us) : 11625.143 # Benchmarking PyTorch: cat # Mode: Eager # Name: cat_sizes[<function<lambda>at0x7f96ef874048>,32,64]_N50_dim0_cuda # Input: sizes: [<function <lambda> at 0x7f96ef874048>, 32, 64], N: 50, dim: 0, device: cuda Forward Execution Time (us) : 13079.204 # Benchmarking PyTorch: cat # Mode: Eager # Name: cat_sizes[32,<function<lambda>at0x7f96ef8740d0>,64]_N50_dim1_cuda # Input: sizes: [32, <function <lambda> at 0x7f96ef8740d0>, 64], N: 50, dim: 1, device: cuda Forward Execution Time (us) : 13095.620 # Benchmarking PyTorch: cat # Mode: Eager # Name: cat_sizes[33,65,<function<lambda>at0x7f96ef874158>]_N50_dim2_cuda # Input: sizes: [33, 65, <function <lambda> at 0x7f96ef874158>], N: 50, dim: 2, device: cuda Forward Execution Time (us) : 13403.086 # Benchmarking PyTorch: cat # Mode: Eager # Name: cat_sizes(64,32,4,16,32)_N2_dim2_cuda # Input: sizes: (64, 32, 4, 16, 32), N: 2, dim: 2, device: cuda Forward Execution Time (us) : 118.704 # Benchmarking PyTorch: cat # Mode: Eager # Name: cat_sizes(16,32,4,16,32)_N8_dim2_cuda # Input: sizes: (16, 32, 4, 16, 32), N: 8, dim: 2, device: cuda Forward Execution Time (us) : 263.273 # Benchmarking PyTorch: cat # Mode: Eager # Name: cat_sizes(9,31,5,15,33)_N17_dim4_cuda # Input: sizes: (9, 31, 5, 15, 33), N: 17, dim: 4, device: cuda Forward Execution Time (us) : 463.024 # Benchmarking PyTorch: cat # Mode: Eager # Name: cat_sizes[<function<lambda>at0x7f96ef8741e0>]_N100_dim0_cuda # Input: sizes: [<function <lambda> at 0x7f96ef8741e0>], N: 100, dim: 0, device: cuda Forward Execution Time (us) : 23818.032 # Benchmarking PyTorch: cat # Mode: Eager # Name: cat_sizes[<function<lambda>at0x7f96ef874268>]_N1000_dim0_cuda # Input: sizes: [<function <lambda> at 0x7f96ef874268>], N: 1000, dim: 0, device: cuda Forward Execution Time (us) : 234778.296 # Benchmarking PyTorch: cat # Mode: Eager # Name: cat_sizes[<function<lambda>at0x7f96ef8742f0>]_N2000_dim0_cuda # Input: sizes: [<function <lambda> at 0x7f96ef8742f0>], N: 2000, dim: 0, device: cuda Forward Execution Time (us) : 470288.132 # Benchmarking PyTorch: cat # Mode: Eager # Name: cat_sizes[<function<lambda>at0x7f96ef874378>]_N3000_dim0_cuda # Input: sizes: [<function <lambda> at 0x7f96ef874378>], N: 3000, dim: 0, device: cuda Forward Execution Time (us) : 704361.221 ``` Results on ROCm after the PR: ``` # ---------------------------------------- # PyTorch/Caffe2 Operator Micro-benchmarks # ---------------------------------------- # Tag : all # Benchmarking PyTorch: cat # Mode: Eager # Name: cat_sizes(1,1,1)_N2_dim0_cuda # Input: sizes: (1, 1, 1), N: 2, dim: 0, device: cuda Forward Execution Time (us) : 29.292 # Benchmarking PyTorch: cat # Mode: Eager # Name: cat_sizes(512,512,2)_N2_dim1_cuda # Input: sizes: (512, 512, 2), N: 2, dim: 1, device: cuda Forward Execution Time (us) : 46.320 # Benchmarking PyTorch: cat # Mode: Eager # Name: cat_sizes(128,1024,2)_N2_dim1_cuda # Input: sizes: (128, 1024, 2), N: 2, dim: 1, device: cuda Forward Execution Time (us) : 36.969 # Benchmarking PyTorch: cat # Mode: Eager # Name: cat_sizes(1024,1024,2)_N2_dim0_cuda # Input: sizes: (1024, 1024, 2), N: 2, dim: 0, device: cuda Forward Execution Time (us) : 92.816 # Benchmarking PyTorch: cat # Mode: Eager # Name: cat_sizes(1025,1023,2)_N2_dim1_cuda # Input: sizes: (1025, 1023, 2), N: 2, dim: 1, device: cuda Forward Execution Time (us) : 93.943 # Benchmarking PyTorch: cat # Mode: Eager # Name: cat_sizes(1024,1024,2)_N2_dim2_cuda # Input: sizes: (1024, 1024, 2), N: 2, dim: 2, device: cuda Forward Execution Time (us) : 163.914 # Benchmarking PyTorch: cat # Mode: Eager # Name: cat_sizes[<function<lambda>at0x7f1da3186510>,111,65]_N5_dim0_cuda # Input: sizes: [<function <lambda> at 0x7f1da3186510>, 111, 65], N: 5, dim: 0, device: cuda Forward Execution Time (us) : 75.475 # Benchmarking PyTorch: cat # Mode: Eager # Name: cat_sizes[96,<function<lambda>at0x7f1da3186620>,64]_N5_dim1_cuda # Input: sizes: [96, <function <lambda> at 0x7f1da3186620>, 64], N: 5, dim: 1, device: cuda Forward Execution Time (us) : 68.880 # Benchmarking PyTorch: cat # Mode: Eager # Name: cat_sizes[128,64,<function<lambda>at0x7f1bf3c50f28>]_N5_dim2_cuda # Input: sizes: [128, 64, <function <lambda> at 0x7f1bf3c50f28>], N: 5, dim: 2, device: cuda Forward Execution Time (us) : 85.268 # Benchmarking PyTorch: cat # Mode: Eager # Name: cat_sizes[<function<lambda>at0x7f1bf4669048>,32,64]_N50_dim0_cuda # Input: sizes: [<function <lambda> at 0x7f1bf4669048>, 32, 64], N: 50, dim: 0, device: cuda Forward Execution Time (us) : 111.543 # Benchmarking PyTorch: cat # Mode: Eager # Name: cat_sizes[32,<function<lambda>at0x7f1bf46690d0>,64]_N50_dim1_cuda # Input: sizes: [32, <function <lambda> at 0x7f1bf46690d0>, 64], N: 50, dim: 1, device: cuda Forward Execution Time (us) : 110.644 # Benchmarking PyTorch: cat # Mode: Eager # Name: cat_sizes[33,65,<function<lambda>at0x7f1bf4669158>]_N50_dim2_cuda # Input: sizes: [33, 65, <function <lambda> at 0x7f1bf4669158>], N: 50, dim: 2, device: cuda Forward Execution Time (us) : 116.201 # Benchmarking PyTorch: cat # Mode: Eager # Name: cat_sizes(64,32,4,16,32)_N2_dim2_cuda # Input: sizes: (64, 32, 4, 16, 32), N: 2, dim: 2, device: cuda Forward Execution Time (us) : 117.708 # Benchmarking PyTorch: cat # Mode: Eager # Name: cat_sizes(16,32,4,16,32)_N8_dim2_cuda # Input: sizes: (16, 32, 4, 16, 32), N: 8, dim: 2, device: cuda Forward Execution Time (us) : 264.953 # Benchmarking PyTorch: cat # Mode: Eager # Name: cat_sizes(9,31,5,15,33)_N17_dim4_cuda # Input: sizes: (9, 31, 5, 15, 33), N: 17, dim: 4, device: cuda Forward Execution Time (us) : 480.304 # Benchmarking PyTorch: cat # Mode: Eager # Name: cat_sizes[<function<lambda>at0x7f1bf46691e0>]_N100_dim0_cuda # Input: sizes: [<function <lambda> at 0x7f1bf46691e0>], N: 100, dim: 0, device: cuda Forward Execution Time (us) : 116.385 # Benchmarking PyTorch: cat # Mode: Eager # Name: cat_sizes[<function<lambda>at0x7f1bf4669268>]_N1000_dim0_cuda # Input: sizes: [<function <lambda> at 0x7f1bf4669268>], N: 1000, dim: 0, device: cuda Forward Execution Time (us) : 913.591 # Benchmarking PyTorch: cat # Mode: Eager # Name: cat_sizes[<function<lambda>at0x7f1bf46692f0>]_N2000_dim0_cuda # Input: sizes: [<function <lambda> at 0x7f1bf46692f0>], N: 2000, dim: 0, device: cuda Forward Execution Time (us) : 2003.212 # Benchmarking PyTorch: cat # Mode: Eager # Name: cat_sizes[<function<lambda>at0x7f1bf4669378>]_N3000_dim0_cuda # Input: sizes: [<function <lambda> at 0x7f1bf4669378>], N: 3000, dim: 0, device: cuda Forward Execution Time (us) : 3004.174 ``` Reviewed By: bdhirsh Differential Revision: D24286324 Pulled By: malfet fbshipit-source-id: 291f3f3f80f9d2f9ba52a455a942f3fb0406e7d2

revert d5ca53c (#46097). The changes only affect ROCm. Reverts a work-around for a compiler performance issue that is no longer needed. `python -m pt.cat_test --tag_filter all --device cuda` ``` # ---------------------------------------- # PyTorch/Caffe2 Operator Micro-benchmarks # ---------------------------------------- # Tag : all # Benchmarking PyTorch: cat # Mode: Eager # Name: cat_sizes(1,1,1)_N2_dim0_cuda # Input: sizes: (1, 1, 1), N: 2, dim: 0, device: cuda OLD Forward Execution Time (us) : 48.833 NEW Forward Execution Time (us) : 8.318 # Benchmarking PyTorch: cat # Mode: Eager # Name: cat_sizes(512,512,2)_N2_dim1_cuda # Input: sizes: (512, 512, 2), N: 2, dim: 1, device: cuda OLD Forward Execution Time (us) : 54.508 NEW Forward Execution Time (us) : 23.824 # Benchmarking PyTorch: cat # Mode: Eager # Name: cat_sizes(128,1024,2)_N2_dim1_cuda # Input: sizes: (128, 1024, 2), N: 2, dim: 1, device: cuda OLD Forward Execution Time (us) : 52.117 NEW Forward Execution Time (us) : 14.942 # Benchmarking PyTorch: cat # Mode: Eager # Name: cat_sizes(1024,1024,2)_N2_dim0_cuda # Input: sizes: (1024, 1024, 2), N: 2, dim: 0, device: cuda OLD Forward Execution Time (us) : 98.790 NEW Forward Execution Time (us) : 74.334 # Benchmarking PyTorch: cat # Mode: Eager # Name: cat_sizes(1025,1023,2)_N2_dim1_cuda # Input: sizes: (1025, 1023, 2), N: 2, dim: 1, device: cuda OLD Forward Execution Time (us) : 102.063 NEW Forward Execution Time (us) : 76.008 # Benchmarking PyTorch: cat # Mode: Eager # Name: cat_sizes(1024,1024,2)_N2_dim2_cuda # Input: sizes: (1024, 1024, 2), N: 2, dim: 2, device: cuda OLD Forward Execution Time (us) : 167.786 NEW Forward Execution Time (us) : 123.679 # Benchmarking PyTorch: cat # Mode: Eager # Name: cat_sizes[<function<lambda>at0x7f1b1dec7b00>,111,65]_N5_dim0_cuda # Input: sizes: [<function <lambda> at 0x7f1b1dec7b00>, 111, 65], N: 5, dim: 0, device: cuda OLD Forward Execution Time (us) : 98.320 NEW Forward Execution Time (us) : 67.436 # Benchmarking PyTorch: cat # Mode: Eager # Name: cat_sizes[96,<function<lambda>at0x7f1b1dec7a70>,64]_N5_dim1_cuda # Input: sizes: [96, <function <lambda> at 0x7f1b1dec7a70>, 64], N: 5, dim: 1, device: cuda OLD Forward Execution Time (us) : 91.484 NEW Forward Execution Time (us) : 59.230 # Benchmarking PyTorch: cat # Mode: Eager # Name: cat_sizes[128,64,<function<lambda>at0x7f18db09d290>]_N5_dim2_cuda # Input: sizes: [128, 64, <function <lambda> at 0x7f18db09d290>], N: 5, dim: 2, device: cuda OLD Forward Execution Time (us) : 109.569 NEW Forward Execution Time (us) : 76.557 # Benchmarking PyTorch: cat # Mode: Eager # Name: cat_sizes[<function<lambda>at0x7f18db09d560>,32,64]_N50_dim0_cuda # Input: sizes: [<function <lambda> at 0x7f18db09d560>, 32, 64], N: 50, dim: 0, device: cuda OLD Forward Execution Time (us) : 106.603 NEW Forward Execution Time (us) : 87.635 # Benchmarking PyTorch: cat # Mode: Eager # Name: cat_sizes[32,<function<lambda>at0x7f18db09d5f0>,64]_N50_dim1_cuda # Input: sizes: [32, <function <lambda> at 0x7f18db09d5f0>, 64], N: 50, dim: 1, device: cuda OLD Forward Execution Time (us) : 106.693 NEW Forward Execution Time (us) : 88.902 # Benchmarking PyTorch: cat # Mode: Eager # Name: cat_sizes[33,65,<function<lambda>at0x7f18db09d680>]_N50_dim2_cuda # Input: sizes: [33, 65, <function <lambda> at 0x7f18db09d680>], N: 50, dim: 2, device: cuda OLD Forward Execution Time (us) : 110.881 NEW Forward Execution Time (us) : 94.361 # Benchmarking PyTorch: cat # Mode: Eager # Name: cat_sizes(64,32,4,16,32)_N2_dim2_cuda # Input: sizes: (64, 32, 4, 16, 32), N: 2, dim: 2, device: cuda OLD Forward Execution Time (us) : 122.925 NEW Forward Execution Time (us) : 123.046 # Benchmarking PyTorch: cat # Mode: Eager # Name: cat_sizes(16,32,4,16,32)_N8_dim2_cuda # Input: sizes: (16, 32, 4, 16, 32), N: 8, dim: 2, device: cuda OLD Forward Execution Time (us) : 272.442 NEW Forward Execution Time (us) : 271.932 # Benchmarking PyTorch: cat # Mode: Eager # Name: cat_sizes(9,31,5,15,33)_N17_dim4_cuda # Input: sizes: (9, 31, 5, 15, 33), N: 17, dim: 4, device: cuda OLD Forward Execution Time (us) : 457.329 NEW Forward Execution Time (us) : 456.767 # Benchmarking PyTorch: cat # Mode: Eager # Name: cat_sizes[<function<lambda>at0x7f18db09d710>]_N100_dim0_cuda # Input: sizes: [<function <lambda> at 0x7f18db09d710>], N: 100, dim: 0, device: cuda OLD Forward Execution Time (us) : 117.688 NEW Forward Execution Time (us) : 87.133 # Benchmarking PyTorch: cat # Mode: Eager # Name: cat_sizes[<function<lambda>at0x7f18db09d7a0>]_N1000_dim0_cuda # Input: sizes: [<function <lambda> at 0x7f18db09d7a0>], N: 1000, dim: 0, device: cuda OLD Forward Execution Time (us) : 873.764 NEW Forward Execution Time (us) : 865.075 # Benchmarking PyTorch: cat # Mode: Eager # Name: cat_sizes[<function<lambda>at0x7f18db09d830>]_N2000_dim0_cuda # Input: sizes: [<function <lambda> at 0x7f18db09d830>], N: 2000, dim: 0, device: cuda OLD Forward Execution Time (us) : 1746.831 NEW Forward Execution Time (us) : 1730.252 # Benchmarking PyTorch: cat # Mode: Eager # Name: cat_sizes[<function<lambda>at0x7f18db09d8c0>]_N3000_dim0_cuda # Input: sizes: [<function <lambda> at 0x7f18db09d8c0>], N: 3000, dim: 0, device: cuda OLD Forward Execution Time (us) : 2619.303 NEW Forward Execution Time (us) : 2598.717 # Benchmarking PyTorch: cat # Mode: Eager # Name: cat_sizes[(1,160),(1,14)]_N-1_dim1_cuda # Input: sizes: [(1, 160), (1, 14)], N: -1, dim: 1, device: cuda OLD Forward Execution Time (us) : 52.063 NEW Forward Execution Time (us) : 7.904 # Benchmarking PyTorch: cat # Mode: Eager # Name: cat_sizes[(1,20,40),(1,4,40),(1,5,40)]_N-1_dim1_cuda # Input: sizes: [(1, 20, 40), (1, 4, 40), (1, 5, 40)], N: -1, dim: 1, device: cuda OLD Forward Execution Time (us) : 52.275 NEW Forward Execution Time (us) : 8.118 # Benchmarking PyTorch: cat # Mode: Eager # Name: cat_sizes[(1,580),(1,174)]_N-1_dim1_cuda # Input: sizes: [(1, 580), (1, 174)], N: -1, dim: 1, device: cuda OLD Forward Execution Time (us) : 51.896 NEW Forward Execution Time (us) : 7.938 # Benchmarking PyTorch: cat # Mode: Eager # Name: cat_sizes[(20,160),(20,14)]_N-1_dim1_cuda # Input: sizes: [(20, 160), (20, 14)], N: -1, dim: 1, device: cuda OLD Forward Execution Time (us) : 51.745 NEW Forward Execution Time (us) : 7.922 # Benchmarking PyTorch: cat # Mode: Eager # Name: cat_sizes[(20,20,40),(20,4,40),(20,5,40)]_N-1_dim1_cuda # Input: sizes: [(20, 20, 40), (20, 4, 40), (20, 5, 40)], N: -1, dim: 1, device: cuda OLD Forward Execution Time (us) : 52.575 NEW Forward Execution Time (us) : 13.299 # Benchmarking PyTorch: cat # Mode: Eager # Name: cat_sizes[(20,580),(20,174)]_N-1_dim1_cuda # Input: sizes: [(20, 580), (20, 174)], N: -1, dim: 1, device: cuda OLD Forward Execution Time (us) : 52.090 NEW Forward Execution Time (us) : 8.015 ``` Pull Request resolved: #74129 Approved by: https://github.com/ngimel

Summary: revert d5ca53c (#46097). The changes only affect ROCm. Reverts a work-around for a compiler performance issue that is no longer needed. `python -m pt.cat_test --tag_filter all --device cuda` ``` # ---------------------------------------- # PyTorch/Caffe2 Operator Micro-benchmarks # ---------------------------------------- # Tag : all # Benchmarking PyTorch: cat # Mode: Eager # Name: cat_sizes(1,1,1)_N2_dim0_cuda # Input: sizes: (1, 1, 1), N: 2, dim: 0, device: cuda OLD Forward Execution Time (us) : 48.833 NEW Forward Execution Time (us) : 8.318 # Benchmarking PyTorch: cat # Mode: Eager # Name: cat_sizes(512,512,2)_N2_dim1_cuda # Input: sizes: (512, 512, 2), N: 2, dim: 1, device: cuda OLD Forward Execution Time (us) : 54.508 NEW Forward Execution Time (us) : 23.824 # Benchmarking PyTorch: cat # Mode: Eager # Name: cat_sizes(128,1024,2)_N2_dim1_cuda # Input: sizes: (128, 1024, 2), N: 2, dim: 1, device: cuda OLD Forward Execution Time (us) : 52.117 NEW Forward Execution Time (us) : 14.942 # Benchmarking PyTorch: cat # Mode: Eager # Name: cat_sizes(1024,1024,2)_N2_dim0_cuda # Input: sizes: (1024, 1024, 2), N: 2, dim: 0, device: cuda OLD Forward Execution Time (us) : 98.790 NEW Forward Execution Time (us) : 74.334 # Benchmarking PyTorch: cat # Mode: Eager # Name: cat_sizes(1025,1023,2)_N2_dim1_cuda # Input: sizes: (1025, 1023, 2), N: 2, dim: 1, device: cuda OLD Forward Execution Time (us) : 102.063 NEW Forward Execution Time (us) : 76.008 # Benchmarking PyTorch: cat # Mode: Eager # Name: cat_sizes(1024,1024,2)_N2_dim2_cuda # Input: sizes: (1024, 1024, 2), N: 2, dim: 2, device: cuda OLD Forward Execution Time (us) : 167.786 NEW Forward Execution Time (us) : 123.679 # Benchmarking PyTorch: cat # Mode: Eager # Name: cat_sizes[<function<lambda>at0x7f1b1dec7b00>,111,65]_N5_dim0_cuda # Input: sizes: [<function <lambda> at 0x7f1b1dec7b00>, 111, 65], N: 5, dim: 0, device: cuda OLD Forward Execution Time (us) : 98.320 NEW Forward Execution Time (us) : 67.436 # Benchmarking PyTorch: cat # Mode: Eager # Name: cat_sizes[96,<function<lambda>at0x7f1b1dec7a70>,64]_N5_dim1_cuda # Input: sizes: [96, <function <lambda> at 0x7f1b1dec7a70>, 64], N: 5, dim: 1, device: cuda OLD Forward Execution Time (us) : 91.484 NEW Forward Execution Time (us) : 59.230 # Benchmarking PyTorch: cat # Mode: Eager # Name: cat_sizes[128,64,<function<lambda>at0x7f18db09d290>]_N5_dim2_cuda # Input: sizes: [128, 64, <function <lambda> at 0x7f18db09d290>], N: 5, dim: 2, device: cuda OLD Forward Execution Time (us) : 109.569 NEW Forward Execution Time (us) : 76.557 # Benchmarking PyTorch: cat # Mode: Eager # Name: cat_sizes[<function<lambda>at0x7f18db09d560>,32,64]_N50_dim0_cuda # Input: sizes: [<function <lambda> at 0x7f18db09d560>, 32, 64], N: 50, dim: 0, device: cuda OLD Forward Execution Time (us) : 106.603 NEW Forward Execution Time (us) : 87.635 # Benchmarking PyTorch: cat # Mode: Eager # Name: cat_sizes[32,<function<lambda>at0x7f18db09d5f0>,64]_N50_dim1_cuda # Input: sizes: [32, <function <lambda> at 0x7f18db09d5f0>, 64], N: 50, dim: 1, device: cuda OLD Forward Execution Time (us) : 106.693 NEW Forward Execution Time (us) : 88.902 # Benchmarking PyTorch: cat # Mode: Eager # Name: cat_sizes[33,65,<function<lambda>at0x7f18db09d680>]_N50_dim2_cuda # Input: sizes: [33, 65, <function <lambda> at 0x7f18db09d680>], N: 50, dim: 2, device: cuda OLD Forward Execution Time (us) : 110.881 NEW Forward Execution Time (us) : 94.361 # Benchmarking PyTorch: cat # Mode: Eager # Name: cat_sizes(64,32,4,16,32)_N2_dim2_cuda # Input: sizes: (64, 32, 4, 16, 32), N: 2, dim: 2, device: cuda OLD Forward Execution Time (us) : 122.925 NEW Forward Execution Time (us) : 123.046 # Benchmarking PyTorch: cat # Mode: Eager # Name: cat_sizes(16,32,4,16,32)_N8_dim2_cuda # Input: sizes: (16, 32, 4, 16, 32), N: 8, dim: 2, device: cuda OLD Forward Execution Time (us) : 272.442 NEW Forward Execution Time (us) : 271.932 # Benchmarking PyTorch: cat # Mode: Eager # Name: cat_sizes(9,31,5,15,33)_N17_dim4_cuda # Input: sizes: (9, 31, 5, 15, 33), N: 17, dim: 4, device: cuda OLD Forward Execution Time (us) : 457.329 NEW Forward Execution Time (us) : 456.767 # Benchmarking PyTorch: cat # Mode: Eager # Name: cat_sizes[<function<lambda>at0x7f18db09d710>]_N100_dim0_cuda # Input: sizes: [<function <lambda> at 0x7f18db09d710>], N: 100, dim: 0, device: cuda OLD Forward Execution Time (us) : 117.688 NEW Forward Execution Time (us) : 87.133 # Benchmarking PyTorch: cat # Mode: Eager # Name: cat_sizes[<function<lambda>at0x7f18db09d7a0>]_N1000_dim0_cuda # Input: sizes: [<function <lambda> at 0x7f18db09d7a0>], N: 1000, dim: 0, device: cuda OLD Forward Execution Time (us) : 873.764 NEW Forward Execution Time (us) : 865.075 # Benchmarking PyTorch: cat # Mode: Eager # Name: cat_sizes[<function<lambda>at0x7f18db09d830>]_N2000_dim0_cuda # Input: sizes: [<function <lambda> at 0x7f18db09d830>], N: 2000, dim: 0, device: cuda OLD Forward Execution Time (us) : 1746.831 NEW Forward Execution Time (us) : 1730.252 # Benchmarking PyTorch: cat # Mode: Eager # Name: cat_sizes[<function<lambda>at0x7f18db09d8c0>]_N3000_dim0_cuda # Input: sizes: [<function <lambda> at 0x7f18db09d8c0>], N: 3000, dim: 0, device: cuda OLD Forward Execution Time (us) : 2619.303 NEW Forward Execution Time (us) : 2598.717 # Benchmarking PyTorch: cat # Mode: Eager # Name: cat_sizes[(1,160),(1,14)]_N-1_dim1_cuda # Input: sizes: [(1, 160), (1, 14)], N: -1, dim: 1, device: cuda OLD Forward Execution Time (us) : 52.063 NEW Forward Execution Time (us) : 7.904 # Benchmarking PyTorch: cat # Mode: Eager # Name: cat_sizes[(1,20,40),(1,4,40),(1,5,40)]_N-1_dim1_cuda # Input: sizes: [(1, 20, 40), (1, 4, 40), (1, 5, 40)], N: -1, dim: 1, device: cuda OLD Forward Execution Time (us) : 52.275 NEW Forward Execution Time (us) : 8.118 # Benchmarking PyTorch: cat # Mode: Eager # Name: cat_sizes[(1,580),(1,174)]_N-1_dim1_cuda # Input: sizes: [(1, 580), (1, 174)], N: -1, dim: 1, device: cuda OLD Forward Execution Time (us) : 51.896 NEW Forward Execution Time (us) : 7.938 # Benchmarking PyTorch: cat # Mode: Eager # Name: cat_sizes[(20,160),(20,14)]_N-1_dim1_cuda # Input: sizes: [(20, 160), (20, 14)], N: -1, dim: 1, device: cuda OLD Forward Execution Time (us) : 51.745 NEW Forward Execution Time (us) : 7.922 # Benchmarking PyTorch: cat # Mode: Eager # Name: cat_sizes[(20,20,40),(20,4,40),(20,5,40)]_N-1_dim1_cuda # Input: sizes: [(20, 20, 40), (20, 4, 40), (20, 5, 40)], N: -1, dim: 1, device: cuda OLD Forward Execution Time (us) : 52.575 NEW Forward Execution Time (us) : 13.299 # Benchmarking PyTorch: cat # Mode: Eager # Name: cat_sizes[(20,580),(20,174)]_N-1_dim1_cuda # Input: sizes: [(20, 580), (20, 174)], N: -1, dim: 1, device: cuda OLD Forward Execution Time (us) : 52.090 NEW Forward Execution Time (us) : 8.015 ``` Pull Request resolved: #74129 Approved by: https://github.com/ngimel Test Plan: contbuild & OSS CI, see https://hud.pytorch.org/commit/pytorch/pytorch/14bf20cd922c3ba33c32343c19fd9ac490d4f7a6 Reviewed By: anjali411 Differential Revision: D34990460 fbshipit-source-id: 2bb09b9f60342f7bd23e856d4861d513dd3d104f

revert d5ca53c (#46097). The changes only affect ROCm. Reverts a work-around for a compiler performance issue that is no longer needed. `python -m pt.cat_test --tag_filter all --device cuda` ``` # ---------------------------------------- # PyTorch/Caffe2 Operator Micro-benchmarks # ---------------------------------------- # Tag : all # Benchmarking PyTorch: cat # Mode: Eager # Name: cat_sizes(1,1,1)_N2_dim0_cuda # Input: sizes: (1, 1, 1), N: 2, dim: 0, device: cuda OLD Forward Execution Time (us) : 48.833 NEW Forward Execution Time (us) : 8.318 # Benchmarking PyTorch: cat # Mode: Eager # Name: cat_sizes(512,512,2)_N2_dim1_cuda # Input: sizes: (512, 512, 2), N: 2, dim: 1, device: cuda OLD Forward Execution Time (us) : 54.508 NEW Forward Execution Time (us) : 23.824 # Benchmarking PyTorch: cat # Mode: Eager # Name: cat_sizes(128,1024,2)_N2_dim1_cuda # Input: sizes: (128, 1024, 2), N: 2, dim: 1, device: cuda OLD Forward Execution Time (us) : 52.117 NEW Forward Execution Time (us) : 14.942 # Benchmarking PyTorch: cat # Mode: Eager # Name: cat_sizes(1024,1024,2)_N2_dim0_cuda # Input: sizes: (1024, 1024, 2), N: 2, dim: 0, device: cuda OLD Forward Execution Time (us) : 98.790 NEW Forward Execution Time (us) : 74.334 # Benchmarking PyTorch: cat # Mode: Eager # Name: cat_sizes(1025,1023,2)_N2_dim1_cuda # Input: sizes: (1025, 1023, 2), N: 2, dim: 1, device: cuda OLD Forward Execution Time (us) : 102.063 NEW Forward Execution Time (us) : 76.008 # Benchmarking PyTorch: cat # Mode: Eager # Name: cat_sizes(1024,1024,2)_N2_dim2_cuda # Input: sizes: (1024, 1024, 2), N: 2, dim: 2, device: cuda OLD Forward Execution Time (us) : 167.786 NEW Forward Execution Time (us) : 123.679 # Benchmarking PyTorch: cat # Mode: Eager # Name: cat_sizes[<function<lambda>at0x7f1b1dec7b00>,111,65]_N5_dim0_cuda # Input: sizes: [<function <lambda> at 0x7f1b1dec7b00>, 111, 65], N: 5, dim: 0, device: cuda OLD Forward Execution Time (us) : 98.320 NEW Forward Execution Time (us) : 67.436 # Benchmarking PyTorch: cat # Mode: Eager # Name: cat_sizes[96,<function<lambda>at0x7f1b1dec7a70>,64]_N5_dim1_cuda # Input: sizes: [96, <function <lambda> at 0x7f1b1dec7a70>, 64], N: 5, dim: 1, device: cuda OLD Forward Execution Time (us) : 91.484 NEW Forward Execution Time (us) : 59.230 # Benchmarking PyTorch: cat # Mode: Eager # Name: cat_sizes[128,64,<function<lambda>at0x7f18db09d290>]_N5_dim2_cuda # Input: sizes: [128, 64, <function <lambda> at 0x7f18db09d290>], N: 5, dim: 2, device: cuda OLD Forward Execution Time (us) : 109.569 NEW Forward Execution Time (us) : 76.557 # Benchmarking PyTorch: cat # Mode: Eager # Name: cat_sizes[<function<lambda>at0x7f18db09d560>,32,64]_N50_dim0_cuda # Input: sizes: [<function <lambda> at 0x7f18db09d560>, 32, 64], N: 50, dim: 0, device: cuda OLD Forward Execution Time (us) : 106.603 NEW Forward Execution Time (us) : 87.635 # Benchmarking PyTorch: cat # Mode: Eager # Name: cat_sizes[32,<function<lambda>at0x7f18db09d5f0>,64]_N50_dim1_cuda # Input: sizes: [32, <function <lambda> at 0x7f18db09d5f0>, 64], N: 50, dim: 1, device: cuda OLD Forward Execution Time (us) : 106.693 NEW Forward Execution Time (us) : 88.902 # Benchmarking PyTorch: cat # Mode: Eager # Name: cat_sizes[33,65,<function<lambda>at0x7f18db09d680>]_N50_dim2_cuda # Input: sizes: [33, 65, <function <lambda> at 0x7f18db09d680>], N: 50, dim: 2, device: cuda OLD Forward Execution Time (us) : 110.881 NEW Forward Execution Time (us) : 94.361 # Benchmarking PyTorch: cat # Mode: Eager # Name: cat_sizes(64,32,4,16,32)_N2_dim2_cuda # Input: sizes: (64, 32, 4, 16, 32), N: 2, dim: 2, device: cuda OLD Forward Execution Time (us) : 122.925 NEW Forward Execution Time (us) : 123.046 # Benchmarking PyTorch: cat # Mode: Eager # Name: cat_sizes(16,32,4,16,32)_N8_dim2_cuda # Input: sizes: (16, 32, 4, 16, 32), N: 8, dim: 2, device: cuda OLD Forward Execution Time (us) : 272.442 NEW Forward Execution Time (us) : 271.932 # Benchmarking PyTorch: cat # Mode: Eager # Name: cat_sizes(9,31,5,15,33)_N17_dim4_cuda # Input: sizes: (9, 31, 5, 15, 33), N: 17, dim: 4, device: cuda OLD Forward Execution Time (us) : 457.329 NEW Forward Execution Time (us) : 456.767 # Benchmarking PyTorch: cat # Mode: Eager # Name: cat_sizes[<function<lambda>at0x7f18db09d710>]_N100_dim0_cuda # Input: sizes: [<function <lambda> at 0x7f18db09d710>], N: 100, dim: 0, device: cuda OLD Forward Execution Time (us) : 117.688 NEW Forward Execution Time (us) : 87.133 # Benchmarking PyTorch: cat # Mode: Eager # Name: cat_sizes[<function<lambda>at0x7f18db09d7a0>]_N1000_dim0_cuda # Input: sizes: [<function <lambda> at 0x7f18db09d7a0>], N: 1000, dim: 0, device: cuda OLD Forward Execution Time (us) : 873.764 NEW Forward Execution Time (us) : 865.075 # Benchmarking PyTorch: cat # Mode: Eager # Name: cat_sizes[<function<lambda>at0x7f18db09d830>]_N2000_dim0_cuda # Input: sizes: [<function <lambda> at 0x7f18db09d830>], N: 2000, dim: 0, device: cuda OLD Forward Execution Time (us) : 1746.831 NEW Forward Execution Time (us) : 1730.252 # Benchmarking PyTorch: cat # Mode: Eager # Name: cat_sizes[<function<lambda>at0x7f18db09d8c0>]_N3000_dim0_cuda # Input: sizes: [<function <lambda> at 0x7f18db09d8c0>], N: 3000, dim: 0, device: cuda OLD Forward Execution Time (us) : 2619.303 NEW Forward Execution Time (us) : 2598.717 # Benchmarking PyTorch: cat # Mode: Eager # Name: cat_sizes[(1,160),(1,14)]_N-1_dim1_cuda # Input: sizes: [(1, 160), (1, 14)], N: -1, dim: 1, device: cuda OLD Forward Execution Time (us) : 52.063 NEW Forward Execution Time (us) : 7.904 # Benchmarking PyTorch: cat # Mode: Eager # Name: cat_sizes[(1,20,40),(1,4,40),(1,5,40)]_N-1_dim1_cuda # Input: sizes: [(1, 20, 40), (1, 4, 40), (1, 5, 40)], N: -1, dim: 1, device: cuda OLD Forward Execution Time (us) : 52.275 NEW Forward Execution Time (us) : 8.118 # Benchmarking PyTorch: cat # Mode: Eager # Name: cat_sizes[(1,580),(1,174)]_N-1_dim1_cuda # Input: sizes: [(1, 580), (1, 174)], N: -1, dim: 1, device: cuda OLD Forward Execution Time (us) : 51.896 NEW Forward Execution Time (us) : 7.938 # Benchmarking PyTorch: cat # Mode: Eager # Name: cat_sizes[(20,160),(20,14)]_N-1_dim1_cuda # Input: sizes: [(20, 160), (20, 14)], N: -1, dim: 1, device: cuda OLD Forward Execution Time (us) : 51.745 NEW Forward Execution Time (us) : 7.922 # Benchmarking PyTorch: cat # Mode: Eager # Name: cat_sizes[(20,20,40),(20,4,40),(20,5,40)]_N-1_dim1_cuda # Input: sizes: [(20, 20, 40), (20, 4, 40), (20, 5, 40)], N: -1, dim: 1, device: cuda OLD Forward Execution Time (us) : 52.575 NEW Forward Execution Time (us) : 13.299 # Benchmarking PyTorch: cat # Mode: Eager # Name: cat_sizes[(20,580),(20,174)]_N-1_dim1_cuda # Input: sizes: [(20, 580), (20, 174)], N: -1, dim: 1, device: cuda OLD Forward Execution Time (us) : 52.090 NEW Forward Execution Time (us) : 8.015 ``` Pull Request resolved: #74129 Approved by: https://github.com/ngimel

revert d5ca53c (pytorch#46097). The changes only affect ROCm. Reverts a work-around for a compiler performance issue that is no longer needed. `python -m pt.cat_test --tag_filter all --device cuda` ``` # ---------------------------------------- # PyTorch/Caffe2 Operator Micro-benchmarks # ---------------------------------------- # Tag : all # Benchmarking PyTorch: cat # Mode: Eager # Name: cat_sizes(1,1,1)_N2_dim0_cuda # Input: sizes: (1, 1, 1), N: 2, dim: 0, device: cuda OLD Forward Execution Time (us) : 48.833 NEW Forward Execution Time (us) : 8.318 # Benchmarking PyTorch: cat # Mode: Eager # Name: cat_sizes(512,512,2)_N2_dim1_cuda # Input: sizes: (512, 512, 2), N: 2, dim: 1, device: cuda OLD Forward Execution Time (us) : 54.508 NEW Forward Execution Time (us) : 23.824 # Benchmarking PyTorch: cat # Mode: Eager # Name: cat_sizes(128,1024,2)_N2_dim1_cuda # Input: sizes: (128, 1024, 2), N: 2, dim: 1, device: cuda OLD Forward Execution Time (us) : 52.117 NEW Forward Execution Time (us) : 14.942 # Benchmarking PyTorch: cat # Mode: Eager # Name: cat_sizes(1024,1024,2)_N2_dim0_cuda # Input: sizes: (1024, 1024, 2), N: 2, dim: 0, device: cuda OLD Forward Execution Time (us) : 98.790 NEW Forward Execution Time (us) : 74.334 # Benchmarking PyTorch: cat # Mode: Eager # Name: cat_sizes(1025,1023,2)_N2_dim1_cuda # Input: sizes: (1025, 1023, 2), N: 2, dim: 1, device: cuda OLD Forward Execution Time (us) : 102.063 NEW Forward Execution Time (us) : 76.008 # Benchmarking PyTorch: cat # Mode: Eager # Name: cat_sizes(1024,1024,2)_N2_dim2_cuda # Input: sizes: (1024, 1024, 2), N: 2, dim: 2, device: cuda OLD Forward Execution Time (us) : 167.786 NEW Forward Execution Time (us) : 123.679 # Benchmarking PyTorch: cat # Mode: Eager # Name: cat_sizes[<function<lambda>at0x7f1b1dec7b00>,111,65]_N5_dim0_cuda # Input: sizes: [<function <lambda> at 0x7f1b1dec7b00>, 111, 65], N: 5, dim: 0, device: cuda OLD Forward Execution Time (us) : 98.320 NEW Forward Execution Time (us) : 67.436 # Benchmarking PyTorch: cat # Mode: Eager # Name: cat_sizes[96,<function<lambda>at0x7f1b1dec7a70>,64]_N5_dim1_cuda # Input: sizes: [96, <function <lambda> at 0x7f1b1dec7a70>, 64], N: 5, dim: 1, device: cuda OLD Forward Execution Time (us) : 91.484 NEW Forward Execution Time (us) : 59.230 # Benchmarking PyTorch: cat # Mode: Eager # Name: cat_sizes[128,64,<function<lambda>at0x7f18db09d290>]_N5_dim2_cuda # Input: sizes: [128, 64, <function <lambda> at 0x7f18db09d290>], N: 5, dim: 2, device: cuda OLD Forward Execution Time (us) : 109.569 NEW Forward Execution Time (us) : 76.557 # Benchmarking PyTorch: cat # Mode: Eager # Name: cat_sizes[<function<lambda>at0x7f18db09d560>,32,64]_N50_dim0_cuda # Input: sizes: [<function <lambda> at 0x7f18db09d560>, 32, 64], N: 50, dim: 0, device: cuda OLD Forward Execution Time (us) : 106.603 NEW Forward Execution Time (us) : 87.635 # Benchmarking PyTorch: cat # Mode: Eager # Name: cat_sizes[32,<function<lambda>at0x7f18db09d5f0>,64]_N50_dim1_cuda # Input: sizes: [32, <function <lambda> at 0x7f18db09d5f0>, 64], N: 50, dim: 1, device: cuda OLD Forward Execution Time (us) : 106.693 NEW Forward Execution Time (us) : 88.902 # Benchmarking PyTorch: cat # Mode: Eager # Name: cat_sizes[33,65,<function<lambda>at0x7f18db09d680>]_N50_dim2_cuda # Input: sizes: [33, 65, <function <lambda> at 0x7f18db09d680>], N: 50, dim: 2, device: cuda OLD Forward Execution Time (us) : 110.881 NEW Forward Execution Time (us) : 94.361 # Benchmarking PyTorch: cat # Mode: Eager # Name: cat_sizes(64,32,4,16,32)_N2_dim2_cuda # Input: sizes: (64, 32, 4, 16, 32), N: 2, dim: 2, device: cuda OLD Forward Execution Time (us) : 122.925 NEW Forward Execution Time (us) : 123.046 # Benchmarking PyTorch: cat # Mode: Eager # Name: cat_sizes(16,32,4,16,32)_N8_dim2_cuda # Input: sizes: (16, 32, 4, 16, 32), N: 8, dim: 2, device: cuda OLD Forward Execution Time (us) : 272.442 NEW Forward Execution Time (us) : 271.932 # Benchmarking PyTorch: cat # Mode: Eager # Name: cat_sizes(9,31,5,15,33)_N17_dim4_cuda # Input: sizes: (9, 31, 5, 15, 33), N: 17, dim: 4, device: cuda OLD Forward Execution Time (us) : 457.329 NEW Forward Execution Time (us) : 456.767 # Benchmarking PyTorch: cat # Mode: Eager # Name: cat_sizes[<function<lambda>at0x7f18db09d710>]_N100_dim0_cuda # Input: sizes: [<function <lambda> at 0x7f18db09d710>], N: 100, dim: 0, device: cuda OLD Forward Execution Time (us) : 117.688 NEW Forward Execution Time (us) : 87.133 # Benchmarking PyTorch: cat # Mode: Eager # Name: cat_sizes[<function<lambda>at0x7f18db09d7a0>]_N1000_dim0_cuda # Input: sizes: [<function <lambda> at 0x7f18db09d7a0>], N: 1000, dim: 0, device: cuda OLD Forward Execution Time (us) : 873.764 NEW Forward Execution Time (us) : 865.075 # Benchmarking PyTorch: cat # Mode: Eager # Name: cat_sizes[<function<lambda>at0x7f18db09d830>]_N2000_dim0_cuda # Input: sizes: [<function <lambda> at 0x7f18db09d830>], N: 2000, dim: 0, device: cuda OLD Forward Execution Time (us) : 1746.831 NEW Forward Execution Time (us) : 1730.252 # Benchmarking PyTorch: cat # Mode: Eager # Name: cat_sizes[<function<lambda>at0x7f18db09d8c0>]_N3000_dim0_cuda # Input: sizes: [<function <lambda> at 0x7f18db09d8c0>], N: 3000, dim: 0, device: cuda OLD Forward Execution Time (us) : 2619.303 NEW Forward Execution Time (us) : 2598.717 # Benchmarking PyTorch: cat # Mode: Eager # Name: cat_sizes[(1,160),(1,14)]_N-1_dim1_cuda # Input: sizes: [(1, 160), (1, 14)], N: -1, dim: 1, device: cuda OLD Forward Execution Time (us) : 52.063 NEW Forward Execution Time (us) : 7.904 # Benchmarking PyTorch: cat # Mode: Eager # Name: cat_sizes[(1,20,40),(1,4,40),(1,5,40)]_N-1_dim1_cuda # Input: sizes: [(1, 20, 40), (1, 4, 40), (1, 5, 40)], N: -1, dim: 1, device: cuda OLD Forward Execution Time (us) : 52.275 NEW Forward Execution Time (us) : 8.118 # Benchmarking PyTorch: cat # Mode: Eager # Name: cat_sizes[(1,580),(1,174)]_N-1_dim1_cuda # Input: sizes: [(1, 580), (1, 174)], N: -1, dim: 1, device: cuda OLD Forward Execution Time (us) : 51.896 NEW Forward Execution Time (us) : 7.938 # Benchmarking PyTorch: cat # Mode: Eager # Name: cat_sizes[(20,160),(20,14)]_N-1_dim1_cuda # Input: sizes: [(20, 160), (20, 14)], N: -1, dim: 1, device: cuda OLD Forward Execution Time (us) : 51.745 NEW Forward Execution Time (us) : 7.922 # Benchmarking PyTorch: cat # Mode: Eager # Name: cat_sizes[(20,20,40),(20,4,40),(20,5,40)]_N-1_dim1_cuda # Input: sizes: [(20, 20, 40), (20, 4, 40), (20, 5, 40)], N: -1, dim: 1, device: cuda OLD Forward Execution Time (us) : 52.575 NEW Forward Execution Time (us) : 13.299 # Benchmarking PyTorch: cat # Mode: Eager # Name: cat_sizes[(20,580),(20,174)]_N-1_dim1_cuda # Input: sizes: [(20, 580), (20, 174)], N: -1, dim: 1, device: cuda OLD Forward Execution Time (us) : 52.090 NEW Forward Execution Time (us) : 8.015 ``` Pull Request resolved: pytorch#74129 Approved by: https://github.com/ngimel

revert d5ca53c (pytorch#46097). The changes only affect ROCm. Reverts a work-around for a compiler performance issue that is no longer needed. `python -m pt.cat_test --tag_filter all --device cuda` ``` OLD Forward Execution Time (us) : 48.833 NEW Forward Execution Time (us) : 8.318 OLD Forward Execution Time (us) : 54.508 NEW Forward Execution Time (us) : 23.824 OLD Forward Execution Time (us) : 52.117 NEW Forward Execution Time (us) : 14.942 OLD Forward Execution Time (us) : 98.790 NEW Forward Execution Time (us) : 74.334 OLD Forward Execution Time (us) : 102.063 NEW Forward Execution Time (us) : 76.008 OLD Forward Execution Time (us) : 167.786 NEW Forward Execution Time (us) : 123.679 OLD Forward Execution Time (us) : 98.320 NEW Forward Execution Time (us) : 67.436 OLD Forward Execution Time (us) : 91.484 NEW Forward Execution Time (us) : 59.230 OLD Forward Execution Time (us) : 109.569 NEW Forward Execution Time (us) : 76.557 OLD Forward Execution Time (us) : 106.603 NEW Forward Execution Time (us) : 87.635 OLD Forward Execution Time (us) : 106.693 NEW Forward Execution Time (us) : 88.902 OLD Forward Execution Time (us) : 110.881 NEW Forward Execution Time (us) : 94.361 OLD Forward Execution Time (us) : 122.925 NEW Forward Execution Time (us) : 123.046 OLD Forward Execution Time (us) : 272.442 NEW Forward Execution Time (us) : 271.932 OLD Forward Execution Time (us) : 457.329 NEW Forward Execution Time (us) : 456.767 OLD Forward Execution Time (us) : 117.688 NEW Forward Execution Time (us) : 87.133 OLD Forward Execution Time (us) : 873.764 NEW Forward Execution Time (us) : 865.075 OLD Forward Execution Time (us) : 1746.831 NEW Forward Execution Time (us) : 1730.252 OLD Forward Execution Time (us) : 2619.303 NEW Forward Execution Time (us) : 2598.717 OLD Forward Execution Time (us) : 52.063 NEW Forward Execution Time (us) : 7.904 OLD Forward Execution Time (us) : 52.275 NEW Forward Execution Time (us) : 8.118 OLD Forward Execution Time (us) : 51.896 NEW Forward Execution Time (us) : 7.938 OLD Forward Execution Time (us) : 51.745 NEW Forward Execution Time (us) : 7.922 OLD Forward Execution Time (us) : 52.575 NEW Forward Execution Time (us) : 13.299 OLD Forward Execution Time (us) : 52.090 NEW Forward Execution Time (us) : 8.015 ``` Pull Request resolved: pytorch#74129 Approved by: https://github.com/ngimel

revert d5ca53c (pytorch#46097). The changes only affect ROCm. Reverts a work-around for a compiler performance issue that is no longer needed. `python -m pt.cat_test --tag_filter all --device cuda` ``` # ---------------------------------------- # PyTorch/Caffe2 Operator Micro-benchmarks # ---------------------------------------- # Tag : all # Benchmarking PyTorch: cat # Mode: Eager # Name: cat_sizes(1,1,1)_N2_dim0_cuda # Input: sizes: (1, 1, 1), N: 2, dim: 0, device: cuda OLD Forward Execution Time (us) : 48.833 NEW Forward Execution Time (us) : 8.318 # Benchmarking PyTorch: cat # Mode: Eager # Name: cat_sizes(512,512,2)_N2_dim1_cuda # Input: sizes: (512, 512, 2), N: 2, dim: 1, device: cuda OLD Forward Execution Time (us) : 54.508 NEW Forward Execution Time (us) : 23.824 # Benchmarking PyTorch: cat # Mode: Eager # Name: cat_sizes(128,1024,2)_N2_dim1_cuda # Input: sizes: (128, 1024, 2), N: 2, dim: 1, device: cuda OLD Forward Execution Time (us) : 52.117 NEW Forward Execution Time (us) : 14.942 # Benchmarking PyTorch: cat # Mode: Eager # Name: cat_sizes(1024,1024,2)_N2_dim0_cuda # Input: sizes: (1024, 1024, 2), N: 2, dim: 0, device: cuda OLD Forward Execution Time (us) : 98.790 NEW Forward Execution Time (us) : 74.334 # Benchmarking PyTorch: cat # Mode: Eager # Name: cat_sizes(1025,1023,2)_N2_dim1_cuda # Input: sizes: (1025, 1023, 2), N: 2, dim: 1, device: cuda OLD Forward Execution Time (us) : 102.063 NEW Forward Execution Time (us) : 76.008 # Benchmarking PyTorch: cat # Mode: Eager # Name: cat_sizes(1024,1024,2)_N2_dim2_cuda # Input: sizes: (1024, 1024, 2), N: 2, dim: 2, device: cuda OLD Forward Execution Time (us) : 167.786 NEW Forward Execution Time (us) : 123.679 # Benchmarking PyTorch: cat # Mode: Eager # Name: cat_sizes[<function<lambda>at0x7f1b1dec7b00>,111,65]_N5_dim0_cuda # Input: sizes: [<function <lambda> at 0x7f1b1dec7b00>, 111, 65], N: 5, dim: 0, device: cuda OLD Forward Execution Time (us) : 98.320 NEW Forward Execution Time (us) : 67.436 # Benchmarking PyTorch: cat # Mode: Eager # Name: cat_sizes[96,<function<lambda>at0x7f1b1dec7a70>,64]_N5_dim1_cuda # Input: sizes: [96, <function <lambda> at 0x7f1b1dec7a70>, 64], N: 5, dim: 1, device: cuda OLD Forward Execution Time (us) : 91.484 NEW Forward Execution Time (us) : 59.230 # Benchmarking PyTorch: cat # Mode: Eager # Name: cat_sizes[128,64,<function<lambda>at0x7f18db09d290>]_N5_dim2_cuda # Input: sizes: [128, 64, <function <lambda> at 0x7f18db09d290>], N: 5, dim: 2, device: cuda OLD Forward Execution Time (us) : 109.569 NEW Forward Execution Time (us) : 76.557 # Benchmarking PyTorch: cat # Mode: Eager # Name: cat_sizes[<function<lambda>at0x7f18db09d560>,32,64]_N50_dim0_cuda # Input: sizes: [<function <lambda> at 0x7f18db09d560>, 32, 64], N: 50, dim: 0, device: cuda OLD Forward Execution Time (us) : 106.603 NEW Forward Execution Time (us) : 87.635 # Benchmarking PyTorch: cat # Mode: Eager # Name: cat_sizes[32,<function<lambda>at0x7f18db09d5f0>,64]_N50_dim1_cuda # Input: sizes: [32, <function <lambda> at 0x7f18db09d5f0>, 64], N: 50, dim: 1, device: cuda OLD Forward Execution Time (us) : 106.693 NEW Forward Execution Time (us) : 88.902 # Benchmarking PyTorch: cat # Mode: Eager # Name: cat_sizes[33,65,<function<lambda>at0x7f18db09d680>]_N50_dim2_cuda # Input: sizes: [33, 65, <function <lambda> at 0x7f18db09d680>], N: 50, dim: 2, device: cuda OLD Forward Execution Time (us) : 110.881 NEW Forward Execution Time (us) : 94.361 # Benchmarking PyTorch: cat # Mode: Eager # Name: cat_sizes(64,32,4,16,32)_N2_dim2_cuda # Input: sizes: (64, 32, 4, 16, 32), N: 2, dim: 2, device: cuda OLD Forward Execution Time (us) : 122.925 NEW Forward Execution Time (us) : 123.046 # Benchmarking PyTorch: cat # Mode: Eager # Name: cat_sizes(16,32,4,16,32)_N8_dim2_cuda # Input: sizes: (16, 32, 4, 16, 32), N: 8, dim: 2, device: cuda OLD Forward Execution Time (us) : 272.442 NEW Forward Execution Time (us) : 271.932 # Benchmarking PyTorch: cat # Mode: Eager # Name: cat_sizes(9,31,5,15,33)_N17_dim4_cuda # Input: sizes: (9, 31, 5, 15, 33), N: 17, dim: 4, device: cuda OLD Forward Execution Time (us) : 457.329 NEW Forward Execution Time (us) : 456.767 # Benchmarking PyTorch: cat # Mode: Eager # Name: cat_sizes[<function<lambda>at0x7f18db09d710>]_N100_dim0_cuda # Input: sizes: [<function <lambda> at 0x7f18db09d710>], N: 100, dim: 0, device: cuda OLD Forward Execution Time (us) : 117.688 NEW Forward Execution Time (us) : 87.133 # Benchmarking PyTorch: cat # Mode: Eager # Name: cat_sizes[<function<lambda>at0x7f18db09d7a0>]_N1000_dim0_cuda # Input: sizes: [<function <lambda> at 0x7f18db09d7a0>], N: 1000, dim: 0, device: cuda OLD Forward Execution Time (us) : 873.764 NEW Forward Execution Time (us) : 865.075 # Benchmarking PyTorch: cat # Mode: Eager # Name: cat_sizes[<function<lambda>at0x7f18db09d830>]_N2000_dim0_cuda # Input: sizes: [<function <lambda> at 0x7f18db09d830>], N: 2000, dim: 0, device: cuda OLD Forward Execution Time (us) : 1746.831 NEW Forward Execution Time (us) : 1730.252 # Benchmarking PyTorch: cat # Mode: Eager # Name: cat_sizes[<function<lambda>at0x7f18db09d8c0>]_N3000_dim0_cuda # Input: sizes: [<function <lambda> at 0x7f18db09d8c0>], N: 3000, dim: 0, device: cuda OLD Forward Execution Time (us) : 2619.303 NEW Forward Execution Time (us) : 2598.717 # Benchmarking PyTorch: cat # Mode: Eager # Name: cat_sizes[(1,160),(1,14)]_N-1_dim1_cuda # Input: sizes: [(1, 160), (1, 14)], N: -1, dim: 1, device: cuda OLD Forward Execution Time (us) : 52.063 NEW Forward Execution Time (us) : 7.904 # Benchmarking PyTorch: cat # Mode: Eager # Name: cat_sizes[(1,20,40),(1,4,40),(1,5,40)]_N-1_dim1_cuda # Input: sizes: [(1, 20, 40), (1, 4, 40), (1, 5, 40)], N: -1, dim: 1, device: cuda OLD Forward Execution Time (us) : 52.275 NEW Forward Execution Time (us) : 8.118 # Benchmarking PyTorch: cat # Mode: Eager # Name: cat_sizes[(1,580),(1,174)]_N-1_dim1_cuda # Input: sizes: [(1, 580), (1, 174)], N: -1, dim: 1, device: cuda OLD Forward Execution Time (us) : 51.896 NEW Forward Execution Time (us) : 7.938 # Benchmarking PyTorch: cat # Mode: Eager # Name: cat_sizes[(20,160),(20,14)]_N-1_dim1_cuda # Input: sizes: [(20, 160), (20, 14)], N: -1, dim: 1, device: cuda OLD Forward Execution Time (us) : 51.745 NEW Forward Execution Time (us) : 7.922 # Benchmarking PyTorch: cat # Mode: Eager # Name: cat_sizes[(20,20,40),(20,4,40),(20,5,40)]_N-1_dim1_cuda # Input: sizes: [(20, 20, 40), (20, 4, 40), (20, 5, 40)], N: -1, dim: 1, device: cuda OLD Forward Execution Time (us) : 52.575 NEW Forward Execution Time (us) : 13.299 # Benchmarking PyTorch: cat # Mode: Eager # Name: cat_sizes[(20,580),(20,174)]_N-1_dim1_cuda # Input: sizes: [(20, 580), (20, 174)], N: -1, dim: 1, device: cuda OLD Forward Execution Time (us) : 52.090 NEW Forward Execution Time (us) : 8.015 ``` Pull Request resolved: pytorch#74129 Approved by: https://github.com/ngimel

revert d5ca53c (pytorch#46097). The changes only affect ROCm. Reverts a work-around for a compiler performance issue that is no longer needed. `python -m pt.cat_test --tag_filter all --device cuda` ``` OLD Forward Execution Time (us) : 48.833 NEW Forward Execution Time (us) : 8.318 OLD Forward Execution Time (us) : 54.508 NEW Forward Execution Time (us) : 23.824 OLD Forward Execution Time (us) : 52.117 NEW Forward Execution Time (us) : 14.942 OLD Forward Execution Time (us) : 98.790 NEW Forward Execution Time (us) : 74.334 OLD Forward Execution Time (us) : 102.063 NEW Forward Execution Time (us) : 76.008 OLD Forward Execution Time (us) : 167.786 NEW Forward Execution Time (us) : 123.679 OLD Forward Execution Time (us) : 98.320 NEW Forward Execution Time (us) : 67.436 OLD Forward Execution Time (us) : 91.484 NEW Forward Execution Time (us) : 59.230 OLD Forward Execution Time (us) : 109.569 NEW Forward Execution Time (us) : 76.557 OLD Forward Execution Time (us) : 106.603 NEW Forward Execution Time (us) : 87.635 OLD Forward Execution Time (us) : 106.693 NEW Forward Execution Time (us) : 88.902 OLD Forward Execution Time (us) : 110.881 NEW Forward Execution Time (us) : 94.361 OLD Forward Execution Time (us) : 122.925 NEW Forward Execution Time (us) : 123.046 OLD Forward Execution Time (us) : 272.442 NEW Forward Execution Time (us) : 271.932 OLD Forward Execution Time (us) : 457.329 NEW Forward Execution Time (us) : 456.767 OLD Forward Execution Time (us) : 117.688 NEW Forward Execution Time (us) : 87.133 OLD Forward Execution Time (us) : 873.764 NEW Forward Execution Time (us) : 865.075 OLD Forward Execution Time (us) : 1746.831 NEW Forward Execution Time (us) : 1730.252 OLD Forward Execution Time (us) : 2619.303 NEW Forward Execution Time (us) : 2598.717 OLD Forward Execution Time (us) : 52.063 NEW Forward Execution Time (us) : 7.904 OLD Forward Execution Time (us) : 52.275 NEW Forward Execution Time (us) : 8.118 OLD Forward Execution Time (us) : 51.896 NEW Forward Execution Time (us) : 7.938 OLD Forward Execution Time (us) : 51.745 NEW Forward Execution Time (us) : 7.922 OLD Forward Execution Time (us) : 52.575 NEW Forward Execution Time (us) : 13.299 OLD Forward Execution Time (us) : 52.090 NEW Forward Execution Time (us) : 8.015 ``` Pull Request resolved: pytorch#74129 Approved by: https://github.com/ngimel

revert d5ca53c (pytorch#46097). The changes only affect ROCm. Reverts a work-around for a compiler performance issue that is no longer needed. `python -m pt.cat_test --tag_filter all --device cuda` ``` # ---------------------------------------- # PyTorch/Caffe2 Operator Micro-benchmarks # ---------------------------------------- # Tag : all # Benchmarking PyTorch: cat # Mode: Eager # Name: cat_sizes(1,1,1)_N2_dim0_cuda # Input: sizes: (1, 1, 1), N: 2, dim: 0, device: cuda OLD Forward Execution Time (us) : 48.833 NEW Forward Execution Time (us) : 8.318 # Benchmarking PyTorch: cat # Mode: Eager # Name: cat_sizes(512,512,2)_N2_dim1_cuda # Input: sizes: (512, 512, 2), N: 2, dim: 1, device: cuda OLD Forward Execution Time (us) : 54.508 NEW Forward Execution Time (us) : 23.824 # Benchmarking PyTorch: cat # Mode: Eager # Name: cat_sizes(128,1024,2)_N2_dim1_cuda # Input: sizes: (128, 1024, 2), N: 2, dim: 1, device: cuda OLD Forward Execution Time (us) : 52.117 NEW Forward Execution Time (us) : 14.942 # Benchmarking PyTorch: cat # Mode: Eager # Name: cat_sizes(1024,1024,2)_N2_dim0_cuda # Input: sizes: (1024, 1024, 2), N: 2, dim: 0, device: cuda OLD Forward Execution Time (us) : 98.790 NEW Forward Execution Time (us) : 74.334 # Benchmarking PyTorch: cat # Mode: Eager # Name: cat_sizes(1025,1023,2)_N2_dim1_cuda # Input: sizes: (1025, 1023, 2), N: 2, dim: 1, device: cuda OLD Forward Execution Time (us) : 102.063 NEW Forward Execution Time (us) : 76.008 # Benchmarking PyTorch: cat # Mode: Eager # Name: cat_sizes(1024,1024,2)_N2_dim2_cuda # Input: sizes: (1024, 1024, 2), N: 2, dim: 2, device: cuda OLD Forward Execution Time (us) : 167.786 NEW Forward Execution Time (us) : 123.679 # Benchmarking PyTorch: cat # Mode: Eager # Name: cat_sizes[<function<lambda>at0x7f1b1dec7b00>,111,65]_N5_dim0_cuda # Input: sizes: [<function <lambda> at 0x7f1b1dec7b00>, 111, 65], N: 5, dim: 0, device: cuda OLD Forward Execution Time (us) : 98.320 NEW Forward Execution Time (us) : 67.436 # Benchmarking PyTorch: cat # Mode: Eager # Name: cat_sizes[96,<function<lambda>at0x7f1b1dec7a70>,64]_N5_dim1_cuda # Input: sizes: [96, <function <lambda> at 0x7f1b1dec7a70>, 64], N: 5, dim: 1, device: cuda OLD Forward Execution Time (us) : 91.484 NEW Forward Execution Time (us) : 59.230 # Benchmarking PyTorch: cat # Mode: Eager # Name: cat_sizes[128,64,<function<lambda>at0x7f18db09d290>]_N5_dim2_cuda # Input: sizes: [128, 64, <function <lambda> at 0x7f18db09d290>], N: 5, dim: 2, device: cuda OLD Forward Execution Time (us) : 109.569 NEW Forward Execution Time (us) : 76.557 # Benchmarking PyTorch: cat # Mode: Eager # Name: cat_sizes[<function<lambda>at0x7f18db09d560>,32,64]_N50_dim0_cuda # Input: sizes: [<function <lambda> at 0x7f18db09d560>, 32, 64], N: 50, dim: 0, device: cuda OLD Forward Execution Time (us) : 106.603 NEW Forward Execution Time (us) : 87.635 # Benchmarking PyTorch: cat # Mode: Eager # Name: cat_sizes[32,<function<lambda>at0x7f18db09d5f0>,64]_N50_dim1_cuda # Input: sizes: [32, <function <lambda> at 0x7f18db09d5f0>, 64], N: 50, dim: 1, device: cuda OLD Forward Execution Time (us) : 106.693 NEW Forward Execution Time (us) : 88.902 # Benchmarking PyTorch: cat # Mode: Eager # Name: cat_sizes[33,65,<function<lambda>at0x7f18db09d680>]_N50_dim2_cuda # Input: sizes: [33, 65, <function <lambda> at 0x7f18db09d680>], N: 50, dim: 2, device: cuda OLD Forward Execution Time (us) : 110.881 NEW Forward Execution Time (us) : 94.361 # Benchmarking PyTorch: cat # Mode: Eager # Name: cat_sizes(64,32,4,16,32)_N2_dim2_cuda # Input: sizes: (64, 32, 4, 16, 32), N: 2, dim: 2, device: cuda OLD Forward Execution Time (us) : 122.925 NEW Forward Execution Time (us) : 123.046 # Benchmarking PyTorch: cat # Mode: Eager # Name: cat_sizes(16,32,4,16,32)_N8_dim2_cuda # Input: sizes: (16, 32, 4, 16, 32), N: 8, dim: 2, device: cuda OLD Forward Execution Time (us) : 272.442 NEW Forward Execution Time (us) : 271.932 # Benchmarking PyTorch: cat # Mode: Eager # Name: cat_sizes(9,31,5,15,33)_N17_dim4_cuda # Input: sizes: (9, 31, 5, 15, 33), N: 17, dim: 4, device: cuda OLD Forward Execution Time (us) : 457.329 NEW Forward Execution Time (us) : 456.767 # Benchmarking PyTorch: cat # Mode: Eager # Name: cat_sizes[<function<lambda>at0x7f18db09d710>]_N100_dim0_cuda # Input: sizes: [<function <lambda> at 0x7f18db09d710>], N: 100, dim: 0, device: cuda OLD Forward Execution Time (us) : 117.688 NEW Forward Execution Time (us) : 87.133 # Benchmarking PyTorch: cat # Mode: Eager # Name: cat_sizes[<function<lambda>at0x7f18db09d7a0>]_N1000_dim0_cuda # Input: sizes: [<function <lambda> at 0x7f18db09d7a0>], N: 1000, dim: 0, device: cuda OLD Forward Execution Time (us) : 873.764 NEW Forward Execution Time (us) : 865.075 # Benchmarking PyTorch: cat # Mode: Eager # Name: cat_sizes[<function<lambda>at0x7f18db09d830>]_N2000_dim0_cuda # Input: sizes: [<function <lambda> at 0x7f18db09d830>], N: 2000, dim: 0, device: cuda OLD Forward Execution Time (us) : 1746.831 NEW Forward Execution Time (us) : 1730.252 # Benchmarking PyTorch: cat # Mode: Eager # Name: cat_sizes[<function<lambda>at0x7f18db09d8c0>]_N3000_dim0_cuda # Input: sizes: [<function <lambda> at 0x7f18db09d8c0>], N: 3000, dim: 0, device: cuda OLD Forward Execution Time (us) : 2619.303 NEW Forward Execution Time (us) : 2598.717 # Benchmarking PyTorch: cat # Mode: Eager # Name: cat_sizes[(1,160),(1,14)]_N-1_dim1_cuda # Input: sizes: [(1, 160), (1, 14)], N: -1, dim: 1, device: cuda OLD Forward Execution Time (us) : 52.063 NEW Forward Execution Time (us) : 7.904 # Benchmarking PyTorch: cat # Mode: Eager # Name: cat_sizes[(1,20,40),(1,4,40),(1,5,40)]_N-1_dim1_cuda # Input: sizes: [(1, 20, 40), (1, 4, 40), (1, 5, 40)], N: -1, dim: 1, device: cuda OLD Forward Execution Time (us) : 52.275 NEW Forward Execution Time (us) : 8.118 # Benchmarking PyTorch: cat # Mode: Eager # Name: cat_sizes[(1,580),(1,174)]_N-1_dim1_cuda # Input: sizes: [(1, 580), (1, 174)], N: -1, dim: 1, device: cuda OLD Forward Execution Time (us) : 51.896 NEW Forward Execution Time (us) : 7.938 # Benchmarking PyTorch: cat # Mode: Eager # Name: cat_sizes[(20,160),(20,14)]_N-1_dim1_cuda # Input: sizes: [(20, 160), (20, 14)], N: -1, dim: 1, device: cuda OLD Forward Execution Time (us) : 51.745 NEW Forward Execution Time (us) : 7.922 # Benchmarking PyTorch: cat # Mode: Eager # Name: cat_sizes[(20,20,40),(20,4,40),(20,5,40)]_N-1_dim1_cuda # Input: sizes: [(20, 20, 40), (20, 4, 40), (20, 5, 40)], N: -1, dim: 1, device: cuda OLD Forward Execution Time (us) : 52.575 NEW Forward Execution Time (us) : 13.299 # Benchmarking PyTorch: cat # Mode: Eager # Name: cat_sizes[(20,580),(20,174)]_N-1_dim1_cuda # Input: sizes: [(20, 580), (20, 174)], N: -1, dim: 1, device: cuda OLD Forward Execution Time (us) : 52.090 NEW Forward Execution Time (us) : 8.015 ``` Pull Request resolved: pytorch#74129 Approved by: https://github.com/ngimel

revert d5ca53c (pytorch#46097). The changes only affect ROCm. Reverts a work-around for a compiler performance issue that is no longer needed. `python -m pt.cat_test --tag_filter all --device cuda` ``` OLD Forward Execution Time (us) : 48.833 NEW Forward Execution Time (us) : 8.318 OLD Forward Execution Time (us) : 54.508 NEW Forward Execution Time (us) : 23.824 OLD Forward Execution Time (us) : 52.117 NEW Forward Execution Time (us) : 14.942 OLD Forward Execution Time (us) : 98.790 NEW Forward Execution Time (us) : 74.334 OLD Forward Execution Time (us) : 102.063 NEW Forward Execution Time (us) : 76.008 OLD Forward Execution Time (us) : 167.786 NEW Forward Execution Time (us) : 123.679 OLD Forward Execution Time (us) : 98.320 NEW Forward Execution Time (us) : 67.436 OLD Forward Execution Time (us) : 91.484 NEW Forward Execution Time (us) : 59.230 OLD Forward Execution Time (us) : 109.569 NEW Forward Execution Time (us) : 76.557 OLD Forward Execution Time (us) : 106.603 NEW Forward Execution Time (us) : 87.635 OLD Forward Execution Time (us) : 106.693 NEW Forward Execution Time (us) : 88.902 OLD Forward Execution Time (us) : 110.881 NEW Forward Execution Time (us) : 94.361 OLD Forward Execution Time (us) : 122.925 NEW Forward Execution Time (us) : 123.046 OLD Forward Execution Time (us) : 272.442 NEW Forward Execution Time (us) : 271.932 OLD Forward Execution Time (us) : 457.329 NEW Forward Execution Time (us) : 456.767 OLD Forward Execution Time (us) : 117.688 NEW Forward Execution Time (us) : 87.133 OLD Forward Execution Time (us) : 873.764 NEW Forward Execution Time (us) : 865.075 OLD Forward Execution Time (us) : 1746.831 NEW Forward Execution Time (us) : 1730.252 OLD Forward Execution Time (us) : 2619.303 NEW Forward Execution Time (us) : 2598.717 OLD Forward Execution Time (us) : 52.063 NEW Forward Execution Time (us) : 7.904 OLD Forward Execution Time (us) : 52.275 NEW Forward Execution Time (us) : 8.118 OLD Forward Execution Time (us) : 51.896 NEW Forward Execution Time (us) : 7.938 OLD Forward Execution Time (us) : 51.745 NEW Forward Execution Time (us) : 7.922 OLD Forward Execution Time (us) : 52.575 NEW Forward Execution Time (us) : 13.299 OLD Forward Execution Time (us) : 52.090 NEW Forward Execution Time (us) : 8.015 ``` Pull Request resolved: pytorch#74129 Approved by: https://github.com/ngimel

revert d5ca53c (pytorch#46097). The changes only affect ROCm. Reverts a work-around for a compiler performance issue that is no longer needed. `python -m pt.cat_test --tag_filter all --device cuda` ``` # ---------------------------------------- # PyTorch/Caffe2 Operator Micro-benchmarks # ---------------------------------------- # Tag : all # Benchmarking PyTorch: cat # Mode: Eager # Name: cat_sizes(1,1,1)_N2_dim0_cuda # Input: sizes: (1, 1, 1), N: 2, dim: 0, device: cuda OLD Forward Execution Time (us) : 48.833 NEW Forward Execution Time (us) : 8.318 # Benchmarking PyTorch: cat # Mode: Eager # Name: cat_sizes(512,512,2)_N2_dim1_cuda # Input: sizes: (512, 512, 2), N: 2, dim: 1, device: cuda OLD Forward Execution Time (us) : 54.508 NEW Forward Execution Time (us) : 23.824 # Benchmarking PyTorch: cat # Mode: Eager # Name: cat_sizes(128,1024,2)_N2_dim1_cuda # Input: sizes: (128, 1024, 2), N: 2, dim: 1, device: cuda OLD Forward Execution Time (us) : 52.117 NEW Forward Execution Time (us) : 14.942 # Benchmarking PyTorch: cat # Mode: Eager # Name: cat_sizes(1024,1024,2)_N2_dim0_cuda # Input: sizes: (1024, 1024, 2), N: 2, dim: 0, device: cuda OLD Forward Execution Time (us) : 98.790 NEW Forward Execution Time (us) : 74.334 # Benchmarking PyTorch: cat # Mode: Eager # Name: cat_sizes(1025,1023,2)_N2_dim1_cuda # Input: sizes: (1025, 1023, 2), N: 2, dim: 1, device: cuda OLD Forward Execution Time (us) : 102.063 NEW Forward Execution Time (us) : 76.008 # Benchmarking PyTorch: cat # Mode: Eager # Name: cat_sizes(1024,1024,2)_N2_dim2_cuda # Input: sizes: (1024, 1024, 2), N: 2, dim: 2, device: cuda OLD Forward Execution Time (us) : 167.786 NEW Forward Execution Time (us) : 123.679 # Benchmarking PyTorch: cat # Mode: Eager # Name: cat_sizes[<function<lambda>at0x7f1b1dec7b00>,111,65]_N5_dim0_cuda # Input: sizes: [<function <lambda> at 0x7f1b1dec7b00>, 111, 65], N: 5, dim: 0, device: cuda OLD Forward Execution Time (us) : 98.320 NEW Forward Execution Time (us) : 67.436 # Benchmarking PyTorch: cat # Mode: Eager # Name: cat_sizes[96,<function<lambda>at0x7f1b1dec7a70>,64]_N5_dim1_cuda # Input: sizes: [96, <function <lambda> at 0x7f1b1dec7a70>, 64], N: 5, dim: 1, device: cuda OLD Forward Execution Time (us) : 91.484 NEW Forward Execution Time (us) : 59.230 # Benchmarking PyTorch: cat # Mode: Eager # Name: cat_sizes[128,64,<function<lambda>at0x7f18db09d290>]_N5_dim2_cuda # Input: sizes: [128, 64, <function <lambda> at 0x7f18db09d290>], N: 5, dim: 2, device: cuda OLD Forward Execution Time (us) : 109.569 NEW Forward Execution Time (us) : 76.557 # Benchmarking PyTorch: cat # Mode: Eager # Name: cat_sizes[<function<lambda>at0x7f18db09d560>,32,64]_N50_dim0_cuda # Input: sizes: [<function <lambda> at 0x7f18db09d560>, 32, 64], N: 50, dim: 0, device: cuda OLD Forward Execution Time (us) : 106.603 NEW Forward Execution Time (us) : 87.635 # Benchmarking PyTorch: cat # Mode: Eager # Name: cat_sizes[32,<function<lambda>at0x7f18db09d5f0>,64]_N50_dim1_cuda # Input: sizes: [32, <function <lambda> at 0x7f18db09d5f0>, 64], N: 50, dim: 1, device: cuda OLD Forward Execution Time (us) : 106.693 NEW Forward Execution Time (us) : 88.902 # Benchmarking PyTorch: cat # Mode: Eager # Name: cat_sizes[33,65,<function<lambda>at0x7f18db09d680>]_N50_dim2_cuda # Input: sizes: [33, 65, <function <lambda> at 0x7f18db09d680>], N: 50, dim: 2, device: cuda OLD Forward Execution Time (us) : 110.881 NEW Forward Execution Time (us) : 94.361 # Benchmarking PyTorch: cat # Mode: Eager # Name: cat_sizes(64,32,4,16,32)_N2_dim2_cuda # Input: sizes: (64, 32, 4, 16, 32), N: 2, dim: 2, device: cuda OLD Forward Execution Time (us) : 122.925 NEW Forward Execution Time (us) : 123.046 # Benchmarking PyTorch: cat # Mode: Eager # Name: cat_sizes(16,32,4,16,32)_N8_dim2_cuda # Input: sizes: (16, 32, 4, 16, 32), N: 8, dim: 2, device: cuda OLD Forward Execution Time (us) : 272.442 NEW Forward Execution Time (us) : 271.932 # Benchmarking PyTorch: cat # Mode: Eager # Name: cat_sizes(9,31,5,15,33)_N17_dim4_cuda # Input: sizes: (9, 31, 5, 15, 33), N: 17, dim: 4, device: cuda OLD Forward Execution Time (us) : 457.329 NEW Forward Execution Time (us) : 456.767 # Benchmarking PyTorch: cat # Mode: Eager # Name: cat_sizes[<function<lambda>at0x7f18db09d710>]_N100_dim0_cuda # Input: sizes: [<function <lambda> at 0x7f18db09d710>], N: 100, dim: 0, device: cuda OLD Forward Execution Time (us) : 117.688 NEW Forward Execution Time (us) : 87.133 # Benchmarking PyTorch: cat # Mode: Eager # Name: cat_sizes[<function<lambda>at0x7f18db09d7a0>]_N1000_dim0_cuda # Input: sizes: [<function <lambda> at 0x7f18db09d7a0>], N: 1000, dim: 0, device: cuda OLD Forward Execution Time (us) : 873.764 NEW Forward Execution Time (us) : 865.075 # Benchmarking PyTorch: cat # Mode: Eager # Name: cat_sizes[<function<lambda>at0x7f18db09d830>]_N2000_dim0_cuda # Input: sizes: [<function <lambda> at 0x7f18db09d830>], N: 2000, dim: 0, device: cuda OLD Forward Execution Time (us) : 1746.831 NEW Forward Execution Time (us) : 1730.252 # Benchmarking PyTorch: cat # Mode: Eager # Name: cat_sizes[<function<lambda>at0x7f18db09d8c0>]_N3000_dim0_cuda # Input: sizes: [<function <lambda> at 0x7f18db09d8c0>], N: 3000, dim: 0, device: cuda OLD Forward Execution Time (us) : 2619.303 NEW Forward Execution Time (us) : 2598.717 # Benchmarking PyTorch: cat # Mode: Eager # Name: cat_sizes[(1,160),(1,14)]_N-1_dim1_cuda # Input: sizes: [(1, 160), (1, 14)], N: -1, dim: 1, device: cuda OLD Forward Execution Time (us) : 52.063 NEW Forward Execution Time (us) : 7.904 # Benchmarking PyTorch: cat # Mode: Eager # Name: cat_sizes[(1,20,40),(1,4,40),(1,5,40)]_N-1_dim1_cuda # Input: sizes: [(1, 20, 40), (1, 4, 40), (1, 5, 40)], N: -1, dim: 1, device: cuda OLD Forward Execution Time (us) : 52.275 NEW Forward Execution Time (us) : 8.118 # Benchmarking PyTorch: cat # Mode: Eager # Name: cat_sizes[(1,580),(1,174)]_N-1_dim1_cuda # Input: sizes: [(1, 580), (1, 174)], N: -1, dim: 1, device: cuda OLD Forward Execution Time (us) : 51.896 NEW Forward Execution Time (us) : 7.938 # Benchmarking PyTorch: cat # Mode: Eager # Name: cat_sizes[(20,160),(20,14)]_N-1_dim1_cuda # Input: sizes: [(20, 160), (20, 14)], N: -1, dim: 1, device: cuda OLD Forward Execution Time (us) : 51.745 NEW Forward Execution Time (us) : 7.922 # Benchmarking PyTorch: cat # Mode: Eager # Name: cat_sizes[(20,20,40),(20,4,40),(20,5,40)]_N-1_dim1_cuda # Input: sizes: [(20, 20, 40), (20, 4, 40), (20, 5, 40)], N: -1, dim: 1, device: cuda OLD Forward Execution Time (us) : 52.575 NEW Forward Execution Time (us) : 13.299 # Benchmarking PyTorch: cat # Mode: Eager # Name: cat_sizes[(20,580),(20,174)]_N-1_dim1_cuda # Input: sizes: [(20, 580), (20, 174)], N: -1, dim: 1, device: cuda OLD Forward Execution Time (us) : 52.090 NEW Forward Execution Time (us) : 8.015 ``` Pull Request resolved: pytorch#74129 Approved by: https://github.com/ngimel

revert d5ca53c (pytorch#46097). The changes only affect ROCm. Reverts a work-around for a compiler performance issue that is no longer needed. `python -m pt.cat_test --tag_filter all --device cuda` ``` OLD Forward Execution Time (us) : 48.833 NEW Forward Execution Time (us) : 8.318 OLD Forward Execution Time (us) : 54.508 NEW Forward Execution Time (us) : 23.824 OLD Forward Execution Time (us) : 52.117 NEW Forward Execution Time (us) : 14.942 OLD Forward Execution Time (us) : 98.790 NEW Forward Execution Time (us) : 74.334 OLD Forward Execution Time (us) : 102.063 NEW Forward Execution Time (us) : 76.008 OLD Forward Execution Time (us) : 167.786 NEW Forward Execution Time (us) : 123.679 OLD Forward Execution Time (us) : 98.320 NEW Forward Execution Time (us) : 67.436 OLD Forward Execution Time (us) : 91.484 NEW Forward Execution Time (us) : 59.230 OLD Forward Execution Time (us) : 109.569 NEW Forward Execution Time (us) : 76.557 OLD Forward Execution Time (us) : 106.603 NEW Forward Execution Time (us) : 87.635 OLD Forward Execution Time (us) : 106.693 NEW Forward Execution Time (us) : 88.902 OLD Forward Execution Time (us) : 110.881 NEW Forward Execution Time (us) : 94.361 OLD Forward Execution Time (us) : 122.925 NEW Forward Execution Time (us) : 123.046 OLD Forward Execution Time (us) : 272.442 NEW Forward Execution Time (us) : 271.932 OLD Forward Execution Time (us) : 457.329 NEW Forward Execution Time (us) : 456.767 OLD Forward Execution Time (us) : 117.688 NEW Forward Execution Time (us) : 87.133 OLD Forward Execution Time (us) : 873.764 NEW Forward Execution Time (us) : 865.075 OLD Forward Execution Time (us) : 1746.831 NEW Forward Execution Time (us) : 1730.252 OLD Forward Execution Time (us) : 2619.303 NEW Forward Execution Time (us) : 2598.717 OLD Forward Execution Time (us) : 52.063 NEW Forward Execution Time (us) : 7.904 OLD Forward Execution Time (us) : 52.275 NEW Forward Execution Time (us) : 8.118 OLD Forward Execution Time (us) : 51.896 NEW Forward Execution Time (us) : 7.938 OLD Forward Execution Time (us) : 51.745 NEW Forward Execution Time (us) : 7.922 OLD Forward Execution Time (us) : 52.575 NEW Forward Execution Time (us) : 13.299 OLD Forward Execution Time (us) : 52.090 NEW Forward Execution Time (us) : 8.015 ``` Pull Request resolved: pytorch#74129 Approved by: https://github.com/ngimel

ashishfarmer added 2 commits October 9, 2020 16:40

pass struct as a pointer

c58c8ca

update comment

997a378

pytorchbot added the open source label Oct 9, 2020

jeffdaily added the module: rocm AMD GPU support for Pytorch label Oct 9, 2020

jeffdaily requested a review from lly-zero-one October 9, 2020 17:24

ashishfarmer marked this pull request as ready for review October 9, 2020 17:24

ashishfarmer mentioned this pull request Oct 9, 2020

Workaround to pass struct to kernel by ref for performance ROCm/pytorch#752

Closed

agolynski added the triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module label Oct 12, 2020

malfet requested a review from ngimel October 12, 2020 19:22

malfet requested changes Oct 12, 2020

View reviewed changes

refactoring into HIP functions

3128c39

jeffdaily mentioned this pull request Oct 12, 2020

Push rocm to slow path #46216

Closed

malfet approved these changes Oct 13, 2020

View reviewed changes

aten/src/ATen/native/cuda/Shape.cu Outdated

dim3 catGrid;

getCatGrid(batchCounter, catGrid);

Copy link

Contributor

malfet Oct 13, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit

Suggested change

remove unintentional line break

c29834d

facebook-github-bot reviewed Oct 13, 2020

View reviewed changes

facebook-github-bot closed this in d5ca53c Oct 14, 2020

facebook-github-bot added the Merged label Oct 14, 2020

ashishfarmer mentioned this pull request Oct 14, 2020

Performance fix for torch.cat operator on ROCm (#46097) #46323

Merged

jeffdaily mentioned this pull request Oct 14, 2020

[v1.7.0] Release Tracker #45592

Closed

jeffdaily added a commit to ROCm/pytorch that referenced this pull request Mar 11, 2022

revert d5ca53c (pytorch#46097)

4bbd81b

jeffdaily mentioned this pull request Mar 11, 2022

[ROCm] revert cat operator performance work-around #74129

Closed

jithunnair-amd pushed a commit to ROCm/pytorch that referenced this pull request Mar 14, 2022

revert d5ca53c (pytorch#46097)

6cec426

ngimel mentioned this pull request Mar 18, 2022

[ROCm] enable foreach fastpath #74417

Closed

jeffdaily mentioned this pull request Apr 7, 2022

[ROCm] revert cat operator performance work-around ROCm/pytorch#986

Merged

jeffdaily mentioned this pull request Apr 7, 2022

[ROCm] revert cat operator performance work-around ROCm/pytorch#987

Merged

jeffdaily mentioned this pull request Apr 7, 2022

[ROCm] revert cat operator performance work-around ROCm/pytorch#988

Merged

jeffdaily mentioned this pull request Apr 7, 2022

[ROCm] revert cat operator performance work-around ROCm/pytorch#989

Merged

jeffdaily mentioned this pull request Apr 8, 2022

[ROCm] revert cat operator performance work-around ROCm/pytorch#991

Merged

	__global__ void CatArrayBatchedCopy(
	__global__ void HIP_CatArrayBatchedCopy(

	void parallel_cat(Tensor &out, const TensorList &inputs, int64_t dimension,
	void hip_parallel_cat(Tensor &out, const TensorList &inputs, int64_t dimension,

	CatArrayBatchedCopy<scalar_t, unsigned int, DIMS><<<\
	HIP_CatArrayBatchedCopy<scalar_t, unsigned int, DIMS><<<\

Performance fix for torch.cat operator on ROCm #46097

Performance fix for torch.cat operator on ROCm #46097

Uh oh!

Conversation

ashishfarmer commented Oct 9, 2020

Uh oh!

jeffdaily commented Oct 9, 2020

Uh oh!

jeffdaily commented Oct 9, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

lly-zero-one commented Oct 9, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jeffdaily commented Oct 9, 2020

Uh oh!

malfet commented Oct 9, 2020

Uh oh!

codecov bot commented Oct 9, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

jeffdaily commented Oct 9, 2020

Uh oh!

jeffdaily commented Oct 12, 2020

Uh oh!

malfet left a comment

Choose a reason for hiding this comment

Uh oh!

malfet Oct 12, 2020

Choose a reason for hiding this comment

Uh oh!

malfet Oct 12, 2020

Choose a reason for hiding this comment

Uh oh!

malfet Oct 12, 2020

Choose a reason for hiding this comment

Uh oh!

ashishfarmer Oct 12, 2020

Choose a reason for hiding this comment

Uh oh!

jeffdaily commented Oct 12, 2020

Uh oh!

dr-ci bot commented Oct 12, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

💊 CI failures summary and remediations

XLA failure

Uh oh!

malfet Oct 13, 2020

Choose a reason for hiding this comment

Uh oh!

facebook-github-bot left a comment

Choose a reason for hiding this comment

Uh oh!

facebook-github-bot commented Oct 14, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants

jeffdaily commented Oct 9, 2020 •

edited

Loading

lly-zero-one commented Oct 9, 2020 •

edited

Loading

codecov bot commented Oct 9, 2020 •

edited

Loading

dr-ci bot commented Oct 12, 2020 •

edited

Loading