-
Notifications
You must be signed in to change notification settings - Fork 26.3k
Improved perfs for vectorized interpolate cpu uint8 RGB-case #96651
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
…element in the line except the last line
| auto yout = unpacked_output.size(1); | ||
| TORCH_INTERNAL_ASSERT(num_channels == unpacked_input.size(0)); | ||
|
|
||
| auto xout_stride = xout * num_channels; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is this a template parameter when num_channels is never used in a constexpr context? Do you expect ImageResampleVerticalConvolution8u to always be inlined and specialized?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I rechecked the assembly code and rerun benchmarks and do not see any advantages of using templates for ImageResampleVerticalConvolution8u. Will remove in the other PR.
|
|
||
| // Define various shuffling masks | ||
| const auto kmask_low = _mm256_set_epi8( | ||
| 11, 10, 9, 8, 11, 10, 9, 8, 11, 10, 9, 8, 11, 10, 9, 8, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These need descriptive names, what shuffle are they performing?
| mmk = _mm256_shuffle_epi8(ksource, _mm256_set_epi8( | ||
| 11,10, 9,8, 11,10, 9,8, 11,10, 9,8, 11,10, 9,8, | ||
| 3,2, 1,0, 3,2, 1,0, 3,2, 1,0, 3,2, 1,0)); | ||
| auto sss256 = _mm256_set1_epi32(1 << (coefs_precision - 2)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you make this a stack PR with style changes as a separate PR. It would make the diff much nicer to read.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Closed in favor of this stack: #96848
|
Is int8 also supported via the same codepath? #5580 |
- Based on #96651 - Fixed mem pointer alignment [ghstack-poisoned]
@vadimkantorov no, it does not. Can you share some info on why int8 is interesting to be supported vs uint8 ? |
|
In some rare cases it could be useful for resizing label maps with values like -1, 0, 1 (e.g. for speaker separation first speaker could be -1, second speaker 1 and "unknown" - 0). Or the other way around, -1 could mean "unknown" and 0/1 could mean two class labels. More practical case for this are dtype bool (two classes) / int16 (more classes) / int32. But if one can use smaller memory footprint, it's useful. And of course, it's useful for consistency - it's good UX when basic ops like interpolate are efficiently supported for all dtypes. Support for torch.bool should be easily implemented if uint8 is supported already. Just reinterpreting torch.bool input as uint8 and then back. And it's very useful/natural for processing binary images / segmentation masks / time series segmentation masks. |
## Description - Based on #96651 - Improved perfs for vectorized interpolate uint8 RGB-case - unified RGB and RGBA processing code such that RGB input is not copied into RGBA - Performances are more close to Pillow-SIMD - RGBA case perfs are the same after refactoring (see Source link below) - Fixed mem pointer alignment, added more comments (reviews from #96651) ## Results ``` [------------------------------------------------------------------------------------------ Resize -----------------------------------------------------------------------------------------] | Pillow (9.0.0.post1) | torch (2.1.0a0+gitcc42a3f) PR | torch (2.1.0a0+git5309c44) nightly | Speed-up: PR vs nightly 1 threads: ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- 3 torch.uint8 channels_last bilinear 256 -> 32 aa=True | 38.8 | 56.0 | 133.2 | 2.4 3 torch.uint8 channels_last bilinear 256 -> 32 aa=False | | 37.5 | 112.8 | 3.0 3 torch.uint8 channels_last bilinear 256 -> 224 aa=True | 128.7 | 157.0 | 305.4 | 1.9 3 torch.uint8 channels_last bilinear 256 -> 224 aa=False | | 146.4 | 288.7 | 2.0 3 torch.uint8 channels_last bilinear 256 -> 320 aa=True | 179.4 | 215.8 | 442.5 | 2.1 3 torch.uint8 channels_last bilinear 256 -> 320 aa=False | | 212.5 | 436.9 | 2.1 3 torch.uint8 channels_last bilinear 520 -> 32 aa=True | 113.3 | 127.9 | 464.8 | 3.6 3 torch.uint8 channels_last bilinear 520 -> 32 aa=False | | 56.8 | 365.5 | 6.4 3 torch.uint8 channels_last bilinear 520 -> 224 aa=True | 281.7 | 325.2 | 722.4 | 2.2 3 torch.uint8 channels_last bilinear 520 -> 224 aa=False | | 239.1 | 593.5 | 2.5 3 torch.uint8 channels_last bilinear 712 -> 32 aa=True | 186.2 | 200.7 | 833.8 | 4.2 3 torch.uint8 channels_last bilinear 712 -> 32 aa=False | | 75.2 | 651.4 | 8.7 3 torch.uint8 channels_last bilinear 712 -> 224 aa=True | 410.0 | 444.5 | 1128.4 | 2.5 3 torch.uint8 channels_last bilinear 712 -> 224 aa=False | | 309.3 | 917.6 | 3.0 ``` Note: for other cases (see Source below) speed-up is roughly around 1.0 +/- 0.1 which may be attributed to noisy measurements ... [Source](https://gist.github.com/vfdev-5/1c0778904a07ce40401306548b9525e8#file-20230315-144416-pr_vs_nightly_speedup-md) ## Context - #90771 cc jgong5 mingfeima XiaobingSuper sanchitintel ashokei jingxu10 [ghstack-poisoned]
|
@vadimkantorov thanks for explanation, however I'm not sure to understand your point here:
Resizing label maps of any dtype should use |
|
int16 is also useful for "resampling" pcm audio. int16 there is quite common. and there some proper filtering / anti-aliasing during "resampling" would be better than simple "nearest" |
…cpu uint8 RGB-case" ## Description - Based on #96651 - Improved perfs for vectorized interpolate uint8 RGB-case - unified RGB and RGBA processing code such that RGB input is not copied into RGBA - Performances are more close to Pillow-SIMD - RGBA case perfs are the same after refactoring (see Source link below) - Fixed mem pointer alignment, added more comments (reviews from #96651) ## Results ``` [------------------------------------------------------------------------------------------ Resize -----------------------------------------------------------------------------------------] | Pillow (9.0.0.post1) | torch (2.1.0a0+git0968a5d) PR | torch (2.1.0a0+git5309c44) nightly | Speed-up: PR vs nightly 1 threads: ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- 3 torch.uint8 channels_last bilinear 256 -> 32 aa=True | 39.0 | 56.6 | 133.2 | 2.4 3 torch.uint8 channels_last bilinear 256 -> 32 aa=False | | 36.9 | 112.8 | 3.1 3 torch.uint8 channels_last bilinear 256 -> 224 aa=True | 128.1 | 152.5 | 305.4 | 2.0 3 torch.uint8 channels_last bilinear 256 -> 224 aa=False | | 141.1 | 288.7 | 2.0 3 torch.uint8 channels_last bilinear 256 -> 320 aa=True | 179.6 | 208.8 | 442.5 | 2.1 3 torch.uint8 channels_last bilinear 256 -> 320 aa=False | | 206.4 | 436.9 | 2.1 3 torch.uint8 channels_last bilinear 520 -> 32 aa=True | 113.3 | 132.1 | 464.8 | 3.5 3 torch.uint8 channels_last bilinear 520 -> 32 aa=False | | 57.2 | 365.5 | 6.4 3 torch.uint8 channels_last bilinear 520 -> 224 aa=True | 281.7 | 327.4 | 722.4 | 2.2 3 torch.uint8 channels_last bilinear 520 -> 224 aa=False | | 230.2 | 593.5 | 2.6 3 torch.uint8 channels_last bilinear 712 -> 32 aa=True | 186.9 | 210.5 | 833.8 | 4.0 3 torch.uint8 channels_last bilinear 712 -> 32 aa=False | | 75.6 | 651.4 | 8.6 3 torch.uint8 channels_last bilinear 712 -> 224 aa=True | 410.3 | 450.9 | 1128.4 | 2.5 3 torch.uint8 channels_last bilinear 712 -> 224 aa=False | | 298.7 | 917.6 | 3.1 ``` Note: for other cases (see Source below) speed-up is roughly around 1.0 +/- 0.1 which may be attributed to noisy measurements ... [Source](https://gist.github.com/vfdev-5/1c0778904a07ce40401306548b9525e8#file-20230315-162238-pr_vs_nightly_speedup-md) ## Context - #90771 cc jgong5 mingfeima XiaobingSuper sanchitintel ashokei jingxu10 [ghstack-poisoned]
## Description - Based on #96651 - Improved perfs for vectorized interpolate uint8 RGB-case - unified RGB and RGBA processing code such that RGB input is not copied into RGBA - Performances are more close to Pillow-SIMD - RGBA case perfs are the same after refactoring (see Source link below) - Fixed mem pointer alignment, added more comments (reviews from #96651) ## Results ``` [------------------------------------------------------------------------------------------ Resize -----------------------------------------------------------------------------------------] | Pillow (9.0.0.post1) | torch (2.1.0a0+git0968a5d) PR | torch (2.1.0a0+git5309c44) nightly | Speed-up: PR vs nightly 1 threads: ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- 3 torch.uint8 channels_last bilinear 256 -> 32 aa=True | 39.0 | 56.6 | 133.2 | 2.4 3 torch.uint8 channels_last bilinear 256 -> 32 aa=False | | 36.9 | 112.8 | 3.1 3 torch.uint8 channels_last bilinear 256 -> 224 aa=True | 128.1 | 152.5 | 305.4 | 2.0 3 torch.uint8 channels_last bilinear 256 -> 224 aa=False | | 141.1 | 288.7 | 2.0 3 torch.uint8 channels_last bilinear 256 -> 320 aa=True | 179.6 | 208.8 | 442.5 | 2.1 3 torch.uint8 channels_last bilinear 256 -> 320 aa=False | | 206.4 | 436.9 | 2.1 3 torch.uint8 channels_last bilinear 520 -> 32 aa=True | 113.3 | 132.1 | 464.8 | 3.5 3 torch.uint8 channels_last bilinear 520 -> 32 aa=False | | 57.2 | 365.5 | 6.4 3 torch.uint8 channels_last bilinear 520 -> 224 aa=True | 281.7 | 327.4 | 722.4 | 2.2 3 torch.uint8 channels_last bilinear 520 -> 224 aa=False | | 230.2 | 593.5 | 2.6 3 torch.uint8 channels_last bilinear 712 -> 32 aa=True | 186.9 | 210.5 | 833.8 | 4.0 3 torch.uint8 channels_last bilinear 712 -> 32 aa=False | | 75.6 | 651.4 | 8.6 3 torch.uint8 channels_last bilinear 712 -> 224 aa=True | 410.3 | 450.9 | 1128.4 | 2.5 3 torch.uint8 channels_last bilinear 712 -> 224 aa=False | | 298.7 | 917.6 | 3.1 ``` Note: for other cases (see Source below) speed-up is roughly around 1.0 +/- 0.1 which may be attributed to noisy measurements ... [Source](https://gist.github.com/vfdev-5/1c0778904a07ce40401306548b9525e8#file-20230315-162238-pr_vs_nightly_speedup-md) ## Context - #90771 cc jgong5 mingfeima XiaobingSuper sanchitintel ashokei jingxu10 [ghstack-poisoned]
…erpolate cpu uint8 RGB-case (channels last)" ## Description - Based on #96651 - Improved perfs for vectorized bilinear interpolate uint8 RGB-case, channels last - unified RGB and RGBA processing code such that RGB input is not copied into RGBA - Performances are more close to Pillow-SIMD (`Pillow (9.0.0.post1)`) - RGBA case perfs are the same after refactoring (see Source link below) - Fixed mem pointer alignment, added more comments (reviews from #96651) ## Results ``` [------------------------------------------------------------------------------------------ Resize -----------------------------------------------------------------------------------------] | Pillow (9.0.0.post1) | torch (2.1.0a0+git0968a5d) PR | torch (2.1.0a0+git5309c44) nightly | Speed-up: PR vs nightly 1 threads: ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- 3 torch.uint8 channels_last bilinear 256 -> 32 aa=True | 39.0 | 56.6 | 133.2 | 2.4 3 torch.uint8 channels_last bilinear 256 -> 32 aa=False | | 36.9 | 112.8 | 3.1 3 torch.uint8 channels_last bilinear 256 -> 224 aa=True | 128.1 | 152.5 | 305.4 | 2.0 3 torch.uint8 channels_last bilinear 256 -> 224 aa=False | | 141.1 | 288.7 | 2.0 3 torch.uint8 channels_last bilinear 256 -> 320 aa=True | 179.6 | 208.8 | 442.5 | 2.1 3 torch.uint8 channels_last bilinear 256 -> 320 aa=False | | 206.4 | 436.9 | 2.1 3 torch.uint8 channels_last bilinear 520 -> 32 aa=True | 113.3 | 132.1 | 464.8 | 3.5 3 torch.uint8 channels_last bilinear 520 -> 32 aa=False | | 57.2 | 365.5 | 6.4 3 torch.uint8 channels_last bilinear 520 -> 224 aa=True | 281.7 | 327.4 | 722.4 | 2.2 3 torch.uint8 channels_last bilinear 520 -> 224 aa=False | | 230.2 | 593.5 | 2.6 3 torch.uint8 channels_last bilinear 712 -> 32 aa=True | 186.9 | 210.5 | 833.8 | 4.0 3 torch.uint8 channels_last bilinear 712 -> 32 aa=False | | 75.6 | 651.4 | 8.6 3 torch.uint8 channels_last bilinear 712 -> 224 aa=True | 410.3 | 450.9 | 1128.4 | 2.5 3 torch.uint8 channels_last bilinear 712 -> 224 aa=False | | 298.7 | 917.6 | 3.1 ``` Note: for other cases (see Source below) speed-up is roughly around 1.0 +/- 0.1 which may be attributed to noisy measurements ... [Source](https://gist.github.com/vfdev-5/1c0778904a07ce40401306548b9525e8#file-20230315-162238-pr_vs_nightly_speedup-md) ## Context - #90771 cc jgong5 mingfeima XiaobingSuper sanchitintel ashokei jingxu10 [ghstack-poisoned]
…t8 RGB-case (channels last)" ## Description - Based on #96651 - Improved perfs for vectorized bilinear interpolate uint8 RGB-case, channels last - unified RGB and RGBA processing code such that RGB input is not copied into RGBA - Performances are more close to Pillow-SIMD (`Pillow (9.0.0.post1)`) - RGBA case perfs are the same after refactoring (see Source link below) - Fixed mem pointer alignment, added more comments (reviews from #96651) ## Results ``` [------------------------------------------------------------------------------------------ Resize -----------------------------------------------------------------------------------------] | Pillow (9.0.0.post1) | torch (2.1.0a0+git0968a5d) PR | torch (2.1.0a0+git5309c44) nightly | Speed-up: PR vs nightly 1 threads: ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- 3 torch.uint8 channels_last bilinear 256 -> 32 aa=True | 39.0 | 56.6 | 133.2 | 2.4 3 torch.uint8 channels_last bilinear 256 -> 32 aa=False | | 36.9 | 112.8 | 3.1 3 torch.uint8 channels_last bilinear 256 -> 224 aa=True | 128.1 | 152.5 | 305.4 | 2.0 3 torch.uint8 channels_last bilinear 256 -> 224 aa=False | | 141.1 | 288.7 | 2.0 3 torch.uint8 channels_last bilinear 256 -> 320 aa=True | 179.6 | 208.8 | 442.5 | 2.1 3 torch.uint8 channels_last bilinear 256 -> 320 aa=False | | 206.4 | 436.9 | 2.1 3 torch.uint8 channels_last bilinear 520 -> 32 aa=True | 113.3 | 132.1 | 464.8 | 3.5 3 torch.uint8 channels_last bilinear 520 -> 32 aa=False | | 57.2 | 365.5 | 6.4 3 torch.uint8 channels_last bilinear 520 -> 224 aa=True | 281.7 | 327.4 | 722.4 | 2.2 3 torch.uint8 channels_last bilinear 520 -> 224 aa=False | | 230.2 | 593.5 | 2.6 3 torch.uint8 channels_last bilinear 712 -> 32 aa=True | 186.9 | 210.5 | 833.8 | 4.0 3 torch.uint8 channels_last bilinear 712 -> 32 aa=False | | 75.6 | 651.4 | 8.6 3 torch.uint8 channels_last bilinear 712 -> 224 aa=True | 410.3 | 450.9 | 1128.4 | 2.5 3 torch.uint8 channels_last bilinear 712 -> 224 aa=False | | 298.7 | 917.6 | 3.1 ``` Note: for other cases (see Source below) speed-up is roughly around 1.0 +/- 0.1 which may be attributed to noisy measurements ... [Source](https://gist.github.com/vfdev-5/1c0778904a07ce40401306548b9525e8#file-20230315-162238-pr_vs_nightly_speedup-md) ## Context - #90771 cc jgong5 mingfeima XiaobingSuper sanchitintel ashokei jingxu10 [ghstack-poisoned]
- Based on pytorch#96651 - Fixed mem pointer alignment ghstack-source-id: c82a73d Pull Request resolved: pytorch#96848
…erpolate cpu uint8 RGB-case (channels last)" ## Description - Based on #96651 - Improved perfs for vectorized **bilinear** interpolate uint8 RGB-case, **channels last** - unified RGB and RGBA processing code such that RGB input is not copied into RGBA - Performances are more close to Pillow-SIMD (labeled as `Pillow (9.0.0.post1)` in the results) - RGBA case perfs are the same after refactoring (see Source link below) - Fixed mem pointer alignment, added more comments (reviews from #96651) ## Results - `Pillow (9.0.0.post1)` == Pillow-SIMD ``` [-------------------------------------------------------------------------------------------------- Resize -------------------------------------------------------------------------------------------------] | Pillow (9.0.0.post1) | torch (2.1.0a0+gitc005105) PR | torch (2.1.0a0+git5309c44) nightly | Speed-up: PR vs nightly 1 threads: -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- 3 torch.uint8 channels_last bilinear (256, 256) -> (32, 32) aa=True | 38.670 (+-0.445) | 57.366 (+-0.799) | 132.147 (+-1.236) | 2.304 (+-0.000) 3 torch.uint8 channels_last bilinear (256, 256) -> (32, 32) aa=False | | 37.825 (+-0.417) | 111.789 (+-1.175) | 2.955 (+-0.000) 3 torch.uint8 channels_last bilinear (256, 256) -> (224, 224) aa=True | 127.898 (+-1.335) | 153.081 (+-2.346) | 302.518 (+-2.632) | 1.976 (+-0.000) 3 torch.uint8 channels_last bilinear (256, 256) -> (224, 224) aa=False | | 141.695 (+-1.415) | 286.663 (+-2.494) | 2.023 (+-0.000) 3 torch.uint8 channels_last bilinear (256, 256) -> (320, 320) aa=True | 179.735 (+-2.054) | 210.613 (+-3.116) | 439.375 (+-4.014) | 2.086 (+-0.000) 3 torch.uint8 channels_last bilinear (256, 256) -> (320, 320) aa=False | | 207.601 (+-1.639) | 438.537 (+-4.143) | 2.112 (+-0.000) 3 torch.uint8 channels_last bilinear (520, 520) -> (32, 32) aa=True | 112.679 (+-1.321) | 130.863 (+-1.987) | 446.804 (+-3.283) | 3.414 (+-0.000) 3 torch.uint8 channels_last bilinear (520, 520) -> (32, 32) aa=False | | 57.968 (+-0.270) | 374.244 (+-13.598) | 6.456 (+-0.000) 3 torch.uint8 channels_last bilinear (520, 520) -> (224, 224) aa=True | 282.398 (+-3.485) | 322.986 (+-1.947) | 720.197 (+-3.467) | 2.230 (+-0.000) 3 torch.uint8 channels_last bilinear (520, 520) -> (224, 224) aa=False | | 231.625 (+-2.006) | 592.834 (+-3.903) | 2.559 (+-0.000) 3 torch.uint8 channels_last bilinear (712, 712) -> (32, 32) aa=True | 185.711 (+-1.666) | 201.069 (+-2.182) | 787.868 (+-3.648) | 3.918 (+-0.000) 3 torch.uint8 channels_last bilinear (712, 712) -> (32, 32) aa=False | | 75.975 (+-0.696) | 651.016 (+-3.926) | 8.569 (+-0.000) 3 torch.uint8 channels_last bilinear (712, 712) -> (224, 224) aa=True | 410.236 (+-6.021) | 451.486 (+-3.939) | 1123.923 (+-14.988) | 2.489 (+-0.000) 3 torch.uint8 channels_last bilinear (712, 712) -> (224, 224) aa=False | | 299.597 (+-1.887) | 915.347 (+-4.486) | 3.055 (+-0.000) # More test-cases from #90771 3 torch.uint8 channels_last bilinear (64, 64) -> (224, 224) aa=True | 60.751 (+-0.285) | 78.538 (+-1.282) | 170.465 (+-1.830) | 2.170 (+-0.000) 3 torch.uint8 channels_last bilinear (224, 224) -> (270, 268) aa=True | 133.619 (+-2.035) | 159.614 (+-1.587) | 330.971 (+-3.249) | 2.074 (+-0.000) 3 torch.uint8 channels_last bilinear (256, 256) -> (1024, 1024) aa=True | 950.243 (+-10.641) | 891.369 (+-17.946) | 2805.510 (+-25.503) | 3.147 (+-0.000) 3 torch.uint8 channels_last bilinear (224, 224) -> (64, 64) aa=True | 52.771 (+-0.961) | 72.253 (+-1.020) | 135.933 (+-1.625) | 1.881 (+-0.000) 3 torch.uint8 channels_last bilinear (270, 268) -> (224, 224) aa=True | 139.107 (+-2.143) | 165.844 (+-2.177) | 321.112 (+-2.904) | 1.936 (+-0.000) 3 torch.uint8 channels_last bilinear (1024, 1024) -> (256, 256) aa=True | 691.470 (+-9.566) | 764.942 (+-11.192) | 2050.880 (+-22.188) | 2.681 (+-0.000) 3 torch.uint8 channels_last bilinear (64, 64) -> (224, 224) aa=False | | 77.375 (+-1.345) | 169.646 (+-1.640) | 2.193 (+-0.000) 3 torch.uint8 channels_last bilinear (224, 224) -> (270, 268) aa=False | | 159.115 (+-3.935) | 329.754 (+-2.590) | 2.072 (+-0.000) 3 torch.uint8 channels_last bilinear (256, 256) -> (1024, 1024) aa=False | | 877.248 (+-5.736) | 2815.870 (+-22.589) | 3.210 (+-0.000) 3 torch.uint8 channels_last bilinear (224, 224) -> (64, 64) aa=False | | 53.120 (+-0.316) | 112.024 (+-1.225) | 2.109 (+-0.000) 3 torch.uint8 channels_last bilinear (270, 268) -> (224, 224) aa=False | | 147.330 (+-1.871) | 299.152 (+-3.353) | 2.030 (+-0.000) 3 torch.uint8 channels_last bilinear (1024, 1024) -> (256, 256) aa=False | | 472.182 (+-10.785) | 1698.601 (+-16.785) | 3.597 (+-0.000) ``` Note: for other cases (see Source below) speed-up is roughly around 1.0 +/- 0.1 which may be attributed to noisy measurements ... [Source](https://gist.github.com/vfdev-5/1c0778904a07ce40401306548b9525e8#file-20230320-160044-pr_vs_nightly-speedup-md) ## Context - #90771 cc jgong5 mingfeima XiaobingSuper sanchitintel ashokei jingxu10 [ghstack-poisoned]
…t8 RGB-case (channels last)" ## Description - Based on #96651 - Improved perfs for vectorized **bilinear** interpolate uint8 RGB-case, **channels last** - unified RGB and RGBA processing code such that RGB input is not copied into RGBA - Performances are more close to Pillow-SIMD (labeled as `Pillow (9.0.0.post1)` in the results) - RGBA case perfs are the same after refactoring (see Source link below) - Fixed mem pointer alignment, added more comments (reviews from #96651) ## Results - `Pillow (9.0.0.post1)` == Pillow-SIMD ``` [-------------------------------------------------------------------------------------------------- Resize -------------------------------------------------------------------------------------------------] | Pillow (9.0.0.post1) | torch (2.1.0a0+gitc005105) PR | torch (2.1.0a0+git5309c44) nightly | Speed-up: PR vs nightly 1 threads: -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- 3 torch.uint8 channels_last bilinear (256, 256) -> (32, 32) aa=True | 38.670 (+-0.445) | 57.366 (+-0.799) | 132.147 (+-1.236) | 2.304 (+-0.000) 3 torch.uint8 channels_last bilinear (256, 256) -> (32, 32) aa=False | | 37.825 (+-0.417) | 111.789 (+-1.175) | 2.955 (+-0.000) 3 torch.uint8 channels_last bilinear (256, 256) -> (224, 224) aa=True | 127.898 (+-1.335) | 153.081 (+-2.346) | 302.518 (+-2.632) | 1.976 (+-0.000) 3 torch.uint8 channels_last bilinear (256, 256) -> (224, 224) aa=False | | 141.695 (+-1.415) | 286.663 (+-2.494) | 2.023 (+-0.000) 3 torch.uint8 channels_last bilinear (256, 256) -> (320, 320) aa=True | 179.735 (+-2.054) | 210.613 (+-3.116) | 439.375 (+-4.014) | 2.086 (+-0.000) 3 torch.uint8 channels_last bilinear (256, 256) -> (320, 320) aa=False | | 207.601 (+-1.639) | 438.537 (+-4.143) | 2.112 (+-0.000) 3 torch.uint8 channels_last bilinear (520, 520) -> (32, 32) aa=True | 112.679 (+-1.321) | 130.863 (+-1.987) | 446.804 (+-3.283) | 3.414 (+-0.000) 3 torch.uint8 channels_last bilinear (520, 520) -> (32, 32) aa=False | | 57.968 (+-0.270) | 374.244 (+-13.598) | 6.456 (+-0.000) 3 torch.uint8 channels_last bilinear (520, 520) -> (224, 224) aa=True | 282.398 (+-3.485) | 322.986 (+-1.947) | 720.197 (+-3.467) | 2.230 (+-0.000) 3 torch.uint8 channels_last bilinear (520, 520) -> (224, 224) aa=False | | 231.625 (+-2.006) | 592.834 (+-3.903) | 2.559 (+-0.000) 3 torch.uint8 channels_last bilinear (712, 712) -> (32, 32) aa=True | 185.711 (+-1.666) | 201.069 (+-2.182) | 787.868 (+-3.648) | 3.918 (+-0.000) 3 torch.uint8 channels_last bilinear (712, 712) -> (32, 32) aa=False | | 75.975 (+-0.696) | 651.016 (+-3.926) | 8.569 (+-0.000) 3 torch.uint8 channels_last bilinear (712, 712) -> (224, 224) aa=True | 410.236 (+-6.021) | 451.486 (+-3.939) | 1123.923 (+-14.988) | 2.489 (+-0.000) 3 torch.uint8 channels_last bilinear (712, 712) -> (224, 224) aa=False | | 299.597 (+-1.887) | 915.347 (+-4.486) | 3.055 (+-0.000) # More test-cases from #90771 3 torch.uint8 channels_last bilinear (64, 64) -> (224, 224) aa=True | 60.751 (+-0.285) | 78.538 (+-1.282) | 170.465 (+-1.830) | 2.170 (+-0.000) 3 torch.uint8 channels_last bilinear (224, 224) -> (270, 268) aa=True | 133.619 (+-2.035) | 159.614 (+-1.587) | 330.971 (+-3.249) | 2.074 (+-0.000) 3 torch.uint8 channels_last bilinear (256, 256) -> (1024, 1024) aa=True | 950.243 (+-10.641) | 891.369 (+-17.946) | 2805.510 (+-25.503) | 3.147 (+-0.000) 3 torch.uint8 channels_last bilinear (224, 224) -> (64, 64) aa=True | 52.771 (+-0.961) | 72.253 (+-1.020) | 135.933 (+-1.625) | 1.881 (+-0.000) 3 torch.uint8 channels_last bilinear (270, 268) -> (224, 224) aa=True | 139.107 (+-2.143) | 165.844 (+-2.177) | 321.112 (+-2.904) | 1.936 (+-0.000) 3 torch.uint8 channels_last bilinear (1024, 1024) -> (256, 256) aa=True | 691.470 (+-9.566) | 764.942 (+-11.192) | 2050.880 (+-22.188) | 2.681 (+-0.000) 3 torch.uint8 channels_last bilinear (64, 64) -> (224, 224) aa=False | | 77.375 (+-1.345) | 169.646 (+-1.640) | 2.193 (+-0.000) 3 torch.uint8 channels_last bilinear (224, 224) -> (270, 268) aa=False | | 159.115 (+-3.935) | 329.754 (+-2.590) | 2.072 (+-0.000) 3 torch.uint8 channels_last bilinear (256, 256) -> (1024, 1024) aa=False | | 877.248 (+-5.736) | 2815.870 (+-22.589) | 3.210 (+-0.000) 3 torch.uint8 channels_last bilinear (224, 224) -> (64, 64) aa=False | | 53.120 (+-0.316) | 112.024 (+-1.225) | 2.109 (+-0.000) 3 torch.uint8 channels_last bilinear (270, 268) -> (224, 224) aa=False | | 147.330 (+-1.871) | 299.152 (+-3.353) | 2.030 (+-0.000) 3 torch.uint8 channels_last bilinear (1024, 1024) -> (256, 256) aa=False | | 472.182 (+-10.785) | 1698.601 (+-16.785) | 3.597 (+-0.000) ``` Note: for other cases (see Source below) speed-up is roughly around 1.0 +/- 0.1 which may be attributed to noisy measurements ... [Source](https://gist.github.com/vfdev-5/1c0778904a07ce40401306548b9525e8#file-20230320-160044-pr_vs_nightly-speedup-md) ## Context - #90771 cc jgong5 mingfeima XiaobingSuper sanchitintel ashokei jingxu10 [ghstack-poisoned]
…erpolate cpu uint8 RGB-case (channels last)" ## Description - Based on #96651 - Improved perfs for vectorized **bilinear** interpolate uint8 RGB-case, **channels last** - unified RGB and RGBA processing code such that RGB input is not copied into RGBA - Performances are more close to Pillow-SIMD (labeled as `Pillow (9.0.0.post1)` in the results) - RGBA case perfs are the same after refactoring (see Source link below) - Fixed mem pointer alignment, added more comments (reviews from #96651) ## Results - `Pillow (9.0.0.post1)` == Pillow-SIMD ``` [-------------------------------------------------------------------------------------------------- Resize -------------------------------------------------------------------------------------------------] | Pillow (9.0.0.post1) | torch (2.1.0a0+git8d955df) PR | torch (2.1.0a0+git5309c44) nightly | Speed-up: PR vs nightly 1 threads: -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- 3 torch.uint8 channels_last bilinear (256, 256) -> (32, 32) aa=True | 38.649 (+-0.306) | 55.828 (+-0.370) | 132.147 (+-1.236) | 2.367 (+-0.000) 3 torch.uint8 channels_last bilinear (256, 256) -> (32, 32) aa=False | | 36.826 (+-0.229) | 111.789 (+-1.175) | 3.036 (+-0.000) 3 torch.uint8 channels_last bilinear (256, 256) -> (224, 224) aa=True | 128.233 (+-1.313) | 153.827 (+-1.229) | 302.518 (+-2.632) | 1.967 (+-0.000) 3 torch.uint8 channels_last bilinear (256, 256) -> (224, 224) aa=False | | 143.886 (+-1.409) | 286.663 (+-2.494) | 1.992 (+-0.000) 3 torch.uint8 channels_last bilinear (256, 256) -> (320, 320) aa=True | 179.504 (+-1.825) | 211.569 (+-1.336) | 439.375 (+-4.014) | 2.077 (+-0.000) 3 torch.uint8 channels_last bilinear (256, 256) -> (320, 320) aa=False | | 209.888 (+-1.443) | 438.537 (+-4.143) | 2.089 (+-0.000) 3 torch.uint8 channels_last bilinear (520, 520) -> (32, 32) aa=True | 112.891 (+-1.118) | 129.373 (+-1.396) | 446.804 (+-3.283) | 3.454 (+-0.000) 3 torch.uint8 channels_last bilinear (520, 520) -> (32, 32) aa=False | | 56.858 (+-0.227) | 374.244 (+-13.598) | 6.582 (+-0.000) 3 torch.uint8 channels_last bilinear (520, 520) -> (224, 224) aa=True | 282.917 (+-2.992) | 324.378 (+-1.694) | 720.197 (+-3.467) | 2.220 (+-0.000) 3 torch.uint8 channels_last bilinear (520, 520) -> (224, 224) aa=False | | 236.078 (+-1.679) | 592.834 (+-3.903) | 2.511 (+-0.000) 3 torch.uint8 channels_last bilinear (712, 712) -> (32, 32) aa=True | 185.595 (+-1.633) | 202.000 (+-1.920) | 787.868 (+-3.648) | 3.900 (+-0.000) 3 torch.uint8 channels_last bilinear (712, 712) -> (32, 32) aa=False | | 75.421 (+-0.512) | 651.016 (+-3.926) | 8.632 (+-0.000) 3 torch.uint8 channels_last bilinear (712, 712) -> (224, 224) aa=True | 409.691 (+-2.735) | 449.927 (+-2.500) | 1123.923 (+-14.988) | 2.498 (+-0.000) 3 torch.uint8 channels_last bilinear (712, 712) -> (224, 224) aa=False | | 306.691 (+-2.095) | 915.347 (+-4.486) | 2.985 (+-0.000) # More test-cases from #90771 3 torch.uint8 channels_last bilinear (64, 64) -> (224, 224) aa=True | 60.740 (+-0.278) | 78.745 (+-0.286) | 170.465 (+-1.830) | 2.165 (+-0.000) 3 torch.uint8 channels_last bilinear (224, 224) -> (270, 268) aa=True | 133.029 (+-1.619) | 162.393 (+-1.289) | 330.971 (+-3.249) | 2.038 (+-0.000) 3 torch.uint8 channels_last bilinear (256, 256) -> (1024, 1024) aa=True | 948.849 (+-2.749) | 896.127 (+-3.696) | 2805.510 (+-25.503) | 3.131 (+-0.000) 3 torch.uint8 channels_last bilinear (224, 224) -> (64, 64) aa=True | 52.505 (+-0.319) | 70.617 (+-0.344) | 135.933 (+-1.625) | 1.925 (+-0.000) 3 torch.uint8 channels_last bilinear (270, 268) -> (224, 224) aa=True | 138.671 (+-1.953) | 165.638 (+-1.473) | 321.112 (+-2.904) | 1.939 (+-0.000) 3 torch.uint8 channels_last bilinear (1024, 1024) -> (256, 256) aa=True | 689.492 (+-2.917) | 758.162 (+-3.719) | 2050.880 (+-22.188) | 2.705 (+-0.000) 3 torch.uint8 channels_last bilinear (64, 64) -> (224, 224) aa=False | | 77.300 (+-0.307) | 169.646 (+-1.640) | 2.195 (+-0.000) 3 torch.uint8 channels_last bilinear (224, 224) -> (270, 268) aa=False | | 159.525 (+-1.225) | 329.754 (+-2.590) | 2.067 (+-0.000) 3 torch.uint8 channels_last bilinear (256, 256) -> (1024, 1024) aa=False | | 890.106 (+-3.358) | 2815.870 (+-22.589) | 3.164 (+-0.000) 3 torch.uint8 channels_last bilinear (224, 224) -> (64, 64) aa=False | | 52.399 (+-0.314) | 112.024 (+-1.225) | 2.138 (+-0.000) 3 torch.uint8 channels_last bilinear (270, 268) -> (224, 224) aa=False | | 148.780 (+-1.282) | 299.152 (+-3.353) | 2.011 (+-0.000) 3 torch.uint8 channels_last bilinear (1024, 1024) -> (256, 256) aa=False | | 479.273 (+-3.432) | 1698.601 (+-16.785) | 3.544 (+-0.000) 4 ``` Note: There is no perf regression for other case. There some cases (see Source below) with small speed-ups, for the rest it is roughly around 1.0 +/- 0.1 which may be attributed to noisy measurements ... [Source](https://gist.github.com/vfdev-5/1c0778904a07ce40401306548b9525e8#file-20230321-145513-pr_vs_nightly-speedup-md) ## Context - #90771 cc jgong5 mingfeima XiaobingSuper sanchitintel ashokei jingxu10 [ghstack-poisoned]
…t8 RGB-case (channels last)" ## Description - Based on #96651 - Improved perfs for vectorized **bilinear** interpolate uint8 RGB-case, **channels last** - unified RGB and RGBA processing code such that RGB input is not copied into RGBA - Performances are more close to Pillow-SIMD (labeled as `Pillow (9.0.0.post1)` in the results) - RGBA case perfs are the same after refactoring (see Source link below) - Fixed mem pointer alignment, added more comments (reviews from #96651) ## Results - `Pillow (9.0.0.post1)` == Pillow-SIMD ``` [-------------------------------------------------------------------------------------------------- Resize -------------------------------------------------------------------------------------------------] | Pillow (9.0.0.post1) | torch (2.1.0a0+git8d955df) PR | torch (2.1.0a0+git5309c44) nightly | Speed-up: PR vs nightly 1 threads: -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- 3 torch.uint8 channels_last bilinear (256, 256) -> (32, 32) aa=True | 38.649 (+-0.306) | 55.828 (+-0.370) | 132.147 (+-1.236) | 2.367 (+-0.000) 3 torch.uint8 channels_last bilinear (256, 256) -> (32, 32) aa=False | | 36.826 (+-0.229) | 111.789 (+-1.175) | 3.036 (+-0.000) 3 torch.uint8 channels_last bilinear (256, 256) -> (224, 224) aa=True | 128.233 (+-1.313) | 153.827 (+-1.229) | 302.518 (+-2.632) | 1.967 (+-0.000) 3 torch.uint8 channels_last bilinear (256, 256) -> (224, 224) aa=False | | 143.886 (+-1.409) | 286.663 (+-2.494) | 1.992 (+-0.000) 3 torch.uint8 channels_last bilinear (256, 256) -> (320, 320) aa=True | 179.504 (+-1.825) | 211.569 (+-1.336) | 439.375 (+-4.014) | 2.077 (+-0.000) 3 torch.uint8 channels_last bilinear (256, 256) -> (320, 320) aa=False | | 209.888 (+-1.443) | 438.537 (+-4.143) | 2.089 (+-0.000) 3 torch.uint8 channels_last bilinear (520, 520) -> (32, 32) aa=True | 112.891 (+-1.118) | 129.373 (+-1.396) | 446.804 (+-3.283) | 3.454 (+-0.000) 3 torch.uint8 channels_last bilinear (520, 520) -> (32, 32) aa=False | | 56.858 (+-0.227) | 374.244 (+-13.598) | 6.582 (+-0.000) 3 torch.uint8 channels_last bilinear (520, 520) -> (224, 224) aa=True | 282.917 (+-2.992) | 324.378 (+-1.694) | 720.197 (+-3.467) | 2.220 (+-0.000) 3 torch.uint8 channels_last bilinear (520, 520) -> (224, 224) aa=False | | 236.078 (+-1.679) | 592.834 (+-3.903) | 2.511 (+-0.000) 3 torch.uint8 channels_last bilinear (712, 712) -> (32, 32) aa=True | 185.595 (+-1.633) | 202.000 (+-1.920) | 787.868 (+-3.648) | 3.900 (+-0.000) 3 torch.uint8 channels_last bilinear (712, 712) -> (32, 32) aa=False | | 75.421 (+-0.512) | 651.016 (+-3.926) | 8.632 (+-0.000) 3 torch.uint8 channels_last bilinear (712, 712) -> (224, 224) aa=True | 409.691 (+-2.735) | 449.927 (+-2.500) | 1123.923 (+-14.988) | 2.498 (+-0.000) 3 torch.uint8 channels_last bilinear (712, 712) -> (224, 224) aa=False | | 306.691 (+-2.095) | 915.347 (+-4.486) | 2.985 (+-0.000) # More test-cases from #90771 3 torch.uint8 channels_last bilinear (64, 64) -> (224, 224) aa=True | 60.740 (+-0.278) | 78.745 (+-0.286) | 170.465 (+-1.830) | 2.165 (+-0.000) 3 torch.uint8 channels_last bilinear (224, 224) -> (270, 268) aa=True | 133.029 (+-1.619) | 162.393 (+-1.289) | 330.971 (+-3.249) | 2.038 (+-0.000) 3 torch.uint8 channels_last bilinear (256, 256) -> (1024, 1024) aa=True | 948.849 (+-2.749) | 896.127 (+-3.696) | 2805.510 (+-25.503) | 3.131 (+-0.000) 3 torch.uint8 channels_last bilinear (224, 224) -> (64, 64) aa=True | 52.505 (+-0.319) | 70.617 (+-0.344) | 135.933 (+-1.625) | 1.925 (+-0.000) 3 torch.uint8 channels_last bilinear (270, 268) -> (224, 224) aa=True | 138.671 (+-1.953) | 165.638 (+-1.473) | 321.112 (+-2.904) | 1.939 (+-0.000) 3 torch.uint8 channels_last bilinear (1024, 1024) -> (256, 256) aa=True | 689.492 (+-2.917) | 758.162 (+-3.719) | 2050.880 (+-22.188) | 2.705 (+-0.000) 3 torch.uint8 channels_last bilinear (64, 64) -> (224, 224) aa=False | | 77.300 (+-0.307) | 169.646 (+-1.640) | 2.195 (+-0.000) 3 torch.uint8 channels_last bilinear (224, 224) -> (270, 268) aa=False | | 159.525 (+-1.225) | 329.754 (+-2.590) | 2.067 (+-0.000) 3 torch.uint8 channels_last bilinear (256, 256) -> (1024, 1024) aa=False | | 890.106 (+-3.358) | 2815.870 (+-22.589) | 3.164 (+-0.000) 3 torch.uint8 channels_last bilinear (224, 224) -> (64, 64) aa=False | | 52.399 (+-0.314) | 112.024 (+-1.225) | 2.138 (+-0.000) 3 torch.uint8 channels_last bilinear (270, 268) -> (224, 224) aa=False | | 148.780 (+-1.282) | 299.152 (+-3.353) | 2.011 (+-0.000) 3 torch.uint8 channels_last bilinear (1024, 1024) -> (256, 256) aa=False | | 479.273 (+-3.432) | 1698.601 (+-16.785) | 3.544 (+-0.000) 4 ``` Note: There is no perf regression for other case. There some cases (see Source below) with small speed-ups, for the rest it is roughly around 1.0 +/- 0.1 which may be attributed to noisy measurements ... [Source](https://gist.github.com/vfdev-5/1c0778904a07ce40401306548b9525e8#file-20230321-145513-pr_vs_nightly-speedup-md) ## Context - #90771 cc jgong5 mingfeima XiaobingSuper sanchitintel ashokei jingxu10 [ghstack-poisoned]
…erpolate cpu uint8 RGB-case (channels last)" ## Description - Based on #96651 - Improved perfs for vectorized **bilinear** interpolate uint8 RGB-case, **channels last** - unified RGB and RGBA processing code such that RGB input is not copied into RGBA - Performances are more close to Pillow-SIMD (labeled as `Pillow (9.0.0.post1)` in the results) - RGBA case perfs are the same after refactoring (see Source link below) - Fixed mem pointer alignment, added more comments (reviews from #96651) ## Results - `Pillow (9.0.0.post1)` == Pillow-SIMD ``` [-------------------------------------------------------------------------------------------------- Resize -------------------------------------------------------------------------------------------------] | Pillow (9.0.0.post1) | torch (2.1.0a0+gitce4be01) PR | torch (2.1.0a0+git5309c44) nightly | Speed-up: PR vs nightly 1 threads: -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- 3 torch.uint8 channels_last bilinear (256, 256) -> (32, 32) aa=True | 38.548 (+-0.280) | 57.536 (+-0.210) | 132.147 (+-1.236) | 2.297 (+-0.000) 3 torch.uint8 channels_last bilinear (256, 256) -> (32, 32) aa=False | | 38.532 (+-0.219) | 111.789 (+-1.175) | 2.901 (+-0.000) 3 torch.uint8 channels_last bilinear (256, 256) -> (224, 224) aa=True | 127.689 (+-1.348) | 156.262 (+-1.213) | 302.518 (+-2.632) | 1.936 (+-0.000) 3 torch.uint8 channels_last bilinear (256, 256) -> (224, 224) aa=False | | 145.483 (+-1.077) | 286.663 (+-2.494) | 1.970 (+-0.000) 3 torch.uint8 channels_last bilinear (256, 256) -> (320, 320) aa=True | 178.117 (+-1.956) | 215.053 (+-1.470) | 439.375 (+-4.014) | 2.043 (+-0.000) 3 torch.uint8 channels_last bilinear (256, 256) -> (320, 320) aa=False | | 211.340 (+-2.239) | 438.537 (+-4.143) | 2.075 (+-0.000) 3 torch.uint8 channels_last bilinear (520, 520) -> (32, 32) aa=True | 112.593 (+-1.266) | 130.414 (+-1.633) | 446.804 (+-3.283) | 3.426 (+-0.000) 3 torch.uint8 channels_last bilinear (520, 520) -> (32, 32) aa=False | | 58.767 (+-0.203) | 374.244 (+-13.598) | 6.368 (+-0.000) 3 torch.uint8 channels_last bilinear (520, 520) -> (224, 224) aa=True | 283.210 (+-2.937) | 324.157 (+-1.895) | 720.197 (+-3.467) | 2.222 (+-0.000) 3 torch.uint8 channels_last bilinear (520, 520) -> (224, 224) aa=False | | 239.800 (+-2.492) | 592.834 (+-3.903) | 2.472 (+-0.000) 3 torch.uint8 channels_last bilinear (712, 712) -> (32, 32) aa=True | 186.255 (+-1.629) | 204.834 (+-1.496) | 787.868 (+-3.648) | 3.846 (+-0.000) 3 torch.uint8 channels_last bilinear (712, 712) -> (32, 32) aa=False | | 77.335 (+-0.341) | 651.016 (+-3.926) | 8.418 (+-0.000) 3 torch.uint8 channels_last bilinear (712, 712) -> (224, 224) aa=True | 410.286 (+-2.439) | 443.934 (+-2.899) | 1123.923 (+-14.988) | 2.532 (+-0.000) 3 torch.uint8 channels_last bilinear (712, 712) -> (224, 224) aa=False | | 312.220 (+-2.307) | 915.347 (+-4.486) | 2.932 (+-0.000) # More test-cases from #90771 3 torch.uint8 channels_last bilinear (64, 64) -> (224, 224) aa=True | 60.611 (+-0.337) | 80.849 (+-1.780) | 170.465 (+-1.830) | 2.108 (+-0.000) 3 torch.uint8 channels_last bilinear (224, 224) -> (270, 268) aa=True | 132.971 (+-1.624) | 164.892 (+-1.426) | 330.971 (+-3.249) | 2.007 (+-0.000) 3 torch.uint8 channels_last bilinear (256, 256) -> (1024, 1024) aa=True | 948.467 (+-3.179) | 891.414 (+-5.282) | 2805.510 (+-25.503) | 3.147 (+-0.000) 3 torch.uint8 channels_last bilinear (224, 224) -> (64, 64) aa=True | 52.539 (+-0.327) | 72.471 (+-0.367) | 135.933 (+-1.625) | 1.876 (+-0.000) 3 torch.uint8 channels_last bilinear (270, 268) -> (224, 224) aa=True | 138.669 (+-1.867) | 168.628 (+-1.213) | 321.112 (+-2.904) | 1.904 (+-0.000) 3 torch.uint8 channels_last bilinear (1024, 1024) -> (256, 256) aa=True | 689.933 (+-3.175) | 746.911 (+-2.985) | 2050.880 (+-22.188) | 2.746 (+-0.000) 3 torch.uint8 channels_last bilinear (64, 64) -> (224, 224) aa=False | | 78.347 (+-0.338) | 169.646 (+-1.640) | 2.165 (+-0.000) 3 torch.uint8 channels_last bilinear (224, 224) -> (270, 268) aa=False | | 162.194 (+-1.089) | 329.754 (+-2.590) | 2.033 (+-0.000) 3 torch.uint8 channels_last bilinear (256, 256) -> (1024, 1024) aa=False | | 894.476 (+-2.738) | 2815.870 (+-22.589) | 3.148 (+-0.000) 3 torch.uint8 channels_last bilinear (224, 224) -> (64, 64) aa=False | | 52.728 (+-0.406) | 112.024 (+-1.225) | 2.125 (+-0.000) 3 torch.uint8 channels_last bilinear (270, 268) -> (224, 224) aa=False | | 151.560 (+-1.128) | 299.152 (+-3.353) | 1.974 (+-0.000) 3 torch.uint8 channels_last bilinear (1024, 1024) -> (256, 256) aa=False | | 500.053 (+-4.288) | 1698.601 (+-16.785) | 3.397 (+-0.000) ``` Note: There is no perf regression for other case. There some cases (see Source below) with small speed-ups, for the rest it is roughly around 1.0 +/- 0.1 which may be attributed to noisy measurements ... [Source](https://gist.github.com/vfdev-5/1c0778904a07ce40401306548b9525e8#file-20230322-132441-pr_vs_nightly-speedup-md) ## Context - #90771 cc jgong5 mingfeima XiaobingSuper sanchitintel ashokei jingxu10 [ghstack-poisoned]
…t8 RGB-case (channels last)" ## Description - Based on #96651 - Improved perfs for vectorized **bilinear** interpolate uint8 RGB-case, **channels last** - unified RGB and RGBA processing code such that RGB input is not copied into RGBA - Performances are more close to Pillow-SIMD (labeled as `Pillow (9.0.0.post1)` in the results) - RGBA case perfs are the same after refactoring (see Source link below) - Fixed mem pointer alignment, added more comments (reviews from #96651) ## Results - `Pillow (9.0.0.post1)` == Pillow-SIMD ``` [-------------------------------------------------------------------------------------------------- Resize -------------------------------------------------------------------------------------------------] | Pillow (9.0.0.post1) | torch (2.1.0a0+gitce4be01) PR | torch (2.1.0a0+git5309c44) nightly | Speed-up: PR vs nightly 1 threads: -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- 3 torch.uint8 channels_last bilinear (256, 256) -> (32, 32) aa=True | 38.548 (+-0.280) | 57.536 (+-0.210) | 132.147 (+-1.236) | 2.297 (+-0.000) 3 torch.uint8 channels_last bilinear (256, 256) -> (32, 32) aa=False | | 38.532 (+-0.219) | 111.789 (+-1.175) | 2.901 (+-0.000) 3 torch.uint8 channels_last bilinear (256, 256) -> (224, 224) aa=True | 127.689 (+-1.348) | 156.262 (+-1.213) | 302.518 (+-2.632) | 1.936 (+-0.000) 3 torch.uint8 channels_last bilinear (256, 256) -> (224, 224) aa=False | | 145.483 (+-1.077) | 286.663 (+-2.494) | 1.970 (+-0.000) 3 torch.uint8 channels_last bilinear (256, 256) -> (320, 320) aa=True | 178.117 (+-1.956) | 215.053 (+-1.470) | 439.375 (+-4.014) | 2.043 (+-0.000) 3 torch.uint8 channels_last bilinear (256, 256) -> (320, 320) aa=False | | 211.340 (+-2.239) | 438.537 (+-4.143) | 2.075 (+-0.000) 3 torch.uint8 channels_last bilinear (520, 520) -> (32, 32) aa=True | 112.593 (+-1.266) | 130.414 (+-1.633) | 446.804 (+-3.283) | 3.426 (+-0.000) 3 torch.uint8 channels_last bilinear (520, 520) -> (32, 32) aa=False | | 58.767 (+-0.203) | 374.244 (+-13.598) | 6.368 (+-0.000) 3 torch.uint8 channels_last bilinear (520, 520) -> (224, 224) aa=True | 283.210 (+-2.937) | 324.157 (+-1.895) | 720.197 (+-3.467) | 2.222 (+-0.000) 3 torch.uint8 channels_last bilinear (520, 520) -> (224, 224) aa=False | | 239.800 (+-2.492) | 592.834 (+-3.903) | 2.472 (+-0.000) 3 torch.uint8 channels_last bilinear (712, 712) -> (32, 32) aa=True | 186.255 (+-1.629) | 204.834 (+-1.496) | 787.868 (+-3.648) | 3.846 (+-0.000) 3 torch.uint8 channels_last bilinear (712, 712) -> (32, 32) aa=False | | 77.335 (+-0.341) | 651.016 (+-3.926) | 8.418 (+-0.000) 3 torch.uint8 channels_last bilinear (712, 712) -> (224, 224) aa=True | 410.286 (+-2.439) | 443.934 (+-2.899) | 1123.923 (+-14.988) | 2.532 (+-0.000) 3 torch.uint8 channels_last bilinear (712, 712) -> (224, 224) aa=False | | 312.220 (+-2.307) | 915.347 (+-4.486) | 2.932 (+-0.000) # More test-cases from #90771 3 torch.uint8 channels_last bilinear (64, 64) -> (224, 224) aa=True | 60.611 (+-0.337) | 80.849 (+-1.780) | 170.465 (+-1.830) | 2.108 (+-0.000) 3 torch.uint8 channels_last bilinear (224, 224) -> (270, 268) aa=True | 132.971 (+-1.624) | 164.892 (+-1.426) | 330.971 (+-3.249) | 2.007 (+-0.000) 3 torch.uint8 channels_last bilinear (256, 256) -> (1024, 1024) aa=True | 948.467 (+-3.179) | 891.414 (+-5.282) | 2805.510 (+-25.503) | 3.147 (+-0.000) 3 torch.uint8 channels_last bilinear (224, 224) -> (64, 64) aa=True | 52.539 (+-0.327) | 72.471 (+-0.367) | 135.933 (+-1.625) | 1.876 (+-0.000) 3 torch.uint8 channels_last bilinear (270, 268) -> (224, 224) aa=True | 138.669 (+-1.867) | 168.628 (+-1.213) | 321.112 (+-2.904) | 1.904 (+-0.000) 3 torch.uint8 channels_last bilinear (1024, 1024) -> (256, 256) aa=True | 689.933 (+-3.175) | 746.911 (+-2.985) | 2050.880 (+-22.188) | 2.746 (+-0.000) 3 torch.uint8 channels_last bilinear (64, 64) -> (224, 224) aa=False | | 78.347 (+-0.338) | 169.646 (+-1.640) | 2.165 (+-0.000) 3 torch.uint8 channels_last bilinear (224, 224) -> (270, 268) aa=False | | 162.194 (+-1.089) | 329.754 (+-2.590) | 2.033 (+-0.000) 3 torch.uint8 channels_last bilinear (256, 256) -> (1024, 1024) aa=False | | 894.476 (+-2.738) | 2815.870 (+-22.589) | 3.148 (+-0.000) 3 torch.uint8 channels_last bilinear (224, 224) -> (64, 64) aa=False | | 52.728 (+-0.406) | 112.024 (+-1.225) | 2.125 (+-0.000) 3 torch.uint8 channels_last bilinear (270, 268) -> (224, 224) aa=False | | 151.560 (+-1.128) | 299.152 (+-3.353) | 1.974 (+-0.000) 3 torch.uint8 channels_last bilinear (1024, 1024) -> (256, 256) aa=False | | 500.053 (+-4.288) | 1698.601 (+-16.785) | 3.397 (+-0.000) ``` Note: There is no perf regression for other case. There some cases (see Source below) with small speed-ups, for the rest it is roughly around 1.0 +/- 0.1 which may be attributed to noisy measurements ... [Source](https://gist.github.com/vfdev-5/1c0778904a07ce40401306548b9525e8#file-20230322-132441-pr_vs_nightly-speedup-md) ## Context - #90771 cc jgong5 mingfeima XiaobingSuper sanchitintel ashokei jingxu10 [ghstack-poisoned]
…erpolate cpu uint8 RGB-case (channels last)" ## Description - Based on #96651 - Improved perfs for vectorized **bilinear** interpolate uint8 RGB-case, **channels last** - unified RGB and RGBA processing code such that RGB input is not copied into RGBA - Performances are more close to Pillow-SIMD (labeled as `Pillow (9.0.0.post1)` in the results) - RGBA case perfs are the same after refactoring (see Source link below) - Fixed mem pointer alignment, added more comments (reviews from #96651) ## Results - `Pillow (9.0.0.post1)` == Pillow-SIMD ``` [-------------------------------------------------------------------------------------------------- Resize -------------------------------------------------------------------------------------------------] | Pillow (9.0.0.post1) | torch (2.1.0a0+gitce4be01) PR | torch (2.1.0a0+git5309c44) nightly | Speed-up: PR vs nightly 1 threads: -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- 3 torch.uint8 channels_last bilinear (256, 256) -> (32, 32) aa=True | 38.548 (+-0.280) | 57.536 (+-0.210) | 132.147 (+-1.236) | 2.297 (+-0.000) 3 torch.uint8 channels_last bilinear (256, 256) -> (32, 32) aa=False | | 38.532 (+-0.219) | 111.789 (+-1.175) | 2.901 (+-0.000) 3 torch.uint8 channels_last bilinear (256, 256) -> (224, 224) aa=True | 127.689 (+-1.348) | 156.262 (+-1.213) | 302.518 (+-2.632) | 1.936 (+-0.000) 3 torch.uint8 channels_last bilinear (256, 256) -> (224, 224) aa=False | | 145.483 (+-1.077) | 286.663 (+-2.494) | 1.970 (+-0.000) 3 torch.uint8 channels_last bilinear (256, 256) -> (320, 320) aa=True | 178.117 (+-1.956) | 215.053 (+-1.470) | 439.375 (+-4.014) | 2.043 (+-0.000) 3 torch.uint8 channels_last bilinear (256, 256) -> (320, 320) aa=False | | 211.340 (+-2.239) | 438.537 (+-4.143) | 2.075 (+-0.000) 3 torch.uint8 channels_last bilinear (520, 520) -> (32, 32) aa=True | 112.593 (+-1.266) | 130.414 (+-1.633) | 446.804 (+-3.283) | 3.426 (+-0.000) 3 torch.uint8 channels_last bilinear (520, 520) -> (32, 32) aa=False | | 58.767 (+-0.203) | 374.244 (+-13.598) | 6.368 (+-0.000) 3 torch.uint8 channels_last bilinear (520, 520) -> (224, 224) aa=True | 283.210 (+-2.937) | 324.157 (+-1.895) | 720.197 (+-3.467) | 2.222 (+-0.000) 3 torch.uint8 channels_last bilinear (520, 520) -> (224, 224) aa=False | | 239.800 (+-2.492) | 592.834 (+-3.903) | 2.472 (+-0.000) 3 torch.uint8 channels_last bilinear (712, 712) -> (32, 32) aa=True | 186.255 (+-1.629) | 204.834 (+-1.496) | 787.868 (+-3.648) | 3.846 (+-0.000) 3 torch.uint8 channels_last bilinear (712, 712) -> (32, 32) aa=False | | 77.335 (+-0.341) | 651.016 (+-3.926) | 8.418 (+-0.000) 3 torch.uint8 channels_last bilinear (712, 712) -> (224, 224) aa=True | 410.286 (+-2.439) | 443.934 (+-2.899) | 1123.923 (+-14.988) | 2.532 (+-0.000) 3 torch.uint8 channels_last bilinear (712, 712) -> (224, 224) aa=False | | 312.220 (+-2.307) | 915.347 (+-4.486) | 2.932 (+-0.000) # More test-cases from #90771 3 torch.uint8 channels_last bilinear (64, 64) -> (224, 224) aa=True | 60.611 (+-0.337) | 80.849 (+-1.780) | 170.465 (+-1.830) | 2.108 (+-0.000) 3 torch.uint8 channels_last bilinear (224, 224) -> (270, 268) aa=True | 132.971 (+-1.624) | 164.892 (+-1.426) | 330.971 (+-3.249) | 2.007 (+-0.000) 3 torch.uint8 channels_last bilinear (256, 256) -> (1024, 1024) aa=True | 948.467 (+-3.179) | 891.414 (+-5.282) | 2805.510 (+-25.503) | 3.147 (+-0.000) 3 torch.uint8 channels_last bilinear (224, 224) -> (64, 64) aa=True | 52.539 (+-0.327) | 72.471 (+-0.367) | 135.933 (+-1.625) | 1.876 (+-0.000) 3 torch.uint8 channels_last bilinear (270, 268) -> (224, 224) aa=True | 138.669 (+-1.867) | 168.628 (+-1.213) | 321.112 (+-2.904) | 1.904 (+-0.000) 3 torch.uint8 channels_last bilinear (1024, 1024) -> (256, 256) aa=True | 689.933 (+-3.175) | 746.911 (+-2.985) | 2050.880 (+-22.188) | 2.746 (+-0.000) 3 torch.uint8 channels_last bilinear (64, 64) -> (224, 224) aa=False | | 78.347 (+-0.338) | 169.646 (+-1.640) | 2.165 (+-0.000) 3 torch.uint8 channels_last bilinear (224, 224) -> (270, 268) aa=False | | 162.194 (+-1.089) | 329.754 (+-2.590) | 2.033 (+-0.000) 3 torch.uint8 channels_last bilinear (256, 256) -> (1024, 1024) aa=False | | 894.476 (+-2.738) | 2815.870 (+-22.589) | 3.148 (+-0.000) 3 torch.uint8 channels_last bilinear (224, 224) -> (64, 64) aa=False | | 52.728 (+-0.406) | 112.024 (+-1.225) | 2.125 (+-0.000) 3 torch.uint8 channels_last bilinear (270, 268) -> (224, 224) aa=False | | 151.560 (+-1.128) | 299.152 (+-3.353) | 1.974 (+-0.000) 3 torch.uint8 channels_last bilinear (1024, 1024) -> (256, 256) aa=False | | 500.053 (+-4.288) | 1698.601 (+-16.785) | 3.397 (+-0.000) ``` Note: There is no perf regression for other case. There some cases (see Source below) with small speed-ups, for the rest it is roughly around 1.0 +/- 0.1 which may be attributed to noisy measurements ... [Source](https://gist.github.com/vfdev-5/1c0778904a07ce40401306548b9525e8#file-20230322-132441-pr_vs_nightly-speedup-md) ## Context - #90771 cc jgong5 mingfeima XiaobingSuper sanchitintel ashokei jingxu10 [ghstack-poisoned]
…t8 RGB-case (channels last)" ## Description - Based on #96651 - Improved perfs for vectorized **bilinear** interpolate uint8 RGB-case, **channels last** - unified RGB and RGBA processing code such that RGB input is not copied into RGBA - Performances are more close to Pillow-SIMD (labeled as `Pillow (9.0.0.post1)` in the results) - RGBA case perfs are the same after refactoring (see Source link below) - Fixed mem pointer alignment, added more comments (reviews from #96651) ## Results - `Pillow (9.0.0.post1)` == Pillow-SIMD ``` [-------------------------------------------------------------------------------------------------- Resize -------------------------------------------------------------------------------------------------] | Pillow (9.0.0.post1) | torch (2.1.0a0+gitce4be01) PR | torch (2.1.0a0+git5309c44) nightly | Speed-up: PR vs nightly 1 threads: -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- 3 torch.uint8 channels_last bilinear (256, 256) -> (32, 32) aa=True | 38.548 (+-0.280) | 57.536 (+-0.210) | 132.147 (+-1.236) | 2.297 (+-0.000) 3 torch.uint8 channels_last bilinear (256, 256) -> (32, 32) aa=False | | 38.532 (+-0.219) | 111.789 (+-1.175) | 2.901 (+-0.000) 3 torch.uint8 channels_last bilinear (256, 256) -> (224, 224) aa=True | 127.689 (+-1.348) | 156.262 (+-1.213) | 302.518 (+-2.632) | 1.936 (+-0.000) 3 torch.uint8 channels_last bilinear (256, 256) -> (224, 224) aa=False | | 145.483 (+-1.077) | 286.663 (+-2.494) | 1.970 (+-0.000) 3 torch.uint8 channels_last bilinear (256, 256) -> (320, 320) aa=True | 178.117 (+-1.956) | 215.053 (+-1.470) | 439.375 (+-4.014) | 2.043 (+-0.000) 3 torch.uint8 channels_last bilinear (256, 256) -> (320, 320) aa=False | | 211.340 (+-2.239) | 438.537 (+-4.143) | 2.075 (+-0.000) 3 torch.uint8 channels_last bilinear (520, 520) -> (32, 32) aa=True | 112.593 (+-1.266) | 130.414 (+-1.633) | 446.804 (+-3.283) | 3.426 (+-0.000) 3 torch.uint8 channels_last bilinear (520, 520) -> (32, 32) aa=False | | 58.767 (+-0.203) | 374.244 (+-13.598) | 6.368 (+-0.000) 3 torch.uint8 channels_last bilinear (520, 520) -> (224, 224) aa=True | 283.210 (+-2.937) | 324.157 (+-1.895) | 720.197 (+-3.467) | 2.222 (+-0.000) 3 torch.uint8 channels_last bilinear (520, 520) -> (224, 224) aa=False | | 239.800 (+-2.492) | 592.834 (+-3.903) | 2.472 (+-0.000) 3 torch.uint8 channels_last bilinear (712, 712) -> (32, 32) aa=True | 186.255 (+-1.629) | 204.834 (+-1.496) | 787.868 (+-3.648) | 3.846 (+-0.000) 3 torch.uint8 channels_last bilinear (712, 712) -> (32, 32) aa=False | | 77.335 (+-0.341) | 651.016 (+-3.926) | 8.418 (+-0.000) 3 torch.uint8 channels_last bilinear (712, 712) -> (224, 224) aa=True | 410.286 (+-2.439) | 443.934 (+-2.899) | 1123.923 (+-14.988) | 2.532 (+-0.000) 3 torch.uint8 channels_last bilinear (712, 712) -> (224, 224) aa=False | | 312.220 (+-2.307) | 915.347 (+-4.486) | 2.932 (+-0.000) # More test-cases from #90771 3 torch.uint8 channels_last bilinear (64, 64) -> (224, 224) aa=True | 60.611 (+-0.337) | 80.849 (+-1.780) | 170.465 (+-1.830) | 2.108 (+-0.000) 3 torch.uint8 channels_last bilinear (224, 224) -> (270, 268) aa=True | 132.971 (+-1.624) | 164.892 (+-1.426) | 330.971 (+-3.249) | 2.007 (+-0.000) 3 torch.uint8 channels_last bilinear (256, 256) -> (1024, 1024) aa=True | 948.467 (+-3.179) | 891.414 (+-5.282) | 2805.510 (+-25.503) | 3.147 (+-0.000) 3 torch.uint8 channels_last bilinear (224, 224) -> (64, 64) aa=True | 52.539 (+-0.327) | 72.471 (+-0.367) | 135.933 (+-1.625) | 1.876 (+-0.000) 3 torch.uint8 channels_last bilinear (270, 268) -> (224, 224) aa=True | 138.669 (+-1.867) | 168.628 (+-1.213) | 321.112 (+-2.904) | 1.904 (+-0.000) 3 torch.uint8 channels_last bilinear (1024, 1024) -> (256, 256) aa=True | 689.933 (+-3.175) | 746.911 (+-2.985) | 2050.880 (+-22.188) | 2.746 (+-0.000) 3 torch.uint8 channels_last bilinear (64, 64) -> (224, 224) aa=False | | 78.347 (+-0.338) | 169.646 (+-1.640) | 2.165 (+-0.000) 3 torch.uint8 channels_last bilinear (224, 224) -> (270, 268) aa=False | | 162.194 (+-1.089) | 329.754 (+-2.590) | 2.033 (+-0.000) 3 torch.uint8 channels_last bilinear (256, 256) -> (1024, 1024) aa=False | | 894.476 (+-2.738) | 2815.870 (+-22.589) | 3.148 (+-0.000) 3 torch.uint8 channels_last bilinear (224, 224) -> (64, 64) aa=False | | 52.728 (+-0.406) | 112.024 (+-1.225) | 2.125 (+-0.000) 3 torch.uint8 channels_last bilinear (270, 268) -> (224, 224) aa=False | | 151.560 (+-1.128) | 299.152 (+-3.353) | 1.974 (+-0.000) 3 torch.uint8 channels_last bilinear (1024, 1024) -> (256, 256) aa=False | | 500.053 (+-4.288) | 1698.601 (+-16.785) | 3.397 (+-0.000) ``` Note: There is no perf regression for other case. There some cases (see Source below) with small speed-ups, for the rest it is roughly around 1.0 +/- 0.1 which may be attributed to noisy measurements ... [Source](https://gist.github.com/vfdev-5/1c0778904a07ce40401306548b9525e8#file-20230322-132441-pr_vs_nightly-speedup-md) ## Context - #90771 cc jgong5 mingfeima XiaobingSuper sanchitintel ashokei jingxu10 [ghstack-poisoned]
…erpolate cpu uint8 RGB-case (channels last)" ## Description - Based on #96651 - Improved perfs for vectorized **bilinear** interpolate uint8 RGB-case, **channels last** - unified RGB and RGBA processing code such that RGB input is not copied into RGBA - Performances are more close to Pillow-SIMD (labeled as `Pillow (9.0.0.post1)` in the results) - RGBA case perfs are the same after refactoring (see Source link below) - Fixed mem pointer alignment, added more comments (reviews from #96651) ## Results - `Pillow (9.0.0.post1)` == Pillow-SIMD ``` [-------------------------------------------------------------------------------------------------- Resize -------------------------------------------------------------------------------------------------] | Pillow (9.0.0.post1) | torch (2.1.0a0+gitce4be01) PR | torch (2.1.0a0+git5309c44) nightly | Speed-up: PR vs nightly 1 threads: -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- 3 torch.uint8 channels_last bilinear (256, 256) -> (32, 32) aa=True | 38.548 (+-0.280) | 57.536 (+-0.210) | 132.147 (+-1.236) | 2.297 (+-0.000) 3 torch.uint8 channels_last bilinear (256, 256) -> (32, 32) aa=False | | 38.532 (+-0.219) | 111.789 (+-1.175) | 2.901 (+-0.000) 3 torch.uint8 channels_last bilinear (256, 256) -> (224, 224) aa=True | 127.689 (+-1.348) | 156.262 (+-1.213) | 302.518 (+-2.632) | 1.936 (+-0.000) 3 torch.uint8 channels_last bilinear (256, 256) -> (224, 224) aa=False | | 145.483 (+-1.077) | 286.663 (+-2.494) | 1.970 (+-0.000) 3 torch.uint8 channels_last bilinear (256, 256) -> (320, 320) aa=True | 178.117 (+-1.956) | 215.053 (+-1.470) | 439.375 (+-4.014) | 2.043 (+-0.000) 3 torch.uint8 channels_last bilinear (256, 256) -> (320, 320) aa=False | | 211.340 (+-2.239) | 438.537 (+-4.143) | 2.075 (+-0.000) 3 torch.uint8 channels_last bilinear (520, 520) -> (32, 32) aa=True | 112.593 (+-1.266) | 130.414 (+-1.633) | 446.804 (+-3.283) | 3.426 (+-0.000) 3 torch.uint8 channels_last bilinear (520, 520) -> (32, 32) aa=False | | 58.767 (+-0.203) | 374.244 (+-13.598) | 6.368 (+-0.000) 3 torch.uint8 channels_last bilinear (520, 520) -> (224, 224) aa=True | 283.210 (+-2.937) | 324.157 (+-1.895) | 720.197 (+-3.467) | 2.222 (+-0.000) 3 torch.uint8 channels_last bilinear (520, 520) -> (224, 224) aa=False | | 239.800 (+-2.492) | 592.834 (+-3.903) | 2.472 (+-0.000) 3 torch.uint8 channels_last bilinear (712, 712) -> (32, 32) aa=True | 186.255 (+-1.629) | 204.834 (+-1.496) | 787.868 (+-3.648) | 3.846 (+-0.000) 3 torch.uint8 channels_last bilinear (712, 712) -> (32, 32) aa=False | | 77.335 (+-0.341) | 651.016 (+-3.926) | 8.418 (+-0.000) 3 torch.uint8 channels_last bilinear (712, 712) -> (224, 224) aa=True | 410.286 (+-2.439) | 443.934 (+-2.899) | 1123.923 (+-14.988) | 2.532 (+-0.000) 3 torch.uint8 channels_last bilinear (712, 712) -> (224, 224) aa=False | | 312.220 (+-2.307) | 915.347 (+-4.486) | 2.932 (+-0.000) # More test-cases from #90771 3 torch.uint8 channels_last bilinear (64, 64) -> (224, 224) aa=True | 60.611 (+-0.337) | 80.849 (+-1.780) | 170.465 (+-1.830) | 2.108 (+-0.000) 3 torch.uint8 channels_last bilinear (224, 224) -> (270, 268) aa=True | 132.971 (+-1.624) | 164.892 (+-1.426) | 330.971 (+-3.249) | 2.007 (+-0.000) 3 torch.uint8 channels_last bilinear (256, 256) -> (1024, 1024) aa=True | 948.467 (+-3.179) | 891.414 (+-5.282) | 2805.510 (+-25.503) | 3.147 (+-0.000) 3 torch.uint8 channels_last bilinear (224, 224) -> (64, 64) aa=True | 52.539 (+-0.327) | 72.471 (+-0.367) | 135.933 (+-1.625) | 1.876 (+-0.000) 3 torch.uint8 channels_last bilinear (270, 268) -> (224, 224) aa=True | 138.669 (+-1.867) | 168.628 (+-1.213) | 321.112 (+-2.904) | 1.904 (+-0.000) 3 torch.uint8 channels_last bilinear (1024, 1024) -> (256, 256) aa=True | 689.933 (+-3.175) | 746.911 (+-2.985) | 2050.880 (+-22.188) | 2.746 (+-0.000) 3 torch.uint8 channels_last bilinear (64, 64) -> (224, 224) aa=False | | 78.347 (+-0.338) | 169.646 (+-1.640) | 2.165 (+-0.000) 3 torch.uint8 channels_last bilinear (224, 224) -> (270, 268) aa=False | | 162.194 (+-1.089) | 329.754 (+-2.590) | 2.033 (+-0.000) 3 torch.uint8 channels_last bilinear (256, 256) -> (1024, 1024) aa=False | | 894.476 (+-2.738) | 2815.870 (+-22.589) | 3.148 (+-0.000) 3 torch.uint8 channels_last bilinear (224, 224) -> (64, 64) aa=False | | 52.728 (+-0.406) | 112.024 (+-1.225) | 2.125 (+-0.000) 3 torch.uint8 channels_last bilinear (270, 268) -> (224, 224) aa=False | | 151.560 (+-1.128) | 299.152 (+-3.353) | 1.974 (+-0.000) 3 torch.uint8 channels_last bilinear (1024, 1024) -> (256, 256) aa=False | | 500.053 (+-4.288) | 1698.601 (+-16.785) | 3.397 (+-0.000) ``` Note: There is no perf regression for other case. There some cases (see Source below) with small speed-ups, for the rest it is roughly around 1.0 +/- 0.1 which may be attributed to noisy measurements ... [Source](https://gist.github.com/vfdev-5/1c0778904a07ce40401306548b9525e8#file-20230322-132441-pr_vs_nightly-speedup-md) ## Context - #90771 cc jgong5 mingfeima XiaobingSuper sanchitintel ashokei jingxu10 [ghstack-poisoned]
…t8 RGB-case (channels last)" ## Description - Based on #96651 - Improved perfs for vectorized **bilinear** interpolate uint8 RGB-case, **channels last** - unified RGB and RGBA processing code such that RGB input is not copied into RGBA - Performances are more close to Pillow-SIMD (labeled as `Pillow (9.0.0.post1)` in the results) - RGBA case perfs are the same after refactoring (see Source link below) - Fixed mem pointer alignment, added more comments (reviews from #96651) ## Results - `Pillow (9.0.0.post1)` == Pillow-SIMD ``` [-------------------------------------------------------------------------------------------------- Resize -------------------------------------------------------------------------------------------------] | Pillow (9.0.0.post1) | torch (2.1.0a0+gitce4be01) PR | torch (2.1.0a0+git5309c44) nightly | Speed-up: PR vs nightly 1 threads: -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- 3 torch.uint8 channels_last bilinear (256, 256) -> (32, 32) aa=True | 38.548 (+-0.280) | 57.536 (+-0.210) | 132.147 (+-1.236) | 2.297 (+-0.000) 3 torch.uint8 channels_last bilinear (256, 256) -> (32, 32) aa=False | | 38.532 (+-0.219) | 111.789 (+-1.175) | 2.901 (+-0.000) 3 torch.uint8 channels_last bilinear (256, 256) -> (224, 224) aa=True | 127.689 (+-1.348) | 156.262 (+-1.213) | 302.518 (+-2.632) | 1.936 (+-0.000) 3 torch.uint8 channels_last bilinear (256, 256) -> (224, 224) aa=False | | 145.483 (+-1.077) | 286.663 (+-2.494) | 1.970 (+-0.000) 3 torch.uint8 channels_last bilinear (256, 256) -> (320, 320) aa=True | 178.117 (+-1.956) | 215.053 (+-1.470) | 439.375 (+-4.014) | 2.043 (+-0.000) 3 torch.uint8 channels_last bilinear (256, 256) -> (320, 320) aa=False | | 211.340 (+-2.239) | 438.537 (+-4.143) | 2.075 (+-0.000) 3 torch.uint8 channels_last bilinear (520, 520) -> (32, 32) aa=True | 112.593 (+-1.266) | 130.414 (+-1.633) | 446.804 (+-3.283) | 3.426 (+-0.000) 3 torch.uint8 channels_last bilinear (520, 520) -> (32, 32) aa=False | | 58.767 (+-0.203) | 374.244 (+-13.598) | 6.368 (+-0.000) 3 torch.uint8 channels_last bilinear (520, 520) -> (224, 224) aa=True | 283.210 (+-2.937) | 324.157 (+-1.895) | 720.197 (+-3.467) | 2.222 (+-0.000) 3 torch.uint8 channels_last bilinear (520, 520) -> (224, 224) aa=False | | 239.800 (+-2.492) | 592.834 (+-3.903) | 2.472 (+-0.000) 3 torch.uint8 channels_last bilinear (712, 712) -> (32, 32) aa=True | 186.255 (+-1.629) | 204.834 (+-1.496) | 787.868 (+-3.648) | 3.846 (+-0.000) 3 torch.uint8 channels_last bilinear (712, 712) -> (32, 32) aa=False | | 77.335 (+-0.341) | 651.016 (+-3.926) | 8.418 (+-0.000) 3 torch.uint8 channels_last bilinear (712, 712) -> (224, 224) aa=True | 410.286 (+-2.439) | 443.934 (+-2.899) | 1123.923 (+-14.988) | 2.532 (+-0.000) 3 torch.uint8 channels_last bilinear (712, 712) -> (224, 224) aa=False | | 312.220 (+-2.307) | 915.347 (+-4.486) | 2.932 (+-0.000) # More test-cases from #90771 3 torch.uint8 channels_last bilinear (64, 64) -> (224, 224) aa=True | 60.611 (+-0.337) | 80.849 (+-1.780) | 170.465 (+-1.830) | 2.108 (+-0.000) 3 torch.uint8 channels_last bilinear (224, 224) -> (270, 268) aa=True | 132.971 (+-1.624) | 164.892 (+-1.426) | 330.971 (+-3.249) | 2.007 (+-0.000) 3 torch.uint8 channels_last bilinear (256, 256) -> (1024, 1024) aa=True | 948.467 (+-3.179) | 891.414 (+-5.282) | 2805.510 (+-25.503) | 3.147 (+-0.000) 3 torch.uint8 channels_last bilinear (224, 224) -> (64, 64) aa=True | 52.539 (+-0.327) | 72.471 (+-0.367) | 135.933 (+-1.625) | 1.876 (+-0.000) 3 torch.uint8 channels_last bilinear (270, 268) -> (224, 224) aa=True | 138.669 (+-1.867) | 168.628 (+-1.213) | 321.112 (+-2.904) | 1.904 (+-0.000) 3 torch.uint8 channels_last bilinear (1024, 1024) -> (256, 256) aa=True | 689.933 (+-3.175) | 746.911 (+-2.985) | 2050.880 (+-22.188) | 2.746 (+-0.000) 3 torch.uint8 channels_last bilinear (64, 64) -> (224, 224) aa=False | | 78.347 (+-0.338) | 169.646 (+-1.640) | 2.165 (+-0.000) 3 torch.uint8 channels_last bilinear (224, 224) -> (270, 268) aa=False | | 162.194 (+-1.089) | 329.754 (+-2.590) | 2.033 (+-0.000) 3 torch.uint8 channels_last bilinear (256, 256) -> (1024, 1024) aa=False | | 894.476 (+-2.738) | 2815.870 (+-22.589) | 3.148 (+-0.000) 3 torch.uint8 channels_last bilinear (224, 224) -> (64, 64) aa=False | | 52.728 (+-0.406) | 112.024 (+-1.225) | 2.125 (+-0.000) 3 torch.uint8 channels_last bilinear (270, 268) -> (224, 224) aa=False | | 151.560 (+-1.128) | 299.152 (+-3.353) | 1.974 (+-0.000) 3 torch.uint8 channels_last bilinear (1024, 1024) -> (256, 256) aa=False | | 500.053 (+-4.288) | 1698.601 (+-16.785) | 3.397 (+-0.000) ``` Note: There is no perf regression for other case. There some cases (see Source below) with small speed-ups, for the rest it is roughly around 1.0 +/- 0.1 which may be attributed to noisy measurements ... [Source](https://gist.github.com/vfdev-5/1c0778904a07ce40401306548b9525e8#file-20230322-132441-pr_vs_nightly-speedup-md) ## Context - #90771 cc jgong5 mingfeima XiaobingSuper sanchitintel ashokei jingxu10 [ghstack-poisoned]
…erpolate cpu uint8 RGB-case (channels last)" ## Description - Based on #96651 - Improved perfs for vectorized **bilinear** interpolate uint8 RGB-case, **channels last** - unified RGB and RGBA processing code such that RGB input is not copied into RGBA - Performances are more close to Pillow-SIMD (labeled as `Pillow (9.0.0.post1)` in the results) - RGBA case perfs are the same after refactoring (see Source link below) - Fixed mem pointer alignment, added more comments (reviews from #96651) ## Results - `Pillow (9.0.0.post1)` == Pillow-SIMD ``` [-------------------------------------------------------------------------------------------------- Resize -------------------------------------------------------------------------------------------------] | Pillow (9.0.0.post1) | torch (2.1.0a0+gitd6e220c) PR | torch (2.1.0a0+git2b75955) nightly | Speed-up: PR vs nightly 1 threads: -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- 3 torch.uint8 channels_last bilinear (256, 256) -> (32, 32) aa=True | 38.674 (+-0.323) | 57.591 (+-0.244) | 131.033 (+-1.448) | 2.275 (+-0.000) 3 torch.uint8 channels_last bilinear (256, 256) -> (32, 32) aa=False | | 39.471 (+-0.166) | 113.911 (+-1.736) | 2.886 (+-0.000) 3 torch.uint8 channels_last bilinear (256, 256) -> (224, 224) aa=True | 128.512 (+-1.916) | 161.592 (+-1.242) | 299.679 (+-2.099) | 1.855 (+-0.000) 3 torch.uint8 channels_last bilinear (256, 256) -> (224, 224) aa=False | | 150.994 (+-1.180) | 285.331 (+-1.919) | 1.890 (+-0.000) 3 torch.uint8 channels_last bilinear (256, 256) -> (320, 320) aa=True | 180.045 (+-2.223) | 220.581 (+-1.363) | 431.057 (+-3.536) | 1.954 (+-0.000) 3 torch.uint8 channels_last bilinear (256, 256) -> (320, 320) aa=False | | 219.391 (+-1.409) | 429.410 (+-3.620) | 1.957 (+-0.000) 3 torch.uint8 channels_last bilinear (520, 520) -> (32, 32) aa=True | 113.911 (+-1.024) | 129.457 (+-1.295) | 459.610 (+-13.322) | 3.550 (+-0.000) 3 torch.uint8 channels_last bilinear (520, 520) -> (32, 32) aa=False | | 59.800 (+-0.199) | 400.015 (+-11.815) | 6.689 (+-0.000) 3 torch.uint8 channels_last bilinear (520, 520) -> (224, 224) aa=True | 283.050 (+-2.664) | 339.143 (+-1.209) | 683.555 (+-4.466) | 2.016 (+-0.000) 3 torch.uint8 channels_last bilinear (520, 520) -> (224, 224) aa=False | | 250.601 (+-1.236) | 603.545 (+-2.644) | 2.408 (+-0.000) 3 torch.uint8 channels_last bilinear (712, 712) -> (32, 32) aa=True | 186.723 (+-2.213) | 199.960 (+-1.343) | 860.867 (+-21.763) | 4.305 (+-0.000) 3 torch.uint8 channels_last bilinear (712, 712) -> (32, 32) aa=False | | 79.188 (+-0.261) | 703.019 (+-25.805) | 8.878 (+-0.000) 3 torch.uint8 channels_last bilinear (712, 712) -> (224, 224) aa=True | 412.353 (+-4.476) | 462.230 (+-1.983) | 1101.673 (+-49.299) | 2.383 (+-0.000) 3 torch.uint8 channels_last bilinear (712, 712) -> (224, 224) aa=False | | 327.973 (+-1.852) | 941.062 (+-5.549) | 2.869 (+-0.000) 3 torch.uint8 channels_last bilinear (64, 64) -> (224, 224) aa=True | 61.191 (+-0.926) | 80.795 (+-0.518) | 160.853 (+-1.506) | 1.991 (+-0.000) 3 torch.uint8 channels_last bilinear (224, 224) -> (270, 268) aa=True | 134.488 (+-2.129) | 169.147 (+-1.324) | 327.343 (+-2.846) | 1.935 (+-0.000) 3 torch.uint8 channels_last bilinear (256, 256) -> (1024, 1024) aa=True | 1037.045 (+-24.982) | 938.623 (+-9.010) | 2603.360 (+-20.530) | 2.774 (+-0.000) 3 torch.uint8 channels_last bilinear (224, 224) -> (64, 64) aa=True | 52.792 (+-0.613) | 73.692 (+-0.264) | 131.829 (+-1.333) | 1.789 (+-0.000) 3 torch.uint8 channels_last bilinear (270, 268) -> (224, 224) aa=True | 139.596 (+-1.944) | 173.778 (+-1.039) | 320.063 (+-2.562) | 1.842 (+-0.000) 3 torch.uint8 channels_last bilinear (1024, 1024) -> (256, 256) aa=True | 690.132 (+-10.946) | 772.758 (+-2.864) | 2036.860 (+-36.109) | 2.636 (+-0.000) 3 torch.uint8 channels_last bilinear (64, 64) -> (224, 224) aa=False | | 78.747 (+-0.799) | 158.479 (+-1.702) | 2.013 (+-0.000) 3 torch.uint8 channels_last bilinear (224, 224) -> (270, 268) aa=False | | 167.046 (+-1.077) | 322.104 (+-2.764) | 1.928 (+-0.000) 3 torch.uint8 channels_last bilinear (256, 256) -> (1024, 1024) aa=False | | 918.967 (+-5.251) | 2611.388 (+-29.917) | 2.842 (+-0.000) 3 torch.uint8 channels_last bilinear (224, 224) -> (64, 64) aa=False | | 55.336 (+-0.251) | 113.869 (+-1.243) | 2.058 (+-0.000) 3 torch.uint8 channels_last bilinear (270, 268) -> (224, 224) aa=False | | 156.505 (+-1.095) | 299.861 (+-2.710) | 1.916 (+-0.000) 3 torch.uint8 channels_last bilinear (1024, 1024) -> (256, 256) aa=False | | 514.344 (+-1.905) | 1776.796 (+-19.660) | 3.454 (+-0.000) ``` Note: There is no perf regression for other case. There some cases (see Source below) with small speed-ups, for the rest it is roughly around 1.0 +/- 0.1 which may be attributed to noisy measurements ... [Source](https://gist.github.com/vfdev-5/1c0778904a07ce40401306548b9525e8#file-20230329-181023-pr_vs_nightly-speedup-md) ## Context - #90771 cc jgong5 mingfeima XiaobingSuper sanchitintel ashokei jingxu10 datumbox pmeier [ghstack-poisoned]
…t8 RGB-case (channels last)" ## Description - Based on #96651 - Improved perfs for vectorized **bilinear** interpolate uint8 RGB-case, **channels last** - unified RGB and RGBA processing code such that RGB input is not copied into RGBA - Performances are more close to Pillow-SIMD (labeled as `Pillow (9.0.0.post1)` in the results) - RGBA case perfs are the same after refactoring (see Source link below) - Fixed mem pointer alignment, added more comments (reviews from #96651) ## Results - `Pillow (9.0.0.post1)` == Pillow-SIMD ``` [-------------------------------------------------------------------------------------------------- Resize -------------------------------------------------------------------------------------------------] | Pillow (9.0.0.post1) | torch (2.1.0a0+gitd6e220c) PR | torch (2.1.0a0+git2b75955) nightly | Speed-up: PR vs nightly 1 threads: -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- 3 torch.uint8 channels_last bilinear (256, 256) -> (32, 32) aa=True | 38.674 (+-0.323) | 57.591 (+-0.244) | 131.033 (+-1.448) | 2.275 (+-0.000) 3 torch.uint8 channels_last bilinear (256, 256) -> (32, 32) aa=False | | 39.471 (+-0.166) | 113.911 (+-1.736) | 2.886 (+-0.000) 3 torch.uint8 channels_last bilinear (256, 256) -> (224, 224) aa=True | 128.512 (+-1.916) | 161.592 (+-1.242) | 299.679 (+-2.099) | 1.855 (+-0.000) 3 torch.uint8 channels_last bilinear (256, 256) -> (224, 224) aa=False | | 150.994 (+-1.180) | 285.331 (+-1.919) | 1.890 (+-0.000) 3 torch.uint8 channels_last bilinear (256, 256) -> (320, 320) aa=True | 180.045 (+-2.223) | 220.581 (+-1.363) | 431.057 (+-3.536) | 1.954 (+-0.000) 3 torch.uint8 channels_last bilinear (256, 256) -> (320, 320) aa=False | | 219.391 (+-1.409) | 429.410 (+-3.620) | 1.957 (+-0.000) 3 torch.uint8 channels_last bilinear (520, 520) -> (32, 32) aa=True | 113.911 (+-1.024) | 129.457 (+-1.295) | 459.610 (+-13.322) | 3.550 (+-0.000) 3 torch.uint8 channels_last bilinear (520, 520) -> (32, 32) aa=False | | 59.800 (+-0.199) | 400.015 (+-11.815) | 6.689 (+-0.000) 3 torch.uint8 channels_last bilinear (520, 520) -> (224, 224) aa=True | 283.050 (+-2.664) | 339.143 (+-1.209) | 683.555 (+-4.466) | 2.016 (+-0.000) 3 torch.uint8 channels_last bilinear (520, 520) -> (224, 224) aa=False | | 250.601 (+-1.236) | 603.545 (+-2.644) | 2.408 (+-0.000) 3 torch.uint8 channels_last bilinear (712, 712) -> (32, 32) aa=True | 186.723 (+-2.213) | 199.960 (+-1.343) | 860.867 (+-21.763) | 4.305 (+-0.000) 3 torch.uint8 channels_last bilinear (712, 712) -> (32, 32) aa=False | | 79.188 (+-0.261) | 703.019 (+-25.805) | 8.878 (+-0.000) 3 torch.uint8 channels_last bilinear (712, 712) -> (224, 224) aa=True | 412.353 (+-4.476) | 462.230 (+-1.983) | 1101.673 (+-49.299) | 2.383 (+-0.000) 3 torch.uint8 channels_last bilinear (712, 712) -> (224, 224) aa=False | | 327.973 (+-1.852) | 941.062 (+-5.549) | 2.869 (+-0.000) 3 torch.uint8 channels_last bilinear (64, 64) -> (224, 224) aa=True | 61.191 (+-0.926) | 80.795 (+-0.518) | 160.853 (+-1.506) | 1.991 (+-0.000) 3 torch.uint8 channels_last bilinear (224, 224) -> (270, 268) aa=True | 134.488 (+-2.129) | 169.147 (+-1.324) | 327.343 (+-2.846) | 1.935 (+-0.000) 3 torch.uint8 channels_last bilinear (256, 256) -> (1024, 1024) aa=True | 1037.045 (+-24.982) | 938.623 (+-9.010) | 2603.360 (+-20.530) | 2.774 (+-0.000) 3 torch.uint8 channels_last bilinear (224, 224) -> (64, 64) aa=True | 52.792 (+-0.613) | 73.692 (+-0.264) | 131.829 (+-1.333) | 1.789 (+-0.000) 3 torch.uint8 channels_last bilinear (270, 268) -> (224, 224) aa=True | 139.596 (+-1.944) | 173.778 (+-1.039) | 320.063 (+-2.562) | 1.842 (+-0.000) 3 torch.uint8 channels_last bilinear (1024, 1024) -> (256, 256) aa=True | 690.132 (+-10.946) | 772.758 (+-2.864) | 2036.860 (+-36.109) | 2.636 (+-0.000) 3 torch.uint8 channels_last bilinear (64, 64) -> (224, 224) aa=False | | 78.747 (+-0.799) | 158.479 (+-1.702) | 2.013 (+-0.000) 3 torch.uint8 channels_last bilinear (224, 224) -> (270, 268) aa=False | | 167.046 (+-1.077) | 322.104 (+-2.764) | 1.928 (+-0.000) 3 torch.uint8 channels_last bilinear (256, 256) -> (1024, 1024) aa=False | | 918.967 (+-5.251) | 2611.388 (+-29.917) | 2.842 (+-0.000) 3 torch.uint8 channels_last bilinear (224, 224) -> (64, 64) aa=False | | 55.336 (+-0.251) | 113.869 (+-1.243) | 2.058 (+-0.000) 3 torch.uint8 channels_last bilinear (270, 268) -> (224, 224) aa=False | | 156.505 (+-1.095) | 299.861 (+-2.710) | 1.916 (+-0.000) 3 torch.uint8 channels_last bilinear (1024, 1024) -> (256, 256) aa=False | | 514.344 (+-1.905) | 1776.796 (+-19.660) | 3.454 (+-0.000) ``` Note: There is no perf regression for other case. There some cases (see Source below) with small speed-ups, for the rest it is roughly around 1.0 +/- 0.1 which may be attributed to noisy measurements ... [Source](https://gist.github.com/vfdev-5/1c0778904a07ce40401306548b9525e8#file-20230329-181023-pr_vs_nightly-speedup-md) ## Context - #90771 cc jgong5 mingfeima XiaobingSuper sanchitintel ashokei jingxu10 datumbox pmeier [ghstack-poisoned]
… (channels last) (#96848) ## Description - Based on #96651 - Improved perfs for vectorized **bilinear** interpolate uint8 RGB-case, **channels last** - unified RGB and RGBA processing code such that RGB input is not copied into RGBA - Performances are more close to Pillow-SIMD (labeled as `Pillow (9.0.0.post1)` in the results) - RGBA case perfs are the same after refactoring (see Source link below) - Fixed mem pointer alignment, added more comments (reviews from #96651) ## Results - `Pillow (9.0.0.post1)` == Pillow-SIMD ``` [-------------------------------------------------------------------------------------------------- Resize -------------------------------------------------------------------------------------------------] | Pillow (9.0.0.post1) | torch (2.1.0a0+gitd6e220c) PR | torch (2.1.0a0+git2b75955) nightly | Speed-up: PR vs nightly 1 threads: -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- 3 torch.uint8 channels_last bilinear (256, 256) -> (32, 32) aa=True | 38.674 (+-0.323) | 57.591 (+-0.244) | 131.033 (+-1.448) | 2.275 (+-0.000) 3 torch.uint8 channels_last bilinear (256, 256) -> (32, 32) aa=False | | 39.471 (+-0.166) | 113.911 (+-1.736) | 2.886 (+-0.000) 3 torch.uint8 channels_last bilinear (256, 256) -> (224, 224) aa=True | 128.512 (+-1.916) | 161.592 (+-1.242) | 299.679 (+-2.099) | 1.855 (+-0.000) 3 torch.uint8 channels_last bilinear (256, 256) -> (224, 224) aa=False | | 150.994 (+-1.180) | 285.331 (+-1.919) | 1.890 (+-0.000) 3 torch.uint8 channels_last bilinear (256, 256) -> (320, 320) aa=True | 180.045 (+-2.223) | 220.581 (+-1.363) | 431.057 (+-3.536) | 1.954 (+-0.000) 3 torch.uint8 channels_last bilinear (256, 256) -> (320, 320) aa=False | | 219.391 (+-1.409) | 429.410 (+-3.620) | 1.957 (+-0.000) 3 torch.uint8 channels_last bilinear (520, 520) -> (32, 32) aa=True | 113.911 (+-1.024) | 129.457 (+-1.295) | 459.610 (+-13.322) | 3.550 (+-0.000) 3 torch.uint8 channels_last bilinear (520, 520) -> (32, 32) aa=False | | 59.800 (+-0.199) | 400.015 (+-11.815) | 6.689 (+-0.000) 3 torch.uint8 channels_last bilinear (520, 520) -> (224, 224) aa=True | 283.050 (+-2.664) | 339.143 (+-1.209) | 683.555 (+-4.466) | 2.016 (+-0.000) 3 torch.uint8 channels_last bilinear (520, 520) -> (224, 224) aa=False | | 250.601 (+-1.236) | 603.545 (+-2.644) | 2.408 (+-0.000) 3 torch.uint8 channels_last bilinear (712, 712) -> (32, 32) aa=True | 186.723 (+-2.213) | 199.960 (+-1.343) | 860.867 (+-21.763) | 4.305 (+-0.000) 3 torch.uint8 channels_last bilinear (712, 712) -> (32, 32) aa=False | | 79.188 (+-0.261) | 703.019 (+-25.805) | 8.878 (+-0.000) 3 torch.uint8 channels_last bilinear (712, 712) -> (224, 224) aa=True | 412.353 (+-4.476) | 462.230 (+-1.983) | 1101.673 (+-49.299) | 2.383 (+-0.000) 3 torch.uint8 channels_last bilinear (712, 712) -> (224, 224) aa=False | | 327.973 (+-1.852) | 941.062 (+-5.549) | 2.869 (+-0.000) 3 torch.uint8 channels_last bilinear (64, 64) -> (224, 224) aa=True | 61.191 (+-0.926) | 80.795 (+-0.518) | 160.853 (+-1.506) | 1.991 (+-0.000) 3 torch.uint8 channels_last bilinear (224, 224) -> (270, 268) aa=True | 134.488 (+-2.129) | 169.147 (+-1.324) | 327.343 (+-2.846) | 1.935 (+-0.000) 3 torch.uint8 channels_last bilinear (256, 256) -> (1024, 1024) aa=True | 1037.045 (+-24.982) | 938.623 (+-9.010) | 2603.360 (+-20.530) | 2.774 (+-0.000) 3 torch.uint8 channels_last bilinear (224, 224) -> (64, 64) aa=True | 52.792 (+-0.613) | 73.692 (+-0.264) | 131.829 (+-1.333) | 1.789 (+-0.000) 3 torch.uint8 channels_last bilinear (270, 268) -> (224, 224) aa=True | 139.596 (+-1.944) | 173.778 (+-1.039) | 320.063 (+-2.562) | 1.842 (+-0.000) 3 torch.uint8 channels_last bilinear (1024, 1024) -> (256, 256) aa=True | 690.132 (+-10.946) | 772.758 (+-2.864) | 2036.860 (+-36.109) | 2.636 (+-0.000) 3 torch.uint8 channels_last bilinear (64, 64) -> (224, 224) aa=False | | 78.747 (+-0.799) | 158.479 (+-1.702) | 2.013 (+-0.000) 3 torch.uint8 channels_last bilinear (224, 224) -> (270, 268) aa=False | | 167.046 (+-1.077) | 322.104 (+-2.764) | 1.928 (+-0.000) 3 torch.uint8 channels_last bilinear (256, 256) -> (1024, 1024) aa=False | | 918.967 (+-5.251) | 2611.388 (+-29.917) | 2.842 (+-0.000) 3 torch.uint8 channels_last bilinear (224, 224) -> (64, 64) aa=False | | 55.336 (+-0.251) | 113.869 (+-1.243) | 2.058 (+-0.000) 3 torch.uint8 channels_last bilinear (270, 268) -> (224, 224) aa=False | | 156.505 (+-1.095) | 299.861 (+-2.710) | 1.916 (+-0.000) 3 torch.uint8 channels_last bilinear (1024, 1024) -> (256, 256) aa=False | | 514.344 (+-1.905) | 1776.796 (+-19.660) | 3.454 (+-0.000) ``` Note: There is no perf regression for other case. There some cases (see Source below) with small speed-ups, for the rest it is roughly around 1.0 +/- 0.1 which may be attributed to noisy measurements ... [Source](https://gist.github.com/vfdev-5/1c0778904a07ce40401306548b9525e8#file-20230329-181023-pr_vs_nightly-speedup-md) ## Context - #90771 Pull Request resolved: #96848 Approved by: https://github.com/NicolasHug, https://github.com/peterbell10
Closed in favor of this stack: #96848
Description
Results
Source
Context
cc @jgong5 @mingfeima @XiaobingSuper @sanchitintel @ashokei @jingxu10