Skip to content

Conversation

@vfdev-5
Copy link
Contributor

@vfdev-5 vfdev-5 commented Mar 13, 2023

Closed in favor of this stack: #96848


Description

  • Improved perfs for vectorized interpolate uint8 RGB-case
    • unified RGB and RGBA processing code such that RGB input is not copied into RGBA
  • Performances are more close to Pillow-SIMD
  • RGBA case perfs are the same after refactoring (see Source link below)

Results

[------------------------------------------------------------------------------------------ Resize -----------------------------------------------------------------------------------------]
                                                                 |  Pillow (9.0.0.post1)  |  torch (2.1.0a0+git1d3a939) PR  |  torch (2.1.0a0+git5309c44) nightly  |  Speed-up: PR vs nightly
1 threads: ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
      3 torch.uint8 channels_last bilinear 256 -> 32 aa=True     |          38.5          |                56.3             |                 132.5                |            2.4
      3 torch.uint8 channels_last bilinear 256 -> 32 aa=False    |                        |                36.2             |                 110.6                |            3.1
      3 torch.uint8 channels_last bilinear 256 -> 224 aa=True    |         127.0          |               149.9             |                 292.2                |            1.9
      3 torch.uint8 channels_last bilinear 256 -> 224 aa=False   |                        |               134.2             |                 276.8                |            2.1
      3 torch.uint8 channels_last bilinear 256 -> 320 aa=True    |         178.1          |               200.3             |                 416.4                |            2.1
      3 torch.uint8 channels_last bilinear 256 -> 320 aa=False   |                        |               198.0             |                 414.4                |            2.1
      3 torch.uint8 channels_last bilinear 520 -> 32 aa=True     |         112.9          |               129.3             |                 441.3                |            3.4
      3 torch.uint8 channels_last bilinear 520 -> 32 aa=False    |                        |                54.9             |                 364.2                |            6.6
      3 torch.uint8 channels_last bilinear 520 -> 224 aa=True    |         282.7          |               324.8             |                 691.6                |            2.1
      3 torch.uint8 channels_last bilinear 520 -> 224 aa=False   |                        |               211.9             |                 583.1                |            2.8
      3 torch.uint8 channels_last bilinear 712 -> 32 aa=True     |         185.9          |               201.1             |                 783.1                |            3.9
      3 torch.uint8 channels_last bilinear 712 -> 32 aa=False    |                        |                72.1             |                 649.8                |            9.0
      3 torch.uint8 channels_last bilinear 712 -> 224 aa=True    |         408.7          |               436.7             |                1100.5                |            2.5
      3 torch.uint8 channels_last bilinear 712 -> 224 aa=False   |                        |               268.8             |                 906.6                |            3.4

Source

Context

cc @jgong5 @mingfeima @XiaobingSuper @sanchitintel @ashokei @jingxu10

@pytorch-bot
Copy link

pytorch-bot bot commented Mar 13, 2023

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/96651

Note: Links to docs will display an error until the docs builds have been completed.

❌ 10 Failures

As of commit c7a7f13:

NEW FAILURES - The following jobs have failed:

BROKEN TRUNK - The following jobs failed but were present on the merge base f418e1f:

👉 Rebase onto the `viable/strict` branch to avoid these failures

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@github-actions github-actions bot added the module: cpu CPU specific problem (e.g., perf, algorithm) label Mar 13, 2023
@vfdev-5 vfdev-5 changed the title Improved perfs for vectorized interpolate uint8 RGB-case Improved perfs for vectorized interpolate cpu uint8 RGB-case Mar 13, 2023
@vfdev-5 vfdev-5 added the topic: not user facing topic category label Mar 13, 2023
auto yout = unpacked_output.size(1);
TORCH_INTERNAL_ASSERT(num_channels == unpacked_input.size(0));

auto xout_stride = xout * num_channels;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this a template parameter when num_channels is never used in a constexpr context? Do you expect ImageResampleVerticalConvolution8u to always be inlined and specialized?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I rechecked the assembly code and rerun benchmarks and do not see any advantages of using templates for ImageResampleVerticalConvolution8u. Will remove in the other PR.


// Define various shuffling masks
const auto kmask_low = _mm256_set_epi8(
11, 10, 9, 8, 11, 10, 9, 8, 11, 10, 9, 8, 11, 10, 9, 8,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These need descriptive names, what shuffle are they performing?

mmk = _mm256_shuffle_epi8(ksource, _mm256_set_epi8(
11,10, 9,8, 11,10, 9,8, 11,10, 9,8, 11,10, 9,8,
3,2, 1,0, 3,2, 1,0, 3,2, 1,0, 3,2, 1,0));
auto sss256 = _mm256_set1_epi32(1 << (coefs_precision - 2));
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you make this a stack PR with style changes as a separate PR. It would make the diff much nicer to read.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Closed in favor of this stack: #96848

@vadimkantorov
Copy link
Contributor

Is int8 also supported via the same codepath? #5580

@bdhirsh bdhirsh added the triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module label Mar 13, 2023
@vfdev-5 vfdev-5 marked this pull request as draft March 15, 2023 10:40
vfdev-5 added a commit that referenced this pull request Mar 15, 2023
- Based on #96651
- Fixed mem pointer alignment

[ghstack-poisoned]
@vfdev-5 vfdev-5 closed this Mar 15, 2023
@vfdev-5 vfdev-5 deleted the interp_uint8_rgb_vec_no_copy branch March 15, 2023 14:05
@vfdev-5
Copy link
Contributor Author

vfdev-5 commented Mar 15, 2023

Is int8 also supported via the same codepath? #5580

@vadimkantorov no, it does not. Can you share some info on why int8 is interesting to be supported vs uint8 ?

@vadimkantorov
Copy link
Contributor

vadimkantorov commented Mar 15, 2023

In some rare cases it could be useful for resizing label maps with values like -1, 0, 1 (e.g. for speaker separation first speaker could be -1, second speaker 1 and "unknown" - 0). Or the other way around, -1 could mean "unknown" and 0/1 could mean two class labels. More practical case for this are dtype bool (two classes) / int16 (more classes) / int32. But if one can use smaller memory footprint, it's useful.

And of course, it's useful for consistency - it's good UX when basic ops like interpolate are efficiently supported for all dtypes.

Support for torch.bool should be easily implemented if uint8 is supported already. Just reinterpreting torch.bool input as uint8 and then back. And it's very useful/natural for processing binary images / segmentation masks / time series segmentation masks.

vfdev-5 added a commit that referenced this pull request Mar 15, 2023
## Description

- Based on #96651
  - Improved perfs for vectorized interpolate uint8 RGB-case
    - unified RGB and RGBA processing code such that RGB input is not copied into RGBA
  - Performances are more close to Pillow-SIMD
  - RGBA case perfs are the same after refactoring (see Source link below) 
- Fixed mem pointer alignment, added more comments (reviews from #96651)

## Results

```
[------------------------------------------------------------------------------------------ Resize -----------------------------------------------------------------------------------------]
                                                                 |  Pillow (9.0.0.post1)  |  torch (2.1.0a0+gitcc42a3f) PR  |  torch (2.1.0a0+git5309c44) nightly  |  Speed-up: PR vs nightly
1 threads: ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
      3 torch.uint8 channels_last bilinear 256 -> 32 aa=True     |          38.8          |                56.0             |                 133.2                |            2.4
      3 torch.uint8 channels_last bilinear 256 -> 32 aa=False    |                        |                37.5             |                 112.8                |            3.0
      3 torch.uint8 channels_last bilinear 256 -> 224 aa=True    |         128.7          |               157.0             |                 305.4                |            1.9
      3 torch.uint8 channels_last bilinear 256 -> 224 aa=False   |                        |               146.4             |                 288.7                |            2.0
      3 torch.uint8 channels_last bilinear 256 -> 320 aa=True    |         179.4          |               215.8             |                 442.5                |            2.1
      3 torch.uint8 channels_last bilinear 256 -> 320 aa=False   |                        |               212.5             |                 436.9                |            2.1
      3 torch.uint8 channels_last bilinear 520 -> 32 aa=True     |         113.3          |               127.9             |                 464.8                |            3.6
      3 torch.uint8 channels_last bilinear 520 -> 32 aa=False    |                        |                56.8             |                 365.5                |            6.4
      3 torch.uint8 channels_last bilinear 520 -> 224 aa=True    |         281.7          |               325.2             |                 722.4                |            2.2
      3 torch.uint8 channels_last bilinear 520 -> 224 aa=False   |                        |               239.1             |                 593.5                |            2.5
      3 torch.uint8 channels_last bilinear 712 -> 32 aa=True     |         186.2          |               200.7             |                 833.8                |            4.2
      3 torch.uint8 channels_last bilinear 712 -> 32 aa=False    |                        |                75.2             |                 651.4                |            8.7
      3 torch.uint8 channels_last bilinear 712 -> 224 aa=True    |         410.0          |               444.5             |                1128.4                |            2.5
      3 torch.uint8 channels_last bilinear 712 -> 224 aa=False   |                        |               309.3             |                 917.6                |            3.0
```

Note: for other cases (see Source below) speed-up is roughly around 1.0 +/- 0.1 which may be attributed to noisy measurements ...

[Source](https://gist.github.com/vfdev-5/1c0778904a07ce40401306548b9525e8#file-20230315-144416-pr_vs_nightly_speedup-md)


## Context

- #90771



cc jgong5 mingfeima XiaobingSuper sanchitintel ashokei jingxu10

[ghstack-poisoned]
vfdev-5 added a commit that referenced this pull request Mar 15, 2023
- Based on #96651
- Fixed mem pointer alignment

ghstack-source-id: d961810
Pull Request resolved: #96848
@vfdev-5
Copy link
Contributor Author

vfdev-5 commented Mar 15, 2023

@vadimkantorov thanks for explanation, however I'm not sure to understand your point here:

In some rare cases it could be useful for resizing label maps with values like -1, 0, 1 (e.g. for speaker separation first speaker could be -1, second speaker 1 and "unknown" - 0). Or the other way around, -1 could mean "unknown" and 0/1 could mean two class labels. More practical case for this are dtype bool (two classes) / int16 (more classes) / int32. But if one can use smaller memory footprint, it's useful.

Resizing label maps of any dtype should use mode="nearest" in order to keep the same labels, right ? In this PR we are optimizing uint8 RGB/RGBA cases for bilinear mode with/without anti-aliasing.
Nearest mode is already supported by dtypes like bytes and all floats. Adding bool, int16, int32 shouldn't be a big deal, IMO...

@vadimkantorov
Copy link
Contributor

vadimkantorov commented Mar 16, 2023

int16 is also useful for "resampling" pcm audio. int16 there is quite common. and there some proper filtering / anti-aliasing during "resampling" would be better than simple "nearest"

vfdev-5 added a commit that referenced this pull request Mar 17, 2023
- Based on #96651
- Fixed mem pointer alignment

ghstack-source-id: 4ee5e45
Pull Request resolved: #96848
vfdev-5 added a commit that referenced this pull request Mar 17, 2023
…cpu uint8 RGB-case"


## Description

- Based on #96651
  - Improved perfs for vectorized interpolate uint8 RGB-case
    - unified RGB and RGBA processing code such that RGB input is not copied into RGBA
  - Performances are more close to Pillow-SIMD
  - RGBA case perfs are the same after refactoring (see Source link below) 
- Fixed mem pointer alignment, added more comments (reviews from #96651)

## Results

```
[------------------------------------------------------------------------------------------ Resize -----------------------------------------------------------------------------------------]
                                                                 |  Pillow (9.0.0.post1)  |  torch (2.1.0a0+git0968a5d) PR  |  torch (2.1.0a0+git5309c44) nightly  |  Speed-up: PR vs nightly
1 threads: ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
      3 torch.uint8 channels_last bilinear 256 -> 32 aa=True     |          39.0          |                56.6             |                 133.2                |            2.4
      3 torch.uint8 channels_last bilinear 256 -> 32 aa=False    |                        |                36.9             |                 112.8                |            3.1
      3 torch.uint8 channels_last bilinear 256 -> 224 aa=True    |         128.1          |               152.5             |                 305.4                |            2.0
      3 torch.uint8 channels_last bilinear 256 -> 224 aa=False   |                        |               141.1             |                 288.7                |            2.0
      3 torch.uint8 channels_last bilinear 256 -> 320 aa=True    |         179.6          |               208.8             |                 442.5                |            2.1
      3 torch.uint8 channels_last bilinear 256 -> 320 aa=False   |                        |               206.4             |                 436.9                |            2.1
      3 torch.uint8 channels_last bilinear 520 -> 32 aa=True     |         113.3          |               132.1             |                 464.8                |            3.5
      3 torch.uint8 channels_last bilinear 520 -> 32 aa=False    |                        |                57.2             |                 365.5                |            6.4
      3 torch.uint8 channels_last bilinear 520 -> 224 aa=True    |         281.7          |               327.4             |                 722.4                |            2.2
      3 torch.uint8 channels_last bilinear 520 -> 224 aa=False   |                        |               230.2             |                 593.5                |            2.6
      3 torch.uint8 channels_last bilinear 712 -> 32 aa=True     |         186.9          |               210.5             |                 833.8                |            4.0
      3 torch.uint8 channels_last bilinear 712 -> 32 aa=False    |                        |                75.6             |                 651.4                |            8.6
      3 torch.uint8 channels_last bilinear 712 -> 224 aa=True    |         410.3          |               450.9             |                1128.4                |            2.5
      3 torch.uint8 channels_last bilinear 712 -> 224 aa=False   |                        |               298.7             |                 917.6                |            3.1

```

Note: for other cases (see Source below) speed-up is roughly around 1.0 +/- 0.1 which may be attributed to noisy measurements ...

[Source](https://gist.github.com/vfdev-5/1c0778904a07ce40401306548b9525e8#file-20230315-162238-pr_vs_nightly_speedup-md)


## Context

- #90771



cc jgong5 mingfeima XiaobingSuper sanchitintel ashokei jingxu10

[ghstack-poisoned]
vfdev-5 added a commit that referenced this pull request Mar 17, 2023
## Description

- Based on #96651
  - Improved perfs for vectorized interpolate uint8 RGB-case
    - unified RGB and RGBA processing code such that RGB input is not copied into RGBA
  - Performances are more close to Pillow-SIMD
  - RGBA case perfs are the same after refactoring (see Source link below) 
- Fixed mem pointer alignment, added more comments (reviews from #96651)

## Results

```
[------------------------------------------------------------------------------------------ Resize -----------------------------------------------------------------------------------------]
                                                                 |  Pillow (9.0.0.post1)  |  torch (2.1.0a0+git0968a5d) PR  |  torch (2.1.0a0+git5309c44) nightly  |  Speed-up: PR vs nightly
1 threads: ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
      3 torch.uint8 channels_last bilinear 256 -> 32 aa=True     |          39.0          |                56.6             |                 133.2                |            2.4
      3 torch.uint8 channels_last bilinear 256 -> 32 aa=False    |                        |                36.9             |                 112.8                |            3.1
      3 torch.uint8 channels_last bilinear 256 -> 224 aa=True    |         128.1          |               152.5             |                 305.4                |            2.0
      3 torch.uint8 channels_last bilinear 256 -> 224 aa=False   |                        |               141.1             |                 288.7                |            2.0
      3 torch.uint8 channels_last bilinear 256 -> 320 aa=True    |         179.6          |               208.8             |                 442.5                |            2.1
      3 torch.uint8 channels_last bilinear 256 -> 320 aa=False   |                        |               206.4             |                 436.9                |            2.1
      3 torch.uint8 channels_last bilinear 520 -> 32 aa=True     |         113.3          |               132.1             |                 464.8                |            3.5
      3 torch.uint8 channels_last bilinear 520 -> 32 aa=False    |                        |                57.2             |                 365.5                |            6.4
      3 torch.uint8 channels_last bilinear 520 -> 224 aa=True    |         281.7          |               327.4             |                 722.4                |            2.2
      3 torch.uint8 channels_last bilinear 520 -> 224 aa=False   |                        |               230.2             |                 593.5                |            2.6
      3 torch.uint8 channels_last bilinear 712 -> 32 aa=True     |         186.9          |               210.5             |                 833.8                |            4.0
      3 torch.uint8 channels_last bilinear 712 -> 32 aa=False    |                        |                75.6             |                 651.4                |            8.6
      3 torch.uint8 channels_last bilinear 712 -> 224 aa=True    |         410.3          |               450.9             |                1128.4                |            2.5
      3 torch.uint8 channels_last bilinear 712 -> 224 aa=False   |                        |               298.7             |                 917.6                |            3.1

```

Note: for other cases (see Source below) speed-up is roughly around 1.0 +/- 0.1 which may be attributed to noisy measurements ...

[Source](https://gist.github.com/vfdev-5/1c0778904a07ce40401306548b9525e8#file-20230315-162238-pr_vs_nightly_speedup-md)


## Context

- #90771



cc jgong5 mingfeima XiaobingSuper sanchitintel ashokei jingxu10

[ghstack-poisoned]
vfdev-5 added a commit that referenced this pull request Mar 20, 2023
…erpolate cpu uint8 RGB-case (channels last)"


## Description

- Based on #96651
  - Improved perfs for vectorized bilinear interpolate uint8 RGB-case, channels last
    - unified RGB and RGBA processing code such that RGB input is not copied into RGBA
  - Performances are more close to Pillow-SIMD (`Pillow (9.0.0.post1)`)
  - RGBA case perfs are the same after refactoring (see Source link below) 
- Fixed mem pointer alignment, added more comments (reviews from #96651)

## Results

```
[------------------------------------------------------------------------------------------ Resize -----------------------------------------------------------------------------------------]
                                                                 |  Pillow (9.0.0.post1)  |  torch (2.1.0a0+git0968a5d) PR  |  torch (2.1.0a0+git5309c44) nightly  |  Speed-up: PR vs nightly
1 threads: ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
      3 torch.uint8 channels_last bilinear 256 -> 32 aa=True     |          39.0          |                56.6             |                 133.2                |            2.4
      3 torch.uint8 channels_last bilinear 256 -> 32 aa=False    |                        |                36.9             |                 112.8                |            3.1
      3 torch.uint8 channels_last bilinear 256 -> 224 aa=True    |         128.1          |               152.5             |                 305.4                |            2.0
      3 torch.uint8 channels_last bilinear 256 -> 224 aa=False   |                        |               141.1             |                 288.7                |            2.0
      3 torch.uint8 channels_last bilinear 256 -> 320 aa=True    |         179.6          |               208.8             |                 442.5                |            2.1
      3 torch.uint8 channels_last bilinear 256 -> 320 aa=False   |                        |               206.4             |                 436.9                |            2.1
      3 torch.uint8 channels_last bilinear 520 -> 32 aa=True     |         113.3          |               132.1             |                 464.8                |            3.5
      3 torch.uint8 channels_last bilinear 520 -> 32 aa=False    |                        |                57.2             |                 365.5                |            6.4
      3 torch.uint8 channels_last bilinear 520 -> 224 aa=True    |         281.7          |               327.4             |                 722.4                |            2.2
      3 torch.uint8 channels_last bilinear 520 -> 224 aa=False   |                        |               230.2             |                 593.5                |            2.6
      3 torch.uint8 channels_last bilinear 712 -> 32 aa=True     |         186.9          |               210.5             |                 833.8                |            4.0
      3 torch.uint8 channels_last bilinear 712 -> 32 aa=False    |                        |                75.6             |                 651.4                |            8.6
      3 torch.uint8 channels_last bilinear 712 -> 224 aa=True    |         410.3          |               450.9             |                1128.4                |            2.5
      3 torch.uint8 channels_last bilinear 712 -> 224 aa=False   |                        |               298.7             |                 917.6                |            3.1

```

Note: for other cases (see Source below) speed-up is roughly around 1.0 +/- 0.1 which may be attributed to noisy measurements ...

[Source](https://gist.github.com/vfdev-5/1c0778904a07ce40401306548b9525e8#file-20230315-162238-pr_vs_nightly_speedup-md)


## Context

- #90771



cc jgong5 mingfeima XiaobingSuper sanchitintel ashokei jingxu10

[ghstack-poisoned]
vfdev-5 added a commit that referenced this pull request Mar 20, 2023
- Based on #96651
- Fixed mem pointer alignment

ghstack-source-id: c82a73d
Pull Request resolved: #96848
vfdev-5 added a commit that referenced this pull request Mar 20, 2023
…t8 RGB-case (channels last)"


## Description

- Based on #96651
  - Improved perfs for vectorized bilinear interpolate uint8 RGB-case, channels last
    - unified RGB and RGBA processing code such that RGB input is not copied into RGBA
  - Performances are more close to Pillow-SIMD (`Pillow (9.0.0.post1)`)
  - RGBA case perfs are the same after refactoring (see Source link below) 
- Fixed mem pointer alignment, added more comments (reviews from #96651)

## Results

```
[------------------------------------------------------------------------------------------ Resize -----------------------------------------------------------------------------------------]
                                                                 |  Pillow (9.0.0.post1)  |  torch (2.1.0a0+git0968a5d) PR  |  torch (2.1.0a0+git5309c44) nightly  |  Speed-up: PR vs nightly
1 threads: ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
      3 torch.uint8 channels_last bilinear 256 -> 32 aa=True     |          39.0          |                56.6             |                 133.2                |            2.4
      3 torch.uint8 channels_last bilinear 256 -> 32 aa=False    |                        |                36.9             |                 112.8                |            3.1
      3 torch.uint8 channels_last bilinear 256 -> 224 aa=True    |         128.1          |               152.5             |                 305.4                |            2.0
      3 torch.uint8 channels_last bilinear 256 -> 224 aa=False   |                        |               141.1             |                 288.7                |            2.0
      3 torch.uint8 channels_last bilinear 256 -> 320 aa=True    |         179.6          |               208.8             |                 442.5                |            2.1
      3 torch.uint8 channels_last bilinear 256 -> 320 aa=False   |                        |               206.4             |                 436.9                |            2.1
      3 torch.uint8 channels_last bilinear 520 -> 32 aa=True     |         113.3          |               132.1             |                 464.8                |            3.5
      3 torch.uint8 channels_last bilinear 520 -> 32 aa=False    |                        |                57.2             |                 365.5                |            6.4
      3 torch.uint8 channels_last bilinear 520 -> 224 aa=True    |         281.7          |               327.4             |                 722.4                |            2.2
      3 torch.uint8 channels_last bilinear 520 -> 224 aa=False   |                        |               230.2             |                 593.5                |            2.6
      3 torch.uint8 channels_last bilinear 712 -> 32 aa=True     |         186.9          |               210.5             |                 833.8                |            4.0
      3 torch.uint8 channels_last bilinear 712 -> 32 aa=False    |                        |                75.6             |                 651.4                |            8.6
      3 torch.uint8 channels_last bilinear 712 -> 224 aa=True    |         410.3          |               450.9             |                1128.4                |            2.5
      3 torch.uint8 channels_last bilinear 712 -> 224 aa=False   |                        |               298.7             |                 917.6                |            3.1

```

Note: for other cases (see Source below) speed-up is roughly around 1.0 +/- 0.1 which may be attributed to noisy measurements ...

[Source](https://gist.github.com/vfdev-5/1c0778904a07ce40401306548b9525e8#file-20230315-162238-pr_vs_nightly_speedup-md)


## Context

- #90771



cc jgong5 mingfeima XiaobingSuper sanchitintel ashokei jingxu10

[ghstack-poisoned]
vfdev-5 added a commit to vfdev-5/pytorch that referenced this pull request Mar 21, 2023
- Based on pytorch#96651
- Fixed mem pointer alignment

ghstack-source-id: c82a73d
Pull Request resolved: pytorch#96848
vfdev-5 added a commit that referenced this pull request Mar 21, 2023
…erpolate cpu uint8 RGB-case (channels last)"


## Description

- Based on #96651
  - Improved perfs for vectorized **bilinear** interpolate uint8 RGB-case, **channels last**
    - unified RGB and RGBA processing code such that RGB input is not copied into RGBA
  - Performances are more close to Pillow-SIMD (labeled as `Pillow (9.0.0.post1)` in the results)
  - RGBA case perfs are the same after refactoring (see Source link below) 
- Fixed mem pointer alignment, added more comments (reviews from #96651)

## Results

- `Pillow (9.0.0.post1)` == Pillow-SIMD

```
[-------------------------------------------------------------------------------------------------- Resize -------------------------------------------------------------------------------------------------]
                                                                                 |  Pillow (9.0.0.post1)  |  torch (2.1.0a0+gitc005105) PR  |  torch (2.1.0a0+git5309c44) nightly  |  Speed-up: PR vs nightly
1 threads: --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
      3 torch.uint8 channels_last bilinear (256, 256) -> (32, 32) aa=True        |    38.670 (+-0.445)    |         57.366 (+-0.799)        |          132.147 (+-1.236)           |      2.304 (+-0.000)    
      3 torch.uint8 channels_last bilinear (256, 256) -> (32, 32) aa=False       |                        |         37.825 (+-0.417)        |          111.789 (+-1.175)           |      2.955 (+-0.000)    
      3 torch.uint8 channels_last bilinear (256, 256) -> (224, 224) aa=True      |   127.898 (+-1.335)    |        153.081 (+-2.346)        |          302.518 (+-2.632)           |      1.976 (+-0.000)    
      3 torch.uint8 channels_last bilinear (256, 256) -> (224, 224) aa=False     |                        |        141.695 (+-1.415)        |          286.663 (+-2.494)           |      2.023 (+-0.000)    
      3 torch.uint8 channels_last bilinear (256, 256) -> (320, 320) aa=True      |   179.735 (+-2.054)    |        210.613 (+-3.116)        |          439.375 (+-4.014)           |      2.086 (+-0.000)    
      3 torch.uint8 channels_last bilinear (256, 256) -> (320, 320) aa=False     |                        |        207.601 (+-1.639)        |          438.537 (+-4.143)           |      2.112 (+-0.000)    
      3 torch.uint8 channels_last bilinear (520, 520) -> (32, 32) aa=True        |   112.679 (+-1.321)    |        130.863 (+-1.987)        |          446.804 (+-3.283)           |      3.414 (+-0.000)    
      3 torch.uint8 channels_last bilinear (520, 520) -> (32, 32) aa=False       |                        |         57.968 (+-0.270)        |          374.244 (+-13.598)          |      6.456 (+-0.000)    
      3 torch.uint8 channels_last bilinear (520, 520) -> (224, 224) aa=True      |   282.398 (+-3.485)    |        322.986 (+-1.947)        |          720.197 (+-3.467)           |      2.230 (+-0.000)    
      3 torch.uint8 channels_last bilinear (520, 520) -> (224, 224) aa=False     |                        |        231.625 (+-2.006)        |          592.834 (+-3.903)           |      2.559 (+-0.000)    
      3 torch.uint8 channels_last bilinear (712, 712) -> (32, 32) aa=True        |   185.711 (+-1.666)    |        201.069 (+-2.182)        |          787.868 (+-3.648)           |      3.918 (+-0.000)    
      3 torch.uint8 channels_last bilinear (712, 712) -> (32, 32) aa=False       |                        |         75.975 (+-0.696)        |          651.016 (+-3.926)           |      8.569 (+-0.000)    
      3 torch.uint8 channels_last bilinear (712, 712) -> (224, 224) aa=True      |   410.236 (+-6.021)    |        451.486 (+-3.939)        |         1123.923 (+-14.988)          |      2.489 (+-0.000)    
      3 torch.uint8 channels_last bilinear (712, 712) -> (224, 224) aa=False     |                        |        299.597 (+-1.887)        |          915.347 (+-4.486)           |      3.055 (+-0.000)    

      # More test-cases from #90771
      3 torch.uint8 channels_last bilinear (64, 64) -> (224, 224) aa=True        |    60.751 (+-0.285)    |         78.538 (+-1.282)        |          170.465 (+-1.830)           |      2.170 (+-0.000)    
      3 torch.uint8 channels_last bilinear (224, 224) -> (270, 268) aa=True      |   133.619 (+-2.035)    |        159.614 (+-1.587)        |          330.971 (+-3.249)           |      2.074 (+-0.000)    
      3 torch.uint8 channels_last bilinear (256, 256) -> (1024, 1024) aa=True    |   950.243 (+-10.641)   |        891.369 (+-17.946)       |         2805.510 (+-25.503)          |      3.147 (+-0.000)    
      3 torch.uint8 channels_last bilinear (224, 224) -> (64, 64) aa=True        |    52.771 (+-0.961)    |         72.253 (+-1.020)        |          135.933 (+-1.625)           |      1.881 (+-0.000)    
      3 torch.uint8 channels_last bilinear (270, 268) -> (224, 224) aa=True      |   139.107 (+-2.143)    |        165.844 (+-2.177)        |          321.112 (+-2.904)           |      1.936 (+-0.000)    
      3 torch.uint8 channels_last bilinear (1024, 1024) -> (256, 256) aa=True    |   691.470 (+-9.566)    |        764.942 (+-11.192)       |         2050.880 (+-22.188)          |      2.681 (+-0.000)    
      3 torch.uint8 channels_last bilinear (64, 64) -> (224, 224) aa=False       |                        |         77.375 (+-1.345)        |          169.646 (+-1.640)           |      2.193 (+-0.000)    
      3 torch.uint8 channels_last bilinear (224, 224) -> (270, 268) aa=False     |                        |        159.115 (+-3.935)        |          329.754 (+-2.590)           |      2.072 (+-0.000)    
      3 torch.uint8 channels_last bilinear (256, 256) -> (1024, 1024) aa=False   |                        |        877.248 (+-5.736)        |         2815.870 (+-22.589)          |      3.210 (+-0.000)    
      3 torch.uint8 channels_last bilinear (224, 224) -> (64, 64) aa=False       |                        |         53.120 (+-0.316)        |          112.024 (+-1.225)           |      2.109 (+-0.000)    
      3 torch.uint8 channels_last bilinear (270, 268) -> (224, 224) aa=False     |                        |        147.330 (+-1.871)        |          299.152 (+-3.353)           |      2.030 (+-0.000)    
      3 torch.uint8 channels_last bilinear (1024, 1024) -> (256, 256) aa=False   |                        |        472.182 (+-10.785)       |         1698.601 (+-16.785)          |      3.597 (+-0.000)    
```

Note: for other cases (see Source below) speed-up is roughly around 1.0 +/- 0.1 which may be attributed to noisy measurements ...

[Source](https://gist.github.com/vfdev-5/1c0778904a07ce40401306548b9525e8#file-20230320-160044-pr_vs_nightly-speedup-md)


## Context

- #90771



cc jgong5 mingfeima XiaobingSuper sanchitintel ashokei jingxu10

[ghstack-poisoned]
vfdev-5 added a commit that referenced this pull request Mar 21, 2023
…t8 RGB-case (channels last)"


## Description

- Based on #96651
  - Improved perfs for vectorized **bilinear** interpolate uint8 RGB-case, **channels last**
    - unified RGB and RGBA processing code such that RGB input is not copied into RGBA
  - Performances are more close to Pillow-SIMD (labeled as `Pillow (9.0.0.post1)` in the results)
  - RGBA case perfs are the same after refactoring (see Source link below) 
- Fixed mem pointer alignment, added more comments (reviews from #96651)

## Results

- `Pillow (9.0.0.post1)` == Pillow-SIMD

```
[-------------------------------------------------------------------------------------------------- Resize -------------------------------------------------------------------------------------------------]
                                                                                 |  Pillow (9.0.0.post1)  |  torch (2.1.0a0+gitc005105) PR  |  torch (2.1.0a0+git5309c44) nightly  |  Speed-up: PR vs nightly
1 threads: --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
      3 torch.uint8 channels_last bilinear (256, 256) -> (32, 32) aa=True        |    38.670 (+-0.445)    |         57.366 (+-0.799)        |          132.147 (+-1.236)           |      2.304 (+-0.000)    
      3 torch.uint8 channels_last bilinear (256, 256) -> (32, 32) aa=False       |                        |         37.825 (+-0.417)        |          111.789 (+-1.175)           |      2.955 (+-0.000)    
      3 torch.uint8 channels_last bilinear (256, 256) -> (224, 224) aa=True      |   127.898 (+-1.335)    |        153.081 (+-2.346)        |          302.518 (+-2.632)           |      1.976 (+-0.000)    
      3 torch.uint8 channels_last bilinear (256, 256) -> (224, 224) aa=False     |                        |        141.695 (+-1.415)        |          286.663 (+-2.494)           |      2.023 (+-0.000)    
      3 torch.uint8 channels_last bilinear (256, 256) -> (320, 320) aa=True      |   179.735 (+-2.054)    |        210.613 (+-3.116)        |          439.375 (+-4.014)           |      2.086 (+-0.000)    
      3 torch.uint8 channels_last bilinear (256, 256) -> (320, 320) aa=False     |                        |        207.601 (+-1.639)        |          438.537 (+-4.143)           |      2.112 (+-0.000)    
      3 torch.uint8 channels_last bilinear (520, 520) -> (32, 32) aa=True        |   112.679 (+-1.321)    |        130.863 (+-1.987)        |          446.804 (+-3.283)           |      3.414 (+-0.000)    
      3 torch.uint8 channels_last bilinear (520, 520) -> (32, 32) aa=False       |                        |         57.968 (+-0.270)        |          374.244 (+-13.598)          |      6.456 (+-0.000)    
      3 torch.uint8 channels_last bilinear (520, 520) -> (224, 224) aa=True      |   282.398 (+-3.485)    |        322.986 (+-1.947)        |          720.197 (+-3.467)           |      2.230 (+-0.000)    
      3 torch.uint8 channels_last bilinear (520, 520) -> (224, 224) aa=False     |                        |        231.625 (+-2.006)        |          592.834 (+-3.903)           |      2.559 (+-0.000)    
      3 torch.uint8 channels_last bilinear (712, 712) -> (32, 32) aa=True        |   185.711 (+-1.666)    |        201.069 (+-2.182)        |          787.868 (+-3.648)           |      3.918 (+-0.000)    
      3 torch.uint8 channels_last bilinear (712, 712) -> (32, 32) aa=False       |                        |         75.975 (+-0.696)        |          651.016 (+-3.926)           |      8.569 (+-0.000)    
      3 torch.uint8 channels_last bilinear (712, 712) -> (224, 224) aa=True      |   410.236 (+-6.021)    |        451.486 (+-3.939)        |         1123.923 (+-14.988)          |      2.489 (+-0.000)    
      3 torch.uint8 channels_last bilinear (712, 712) -> (224, 224) aa=False     |                        |        299.597 (+-1.887)        |          915.347 (+-4.486)           |      3.055 (+-0.000)    

      # More test-cases from #90771
      3 torch.uint8 channels_last bilinear (64, 64) -> (224, 224) aa=True        |    60.751 (+-0.285)    |         78.538 (+-1.282)        |          170.465 (+-1.830)           |      2.170 (+-0.000)    
      3 torch.uint8 channels_last bilinear (224, 224) -> (270, 268) aa=True      |   133.619 (+-2.035)    |        159.614 (+-1.587)        |          330.971 (+-3.249)           |      2.074 (+-0.000)    
      3 torch.uint8 channels_last bilinear (256, 256) -> (1024, 1024) aa=True    |   950.243 (+-10.641)   |        891.369 (+-17.946)       |         2805.510 (+-25.503)          |      3.147 (+-0.000)    
      3 torch.uint8 channels_last bilinear (224, 224) -> (64, 64) aa=True        |    52.771 (+-0.961)    |         72.253 (+-1.020)        |          135.933 (+-1.625)           |      1.881 (+-0.000)    
      3 torch.uint8 channels_last bilinear (270, 268) -> (224, 224) aa=True      |   139.107 (+-2.143)    |        165.844 (+-2.177)        |          321.112 (+-2.904)           |      1.936 (+-0.000)    
      3 torch.uint8 channels_last bilinear (1024, 1024) -> (256, 256) aa=True    |   691.470 (+-9.566)    |        764.942 (+-11.192)       |         2050.880 (+-22.188)          |      2.681 (+-0.000)    
      3 torch.uint8 channels_last bilinear (64, 64) -> (224, 224) aa=False       |                        |         77.375 (+-1.345)        |          169.646 (+-1.640)           |      2.193 (+-0.000)    
      3 torch.uint8 channels_last bilinear (224, 224) -> (270, 268) aa=False     |                        |        159.115 (+-3.935)        |          329.754 (+-2.590)           |      2.072 (+-0.000)    
      3 torch.uint8 channels_last bilinear (256, 256) -> (1024, 1024) aa=False   |                        |        877.248 (+-5.736)        |         2815.870 (+-22.589)          |      3.210 (+-0.000)    
      3 torch.uint8 channels_last bilinear (224, 224) -> (64, 64) aa=False       |                        |         53.120 (+-0.316)        |          112.024 (+-1.225)           |      2.109 (+-0.000)    
      3 torch.uint8 channels_last bilinear (270, 268) -> (224, 224) aa=False     |                        |        147.330 (+-1.871)        |          299.152 (+-3.353)           |      2.030 (+-0.000)    
      3 torch.uint8 channels_last bilinear (1024, 1024) -> (256, 256) aa=False   |                        |        472.182 (+-10.785)       |         1698.601 (+-16.785)          |      3.597 (+-0.000)    
```

Note: for other cases (see Source below) speed-up is roughly around 1.0 +/- 0.1 which may be attributed to noisy measurements ...

[Source](https://gist.github.com/vfdev-5/1c0778904a07ce40401306548b9525e8#file-20230320-160044-pr_vs_nightly-speedup-md)


## Context

- #90771



cc jgong5 mingfeima XiaobingSuper sanchitintel ashokei jingxu10

[ghstack-poisoned]
vfdev-5 added a commit that referenced this pull request Mar 21, 2023
- Based on #96651
- Fixed mem pointer alignment

ghstack-source-id: f807362
Pull Request resolved: #96848
vfdev-5 added a commit that referenced this pull request Mar 22, 2023
…erpolate cpu uint8 RGB-case (channels last)"


## Description

- Based on #96651
  - Improved perfs for vectorized **bilinear** interpolate uint8 RGB-case, **channels last**
    - unified RGB and RGBA processing code such that RGB input is not copied into RGBA
  - Performances are more close to Pillow-SIMD (labeled as `Pillow (9.0.0.post1)` in the results)
  - RGBA case perfs are the same after refactoring (see Source link below) 
- Fixed mem pointer alignment, added more comments (reviews from #96651)

## Results

- `Pillow (9.0.0.post1)` == Pillow-SIMD

```
[-------------------------------------------------------------------------------------------------- Resize -------------------------------------------------------------------------------------------------]
                                                                                 |  Pillow (9.0.0.post1)  |  torch (2.1.0a0+git8d955df) PR  |  torch (2.1.0a0+git5309c44) nightly  |  Speed-up: PR vs nightly
1 threads: --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
      3 torch.uint8 channels_last bilinear (256, 256) -> (32, 32) aa=True        |    38.649 (+-0.306)    |         55.828 (+-0.370)        |          132.147 (+-1.236)           |      2.367 (+-0.000)    
      3 torch.uint8 channels_last bilinear (256, 256) -> (32, 32) aa=False       |                        |         36.826 (+-0.229)        |          111.789 (+-1.175)           |      3.036 (+-0.000)    
      3 torch.uint8 channels_last bilinear (256, 256) -> (224, 224) aa=True      |   128.233 (+-1.313)    |        153.827 (+-1.229)        |          302.518 (+-2.632)           |      1.967 (+-0.000)    
      3 torch.uint8 channels_last bilinear (256, 256) -> (224, 224) aa=False     |                        |        143.886 (+-1.409)        |          286.663 (+-2.494)           |      1.992 (+-0.000)    
      3 torch.uint8 channels_last bilinear (256, 256) -> (320, 320) aa=True      |   179.504 (+-1.825)    |        211.569 (+-1.336)        |          439.375 (+-4.014)           |      2.077 (+-0.000)    
      3 torch.uint8 channels_last bilinear (256, 256) -> (320, 320) aa=False     |                        |        209.888 (+-1.443)        |          438.537 (+-4.143)           |      2.089 (+-0.000)    
      3 torch.uint8 channels_last bilinear (520, 520) -> (32, 32) aa=True        |   112.891 (+-1.118)    |        129.373 (+-1.396)        |          446.804 (+-3.283)           |      3.454 (+-0.000)    
      3 torch.uint8 channels_last bilinear (520, 520) -> (32, 32) aa=False       |                        |         56.858 (+-0.227)        |          374.244 (+-13.598)          |      6.582 (+-0.000)    
      3 torch.uint8 channels_last bilinear (520, 520) -> (224, 224) aa=True      |   282.917 (+-2.992)    |        324.378 (+-1.694)        |          720.197 (+-3.467)           |      2.220 (+-0.000)    
      3 torch.uint8 channels_last bilinear (520, 520) -> (224, 224) aa=False     |                        |        236.078 (+-1.679)        |          592.834 (+-3.903)           |      2.511 (+-0.000)    
      3 torch.uint8 channels_last bilinear (712, 712) -> (32, 32) aa=True        |   185.595 (+-1.633)    |        202.000 (+-1.920)        |          787.868 (+-3.648)           |      3.900 (+-0.000)    
      3 torch.uint8 channels_last bilinear (712, 712) -> (32, 32) aa=False       |                        |         75.421 (+-0.512)        |          651.016 (+-3.926)           |      8.632 (+-0.000)    
      3 torch.uint8 channels_last bilinear (712, 712) -> (224, 224) aa=True      |   409.691 (+-2.735)    |        449.927 (+-2.500)        |         1123.923 (+-14.988)          |      2.498 (+-0.000)    
      3 torch.uint8 channels_last bilinear (712, 712) -> (224, 224) aa=False     |                        |        306.691 (+-2.095)        |          915.347 (+-4.486)           |      2.985 (+-0.000)    

      # More test-cases from #90771
      3 torch.uint8 channels_last bilinear (64, 64) -> (224, 224) aa=True        |    60.740 (+-0.278)    |         78.745 (+-0.286)        |          170.465 (+-1.830)           |      2.165 (+-0.000)    
      3 torch.uint8 channels_last bilinear (224, 224) -> (270, 268) aa=True      |   133.029 (+-1.619)    |        162.393 (+-1.289)        |          330.971 (+-3.249)           |      2.038 (+-0.000)    
      3 torch.uint8 channels_last bilinear (256, 256) -> (1024, 1024) aa=True    |   948.849 (+-2.749)    |        896.127 (+-3.696)        |         2805.510 (+-25.503)          |      3.131 (+-0.000)    
      3 torch.uint8 channels_last bilinear (224, 224) -> (64, 64) aa=True        |    52.505 (+-0.319)    |         70.617 (+-0.344)        |          135.933 (+-1.625)           |      1.925 (+-0.000)    
      3 torch.uint8 channels_last bilinear (270, 268) -> (224, 224) aa=True      |   138.671 (+-1.953)    |        165.638 (+-1.473)        |          321.112 (+-2.904)           |      1.939 (+-0.000)    
      3 torch.uint8 channels_last bilinear (1024, 1024) -> (256, 256) aa=True    |   689.492 (+-2.917)    |        758.162 (+-3.719)        |         2050.880 (+-22.188)          |      2.705 (+-0.000)    
      3 torch.uint8 channels_last bilinear (64, 64) -> (224, 224) aa=False       |                        |         77.300 (+-0.307)        |          169.646 (+-1.640)           |      2.195 (+-0.000)    
      3 torch.uint8 channels_last bilinear (224, 224) -> (270, 268) aa=False     |                        |        159.525 (+-1.225)        |          329.754 (+-2.590)           |      2.067 (+-0.000)    
      3 torch.uint8 channels_last bilinear (256, 256) -> (1024, 1024) aa=False   |                        |        890.106 (+-3.358)        |         2815.870 (+-22.589)          |      3.164 (+-0.000)    
      3 torch.uint8 channels_last bilinear (224, 224) -> (64, 64) aa=False       |                        |         52.399 (+-0.314)        |          112.024 (+-1.225)           |      2.138 (+-0.000)    
      3 torch.uint8 channels_last bilinear (270, 268) -> (224, 224) aa=False     |                        |        148.780 (+-1.282)        |          299.152 (+-3.353)           |      2.011 (+-0.000)    
      3 torch.uint8 channels_last bilinear (1024, 1024) -> (256, 256) aa=False   |                        |        479.273 (+-3.432)        |         1698.601 (+-16.785)          |      3.544 (+-0.000)    
      4
```

Note: There is no perf regression for other case. There some cases (see Source below) with small speed-ups, for the rest it is roughly around 1.0 +/- 0.1 which may be attributed to noisy measurements ...

[Source](https://gist.github.com/vfdev-5/1c0778904a07ce40401306548b9525e8#file-20230321-145513-pr_vs_nightly-speedup-md)


## Context

- #90771



cc jgong5 mingfeima XiaobingSuper sanchitintel ashokei jingxu10

[ghstack-poisoned]
vfdev-5 added a commit that referenced this pull request Mar 22, 2023
…t8 RGB-case (channels last)"


## Description

- Based on #96651
  - Improved perfs for vectorized **bilinear** interpolate uint8 RGB-case, **channels last**
    - unified RGB and RGBA processing code such that RGB input is not copied into RGBA
  - Performances are more close to Pillow-SIMD (labeled as `Pillow (9.0.0.post1)` in the results)
  - RGBA case perfs are the same after refactoring (see Source link below) 
- Fixed mem pointer alignment, added more comments (reviews from #96651)

## Results

- `Pillow (9.0.0.post1)` == Pillow-SIMD

```
[-------------------------------------------------------------------------------------------------- Resize -------------------------------------------------------------------------------------------------]
                                                                                 |  Pillow (9.0.0.post1)  |  torch (2.1.0a0+git8d955df) PR  |  torch (2.1.0a0+git5309c44) nightly  |  Speed-up: PR vs nightly
1 threads: --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
      3 torch.uint8 channels_last bilinear (256, 256) -> (32, 32) aa=True        |    38.649 (+-0.306)    |         55.828 (+-0.370)        |          132.147 (+-1.236)           |      2.367 (+-0.000)    
      3 torch.uint8 channels_last bilinear (256, 256) -> (32, 32) aa=False       |                        |         36.826 (+-0.229)        |          111.789 (+-1.175)           |      3.036 (+-0.000)    
      3 torch.uint8 channels_last bilinear (256, 256) -> (224, 224) aa=True      |   128.233 (+-1.313)    |        153.827 (+-1.229)        |          302.518 (+-2.632)           |      1.967 (+-0.000)    
      3 torch.uint8 channels_last bilinear (256, 256) -> (224, 224) aa=False     |                        |        143.886 (+-1.409)        |          286.663 (+-2.494)           |      1.992 (+-0.000)    
      3 torch.uint8 channels_last bilinear (256, 256) -> (320, 320) aa=True      |   179.504 (+-1.825)    |        211.569 (+-1.336)        |          439.375 (+-4.014)           |      2.077 (+-0.000)    
      3 torch.uint8 channels_last bilinear (256, 256) -> (320, 320) aa=False     |                        |        209.888 (+-1.443)        |          438.537 (+-4.143)           |      2.089 (+-0.000)    
      3 torch.uint8 channels_last bilinear (520, 520) -> (32, 32) aa=True        |   112.891 (+-1.118)    |        129.373 (+-1.396)        |          446.804 (+-3.283)           |      3.454 (+-0.000)    
      3 torch.uint8 channels_last bilinear (520, 520) -> (32, 32) aa=False       |                        |         56.858 (+-0.227)        |          374.244 (+-13.598)          |      6.582 (+-0.000)    
      3 torch.uint8 channels_last bilinear (520, 520) -> (224, 224) aa=True      |   282.917 (+-2.992)    |        324.378 (+-1.694)        |          720.197 (+-3.467)           |      2.220 (+-0.000)    
      3 torch.uint8 channels_last bilinear (520, 520) -> (224, 224) aa=False     |                        |        236.078 (+-1.679)        |          592.834 (+-3.903)           |      2.511 (+-0.000)    
      3 torch.uint8 channels_last bilinear (712, 712) -> (32, 32) aa=True        |   185.595 (+-1.633)    |        202.000 (+-1.920)        |          787.868 (+-3.648)           |      3.900 (+-0.000)    
      3 torch.uint8 channels_last bilinear (712, 712) -> (32, 32) aa=False       |                        |         75.421 (+-0.512)        |          651.016 (+-3.926)           |      8.632 (+-0.000)    
      3 torch.uint8 channels_last bilinear (712, 712) -> (224, 224) aa=True      |   409.691 (+-2.735)    |        449.927 (+-2.500)        |         1123.923 (+-14.988)          |      2.498 (+-0.000)    
      3 torch.uint8 channels_last bilinear (712, 712) -> (224, 224) aa=False     |                        |        306.691 (+-2.095)        |          915.347 (+-4.486)           |      2.985 (+-0.000)    

      # More test-cases from #90771
      3 torch.uint8 channels_last bilinear (64, 64) -> (224, 224) aa=True        |    60.740 (+-0.278)    |         78.745 (+-0.286)        |          170.465 (+-1.830)           |      2.165 (+-0.000)    
      3 torch.uint8 channels_last bilinear (224, 224) -> (270, 268) aa=True      |   133.029 (+-1.619)    |        162.393 (+-1.289)        |          330.971 (+-3.249)           |      2.038 (+-0.000)    
      3 torch.uint8 channels_last bilinear (256, 256) -> (1024, 1024) aa=True    |   948.849 (+-2.749)    |        896.127 (+-3.696)        |         2805.510 (+-25.503)          |      3.131 (+-0.000)    
      3 torch.uint8 channels_last bilinear (224, 224) -> (64, 64) aa=True        |    52.505 (+-0.319)    |         70.617 (+-0.344)        |          135.933 (+-1.625)           |      1.925 (+-0.000)    
      3 torch.uint8 channels_last bilinear (270, 268) -> (224, 224) aa=True      |   138.671 (+-1.953)    |        165.638 (+-1.473)        |          321.112 (+-2.904)           |      1.939 (+-0.000)    
      3 torch.uint8 channels_last bilinear (1024, 1024) -> (256, 256) aa=True    |   689.492 (+-2.917)    |        758.162 (+-3.719)        |         2050.880 (+-22.188)          |      2.705 (+-0.000)    
      3 torch.uint8 channels_last bilinear (64, 64) -> (224, 224) aa=False       |                        |         77.300 (+-0.307)        |          169.646 (+-1.640)           |      2.195 (+-0.000)    
      3 torch.uint8 channels_last bilinear (224, 224) -> (270, 268) aa=False     |                        |        159.525 (+-1.225)        |          329.754 (+-2.590)           |      2.067 (+-0.000)    
      3 torch.uint8 channels_last bilinear (256, 256) -> (1024, 1024) aa=False   |                        |        890.106 (+-3.358)        |         2815.870 (+-22.589)          |      3.164 (+-0.000)    
      3 torch.uint8 channels_last bilinear (224, 224) -> (64, 64) aa=False       |                        |         52.399 (+-0.314)        |          112.024 (+-1.225)           |      2.138 (+-0.000)    
      3 torch.uint8 channels_last bilinear (270, 268) -> (224, 224) aa=False     |                        |        148.780 (+-1.282)        |          299.152 (+-3.353)           |      2.011 (+-0.000)    
      3 torch.uint8 channels_last bilinear (1024, 1024) -> (256, 256) aa=False   |                        |        479.273 (+-3.432)        |         1698.601 (+-16.785)          |      3.544 (+-0.000)    
      4
```

Note: There is no perf regression for other case. There some cases (see Source below) with small speed-ups, for the rest it is roughly around 1.0 +/- 0.1 which may be attributed to noisy measurements ...

[Source](https://gist.github.com/vfdev-5/1c0778904a07ce40401306548b9525e8#file-20230321-145513-pr_vs_nightly-speedup-md)


## Context

- #90771



cc jgong5 mingfeima XiaobingSuper sanchitintel ashokei jingxu10

[ghstack-poisoned]
vfdev-5 added a commit that referenced this pull request Mar 22, 2023
- Based on #96651
- Fixed mem pointer alignment

ghstack-source-id: 6132906
Pull Request resolved: #96848
vfdev-5 added a commit that referenced this pull request Mar 23, 2023
…erpolate cpu uint8 RGB-case (channels last)"


## Description

- Based on #96651
  - Improved perfs for vectorized **bilinear** interpolate uint8 RGB-case, **channels last**
    - unified RGB and RGBA processing code such that RGB input is not copied into RGBA
  - Performances are more close to Pillow-SIMD (labeled as `Pillow (9.0.0.post1)` in the results)
  - RGBA case perfs are the same after refactoring (see Source link below) 
- Fixed mem pointer alignment, added more comments (reviews from #96651)

## Results

- `Pillow (9.0.0.post1)` == Pillow-SIMD

```
[-------------------------------------------------------------------------------------------------- Resize -------------------------------------------------------------------------------------------------]
                                                                                 |  Pillow (9.0.0.post1)  |  torch (2.1.0a0+gitce4be01) PR  |  torch (2.1.0a0+git5309c44) nightly  |  Speed-up: PR vs nightly
1 threads: --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
      3 torch.uint8 channels_last bilinear (256, 256) -> (32, 32) aa=True        |    38.548 (+-0.280)    |         57.536 (+-0.210)        |          132.147 (+-1.236)           |      2.297 (+-0.000)    
      3 torch.uint8 channels_last bilinear (256, 256) -> (32, 32) aa=False       |                        |         38.532 (+-0.219)        |          111.789 (+-1.175)           |      2.901 (+-0.000)    
      3 torch.uint8 channels_last bilinear (256, 256) -> (224, 224) aa=True      |   127.689 (+-1.348)    |        156.262 (+-1.213)        |          302.518 (+-2.632)           |      1.936 (+-0.000)    
      3 torch.uint8 channels_last bilinear (256, 256) -> (224, 224) aa=False     |                        |        145.483 (+-1.077)        |          286.663 (+-2.494)           |      1.970 (+-0.000)    
      3 torch.uint8 channels_last bilinear (256, 256) -> (320, 320) aa=True      |   178.117 (+-1.956)    |        215.053 (+-1.470)        |          439.375 (+-4.014)           |      2.043 (+-0.000)    
      3 torch.uint8 channels_last bilinear (256, 256) -> (320, 320) aa=False     |                        |        211.340 (+-2.239)        |          438.537 (+-4.143)           |      2.075 (+-0.000)    
      3 torch.uint8 channels_last bilinear (520, 520) -> (32, 32) aa=True        |   112.593 (+-1.266)    |        130.414 (+-1.633)        |          446.804 (+-3.283)           |      3.426 (+-0.000)    
      3 torch.uint8 channels_last bilinear (520, 520) -> (32, 32) aa=False       |                        |         58.767 (+-0.203)        |          374.244 (+-13.598)          |      6.368 (+-0.000)    
      3 torch.uint8 channels_last bilinear (520, 520) -> (224, 224) aa=True      |   283.210 (+-2.937)    |        324.157 (+-1.895)        |          720.197 (+-3.467)           |      2.222 (+-0.000)    
      3 torch.uint8 channels_last bilinear (520, 520) -> (224, 224) aa=False     |                        |        239.800 (+-2.492)        |          592.834 (+-3.903)           |      2.472 (+-0.000)    
      3 torch.uint8 channels_last bilinear (712, 712) -> (32, 32) aa=True        |   186.255 (+-1.629)    |        204.834 (+-1.496)        |          787.868 (+-3.648)           |      3.846 (+-0.000)    
      3 torch.uint8 channels_last bilinear (712, 712) -> (32, 32) aa=False       |                        |         77.335 (+-0.341)        |          651.016 (+-3.926)           |      8.418 (+-0.000)    
      3 torch.uint8 channels_last bilinear (712, 712) -> (224, 224) aa=True      |   410.286 (+-2.439)    |        443.934 (+-2.899)        |         1123.923 (+-14.988)          |      2.532 (+-0.000)    
      3 torch.uint8 channels_last bilinear (712, 712) -> (224, 224) aa=False     |                        |        312.220 (+-2.307)        |          915.347 (+-4.486)           |      2.932 (+-0.000)    

      # More test-cases from #90771
      3 torch.uint8 channels_last bilinear (64, 64) -> (224, 224) aa=True        |    60.611 (+-0.337)    |         80.849 (+-1.780)        |          170.465 (+-1.830)           |      2.108 (+-0.000)    
      3 torch.uint8 channels_last bilinear (224, 224) -> (270, 268) aa=True      |   132.971 (+-1.624)    |        164.892 (+-1.426)        |          330.971 (+-3.249)           |      2.007 (+-0.000)    
      3 torch.uint8 channels_last bilinear (256, 256) -> (1024, 1024) aa=True    |   948.467 (+-3.179)    |        891.414 (+-5.282)        |         2805.510 (+-25.503)          |      3.147 (+-0.000)    
      3 torch.uint8 channels_last bilinear (224, 224) -> (64, 64) aa=True        |    52.539 (+-0.327)    |         72.471 (+-0.367)        |          135.933 (+-1.625)           |      1.876 (+-0.000)    
      3 torch.uint8 channels_last bilinear (270, 268) -> (224, 224) aa=True      |   138.669 (+-1.867)    |        168.628 (+-1.213)        |          321.112 (+-2.904)           |      1.904 (+-0.000)    
      3 torch.uint8 channels_last bilinear (1024, 1024) -> (256, 256) aa=True    |   689.933 (+-3.175)    |        746.911 (+-2.985)        |         2050.880 (+-22.188)          |      2.746 (+-0.000)    
      3 torch.uint8 channels_last bilinear (64, 64) -> (224, 224) aa=False       |                        |         78.347 (+-0.338)        |          169.646 (+-1.640)           |      2.165 (+-0.000)    
      3 torch.uint8 channels_last bilinear (224, 224) -> (270, 268) aa=False     |                        |        162.194 (+-1.089)        |          329.754 (+-2.590)           |      2.033 (+-0.000)    
      3 torch.uint8 channels_last bilinear (256, 256) -> (1024, 1024) aa=False   |                        |        894.476 (+-2.738)        |         2815.870 (+-22.589)          |      3.148 (+-0.000)    
      3 torch.uint8 channels_last bilinear (224, 224) -> (64, 64) aa=False       |                        |         52.728 (+-0.406)        |          112.024 (+-1.225)           |      2.125 (+-0.000)    
      3 torch.uint8 channels_last bilinear (270, 268) -> (224, 224) aa=False     |                        |        151.560 (+-1.128)        |          299.152 (+-3.353)           |      1.974 (+-0.000)    
      3 torch.uint8 channels_last bilinear (1024, 1024) -> (256, 256) aa=False   |                        |        500.053 (+-4.288)        |         1698.601 (+-16.785)          |      3.397 (+-0.000)    
```

Note: There is no perf regression for other case. There some cases (see Source below) with small speed-ups, for the rest it is roughly around 1.0 +/- 0.1 which may be attributed to noisy measurements ...

[Source](https://gist.github.com/vfdev-5/1c0778904a07ce40401306548b9525e8#file-20230322-132441-pr_vs_nightly-speedup-md)


## Context

- #90771



cc jgong5 mingfeima XiaobingSuper sanchitintel ashokei jingxu10

[ghstack-poisoned]
vfdev-5 added a commit that referenced this pull request Mar 23, 2023
…t8 RGB-case (channels last)"


## Description

- Based on #96651
  - Improved perfs for vectorized **bilinear** interpolate uint8 RGB-case, **channels last**
    - unified RGB and RGBA processing code such that RGB input is not copied into RGBA
  - Performances are more close to Pillow-SIMD (labeled as `Pillow (9.0.0.post1)` in the results)
  - RGBA case perfs are the same after refactoring (see Source link below) 
- Fixed mem pointer alignment, added more comments (reviews from #96651)

## Results

- `Pillow (9.0.0.post1)` == Pillow-SIMD

```
[-------------------------------------------------------------------------------------------------- Resize -------------------------------------------------------------------------------------------------]
                                                                                 |  Pillow (9.0.0.post1)  |  torch (2.1.0a0+gitce4be01) PR  |  torch (2.1.0a0+git5309c44) nightly  |  Speed-up: PR vs nightly
1 threads: --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
      3 torch.uint8 channels_last bilinear (256, 256) -> (32, 32) aa=True        |    38.548 (+-0.280)    |         57.536 (+-0.210)        |          132.147 (+-1.236)           |      2.297 (+-0.000)    
      3 torch.uint8 channels_last bilinear (256, 256) -> (32, 32) aa=False       |                        |         38.532 (+-0.219)        |          111.789 (+-1.175)           |      2.901 (+-0.000)    
      3 torch.uint8 channels_last bilinear (256, 256) -> (224, 224) aa=True      |   127.689 (+-1.348)    |        156.262 (+-1.213)        |          302.518 (+-2.632)           |      1.936 (+-0.000)    
      3 torch.uint8 channels_last bilinear (256, 256) -> (224, 224) aa=False     |                        |        145.483 (+-1.077)        |          286.663 (+-2.494)           |      1.970 (+-0.000)    
      3 torch.uint8 channels_last bilinear (256, 256) -> (320, 320) aa=True      |   178.117 (+-1.956)    |        215.053 (+-1.470)        |          439.375 (+-4.014)           |      2.043 (+-0.000)    
      3 torch.uint8 channels_last bilinear (256, 256) -> (320, 320) aa=False     |                        |        211.340 (+-2.239)        |          438.537 (+-4.143)           |      2.075 (+-0.000)    
      3 torch.uint8 channels_last bilinear (520, 520) -> (32, 32) aa=True        |   112.593 (+-1.266)    |        130.414 (+-1.633)        |          446.804 (+-3.283)           |      3.426 (+-0.000)    
      3 torch.uint8 channels_last bilinear (520, 520) -> (32, 32) aa=False       |                        |         58.767 (+-0.203)        |          374.244 (+-13.598)          |      6.368 (+-0.000)    
      3 torch.uint8 channels_last bilinear (520, 520) -> (224, 224) aa=True      |   283.210 (+-2.937)    |        324.157 (+-1.895)        |          720.197 (+-3.467)           |      2.222 (+-0.000)    
      3 torch.uint8 channels_last bilinear (520, 520) -> (224, 224) aa=False     |                        |        239.800 (+-2.492)        |          592.834 (+-3.903)           |      2.472 (+-0.000)    
      3 torch.uint8 channels_last bilinear (712, 712) -> (32, 32) aa=True        |   186.255 (+-1.629)    |        204.834 (+-1.496)        |          787.868 (+-3.648)           |      3.846 (+-0.000)    
      3 torch.uint8 channels_last bilinear (712, 712) -> (32, 32) aa=False       |                        |         77.335 (+-0.341)        |          651.016 (+-3.926)           |      8.418 (+-0.000)    
      3 torch.uint8 channels_last bilinear (712, 712) -> (224, 224) aa=True      |   410.286 (+-2.439)    |        443.934 (+-2.899)        |         1123.923 (+-14.988)          |      2.532 (+-0.000)    
      3 torch.uint8 channels_last bilinear (712, 712) -> (224, 224) aa=False     |                        |        312.220 (+-2.307)        |          915.347 (+-4.486)           |      2.932 (+-0.000)    

      # More test-cases from #90771
      3 torch.uint8 channels_last bilinear (64, 64) -> (224, 224) aa=True        |    60.611 (+-0.337)    |         80.849 (+-1.780)        |          170.465 (+-1.830)           |      2.108 (+-0.000)    
      3 torch.uint8 channels_last bilinear (224, 224) -> (270, 268) aa=True      |   132.971 (+-1.624)    |        164.892 (+-1.426)        |          330.971 (+-3.249)           |      2.007 (+-0.000)    
      3 torch.uint8 channels_last bilinear (256, 256) -> (1024, 1024) aa=True    |   948.467 (+-3.179)    |        891.414 (+-5.282)        |         2805.510 (+-25.503)          |      3.147 (+-0.000)    
      3 torch.uint8 channels_last bilinear (224, 224) -> (64, 64) aa=True        |    52.539 (+-0.327)    |         72.471 (+-0.367)        |          135.933 (+-1.625)           |      1.876 (+-0.000)    
      3 torch.uint8 channels_last bilinear (270, 268) -> (224, 224) aa=True      |   138.669 (+-1.867)    |        168.628 (+-1.213)        |          321.112 (+-2.904)           |      1.904 (+-0.000)    
      3 torch.uint8 channels_last bilinear (1024, 1024) -> (256, 256) aa=True    |   689.933 (+-3.175)    |        746.911 (+-2.985)        |         2050.880 (+-22.188)          |      2.746 (+-0.000)    
      3 torch.uint8 channels_last bilinear (64, 64) -> (224, 224) aa=False       |                        |         78.347 (+-0.338)        |          169.646 (+-1.640)           |      2.165 (+-0.000)    
      3 torch.uint8 channels_last bilinear (224, 224) -> (270, 268) aa=False     |                        |        162.194 (+-1.089)        |          329.754 (+-2.590)           |      2.033 (+-0.000)    
      3 torch.uint8 channels_last bilinear (256, 256) -> (1024, 1024) aa=False   |                        |        894.476 (+-2.738)        |         2815.870 (+-22.589)          |      3.148 (+-0.000)    
      3 torch.uint8 channels_last bilinear (224, 224) -> (64, 64) aa=False       |                        |         52.728 (+-0.406)        |          112.024 (+-1.225)           |      2.125 (+-0.000)    
      3 torch.uint8 channels_last bilinear (270, 268) -> (224, 224) aa=False     |                        |        151.560 (+-1.128)        |          299.152 (+-3.353)           |      1.974 (+-0.000)    
      3 torch.uint8 channels_last bilinear (1024, 1024) -> (256, 256) aa=False   |                        |        500.053 (+-4.288)        |         1698.601 (+-16.785)          |      3.397 (+-0.000)    
```

Note: There is no perf regression for other case. There some cases (see Source below) with small speed-ups, for the rest it is roughly around 1.0 +/- 0.1 which may be attributed to noisy measurements ...

[Source](https://gist.github.com/vfdev-5/1c0778904a07ce40401306548b9525e8#file-20230322-132441-pr_vs_nightly-speedup-md)


## Context

- #90771



cc jgong5 mingfeima XiaobingSuper sanchitintel ashokei jingxu10

[ghstack-poisoned]
vfdev-5 added a commit that referenced this pull request Mar 23, 2023
…erpolate cpu uint8 RGB-case (channels last)"


## Description

- Based on #96651
  - Improved perfs for vectorized **bilinear** interpolate uint8 RGB-case, **channels last**
    - unified RGB and RGBA processing code such that RGB input is not copied into RGBA
  - Performances are more close to Pillow-SIMD (labeled as `Pillow (9.0.0.post1)` in the results)
  - RGBA case perfs are the same after refactoring (see Source link below) 
- Fixed mem pointer alignment, added more comments (reviews from #96651)

## Results

- `Pillow (9.0.0.post1)` == Pillow-SIMD

```
[-------------------------------------------------------------------------------------------------- Resize -------------------------------------------------------------------------------------------------]
                                                                                 |  Pillow (9.0.0.post1)  |  torch (2.1.0a0+gitce4be01) PR  |  torch (2.1.0a0+git5309c44) nightly  |  Speed-up: PR vs nightly
1 threads: --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
      3 torch.uint8 channels_last bilinear (256, 256) -> (32, 32) aa=True        |    38.548 (+-0.280)    |         57.536 (+-0.210)        |          132.147 (+-1.236)           |      2.297 (+-0.000)    
      3 torch.uint8 channels_last bilinear (256, 256) -> (32, 32) aa=False       |                        |         38.532 (+-0.219)        |          111.789 (+-1.175)           |      2.901 (+-0.000)    
      3 torch.uint8 channels_last bilinear (256, 256) -> (224, 224) aa=True      |   127.689 (+-1.348)    |        156.262 (+-1.213)        |          302.518 (+-2.632)           |      1.936 (+-0.000)    
      3 torch.uint8 channels_last bilinear (256, 256) -> (224, 224) aa=False     |                        |        145.483 (+-1.077)        |          286.663 (+-2.494)           |      1.970 (+-0.000)    
      3 torch.uint8 channels_last bilinear (256, 256) -> (320, 320) aa=True      |   178.117 (+-1.956)    |        215.053 (+-1.470)        |          439.375 (+-4.014)           |      2.043 (+-0.000)    
      3 torch.uint8 channels_last bilinear (256, 256) -> (320, 320) aa=False     |                        |        211.340 (+-2.239)        |          438.537 (+-4.143)           |      2.075 (+-0.000)    
      3 torch.uint8 channels_last bilinear (520, 520) -> (32, 32) aa=True        |   112.593 (+-1.266)    |        130.414 (+-1.633)        |          446.804 (+-3.283)           |      3.426 (+-0.000)    
      3 torch.uint8 channels_last bilinear (520, 520) -> (32, 32) aa=False       |                        |         58.767 (+-0.203)        |          374.244 (+-13.598)          |      6.368 (+-0.000)    
      3 torch.uint8 channels_last bilinear (520, 520) -> (224, 224) aa=True      |   283.210 (+-2.937)    |        324.157 (+-1.895)        |          720.197 (+-3.467)           |      2.222 (+-0.000)    
      3 torch.uint8 channels_last bilinear (520, 520) -> (224, 224) aa=False     |                        |        239.800 (+-2.492)        |          592.834 (+-3.903)           |      2.472 (+-0.000)    
      3 torch.uint8 channels_last bilinear (712, 712) -> (32, 32) aa=True        |   186.255 (+-1.629)    |        204.834 (+-1.496)        |          787.868 (+-3.648)           |      3.846 (+-0.000)    
      3 torch.uint8 channels_last bilinear (712, 712) -> (32, 32) aa=False       |                        |         77.335 (+-0.341)        |          651.016 (+-3.926)           |      8.418 (+-0.000)    
      3 torch.uint8 channels_last bilinear (712, 712) -> (224, 224) aa=True      |   410.286 (+-2.439)    |        443.934 (+-2.899)        |         1123.923 (+-14.988)          |      2.532 (+-0.000)    
      3 torch.uint8 channels_last bilinear (712, 712) -> (224, 224) aa=False     |                        |        312.220 (+-2.307)        |          915.347 (+-4.486)           |      2.932 (+-0.000)    

      # More test-cases from #90771
      3 torch.uint8 channels_last bilinear (64, 64) -> (224, 224) aa=True        |    60.611 (+-0.337)    |         80.849 (+-1.780)        |          170.465 (+-1.830)           |      2.108 (+-0.000)    
      3 torch.uint8 channels_last bilinear (224, 224) -> (270, 268) aa=True      |   132.971 (+-1.624)    |        164.892 (+-1.426)        |          330.971 (+-3.249)           |      2.007 (+-0.000)    
      3 torch.uint8 channels_last bilinear (256, 256) -> (1024, 1024) aa=True    |   948.467 (+-3.179)    |        891.414 (+-5.282)        |         2805.510 (+-25.503)          |      3.147 (+-0.000)    
      3 torch.uint8 channels_last bilinear (224, 224) -> (64, 64) aa=True        |    52.539 (+-0.327)    |         72.471 (+-0.367)        |          135.933 (+-1.625)           |      1.876 (+-0.000)    
      3 torch.uint8 channels_last bilinear (270, 268) -> (224, 224) aa=True      |   138.669 (+-1.867)    |        168.628 (+-1.213)        |          321.112 (+-2.904)           |      1.904 (+-0.000)    
      3 torch.uint8 channels_last bilinear (1024, 1024) -> (256, 256) aa=True    |   689.933 (+-3.175)    |        746.911 (+-2.985)        |         2050.880 (+-22.188)          |      2.746 (+-0.000)    
      3 torch.uint8 channels_last bilinear (64, 64) -> (224, 224) aa=False       |                        |         78.347 (+-0.338)        |          169.646 (+-1.640)           |      2.165 (+-0.000)    
      3 torch.uint8 channels_last bilinear (224, 224) -> (270, 268) aa=False     |                        |        162.194 (+-1.089)        |          329.754 (+-2.590)           |      2.033 (+-0.000)    
      3 torch.uint8 channels_last bilinear (256, 256) -> (1024, 1024) aa=False   |                        |        894.476 (+-2.738)        |         2815.870 (+-22.589)          |      3.148 (+-0.000)    
      3 torch.uint8 channels_last bilinear (224, 224) -> (64, 64) aa=False       |                        |         52.728 (+-0.406)        |          112.024 (+-1.225)           |      2.125 (+-0.000)    
      3 torch.uint8 channels_last bilinear (270, 268) -> (224, 224) aa=False     |                        |        151.560 (+-1.128)        |          299.152 (+-3.353)           |      1.974 (+-0.000)    
      3 torch.uint8 channels_last bilinear (1024, 1024) -> (256, 256) aa=False   |                        |        500.053 (+-4.288)        |         1698.601 (+-16.785)          |      3.397 (+-0.000)    
```

Note: There is no perf regression for other case. There some cases (see Source below) with small speed-ups, for the rest it is roughly around 1.0 +/- 0.1 which may be attributed to noisy measurements ...

[Source](https://gist.github.com/vfdev-5/1c0778904a07ce40401306548b9525e8#file-20230322-132441-pr_vs_nightly-speedup-md)


## Context

- #90771



cc jgong5 mingfeima XiaobingSuper sanchitintel ashokei jingxu10

[ghstack-poisoned]
vfdev-5 added a commit that referenced this pull request Mar 23, 2023
…t8 RGB-case (channels last)"


## Description

- Based on #96651
  - Improved perfs for vectorized **bilinear** interpolate uint8 RGB-case, **channels last**
    - unified RGB and RGBA processing code such that RGB input is not copied into RGBA
  - Performances are more close to Pillow-SIMD (labeled as `Pillow (9.0.0.post1)` in the results)
  - RGBA case perfs are the same after refactoring (see Source link below) 
- Fixed mem pointer alignment, added more comments (reviews from #96651)

## Results

- `Pillow (9.0.0.post1)` == Pillow-SIMD

```
[-------------------------------------------------------------------------------------------------- Resize -------------------------------------------------------------------------------------------------]
                                                                                 |  Pillow (9.0.0.post1)  |  torch (2.1.0a0+gitce4be01) PR  |  torch (2.1.0a0+git5309c44) nightly  |  Speed-up: PR vs nightly
1 threads: --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
      3 torch.uint8 channels_last bilinear (256, 256) -> (32, 32) aa=True        |    38.548 (+-0.280)    |         57.536 (+-0.210)        |          132.147 (+-1.236)           |      2.297 (+-0.000)    
      3 torch.uint8 channels_last bilinear (256, 256) -> (32, 32) aa=False       |                        |         38.532 (+-0.219)        |          111.789 (+-1.175)           |      2.901 (+-0.000)    
      3 torch.uint8 channels_last bilinear (256, 256) -> (224, 224) aa=True      |   127.689 (+-1.348)    |        156.262 (+-1.213)        |          302.518 (+-2.632)           |      1.936 (+-0.000)    
      3 torch.uint8 channels_last bilinear (256, 256) -> (224, 224) aa=False     |                        |        145.483 (+-1.077)        |          286.663 (+-2.494)           |      1.970 (+-0.000)    
      3 torch.uint8 channels_last bilinear (256, 256) -> (320, 320) aa=True      |   178.117 (+-1.956)    |        215.053 (+-1.470)        |          439.375 (+-4.014)           |      2.043 (+-0.000)    
      3 torch.uint8 channels_last bilinear (256, 256) -> (320, 320) aa=False     |                        |        211.340 (+-2.239)        |          438.537 (+-4.143)           |      2.075 (+-0.000)    
      3 torch.uint8 channels_last bilinear (520, 520) -> (32, 32) aa=True        |   112.593 (+-1.266)    |        130.414 (+-1.633)        |          446.804 (+-3.283)           |      3.426 (+-0.000)    
      3 torch.uint8 channels_last bilinear (520, 520) -> (32, 32) aa=False       |                        |         58.767 (+-0.203)        |          374.244 (+-13.598)          |      6.368 (+-0.000)    
      3 torch.uint8 channels_last bilinear (520, 520) -> (224, 224) aa=True      |   283.210 (+-2.937)    |        324.157 (+-1.895)        |          720.197 (+-3.467)           |      2.222 (+-0.000)    
      3 torch.uint8 channels_last bilinear (520, 520) -> (224, 224) aa=False     |                        |        239.800 (+-2.492)        |          592.834 (+-3.903)           |      2.472 (+-0.000)    
      3 torch.uint8 channels_last bilinear (712, 712) -> (32, 32) aa=True        |   186.255 (+-1.629)    |        204.834 (+-1.496)        |          787.868 (+-3.648)           |      3.846 (+-0.000)    
      3 torch.uint8 channels_last bilinear (712, 712) -> (32, 32) aa=False       |                        |         77.335 (+-0.341)        |          651.016 (+-3.926)           |      8.418 (+-0.000)    
      3 torch.uint8 channels_last bilinear (712, 712) -> (224, 224) aa=True      |   410.286 (+-2.439)    |        443.934 (+-2.899)        |         1123.923 (+-14.988)          |      2.532 (+-0.000)    
      3 torch.uint8 channels_last bilinear (712, 712) -> (224, 224) aa=False     |                        |        312.220 (+-2.307)        |          915.347 (+-4.486)           |      2.932 (+-0.000)    

      # More test-cases from #90771
      3 torch.uint8 channels_last bilinear (64, 64) -> (224, 224) aa=True        |    60.611 (+-0.337)    |         80.849 (+-1.780)        |          170.465 (+-1.830)           |      2.108 (+-0.000)    
      3 torch.uint8 channels_last bilinear (224, 224) -> (270, 268) aa=True      |   132.971 (+-1.624)    |        164.892 (+-1.426)        |          330.971 (+-3.249)           |      2.007 (+-0.000)    
      3 torch.uint8 channels_last bilinear (256, 256) -> (1024, 1024) aa=True    |   948.467 (+-3.179)    |        891.414 (+-5.282)        |         2805.510 (+-25.503)          |      3.147 (+-0.000)    
      3 torch.uint8 channels_last bilinear (224, 224) -> (64, 64) aa=True        |    52.539 (+-0.327)    |         72.471 (+-0.367)        |          135.933 (+-1.625)           |      1.876 (+-0.000)    
      3 torch.uint8 channels_last bilinear (270, 268) -> (224, 224) aa=True      |   138.669 (+-1.867)    |        168.628 (+-1.213)        |          321.112 (+-2.904)           |      1.904 (+-0.000)    
      3 torch.uint8 channels_last bilinear (1024, 1024) -> (256, 256) aa=True    |   689.933 (+-3.175)    |        746.911 (+-2.985)        |         2050.880 (+-22.188)          |      2.746 (+-0.000)    
      3 torch.uint8 channels_last bilinear (64, 64) -> (224, 224) aa=False       |                        |         78.347 (+-0.338)        |          169.646 (+-1.640)           |      2.165 (+-0.000)    
      3 torch.uint8 channels_last bilinear (224, 224) -> (270, 268) aa=False     |                        |        162.194 (+-1.089)        |          329.754 (+-2.590)           |      2.033 (+-0.000)    
      3 torch.uint8 channels_last bilinear (256, 256) -> (1024, 1024) aa=False   |                        |        894.476 (+-2.738)        |         2815.870 (+-22.589)          |      3.148 (+-0.000)    
      3 torch.uint8 channels_last bilinear (224, 224) -> (64, 64) aa=False       |                        |         52.728 (+-0.406)        |          112.024 (+-1.225)           |      2.125 (+-0.000)    
      3 torch.uint8 channels_last bilinear (270, 268) -> (224, 224) aa=False     |                        |        151.560 (+-1.128)        |          299.152 (+-3.353)           |      1.974 (+-0.000)    
      3 torch.uint8 channels_last bilinear (1024, 1024) -> (256, 256) aa=False   |                        |        500.053 (+-4.288)        |         1698.601 (+-16.785)          |      3.397 (+-0.000)    
```

Note: There is no perf regression for other case. There some cases (see Source below) with small speed-ups, for the rest it is roughly around 1.0 +/- 0.1 which may be attributed to noisy measurements ...

[Source](https://gist.github.com/vfdev-5/1c0778904a07ce40401306548b9525e8#file-20230322-132441-pr_vs_nightly-speedup-md)


## Context

- #90771



cc jgong5 mingfeima XiaobingSuper sanchitintel ashokei jingxu10

[ghstack-poisoned]
vfdev-5 added a commit that referenced this pull request Mar 23, 2023
- Based on #96651
- Fixed mem pointer alignment

ghstack-source-id: 93bd276
Pull Request resolved: #96848
vfdev-5 added a commit that referenced this pull request Mar 29, 2023
…erpolate cpu uint8 RGB-case (channels last)"


## Description

- Based on #96651
  - Improved perfs for vectorized **bilinear** interpolate uint8 RGB-case, **channels last**
    - unified RGB and RGBA processing code such that RGB input is not copied into RGBA
  - Performances are more close to Pillow-SIMD (labeled as `Pillow (9.0.0.post1)` in the results)
  - RGBA case perfs are the same after refactoring (see Source link below) 
- Fixed mem pointer alignment, added more comments (reviews from #96651)

## Results

- `Pillow (9.0.0.post1)` == Pillow-SIMD

```
[-------------------------------------------------------------------------------------------------- Resize -------------------------------------------------------------------------------------------------]
                                                                                 |  Pillow (9.0.0.post1)  |  torch (2.1.0a0+gitce4be01) PR  |  torch (2.1.0a0+git5309c44) nightly  |  Speed-up: PR vs nightly
1 threads: --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
      3 torch.uint8 channels_last bilinear (256, 256) -> (32, 32) aa=True        |    38.548 (+-0.280)    |         57.536 (+-0.210)        |          132.147 (+-1.236)           |      2.297 (+-0.000)    
      3 torch.uint8 channels_last bilinear (256, 256) -> (32, 32) aa=False       |                        |         38.532 (+-0.219)        |          111.789 (+-1.175)           |      2.901 (+-0.000)    
      3 torch.uint8 channels_last bilinear (256, 256) -> (224, 224) aa=True      |   127.689 (+-1.348)    |        156.262 (+-1.213)        |          302.518 (+-2.632)           |      1.936 (+-0.000)    
      3 torch.uint8 channels_last bilinear (256, 256) -> (224, 224) aa=False     |                        |        145.483 (+-1.077)        |          286.663 (+-2.494)           |      1.970 (+-0.000)    
      3 torch.uint8 channels_last bilinear (256, 256) -> (320, 320) aa=True      |   178.117 (+-1.956)    |        215.053 (+-1.470)        |          439.375 (+-4.014)           |      2.043 (+-0.000)    
      3 torch.uint8 channels_last bilinear (256, 256) -> (320, 320) aa=False     |                        |        211.340 (+-2.239)        |          438.537 (+-4.143)           |      2.075 (+-0.000)    
      3 torch.uint8 channels_last bilinear (520, 520) -> (32, 32) aa=True        |   112.593 (+-1.266)    |        130.414 (+-1.633)        |          446.804 (+-3.283)           |      3.426 (+-0.000)    
      3 torch.uint8 channels_last bilinear (520, 520) -> (32, 32) aa=False       |                        |         58.767 (+-0.203)        |          374.244 (+-13.598)          |      6.368 (+-0.000)    
      3 torch.uint8 channels_last bilinear (520, 520) -> (224, 224) aa=True      |   283.210 (+-2.937)    |        324.157 (+-1.895)        |          720.197 (+-3.467)           |      2.222 (+-0.000)    
      3 torch.uint8 channels_last bilinear (520, 520) -> (224, 224) aa=False     |                        |        239.800 (+-2.492)        |          592.834 (+-3.903)           |      2.472 (+-0.000)    
      3 torch.uint8 channels_last bilinear (712, 712) -> (32, 32) aa=True        |   186.255 (+-1.629)    |        204.834 (+-1.496)        |          787.868 (+-3.648)           |      3.846 (+-0.000)    
      3 torch.uint8 channels_last bilinear (712, 712) -> (32, 32) aa=False       |                        |         77.335 (+-0.341)        |          651.016 (+-3.926)           |      8.418 (+-0.000)    
      3 torch.uint8 channels_last bilinear (712, 712) -> (224, 224) aa=True      |   410.286 (+-2.439)    |        443.934 (+-2.899)        |         1123.923 (+-14.988)          |      2.532 (+-0.000)    
      3 torch.uint8 channels_last bilinear (712, 712) -> (224, 224) aa=False     |                        |        312.220 (+-2.307)        |          915.347 (+-4.486)           |      2.932 (+-0.000)    

      # More test-cases from #90771
      3 torch.uint8 channels_last bilinear (64, 64) -> (224, 224) aa=True        |    60.611 (+-0.337)    |         80.849 (+-1.780)        |          170.465 (+-1.830)           |      2.108 (+-0.000)    
      3 torch.uint8 channels_last bilinear (224, 224) -> (270, 268) aa=True      |   132.971 (+-1.624)    |        164.892 (+-1.426)        |          330.971 (+-3.249)           |      2.007 (+-0.000)    
      3 torch.uint8 channels_last bilinear (256, 256) -> (1024, 1024) aa=True    |   948.467 (+-3.179)    |        891.414 (+-5.282)        |         2805.510 (+-25.503)          |      3.147 (+-0.000)    
      3 torch.uint8 channels_last bilinear (224, 224) -> (64, 64) aa=True        |    52.539 (+-0.327)    |         72.471 (+-0.367)        |          135.933 (+-1.625)           |      1.876 (+-0.000)    
      3 torch.uint8 channels_last bilinear (270, 268) -> (224, 224) aa=True      |   138.669 (+-1.867)    |        168.628 (+-1.213)        |          321.112 (+-2.904)           |      1.904 (+-0.000)    
      3 torch.uint8 channels_last bilinear (1024, 1024) -> (256, 256) aa=True    |   689.933 (+-3.175)    |        746.911 (+-2.985)        |         2050.880 (+-22.188)          |      2.746 (+-0.000)    
      3 torch.uint8 channels_last bilinear (64, 64) -> (224, 224) aa=False       |                        |         78.347 (+-0.338)        |          169.646 (+-1.640)           |      2.165 (+-0.000)    
      3 torch.uint8 channels_last bilinear (224, 224) -> (270, 268) aa=False     |                        |        162.194 (+-1.089)        |          329.754 (+-2.590)           |      2.033 (+-0.000)    
      3 torch.uint8 channels_last bilinear (256, 256) -> (1024, 1024) aa=False   |                        |        894.476 (+-2.738)        |         2815.870 (+-22.589)          |      3.148 (+-0.000)    
      3 torch.uint8 channels_last bilinear (224, 224) -> (64, 64) aa=False       |                        |         52.728 (+-0.406)        |          112.024 (+-1.225)           |      2.125 (+-0.000)    
      3 torch.uint8 channels_last bilinear (270, 268) -> (224, 224) aa=False     |                        |        151.560 (+-1.128)        |          299.152 (+-3.353)           |      1.974 (+-0.000)    
      3 torch.uint8 channels_last bilinear (1024, 1024) -> (256, 256) aa=False   |                        |        500.053 (+-4.288)        |         1698.601 (+-16.785)          |      3.397 (+-0.000)    
```

Note: There is no perf regression for other case. There some cases (see Source below) with small speed-ups, for the rest it is roughly around 1.0 +/- 0.1 which may be attributed to noisy measurements ...

[Source](https://gist.github.com/vfdev-5/1c0778904a07ce40401306548b9525e8#file-20230322-132441-pr_vs_nightly-speedup-md)


## Context

- #90771



cc jgong5 mingfeima XiaobingSuper sanchitintel ashokei jingxu10

[ghstack-poisoned]
vfdev-5 added a commit that referenced this pull request Mar 29, 2023
- Based on #96651
- Fixed mem pointer alignment

ghstack-source-id: 8b90000
Pull Request resolved: #96848
vfdev-5 added a commit that referenced this pull request Mar 29, 2023
…t8 RGB-case (channels last)"


## Description

- Based on #96651
  - Improved perfs for vectorized **bilinear** interpolate uint8 RGB-case, **channels last**
    - unified RGB and RGBA processing code such that RGB input is not copied into RGBA
  - Performances are more close to Pillow-SIMD (labeled as `Pillow (9.0.0.post1)` in the results)
  - RGBA case perfs are the same after refactoring (see Source link below) 
- Fixed mem pointer alignment, added more comments (reviews from #96651)

## Results

- `Pillow (9.0.0.post1)` == Pillow-SIMD

```
[-------------------------------------------------------------------------------------------------- Resize -------------------------------------------------------------------------------------------------]
                                                                                 |  Pillow (9.0.0.post1)  |  torch (2.1.0a0+gitce4be01) PR  |  torch (2.1.0a0+git5309c44) nightly  |  Speed-up: PR vs nightly
1 threads: --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
      3 torch.uint8 channels_last bilinear (256, 256) -> (32, 32) aa=True        |    38.548 (+-0.280)    |         57.536 (+-0.210)        |          132.147 (+-1.236)           |      2.297 (+-0.000)    
      3 torch.uint8 channels_last bilinear (256, 256) -> (32, 32) aa=False       |                        |         38.532 (+-0.219)        |          111.789 (+-1.175)           |      2.901 (+-0.000)    
      3 torch.uint8 channels_last bilinear (256, 256) -> (224, 224) aa=True      |   127.689 (+-1.348)    |        156.262 (+-1.213)        |          302.518 (+-2.632)           |      1.936 (+-0.000)    
      3 torch.uint8 channels_last bilinear (256, 256) -> (224, 224) aa=False     |                        |        145.483 (+-1.077)        |          286.663 (+-2.494)           |      1.970 (+-0.000)    
      3 torch.uint8 channels_last bilinear (256, 256) -> (320, 320) aa=True      |   178.117 (+-1.956)    |        215.053 (+-1.470)        |          439.375 (+-4.014)           |      2.043 (+-0.000)    
      3 torch.uint8 channels_last bilinear (256, 256) -> (320, 320) aa=False     |                        |        211.340 (+-2.239)        |          438.537 (+-4.143)           |      2.075 (+-0.000)    
      3 torch.uint8 channels_last bilinear (520, 520) -> (32, 32) aa=True        |   112.593 (+-1.266)    |        130.414 (+-1.633)        |          446.804 (+-3.283)           |      3.426 (+-0.000)    
      3 torch.uint8 channels_last bilinear (520, 520) -> (32, 32) aa=False       |                        |         58.767 (+-0.203)        |          374.244 (+-13.598)          |      6.368 (+-0.000)    
      3 torch.uint8 channels_last bilinear (520, 520) -> (224, 224) aa=True      |   283.210 (+-2.937)    |        324.157 (+-1.895)        |          720.197 (+-3.467)           |      2.222 (+-0.000)    
      3 torch.uint8 channels_last bilinear (520, 520) -> (224, 224) aa=False     |                        |        239.800 (+-2.492)        |          592.834 (+-3.903)           |      2.472 (+-0.000)    
      3 torch.uint8 channels_last bilinear (712, 712) -> (32, 32) aa=True        |   186.255 (+-1.629)    |        204.834 (+-1.496)        |          787.868 (+-3.648)           |      3.846 (+-0.000)    
      3 torch.uint8 channels_last bilinear (712, 712) -> (32, 32) aa=False       |                        |         77.335 (+-0.341)        |          651.016 (+-3.926)           |      8.418 (+-0.000)    
      3 torch.uint8 channels_last bilinear (712, 712) -> (224, 224) aa=True      |   410.286 (+-2.439)    |        443.934 (+-2.899)        |         1123.923 (+-14.988)          |      2.532 (+-0.000)    
      3 torch.uint8 channels_last bilinear (712, 712) -> (224, 224) aa=False     |                        |        312.220 (+-2.307)        |          915.347 (+-4.486)           |      2.932 (+-0.000)    

      # More test-cases from #90771
      3 torch.uint8 channels_last bilinear (64, 64) -> (224, 224) aa=True        |    60.611 (+-0.337)    |         80.849 (+-1.780)        |          170.465 (+-1.830)           |      2.108 (+-0.000)    
      3 torch.uint8 channels_last bilinear (224, 224) -> (270, 268) aa=True      |   132.971 (+-1.624)    |        164.892 (+-1.426)        |          330.971 (+-3.249)           |      2.007 (+-0.000)    
      3 torch.uint8 channels_last bilinear (256, 256) -> (1024, 1024) aa=True    |   948.467 (+-3.179)    |        891.414 (+-5.282)        |         2805.510 (+-25.503)          |      3.147 (+-0.000)    
      3 torch.uint8 channels_last bilinear (224, 224) -> (64, 64) aa=True        |    52.539 (+-0.327)    |         72.471 (+-0.367)        |          135.933 (+-1.625)           |      1.876 (+-0.000)    
      3 torch.uint8 channels_last bilinear (270, 268) -> (224, 224) aa=True      |   138.669 (+-1.867)    |        168.628 (+-1.213)        |          321.112 (+-2.904)           |      1.904 (+-0.000)    
      3 torch.uint8 channels_last bilinear (1024, 1024) -> (256, 256) aa=True    |   689.933 (+-3.175)    |        746.911 (+-2.985)        |         2050.880 (+-22.188)          |      2.746 (+-0.000)    
      3 torch.uint8 channels_last bilinear (64, 64) -> (224, 224) aa=False       |                        |         78.347 (+-0.338)        |          169.646 (+-1.640)           |      2.165 (+-0.000)    
      3 torch.uint8 channels_last bilinear (224, 224) -> (270, 268) aa=False     |                        |        162.194 (+-1.089)        |          329.754 (+-2.590)           |      2.033 (+-0.000)    
      3 torch.uint8 channels_last bilinear (256, 256) -> (1024, 1024) aa=False   |                        |        894.476 (+-2.738)        |         2815.870 (+-22.589)          |      3.148 (+-0.000)    
      3 torch.uint8 channels_last bilinear (224, 224) -> (64, 64) aa=False       |                        |         52.728 (+-0.406)        |          112.024 (+-1.225)           |      2.125 (+-0.000)    
      3 torch.uint8 channels_last bilinear (270, 268) -> (224, 224) aa=False     |                        |        151.560 (+-1.128)        |          299.152 (+-3.353)           |      1.974 (+-0.000)    
      3 torch.uint8 channels_last bilinear (1024, 1024) -> (256, 256) aa=False   |                        |        500.053 (+-4.288)        |         1698.601 (+-16.785)          |      3.397 (+-0.000)    
```

Note: There is no perf regression for other case. There some cases (see Source below) with small speed-ups, for the rest it is roughly around 1.0 +/- 0.1 which may be attributed to noisy measurements ...

[Source](https://gist.github.com/vfdev-5/1c0778904a07ce40401306548b9525e8#file-20230322-132441-pr_vs_nightly-speedup-md)


## Context

- #90771



cc jgong5 mingfeima XiaobingSuper sanchitintel ashokei jingxu10

[ghstack-poisoned]
pytorchmergebot pushed a commit that referenced this pull request Mar 30, 2023
- Based on #96651
- Fixed mem pointer alignment

ghstack-source-id: 6c30da9
Pull Request resolved: #96848
pytorchmergebot pushed a commit that referenced this pull request Mar 30, 2023
…erpolate cpu uint8 RGB-case (channels last)"


## Description

- Based on #96651
  - Improved perfs for vectorized **bilinear** interpolate uint8 RGB-case, **channels last**
    - unified RGB and RGBA processing code such that RGB input is not copied into RGBA
  - Performances are more close to Pillow-SIMD (labeled as `Pillow (9.0.0.post1)` in the results)
  - RGBA case perfs are the same after refactoring (see Source link below) 
- Fixed mem pointer alignment, added more comments (reviews from #96651)

## Results

- `Pillow (9.0.0.post1)` == Pillow-SIMD

```
[-------------------------------------------------------------------------------------------------- Resize -------------------------------------------------------------------------------------------------]
                                                                                 |  Pillow (9.0.0.post1)  |  torch (2.1.0a0+gitd6e220c) PR  |  torch (2.1.0a0+git2b75955) nightly  |  Speed-up: PR vs nightly
1 threads: --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
      3 torch.uint8 channels_last bilinear (256, 256) -> (32, 32) aa=True        |    38.674 (+-0.323)    |         57.591 (+-0.244)        |          131.033 (+-1.448)           |      2.275 (+-0.000)    
      3 torch.uint8 channels_last bilinear (256, 256) -> (32, 32) aa=False       |                        |         39.471 (+-0.166)        |          113.911 (+-1.736)           |      2.886 (+-0.000)    
      3 torch.uint8 channels_last bilinear (256, 256) -> (224, 224) aa=True      |   128.512 (+-1.916)    |        161.592 (+-1.242)        |          299.679 (+-2.099)           |      1.855 (+-0.000)    
      3 torch.uint8 channels_last bilinear (256, 256) -> (224, 224) aa=False     |                        |        150.994 (+-1.180)        |          285.331 (+-1.919)           |      1.890 (+-0.000)    
      3 torch.uint8 channels_last bilinear (256, 256) -> (320, 320) aa=True      |   180.045 (+-2.223)    |        220.581 (+-1.363)        |          431.057 (+-3.536)           |      1.954 (+-0.000)    
      3 torch.uint8 channels_last bilinear (256, 256) -> (320, 320) aa=False     |                        |        219.391 (+-1.409)        |          429.410 (+-3.620)           |      1.957 (+-0.000)    
      3 torch.uint8 channels_last bilinear (520, 520) -> (32, 32) aa=True        |   113.911 (+-1.024)    |        129.457 (+-1.295)        |          459.610 (+-13.322)          |      3.550 (+-0.000)    
      3 torch.uint8 channels_last bilinear (520, 520) -> (32, 32) aa=False       |                        |         59.800 (+-0.199)        |          400.015 (+-11.815)          |      6.689 (+-0.000)    
      3 torch.uint8 channels_last bilinear (520, 520) -> (224, 224) aa=True      |   283.050 (+-2.664)    |        339.143 (+-1.209)        |          683.555 (+-4.466)           |      2.016 (+-0.000)    
      3 torch.uint8 channels_last bilinear (520, 520) -> (224, 224) aa=False     |                        |        250.601 (+-1.236)        |          603.545 (+-2.644)           |      2.408 (+-0.000)    
      3 torch.uint8 channels_last bilinear (712, 712) -> (32, 32) aa=True        |   186.723 (+-2.213)    |        199.960 (+-1.343)        |          860.867 (+-21.763)          |      4.305 (+-0.000)    
      3 torch.uint8 channels_last bilinear (712, 712) -> (32, 32) aa=False       |                        |         79.188 (+-0.261)        |          703.019 (+-25.805)          |      8.878 (+-0.000)    
      3 torch.uint8 channels_last bilinear (712, 712) -> (224, 224) aa=True      |   412.353 (+-4.476)    |        462.230 (+-1.983)        |         1101.673 (+-49.299)          |      2.383 (+-0.000)    
      3 torch.uint8 channels_last bilinear (712, 712) -> (224, 224) aa=False     |                        |        327.973 (+-1.852)        |          941.062 (+-5.549)           |      2.869 (+-0.000)    

      3 torch.uint8 channels_last bilinear (64, 64) -> (224, 224) aa=True        |    61.191 (+-0.926)    |         80.795 (+-0.518)        |          160.853 (+-1.506)           |      1.991 (+-0.000)    
      3 torch.uint8 channels_last bilinear (224, 224) -> (270, 268) aa=True      |   134.488 (+-2.129)    |        169.147 (+-1.324)        |          327.343 (+-2.846)           |      1.935 (+-0.000)    
      3 torch.uint8 channels_last bilinear (256, 256) -> (1024, 1024) aa=True    |  1037.045 (+-24.982)   |        938.623 (+-9.010)        |         2603.360 (+-20.530)          |      2.774 (+-0.000)    
      3 torch.uint8 channels_last bilinear (224, 224) -> (64, 64) aa=True        |    52.792 (+-0.613)    |         73.692 (+-0.264)        |          131.829 (+-1.333)           |      1.789 (+-0.000)    
      3 torch.uint8 channels_last bilinear (270, 268) -> (224, 224) aa=True      |   139.596 (+-1.944)    |        173.778 (+-1.039)        |          320.063 (+-2.562)           |      1.842 (+-0.000)    
      3 torch.uint8 channels_last bilinear (1024, 1024) -> (256, 256) aa=True    |   690.132 (+-10.946)   |        772.758 (+-2.864)        |         2036.860 (+-36.109)          |      2.636 (+-0.000)    
      3 torch.uint8 channels_last bilinear (64, 64) -> (224, 224) aa=False       |                        |         78.747 (+-0.799)        |          158.479 (+-1.702)           |      2.013 (+-0.000)    
      3 torch.uint8 channels_last bilinear (224, 224) -> (270, 268) aa=False     |                        |        167.046 (+-1.077)        |          322.104 (+-2.764)           |      1.928 (+-0.000)    
      3 torch.uint8 channels_last bilinear (256, 256) -> (1024, 1024) aa=False   |                        |        918.967 (+-5.251)        |         2611.388 (+-29.917)          |      2.842 (+-0.000)    
      3 torch.uint8 channels_last bilinear (224, 224) -> (64, 64) aa=False       |                        |         55.336 (+-0.251)        |          113.869 (+-1.243)           |      2.058 (+-0.000)    
      3 torch.uint8 channels_last bilinear (270, 268) -> (224, 224) aa=False     |                        |        156.505 (+-1.095)        |          299.861 (+-2.710)           |      1.916 (+-0.000)    
      3 torch.uint8 channels_last bilinear (1024, 1024) -> (256, 256) aa=False   |                        |        514.344 (+-1.905)        |         1776.796 (+-19.660)          |      3.454 (+-0.000)    

```

Note: There is no perf regression for other case. There some cases (see Source below) with small speed-ups, for the rest it is roughly around 1.0 +/- 0.1 which may be attributed to noisy measurements ...

[Source](https://gist.github.com/vfdev-5/1c0778904a07ce40401306548b9525e8#file-20230329-181023-pr_vs_nightly-speedup-md)


## Context

- #90771



cc jgong5 mingfeima XiaobingSuper sanchitintel ashokei jingxu10 datumbox pmeier

[ghstack-poisoned]
pytorchmergebot pushed a commit that referenced this pull request Mar 30, 2023
…t8 RGB-case (channels last)"


## Description

- Based on #96651
  - Improved perfs for vectorized **bilinear** interpolate uint8 RGB-case, **channels last**
    - unified RGB and RGBA processing code such that RGB input is not copied into RGBA
  - Performances are more close to Pillow-SIMD (labeled as `Pillow (9.0.0.post1)` in the results)
  - RGBA case perfs are the same after refactoring (see Source link below) 
- Fixed mem pointer alignment, added more comments (reviews from #96651)

## Results

- `Pillow (9.0.0.post1)` == Pillow-SIMD

```
[-------------------------------------------------------------------------------------------------- Resize -------------------------------------------------------------------------------------------------]
                                                                                 |  Pillow (9.0.0.post1)  |  torch (2.1.0a0+gitd6e220c) PR  |  torch (2.1.0a0+git2b75955) nightly  |  Speed-up: PR vs nightly
1 threads: --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
      3 torch.uint8 channels_last bilinear (256, 256) -> (32, 32) aa=True        |    38.674 (+-0.323)    |         57.591 (+-0.244)        |          131.033 (+-1.448)           |      2.275 (+-0.000)    
      3 torch.uint8 channels_last bilinear (256, 256) -> (32, 32) aa=False       |                        |         39.471 (+-0.166)        |          113.911 (+-1.736)           |      2.886 (+-0.000)    
      3 torch.uint8 channels_last bilinear (256, 256) -> (224, 224) aa=True      |   128.512 (+-1.916)    |        161.592 (+-1.242)        |          299.679 (+-2.099)           |      1.855 (+-0.000)    
      3 torch.uint8 channels_last bilinear (256, 256) -> (224, 224) aa=False     |                        |        150.994 (+-1.180)        |          285.331 (+-1.919)           |      1.890 (+-0.000)    
      3 torch.uint8 channels_last bilinear (256, 256) -> (320, 320) aa=True      |   180.045 (+-2.223)    |        220.581 (+-1.363)        |          431.057 (+-3.536)           |      1.954 (+-0.000)    
      3 torch.uint8 channels_last bilinear (256, 256) -> (320, 320) aa=False     |                        |        219.391 (+-1.409)        |          429.410 (+-3.620)           |      1.957 (+-0.000)    
      3 torch.uint8 channels_last bilinear (520, 520) -> (32, 32) aa=True        |   113.911 (+-1.024)    |        129.457 (+-1.295)        |          459.610 (+-13.322)          |      3.550 (+-0.000)    
      3 torch.uint8 channels_last bilinear (520, 520) -> (32, 32) aa=False       |                        |         59.800 (+-0.199)        |          400.015 (+-11.815)          |      6.689 (+-0.000)    
      3 torch.uint8 channels_last bilinear (520, 520) -> (224, 224) aa=True      |   283.050 (+-2.664)    |        339.143 (+-1.209)        |          683.555 (+-4.466)           |      2.016 (+-0.000)    
      3 torch.uint8 channels_last bilinear (520, 520) -> (224, 224) aa=False     |                        |        250.601 (+-1.236)        |          603.545 (+-2.644)           |      2.408 (+-0.000)    
      3 torch.uint8 channels_last bilinear (712, 712) -> (32, 32) aa=True        |   186.723 (+-2.213)    |        199.960 (+-1.343)        |          860.867 (+-21.763)          |      4.305 (+-0.000)    
      3 torch.uint8 channels_last bilinear (712, 712) -> (32, 32) aa=False       |                        |         79.188 (+-0.261)        |          703.019 (+-25.805)          |      8.878 (+-0.000)    
      3 torch.uint8 channels_last bilinear (712, 712) -> (224, 224) aa=True      |   412.353 (+-4.476)    |        462.230 (+-1.983)        |         1101.673 (+-49.299)          |      2.383 (+-0.000)    
      3 torch.uint8 channels_last bilinear (712, 712) -> (224, 224) aa=False     |                        |        327.973 (+-1.852)        |          941.062 (+-5.549)           |      2.869 (+-0.000)    

      3 torch.uint8 channels_last bilinear (64, 64) -> (224, 224) aa=True        |    61.191 (+-0.926)    |         80.795 (+-0.518)        |          160.853 (+-1.506)           |      1.991 (+-0.000)    
      3 torch.uint8 channels_last bilinear (224, 224) -> (270, 268) aa=True      |   134.488 (+-2.129)    |        169.147 (+-1.324)        |          327.343 (+-2.846)           |      1.935 (+-0.000)    
      3 torch.uint8 channels_last bilinear (256, 256) -> (1024, 1024) aa=True    |  1037.045 (+-24.982)   |        938.623 (+-9.010)        |         2603.360 (+-20.530)          |      2.774 (+-0.000)    
      3 torch.uint8 channels_last bilinear (224, 224) -> (64, 64) aa=True        |    52.792 (+-0.613)    |         73.692 (+-0.264)        |          131.829 (+-1.333)           |      1.789 (+-0.000)    
      3 torch.uint8 channels_last bilinear (270, 268) -> (224, 224) aa=True      |   139.596 (+-1.944)    |        173.778 (+-1.039)        |          320.063 (+-2.562)           |      1.842 (+-0.000)    
      3 torch.uint8 channels_last bilinear (1024, 1024) -> (256, 256) aa=True    |   690.132 (+-10.946)   |        772.758 (+-2.864)        |         2036.860 (+-36.109)          |      2.636 (+-0.000)    
      3 torch.uint8 channels_last bilinear (64, 64) -> (224, 224) aa=False       |                        |         78.747 (+-0.799)        |          158.479 (+-1.702)           |      2.013 (+-0.000)    
      3 torch.uint8 channels_last bilinear (224, 224) -> (270, 268) aa=False     |                        |        167.046 (+-1.077)        |          322.104 (+-2.764)           |      1.928 (+-0.000)    
      3 torch.uint8 channels_last bilinear (256, 256) -> (1024, 1024) aa=False   |                        |        918.967 (+-5.251)        |         2611.388 (+-29.917)          |      2.842 (+-0.000)    
      3 torch.uint8 channels_last bilinear (224, 224) -> (64, 64) aa=False       |                        |         55.336 (+-0.251)        |          113.869 (+-1.243)           |      2.058 (+-0.000)    
      3 torch.uint8 channels_last bilinear (270, 268) -> (224, 224) aa=False     |                        |        156.505 (+-1.095)        |          299.861 (+-2.710)           |      1.916 (+-0.000)    
      3 torch.uint8 channels_last bilinear (1024, 1024) -> (256, 256) aa=False   |                        |        514.344 (+-1.905)        |         1776.796 (+-19.660)          |      3.454 (+-0.000)    

```

Note: There is no perf regression for other case. There some cases (see Source below) with small speed-ups, for the rest it is roughly around 1.0 +/- 0.1 which may be attributed to noisy measurements ...

[Source](https://gist.github.com/vfdev-5/1c0778904a07ce40401306548b9525e8#file-20230329-181023-pr_vs_nightly-speedup-md)


## Context

- #90771



cc jgong5 mingfeima XiaobingSuper sanchitintel ashokei jingxu10 datumbox pmeier

[ghstack-poisoned]
pytorchmergebot pushed a commit that referenced this pull request Mar 30, 2023
… (channels last) (#96848)

## Description

- Based on #96651
  - Improved perfs for vectorized **bilinear** interpolate uint8 RGB-case, **channels last**
    - unified RGB and RGBA processing code such that RGB input is not copied into RGBA
  - Performances are more close to Pillow-SIMD (labeled as `Pillow (9.0.0.post1)` in the results)
  - RGBA case perfs are the same after refactoring (see Source link below)
- Fixed mem pointer alignment, added more comments (reviews from #96651)

## Results

- `Pillow (9.0.0.post1)` == Pillow-SIMD

```
[-------------------------------------------------------------------------------------------------- Resize -------------------------------------------------------------------------------------------------]
                                                                                 |  Pillow (9.0.0.post1)  |  torch (2.1.0a0+gitd6e220c) PR  |  torch (2.1.0a0+git2b75955) nightly  |  Speed-up: PR vs nightly
1 threads: --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
      3 torch.uint8 channels_last bilinear (256, 256) -> (32, 32) aa=True        |    38.674 (+-0.323)    |         57.591 (+-0.244)        |          131.033 (+-1.448)           |      2.275 (+-0.000)
      3 torch.uint8 channels_last bilinear (256, 256) -> (32, 32) aa=False       |                        |         39.471 (+-0.166)        |          113.911 (+-1.736)           |      2.886 (+-0.000)
      3 torch.uint8 channels_last bilinear (256, 256) -> (224, 224) aa=True      |   128.512 (+-1.916)    |        161.592 (+-1.242)        |          299.679 (+-2.099)           |      1.855 (+-0.000)
      3 torch.uint8 channels_last bilinear (256, 256) -> (224, 224) aa=False     |                        |        150.994 (+-1.180)        |          285.331 (+-1.919)           |      1.890 (+-0.000)
      3 torch.uint8 channels_last bilinear (256, 256) -> (320, 320) aa=True      |   180.045 (+-2.223)    |        220.581 (+-1.363)        |          431.057 (+-3.536)           |      1.954 (+-0.000)
      3 torch.uint8 channels_last bilinear (256, 256) -> (320, 320) aa=False     |                        |        219.391 (+-1.409)        |          429.410 (+-3.620)           |      1.957 (+-0.000)
      3 torch.uint8 channels_last bilinear (520, 520) -> (32, 32) aa=True        |   113.911 (+-1.024)    |        129.457 (+-1.295)        |          459.610 (+-13.322)          |      3.550 (+-0.000)
      3 torch.uint8 channels_last bilinear (520, 520) -> (32, 32) aa=False       |                        |         59.800 (+-0.199)        |          400.015 (+-11.815)          |      6.689 (+-0.000)
      3 torch.uint8 channels_last bilinear (520, 520) -> (224, 224) aa=True      |   283.050 (+-2.664)    |        339.143 (+-1.209)        |          683.555 (+-4.466)           |      2.016 (+-0.000)
      3 torch.uint8 channels_last bilinear (520, 520) -> (224, 224) aa=False     |                        |        250.601 (+-1.236)        |          603.545 (+-2.644)           |      2.408 (+-0.000)
      3 torch.uint8 channels_last bilinear (712, 712) -> (32, 32) aa=True        |   186.723 (+-2.213)    |        199.960 (+-1.343)        |          860.867 (+-21.763)          |      4.305 (+-0.000)
      3 torch.uint8 channels_last bilinear (712, 712) -> (32, 32) aa=False       |                        |         79.188 (+-0.261)        |          703.019 (+-25.805)          |      8.878 (+-0.000)
      3 torch.uint8 channels_last bilinear (712, 712) -> (224, 224) aa=True      |   412.353 (+-4.476)    |        462.230 (+-1.983)        |         1101.673 (+-49.299)          |      2.383 (+-0.000)
      3 torch.uint8 channels_last bilinear (712, 712) -> (224, 224) aa=False     |                        |        327.973 (+-1.852)        |          941.062 (+-5.549)           |      2.869 (+-0.000)

      3 torch.uint8 channels_last bilinear (64, 64) -> (224, 224) aa=True        |    61.191 (+-0.926)    |         80.795 (+-0.518)        |          160.853 (+-1.506)           |      1.991 (+-0.000)
      3 torch.uint8 channels_last bilinear (224, 224) -> (270, 268) aa=True      |   134.488 (+-2.129)    |        169.147 (+-1.324)        |          327.343 (+-2.846)           |      1.935 (+-0.000)
      3 torch.uint8 channels_last bilinear (256, 256) -> (1024, 1024) aa=True    |  1037.045 (+-24.982)   |        938.623 (+-9.010)        |         2603.360 (+-20.530)          |      2.774 (+-0.000)
      3 torch.uint8 channels_last bilinear (224, 224) -> (64, 64) aa=True        |    52.792 (+-0.613)    |         73.692 (+-0.264)        |          131.829 (+-1.333)           |      1.789 (+-0.000)
      3 torch.uint8 channels_last bilinear (270, 268) -> (224, 224) aa=True      |   139.596 (+-1.944)    |        173.778 (+-1.039)        |          320.063 (+-2.562)           |      1.842 (+-0.000)
      3 torch.uint8 channels_last bilinear (1024, 1024) -> (256, 256) aa=True    |   690.132 (+-10.946)   |        772.758 (+-2.864)        |         2036.860 (+-36.109)          |      2.636 (+-0.000)
      3 torch.uint8 channels_last bilinear (64, 64) -> (224, 224) aa=False       |                        |         78.747 (+-0.799)        |          158.479 (+-1.702)           |      2.013 (+-0.000)
      3 torch.uint8 channels_last bilinear (224, 224) -> (270, 268) aa=False     |                        |        167.046 (+-1.077)        |          322.104 (+-2.764)           |      1.928 (+-0.000)
      3 torch.uint8 channels_last bilinear (256, 256) -> (1024, 1024) aa=False   |                        |        918.967 (+-5.251)        |         2611.388 (+-29.917)          |      2.842 (+-0.000)
      3 torch.uint8 channels_last bilinear (224, 224) -> (64, 64) aa=False       |                        |         55.336 (+-0.251)        |          113.869 (+-1.243)           |      2.058 (+-0.000)
      3 torch.uint8 channels_last bilinear (270, 268) -> (224, 224) aa=False     |                        |        156.505 (+-1.095)        |          299.861 (+-2.710)           |      1.916 (+-0.000)
      3 torch.uint8 channels_last bilinear (1024, 1024) -> (256, 256) aa=False   |                        |        514.344 (+-1.905)        |         1776.796 (+-19.660)          |      3.454 (+-0.000)

```

Note: There is no perf regression for other case. There some cases (see Source below) with small speed-ups, for the rest it is roughly around 1.0 +/- 0.1 which may be attributed to noisy measurements ...

[Source](https://gist.github.com/vfdev-5/1c0778904a07ce40401306548b9525e8#file-20230329-181023-pr_vs_nightly-speedup-md)

## Context

- #90771

Pull Request resolved: #96848
Approved by: https://github.com/NicolasHug, https://github.com/peterbell10
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

module: cpu CPU specific problem (e.g., perf, algorithm) open source topic: not user facing topic category triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants