WIP Add uint8 support for interpolate on channels_last (1, 3, H,W) CPU images, mode=bilinear, antialias=True #87863

NicolasHug · 2022-10-27T15:13:03Z

This is heavily adapted / ported from PILLOW-SIMD.

This PR adds support for uint8 images in the following case:

mode=bilinear, antialias=True -- support for other modes is possible
shape=(1, 3, H, W) -- can extend to batch_size > 1. Not sure about channels yet.
layout = channels_last --- we could easily support contiguous as well, since we have to copy (pack / unpack) the input anyway
device=CPU

This may sound restrictive, but it's not. This is exactly the setting in which torchvision' Resize() is used for training jobs.

This is still WIP with lots of TODOs, but it seems to be working decently. On the inputs I tried and comparing with torchvision's Resize() (which first converts uint8 to floats, runs interpolate(), and converts back to uint8), I'm getting ~3-5X speedup. It seems correct so far as the absolute difference in the outputs is never > 1, and only ~15% of the pixel values differ (by exactly 1).

This addresses pytorch/vision#2289

@vfdev-5 @mingfeima @fmassa I'd love your initial thoughts on this!

cc @VitalyFedyunin @jgong5 @mingfeima @XiaobingSuper @sanchitintel @ashokei @jingxu10

pytorch-bot · 2022-10-27T15:13:06Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/87863

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit our office hours

Note: Links to docs will display an error until the docs builds have been completed.

❌ 14 Failures, 6 Pending

As of commit 33e34ff:

The following jobs have failed:

This comment was automatically generated by Dr. CI and updates every 15 minutes.

vadimkantorov · 2022-10-27T21:56:37Z

Is it vectorizing the reads/writes? If so, may also be nice to support 4-channel inputs (RGBA?) or any divisible by 4/8 inputs - probably vectorization is even simpler in this case

Also, interpolate support for 1-channel uint8/int16/uint32 inputs are useful (for label-maps (nearest) and for audio signal interpolation)

mingfeima · 2022-10-28T01:29:34Z

Is this a migration of PIL-simd's kernel? Shall we parallel it at the same time, shouldn't be too much additional job ~

NicolasHug · 2022-10-28T09:47:08Z

Thanks for the feedback @vadimkantorov. Yes, the vectorized part is roughly output[i] = sum_j (wj * input[j]). The pre-computation of the weights and the index mapping isn't vectorized. Regarding the next feature to support, we could add these to our backlog for sure. I'll have to gauge interest internally to decide what to prioritize though.

@mingfeima yes this is a direct port from PIL-SIMD. Indeed I think we should be able to parallelize over the batch-size. Did you have something else in mind?

vadimkantorov · 2022-10-28T10:31:15Z

Maybe I wasn't clear, I meant is the read-ins and write-outs themselves vectorized? Are all three R,G,B channels read and written in one go? For uint8 3-channels, maybe they just fit in one uint32, but for 3-channels float32, SSE-registers may be used. The memory access would probably be not very aligned, as the memory is read by triplets (and not quadruplets), but maybe in modern processors it's not hurting much.

NicolasHug · 2022-10-28T10:53:17Z

Hm, I'm not sure what you mean by read-ins and write-outs honestly. If you mean the packing/unpacking of the input/output, that part isn't vectorized. The vectorized part is the writing to the unpacked output from the unpacked input.

Are all three R,G,B channels read and written in one go?

Yes.

vadimkantorov · 2022-10-28T11:01:55Z

I guess packing/unpacking part wasn't very clear. Is it first copying the memory in unpacked format and then processing it in vectorized way? I guess my question is whether this first unpacking is needed and can be fused with processing (but this would incur some useless computation for non-existing 4-th channel).

NicolasHug · 2022-10-28T11:09:15Z

Is it first copying the memory in unpacked format and then processing it in vectorized way

Yes, the input tensor arrives as R G B R G B R G B... where each letter is a uint8. For the vectorized code to run, we have to first unpack all RGB triplets into 32bits (the last 8 bits are set to 255). So this becomes R G B P R G B P ... where P is the 255 padding (255 is arbitrary, I guess it's just for consistency with RGBA images). The output is written in that same unpacked format and needs to be re-packed to get a proper tensor.

This does involve extra copies, and there might be opportunities to improve this in the future. But for now, even with the extra copies this is still way faster than torchvision.Resize(), and better than no uint8 support at all; so it's definitely worth it.

vadimkantorov · 2022-10-28T11:27:53Z

Then yes, maybe in the futures copies can be avoided and can be fused directly with reading/writing (probably worth adding a todo or explanation about this in the code...)

vadimkantorov · 2022-10-28T13:14:26Z

hope soon pillow is not needed for simple pipelines :)

anjali411 · 2022-10-31T16:27:19Z

aten/src/ATen/native/cpu/UpSampleKernel.cpp


-  separable_upsample_generic_Nd_kernel_impl<2, scale_t, HelperInterpLinear>(
-      output, input, align_corners, {scales_h, scales_w});
+  if (input.dtype() == at::kByte) {


you might want to use something like _use_vectorized_kernel_cond here (a few functions in the file seem to be using that condition to decide whether or not to vectorize)

synced offline -- currently the focus is to only support this for uint8. In the future, we can extend this to other data types.

anjali411 · 2022-10-31T16:28:53Z

aten/src/ATen/native/cpu/my_interp.h

+    return unpacked_output_p;
+}
+
+void beepidiboop(const Tensor& input, const Tensor& output) {


I think you should add a template for scalar type here

anjali411 · 2022-10-31T16:33:54Z

aten/src/ATen/native/cpu/UpSampleKernel.cpp

+    // - There's a segfault when input_shape == output_shape
+    // - This could be extended to other filters, not just bilinear
+    // - License?
+    beepidiboop(input, output);


call this with the appropriate AT_DISPATCH_{} macro once the function is templated

anjali411 · 2022-10-31T16:36:38Z

aten/src/ATen/native/cpu/resample_simd_vertical.c

+    UINT32 *lineOut, UINT32 * imIn,
+    int xmin, int xmax, INT16 *k, int coefs_precision, int xin)
+{
+#ifdef CPU_CAPABILITY_AVX2


add a comment sharing the link from where this was borrowed

fmassa · 2022-11-03T08:55:18Z

aten/src/ATen/native/cpu/my_interp.h

+    return 0.0;
+}
+
+void unpack_rgb(uint8_t * unpacked, const uint8_t * packed, int num_pixels)


oh ok, this is a trick to perform vector reads / writes in the kernel with uint32_t more easily. This means that we could also "easily" support num_channels <= 4 by just changing this function maybe? It wouldn't be the most efficient implementation, but might be worth checking if it would be faster for num_channels==1 compared to what we already have

fmassa · 2022-11-03T08:57:07Z

aten/src/ATen/native/cpu/my_interp.h

+
+    /* coefficient buffer */
+    /* malloc check ok, overflow checked above */
+    kk = (double *)malloc(outSize * ksize * sizeof(double));


All malloc should use ideally PyTorch's allocator, so that you don't need to handle the frees yourself

NicolasHug · 2022-12-13T16:14:27Z

Closing this in favor of #90771 which is more complete

Initial version. Seems to work decently. Lots of TODOs.

456ce9e

NicolasHug mentioned this pull request Oct 27, 2022

Add uint8 support for interpolate and grid-sample in PyTorch pytorch/vision#2289

Open

NicolasHug changed the title ~~WIP Add uint8 support for interpolate on channels_last (1, 3, H,W) CPU images~~ WIP Add uint8 support for interpolate on channels_last (1, 3, H,W) CPU images, mode=bilinear, antialias=True Oct 27, 2022

This was referenced Oct 27, 2022

[RFC] torchvision performance optimization on CPU pytorch/vision#6619

Open

[feature request] F.interpolate to support integral data types: bool, int8, int32, int16, int64 ||| support uint8 on CUDA #5580

Open

anjali411 reviewed Oct 31, 2022

View reviewed changes

anjali411 requested a review from ngimel October 31, 2022 16:48

fmassa reviewed Nov 3, 2022

View reviewed changes

Added test, to be removed

33e34ff

github-actions bot added the module: cpu CPU specific problem (e.g., perf, algorithm) label Nov 3, 2022

NicolasHug closed this Dec 13, 2022

WIP Add uint8 support for interpolate on channels_last (1, 3, H,W) CPU images, mode=bilinear, antialias=True #87863

WIP Add uint8 support for interpolate on channels_last (1, 3, H,W) CPU images, mode=bilinear, antialias=True #87863

Uh oh!

Conversation

NicolasHug commented Oct 27, 2022 • edited by pytorch-bot bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorch-bot bot commented Oct 27, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/87863

❌ 14 Failures, 6 Pending

Uh oh!

vadimkantorov commented Oct 27, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mingfeima commented Oct 28, 2022

Uh oh!

NicolasHug commented Oct 28, 2022

Uh oh!

vadimkantorov commented Oct 28, 2022

Uh oh!

NicolasHug commented Oct 28, 2022

Uh oh!

vadimkantorov commented Oct 28, 2022

Uh oh!

NicolasHug commented Oct 28, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

vadimkantorov commented Oct 28, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

vadimkantorov commented Oct 28, 2022

Uh oh!

anjali411 Oct 31, 2022

Choose a reason for hiding this comment

Uh oh!

anjali411 Oct 31, 2022

Choose a reason for hiding this comment

Uh oh!

anjali411 Oct 31, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

anjali411 Oct 31, 2022

Choose a reason for hiding this comment

Uh oh!

anjali411 Oct 31, 2022

Choose a reason for hiding this comment

Uh oh!

fmassa Nov 3, 2022

Choose a reason for hiding this comment

Uh oh!

fmassa Nov 3, 2022

Choose a reason for hiding this comment

Uh oh!

NicolasHug commented Dec 13, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

NicolasHug commented Oct 27, 2022 •

edited by pytorch-bot bot

Loading

pytorch-bot bot commented Oct 27, 2022 •

edited

Loading

vadimkantorov commented Oct 27, 2022 •

edited

Loading

NicolasHug commented Oct 28, 2022 •

edited

Loading

vadimkantorov commented Oct 28, 2022 •

edited

Loading

anjali411 Oct 31, 2022 •

edited

Loading