-
Notifications
You must be signed in to change notification settings - Fork 26.3k
WIP Add uint8 support for interpolate on channels_last (1, 3, H,W) CPU images, mode=bilinear, antialias=True #87863
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
WIP Add uint8 support for interpolate on channels_last (1, 3, H,W) CPU images, mode=bilinear, antialias=True #87863
Conversation
|
Is it vectorizing the reads/writes? If so, may also be nice to support 4-channel inputs (RGBA?) or any divisible by 4/8 inputs - probably vectorization is even simpler in this case Also, interpolate support for 1-channel uint8/int16/uint32 inputs are useful (for label-maps (nearest) and for audio signal interpolation) |
|
Is this a migration of PIL-simd's kernel? Shall we parallel it at the same time, shouldn't be too much additional job ~ |
|
Thanks for the feedback @vadimkantorov. Yes, the vectorized part is roughly @mingfeima yes this is a direct port from PIL-SIMD. Indeed I think we should be able to parallelize over the batch-size. Did you have something else in mind? |
|
Maybe I wasn't clear, I meant is the read-ins and write-outs themselves vectorized? Are all three R,G,B channels read and written in one go? For uint8 3-channels, maybe they just fit in one uint32, but for 3-channels float32, SSE-registers may be used. The memory access would probably be not very aligned, as the memory is read by triplets (and not quadruplets), but maybe in modern processors it's not hurting much. |
|
Hm, I'm not sure what you mean by read-ins and write-outs honestly. If you mean the packing/unpacking of the input/output, that part isn't vectorized. The vectorized part is the writing to the unpacked output from the unpacked input.
Yes. |
|
I guess packing/unpacking part wasn't very clear. Is it first copying the memory in unpacked format and then processing it in vectorized way? I guess my question is whether this first unpacking is needed and can be fused with processing (but this would incur some useless computation for non-existing 4-th channel). |
Yes, the input tensor arrives as This does involve extra copies, and there might be opportunities to improve this in the future. But for now, even with the extra copies this is still way faster than |
|
Then yes, maybe in the futures copies can be avoided and can be fused directly with reading/writing (probably worth adding a todo or explanation about this in the code...) |
|
hope soon pillow is not needed for simple pipelines :) |
|
|
||
| separable_upsample_generic_Nd_kernel_impl<2, scale_t, HelperInterpLinear>( | ||
| output, input, align_corners, {scales_h, scales_w}); | ||
| if (input.dtype() == at::kByte) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you might want to use something like _use_vectorized_kernel_cond here (a few functions in the file seem to be using that condition to decide whether or not to vectorize)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
synced offline -- currently the focus is to only support this for uint8. In the future, we can extend this to other data types.
| return unpacked_output_p; | ||
| } | ||
|
|
||
| void beepidiboop(const Tensor& input, const Tensor& output) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you should add a template for scalar type here
| // - There's a segfault when input_shape == output_shape | ||
| // - This could be extended to other filters, not just bilinear | ||
| // - License? | ||
| beepidiboop(input, output); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
call this with the appropriate AT_DISPATCH_{} macro once the function is templated
| UINT32 *lineOut, UINT32 * imIn, | ||
| int xmin, int xmax, INT16 *k, int coefs_precision, int xin) | ||
| { | ||
| #ifdef CPU_CAPABILITY_AVX2 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
add a comment sharing the link from where this was borrowed
| return 0.0; | ||
| } | ||
|
|
||
| void unpack_rgb(uint8_t * unpacked, const uint8_t * packed, int num_pixels) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oh ok, this is a trick to perform vector reads / writes in the kernel with uint32_t more easily. This means that we could also "easily" support num_channels <= 4 by just changing this function maybe? It wouldn't be the most efficient implementation, but might be worth checking if it would be faster for num_channels==1 compared to what we already have
|
|
||
| /* coefficient buffer */ | ||
| /* malloc check ok, overflow checked above */ | ||
| kk = (double *)malloc(outSize * ksize * sizeof(double)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All malloc should use ideally PyTorch's allocator, so that you don't need to handle the frees yourself
|
Closing this in favor of #90771 which is more complete |
This is heavily adapted / ported from PILLOW-SIMD.
This PR adds support for
uint8images in the following case:antialias=True-- support for other modes is possibleThis may sound restrictive, but it's not. This is exactly the setting in which torchvision'
Resize()is used for training jobs.This is still WIP with lots of TODOs, but it seems to be working decently. On the inputs I tried and comparing with torchvision's
Resize()(which first converts uint8 to floats, runsinterpolate(), and converts back to uint8), I'm getting ~3-5X speedup. It seems correct so far as the absolute difference in the outputs is never> 1, and only ~15% of the pixel values differ (by exactly 1).This addresses pytorch/vision#2289
@vfdev-5 @mingfeima @fmassa I'd love your initial thoughts on this!
cc @VitalyFedyunin @jgong5 @mingfeima @XiaobingSuper @sanchitintel @ashokei @jingxu10