Fix FP16 fastAtomicAdd for one case where tensor start address is not 32 bit aligned (#44642)

Michael Carilli · facebook-github-bot · commit 2435d941b137 · 2020-09-14T22:07:29.000-07:00
Summary: For #44206 and #42218, I'd like to update trilinear interpolate backward and grid_sample backward to use `fastAtomicAdd`. As a prelude, I spotted a UB risk in `fastAtomicAdd`. I think existing code incurs a misaligned `__half2` atomicAdd when `index` is odd and `tensor` is not 32-bit aligned (`index % 2 == 1` and `(reinterpret_cast<std::uintptr_t>(tensor) % sizeof(__half2) == 1`). In this case we think we're `!low_bit` and go down the `!low_bit` code path, but in fact we are `low_bit`. It appears the original [fastAtomicAdd PR](#21879 (comment) discussion did not consider that case explicitly. I wanted to push my tentative fix for discussion ASAP. jjsjann123 and mkolod as original authors of `fastAtomicAdd`. (I'm also curious why we need to `reinterpret_cast<std::uintptr_t>(tensor...` for the address modding, but that's minor.) Pull Request resolved: #44642 Reviewed By: mruberry Differential Revision: D23699820 Pulled By: ngimel fbshipit-source-id: 0db57150715ebb45e6a1fb36897e46f00d61defd
diff --git a/aten/src/ATen/native/cuda/KernelUtils.cuh b/aten/src/ATen/native/cuda/KernelUtils.cuh
@@ -21,20 +21,21 @@ __device__ __forceinline__ void fastSpecializedAtomicAdd(
       reinterpret_cast<at::Half*>(tensor) + index,
       static_cast<at::Half>(value));
 #else
-  bool low_bit = (index % 2 == 0) &&
-      (reinterpret_cast<std::uintptr_t>(tensor) % sizeof(__half2) == 0);
+  // Accounts for the chance tensor falls on an odd 16 bit alignment (ie, not 32 bit aligned)
+  __half* target_addr = reinterpret_cast<__half*>(tensor + index);
+  bool low_byte = (reinterpret_cast<std::uintptr_t>(target_addr) % sizeof(__half2) == 0);
 
-  if (low_bit && index < (numel - 1)) {
+  if (low_byte && index < (numel - 1)) {
     __half2 value2;
     value2.x = value;
     value2.y = __int2half_rz(0);
-    atomicAdd(reinterpret_cast<__half2*>(tensor) + index / 2, value2);
+    atomicAdd(reinterpret_cast<__half2*>(target_addr), value2);
 
-  } else if (!low_bit && index > 0) {
+  } else if (!low_byte && index > 0) {
     __half2 value2;
     value2.x = __int2half_rz(0);
     value2.y = value;
-    atomicAdd(reinterpret_cast<__half2*>(tensor) + index / 2, value2);
+    atomicAdd(reinterpret_cast<__half2*>(target_addr - 1), value2);
 
   } else {
     atomicAdd(