Add clang-tidy, fix narrowing issues, fix constness#65
Add clang-tidy, fix narrowing issues, fix constness#65copybara-service[bot] merged 3 commits intogoogle:devfrom
Conversation
ops.h
Outdated
| constexpr float eps = 1e-6f; | ||
| float ss = SquaredL2(x, size); | ||
| ss = 1.0f / sqrtf(ss / static_cast<int>(size) + eps); | ||
| ss = 1.0f / sqrtf(ss / static_cast<float>(size) + eps); |
There was a problem hiding this comment.
FYI this is actually a slight pessimization, it is cheaper to convert int32 to float than u64 -> float. If we want to be explicit, cast<float>(cast<int>(size)) would be ideal.
There was a problem hiding this comment.
I agree, that is true
There was a problem hiding this comment.
What do you think of adding a function like FloatFromSmallInt? Could be a template supporting types such as size_t, and would avoid replicating these two casts in various places.
Thanks BTW for adding const and fixing these issues :)
There was a problem hiding this comment.
I was thinking of creating a safe_cast function to ensure casting without causing narrowing. But in the meantime I will go with FloatFromSmallInt thing.
There was a problem hiding this comment.
Sounds good. How about we only do range checks (if any) in debug builds, to avoid overhead? You can use HWY_DASSERT for such things.
|
This looks pretty good. @jan-wassenberg if this looks alright to you we can probably go ahead and merge. |
ops.h
Outdated
| std::is_arithmetic_v<To> && std::is_arithmetic_v<From>, To> | ||
| StaticCast(From from) noexcept { | ||
| if constexpr (std::is_unsigned_v<From> && std::is_floating_point_v<To>) | ||
| return static_cast<To>(static_cast<int64_t>(from)); |
There was a problem hiding this comment.
int64_t -> float is also somewhat expensive. How about we replace int64_t with hwy::SignedFromSize<sizeof(from)>? This evaluates to int for From=float.
There was a problem hiding this comment.
Ah, it seems I've created traits similar to hwy::SignedFromSize. No need to reinvent the wheel I'll just use that.
I think it would be hwy::SignedFromSize<2 * sizeof(from)> , right ?
There was a problem hiding this comment.
Just out of curiosity, I've experimented with this function on Compiler Explorer and noticed that on 64-bit machines, int64 can sometimes yield faster performance. What's your take on this?
float func1(size_t size)
{
return static_cast<float>(static_cast<int64_t>(size));
}
float func2(size_t size)
{
return static_cast<float>(static_cast<int>(size));
}
In X86-64:
**clang-14**
func1(unsigned long): # @func1(unsigned long)
push rbp
mov rbp, rsp
mov qword ptr [rbp - 8], rdi
cvtsi2ss xmm0, qword ptr [rbp - 8]
pop rbp
ret
func2(unsigned long): # @func2(unsigned long)
push rbp
mov rbp, rsp
mov qword ptr [rbp - 8], rdi
mov rax, qword ptr [rbp - 8]
cvtsi2ss xmm0, eax
pop rbp
ret
**clang-17**
func1(unsigned long): # @func1(unsigned long)
push rbp
mov rbp, rsp
mov qword ptr [rbp - 8], rdi
cvtsi2ss xmm0, qword ptr [rbp - 8]
pop rbp
ret
func2(unsigned long): # @func2(unsigned long)
push rbp
mov rbp, rsp
mov qword ptr [rbp - 8], rdi
mov rax, qword ptr [rbp - 8]
cvtsi2ss xmm0, eax
pop rbp
ret
**gcc-13**
func1(unsigned long):
push rbp
mov rbp, rsp
mov QWORD PTR [rbp-8], rdi
mov rax, QWORD PTR [rbp-8]
pxor xmm0, xmm0
cvtsi2ss xmm0, rax
pop rbp
ret
func2(unsigned long):
push rbp
mov rbp, rsp
mov QWORD PTR [rbp-8], rdi
mov rax, QWORD PTR [rbp-8]
pxor xmm0, xmm0
cvtsi2ss xmm0, eax
pop rbp
ret
In Arm:
**clang-17**
func1(unsigned long): // @func1(unsigned long)
sub sp, sp, #16
str x0, [sp, #8]
ldr d0, [sp, #8]
fmov x8, d0
scvtf s0, x8
add sp, sp, #16
ret
func2(unsigned long): // @func2(unsigned long)
sub sp, sp, #16
str x0, [sp, #8]
ldr x8, [sp, #8]
scvtf s0, w8
add sp, sp, #16
ret
**gcc-13**
func1(unsigned long):
sub sp, sp, #16
str x0, [sp, 8]
ldr x0, [sp, 8]
scvtf s0, x0
add sp, sp, 16
ret
func2(unsigned long):
sub sp, sp, #16
str x0, [sp, 8]
ldr x0, [sp, 8]
fmov s0, w0
scvtf s0, s0
add sp, sp, 16
ret
There was a problem hiding this comment.
Interesting, thanks for showing the exact codegen. It's not 100% obvious which is better because 64-bit cvtsi2ss has lower throughput, and we can afford an extra reg. Looks like same-size is indeed better on Arm, though, and it's not super time-critical, so let's go with your current code 👍
|
Looks good to me as well, thanks for adding the StaticCast :) Just one more small suggestion, then we're good to go. Or let me know if you'd prefer we make this change. |
unfortunately I did not find tests in repo to check.