Conversation
…te most significant ON bit's index using assembly instruction 'bsr' and add 1 to find next higest 2**N >= x
|
Any reason to use inline assembly directly, as opposed to whatever compiler intrinsics are available? It'd be nice to get down the compiler+arch-specificity to just compiler-specificity. |
|
I thought that with "bsrq" it would be "what we see is what we get". i.e. what is in code is what is executed and made our intent clear to reader of code. Below is to give complete context: Ideally we could have simply used 1ULL << "64 - __builtin_clzll(x - 1)" for our purpose We can see that xorq and subl are just adding 1 into result of bsrq ( first is xor with 63 and second subtraction with 64 which just add 1 into result of bsrq or %eax). What we needed addl $1,%eax instead of 2 instructions. This is why we needed second version (63 ^ __builtin_clzll(x - 1)) + 1 to force gcc compiler to produces optimized code. Just for completeness sake (64 - __builtin_clzll(x - 1)) is equivalent to calculating 2's compliment of number __builtin_clzll(x - 1) in 6 bit world and thus equivalent to xor with (2 ** 6 - 1) = 63 and adding 1. Second way to look at it is (64 - __builtin_clzll(x - 1)) => (63 - __builtin_clzll(x - 1)) + 1 => (63 ^ __builtin_clzll(x - 1)) + 1. As reader of code, it was not immediately obvious why (64 - __builtin_clzll(x - 1)) was not used and instead its convoluted version "(63 ^ __builtin_clzll(x - 1)) + 1" was used. I thought it might be easier and straight forward for someone (as well as for compiler) to read "bsrq" assembly instruction and understand quickly that we simply wanted to calculate index of most significant ON bit and shift one more to get the answer. Both way we finally end up in assembly code with "bsrq"/"addl", so I thought why not directly use this instruction instead of getting it in roundabout way thru ""(63 ^ __builtin_clzll(x - 1)) + 1". I agree that compiler intrinsic are more portable/better and should be preferred. I am thinking that presence of "if" condition is probably deal breaker and defeats the purpose of this code change. Same problem would exists with use of __builtin_clzll(x - 1) as this intrinsic is undefined for x=0 and 1 and we would need "if" to handle edge case. I am not sure whether current way of setting all lower bits to 1 by bit shifts (and finally add 1) is better (with 6 OR and 6 SHIFTs) or whether new approach with "if" would perform better. |
|
IIUC (please correct me if I am wrong), given: uint64_t pow2_ceil_u64(uint64_t x) {
if (x <= 1) { return x; }
return 1ULL << (64 - __builtin_clzll(x - 1));
}this invokes undefined behavior if
(emphasis mine) This would apply to all the proposed versions that use a shift, including the one implemented in the PR that uses inline assembly if |
|
FWIW a possible fix for the undefined behavior would be to, e.g., zero-extend 1ULL to a 128-bit wide integer, shift that by N (where N can be 64), and then truncate to a 64-bit integer. Such an implementation generates the following assembly with LLVM: pow2_ceil::pow2_ceil_clz:
push rbp
mov rbp, rsp
cmp rdi, 2
jb LBB1_2
dec rdi
bsr rcx, rdi
xor ecx, 192
add ecx, 65
mov eax, 1
shl rax, cl
xor edi, edi
test cl, 64
cmove rdi, rax
LBB1_2:
mov rax, rdi
pop rbp
retwhich is far from perfect, but still appears to offer a slight performance advantage over the I've benchmarked this implementation against the current one using the sequence of
And in a new Skylake:
That's 1.14x faster on skylake and 1.4x faster on haswell. If the assembly is suboptimal, we should fill GCC and Clang bugs. |
|
I think if we're trying to compute the next power of two for numbers > 2**63, we'll be in trouble no matter what. I think in practice we filter these out with size checks against large_maxclass, so the "0 clz" issue won't happen. (We should probably add an assert, though). |
|
TBH though, this discussion has convinced me that we should just use inline assembly on x86, and be OK with so-so code generation everywhere else. I can't find anything that gets both gcc and clang to emit something reasonable. |
|
I think we should do both, use inline assembly on In both cases we could add a debug assert to the code, but if the situations that might trigger it cannot happen (I don't know where this is used) then a comment might be enough: my point was only that the replacements were not functionally equivalent to the original code for all inputs and that there didn't seemed to be any tests covering these edge cases. |
|
@rkmisra if you don't have time to finish this let me know and I can send a PR to your branch with the requested changes. |
|
Proposed changes are only in bit_util.h file. Please ignore other files (extents.c & extents_struct.h) in PR, which got included from other commits due to some mistake. Once we have finalized the changes, I would create a new clean PR with only bit_util.h changes. |
|
This LGTM, but on MSVC you probably need to use |
Fallback to older XOR way of doing thing for windows and other platforms where gcc builtin is not available.
|
Didn't use _BitScanReverse on MSVC in current commit as there were just too many if/else for different platforms. In MSVC case, we are falling back to older xor implementation. Please let me know if you feel strongly about using _BitScan* instead of xor for MSVC. |
I don't and think it would be the wrong call anyways. What we could do is implement the |
|
@rkmisra could you resolve the conflicts here ? this looks ready |
|
Closing this as #1303 overrides this. |
This is more of a review request and may be to get more feedback. I was trying to optimize the pow2_ceil_u64 (which gets next 2 ** N which is >= x) calculation based on comments given in #674 . Logic seems to be very simple and straightforward:
Calculate index "i" of most significant "1" bit of number using instruction bsr (Deduct number by 1 to handle special case of number being 2 ** N), add one to index and calculate 2 ** ( i + 1) (thur bit shift). It works fine except "bsr" could not handle case of number = 0 or number = 1 (both are not defined for bsr after deducting 1 where operand is 0). To handle these 2 corner cases, I had to add a "if" statement, but that kind of defeats the purpose of optimization as it would cause pipeline stall due to cmp and jne statement.
If we know that we will never encounter 0 and 1 in real world, we could remove if statement, and generated assembly instructions are fall thru, or accept the new values for 2 cases (probably wrong values), at the cost of optimizing other cases. For x=0 we get 1 and for x=1 we get 2 in new code (without if statement).
Any other suggestion?