Skip to content
This repository was archived by the owner on Jun 2, 2025. It is now read-only.

Optimize pow2 674#1118

Closed
rkmisra wants to merge 11 commits intojemalloc:devfrom
rkmisra:optimize_pow2_674
Closed

Optimize pow2 674#1118
rkmisra wants to merge 11 commits intojemalloc:devfrom
rkmisra:optimize_pow2_674

Conversation

@rkmisra
Copy link
Contributor

@rkmisra rkmisra commented Jan 25, 2018

This is more of a review request and may be to get more feedback. I was trying to optimize the pow2_ceil_u64 (which gets next 2 ** N which is >= x) calculation based on comments given in #674 . Logic seems to be very simple and straightforward:

Calculate index "i" of most significant "1" bit of number using instruction bsr (Deduct number by 1 to handle special case of number being 2 ** N), add one to index and calculate 2 ** ( i + 1) (thur bit shift). It works fine except "bsr" could not handle case of number = 0 or number = 1 (both are not defined for bsr after deducting 1 where operand is 0). To handle these 2 corner cases, I had to add a "if" statement, but that kind of defeats the purpose of optimization as it would cause pipeline stall due to cmp and jne statement.

If we know that we will never encounter 0 and 1 in real world, we could remove if statement, and generated assembly instructions are fall thru, or accept the new values for 2 cases (probably wrong values), at the cost of optimizing other cases. For x=0 we get 1 and for x=1 we get 2 in new code (without if statement).

Any other suggestion?

@davidtgoldblatt
Copy link
Contributor

Any reason to use inline assembly directly, as opposed to whatever compiler intrinsics are available? It'd be nice to get down the compiler+arch-specificity to just compiler-specificity.

@rkmisra
Copy link
Contributor Author

rkmisra commented Jan 31, 2018

I thought that with "bsrq" it would be "what we see is what we get". i.e. what is in code is what is executed and made our intent clear to reader of code.

Below is to give complete context:

Ideally we could have simply used 1ULL << "64 - __builtin_clzll(x - 1)" for our purpose
but this produces below kind of code by gcc


movl	$64, %ecx
movl	$1, %edx
subq	$1, %rax
bsrq	%rax, %rax
xorq	$63, %rax
subl	%eax, %ecx
xorl	%eax, %eax
salq	%cl, %rdx

We can see that xorq and subl are just adding 1 into result of bsrq ( first is xor with 63 and second subtraction with 64 which just add 1 into result of bsrq or %eax). What we needed addl $1,%eax instead of 2 instructions. This is why we needed second version (63 ^ __builtin_clzll(x - 1)) + 1 to force gcc compiler to produces optimized code.

Just for completeness sake (64 - __builtin_clzll(x - 1)) is equivalent to calculating 2's compliment of number __builtin_clzll(x - 1) in 6 bit world and thus equivalent to xor with (2 ** 6 - 1) = 63 and adding 1. Second way to look at it is (64 - __builtin_clzll(x - 1)) => (63 - __builtin_clzll(x - 1)) + 1 => (63 ^ __builtin_clzll(x - 1)) + 1.

As reader of code, it was not immediately obvious why (64 - __builtin_clzll(x - 1)) was not used and instead its convoluted version "(63 ^ __builtin_clzll(x - 1)) + 1" was used.

I thought it might be easier and straight forward for someone (as well as for compiler) to read "bsrq" assembly instruction and understand quickly that we simply wanted to calculate index of most significant ON bit and shift one more to get the answer. Both way we finally end up in assembly code with "bsrq"/"addl", so I thought why not directly use this instruction instead of getting it in roundabout way thru ""(63 ^ __builtin_clzll(x - 1)) + 1".

I agree that compiler intrinsic are more portable/better and should be preferred.

I am thinking that presence of "if" condition is probably deal breaker and defeats the purpose of this code change. Same problem would exists with use of __builtin_clzll(x - 1) as this intrinsic is undefined for x=0 and 1 and we would need "if" to handle edge case.

I am not sure whether current way of setting all lower bits to 1 by bit shifts (and finally add 1) is better (with 6 OR and 6 SHIFTs) or whether new approach with "if" would perform better.

@gnzlbg
Copy link
Contributor

gnzlbg commented Aug 1, 2018

IIUC (please correct me if I am wrong), given:

uint64_t pow2_ceil_u64(uint64_t x) {
    if (x <= 1) { return x; }
    return 1ULL << (64 - __builtin_clzll(x - 1));
}

this invokes undefined behavior if __builtin_clzll returns 0 (see https://en.cppreference.com/w/cpp/language/operator_arithmetic):

In any case, if the value of the right operand is negative or is greater or equal to the number of bits in the promoted left operand, the behavior is undefined.

(emphasis mine)

This would apply to all the proposed versions that use a shift, including the one implemented in the PR that uses inline assembly if ret + 1 can be 64 and ULL is a 64-bit integer which is the case in many platforms.

@gnzlbg
Copy link
Contributor

gnzlbg commented Aug 1, 2018

FWIW a possible fix for the undefined behavior would be to, e.g., zero-extend 1ULL to a 128-bit wide integer, shift that by N (where N can be 64), and then truncate to a 64-bit integer. Such an implementation generates the following assembly with LLVM:

pow2_ceil::pow2_ceil_clz:
 push    rbp
 mov     rbp, rsp
 cmp     rdi, 2
 jb      LBB1_2
 dec     rdi
 bsr     rcx, rdi
 xor     ecx, 192
 add     ecx, 65
 mov     eax, 1
 shl     rax, cl
 xor     edi, edi
 test    cl, 64
 cmove   rdi, rax
LBB1_2:
 mov     rax, rdi
 pop     rbp
 ret

which is far from perfect, but still appears to offer a slight performance advantage over the xor based one (and would be portable so that we wouldn't have to keep two implementations around).

I've benchmarked this implementation against the current one using the sequence of xor by timing the time it takes to compute the result for all integers in [0, u32::max/1000] on an old Haswell CPU:

  • clz: 14,236,503 ns / iteration (+/- 1,509,502)
  • xor: 19,303,768 ns / iteration (+/- 1,986,877)

And in a new Skylake:

  • clz: 17,461,318 ns / iteration (+/- 8,456)
  • xor: 19,990,007 ns / iteration (+/- 337,008)

That's 1.14x faster on skylake and 1.4x faster on haswell. If the assembly is suboptimal, we should fill GCC and Clang bugs.

@davidtgoldblatt
Copy link
Contributor

davidtgoldblatt commented Aug 1, 2018

I think if we're trying to compute the next power of two for numbers > 2**63, we'll be in trouble no matter what. I think in practice we filter these out with size checks against large_maxclass, so the "0 clz" issue won't happen.

(We should probably add an assert, though).

@davidtgoldblatt
Copy link
Contributor

TBH though, this discussion has convinced me that we should just use inline assembly on x86, and be OK with so-so code generation everywhere else. I can't find anything that gets both gcc and clang to emit something reasonable.

@gnzlbg
Copy link
Contributor

gnzlbg commented Aug 1, 2018

I think we should do both, use inline assembly on x86/x86_64 and use clz everywhere else (to me the clz code feels simpler to understand and produces slightly better code).

In both cases we could add a debug assert to the code, but if the situations that might trigger it cannot happen (I don't know where this is used) then a comment might be enough: my point was only that the replacements were not functionally equivalent to the original code for all inputs and that there didn't seemed to be any tests covering these edge cases.

@gnzlbg
Copy link
Contributor

gnzlbg commented Aug 2, 2018

@rkmisra if you don't have time to finish this let me know and I can send a PR to your branch with the requested changes.

@rkmisra
Copy link
Contributor Author

rkmisra commented Aug 3, 2018

Proposed changes are only in bit_util.h file. Please ignore other files (extents.c & extents_struct.h) in PR, which got included from other commits due to some mistake. Once we have finalized the changes, I would create a new clean PR with only bit_util.h changes.

@gnzlbg
Copy link
Contributor

gnzlbg commented Aug 3, 2018

This LGTM, but on MSVC you probably need to use _BitScanReverse, _BitScanReverse64.

Fallback to older XOR way of doing thing for windows and other platforms where gcc builtin is not available.
@rkmisra
Copy link
Contributor Author

rkmisra commented Aug 3, 2018

Didn't use _BitScanReverse on MSVC in current commit as there were just too many if/else for different platforms. In MSVC case, we are falling back to older xor implementation. Please let me know if you feel strongly about using _BitScan* instead of xor for MSVC.

@gnzlbg
Copy link
Contributor

gnzlbg commented Aug 3, 2018

Please let me know if you feel strongly about using _BitScan* instead of xor for MSVC.

I don't and think it would be the wrong call anyways.

What we could do is implement the __builtin_clz (and friends) on top of _BitScan* for MSVC and enable JEMALLOC_HAVE_BUILTIN_CLZ for the whole library instead, but that should be done in a different PR.

@gnzlbg
Copy link
Contributor

gnzlbg commented Aug 10, 2018

@rkmisra could you resolve the conflicts here ? this looks ready

@rkmisra
Copy link
Contributor Author

rkmisra commented Aug 11, 2018

Closing this as #1303 overrides this.

@rkmisra rkmisra closed this Aug 11, 2018
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants