Optimize pow2 674 by rkmisra · Pull Request #1118 · jemalloc/jemalloc

rkmisra · 2018-01-25T17:03:16Z

This is more of a review request and may be to get more feedback. I was trying to optimize the pow2_ceil_u64 (which gets next 2 ** N which is >= x) calculation based on comments given in #674 . Logic seems to be very simple and straightforward:

Calculate index "i" of most significant "1" bit of number using instruction bsr (Deduct number by 1 to handle special case of number being 2 ** N), add one to index and calculate 2 ** ( i + 1) (thur bit shift). It works fine except "bsr" could not handle case of number = 0 or number = 1 (both are not defined for bsr after deducting 1 where operand is 0). To handle these 2 corner cases, I had to add a "if" statement, but that kind of defeats the purpose of optimization as it would cause pipeline stall due to cmp and jne statement.

If we know that we will never encounter 0 and 1 in real world, we could remove if statement, and generated assembly instructions are fall thru, or accept the new values for 2 cases (probably wrong values), at the cost of optimizing other cases. For x=0 we get 1 and for x=1 we get 2 in new code (without if statement).

Any other suggestion?

…ue_1004

…te most significant ON bit's index using assembly instruction 'bsr' and add 1 to find next higest 2**N >= x

davidtgoldblatt · 2018-01-30T20:07:59Z

Any reason to use inline assembly directly, as opposed to whatever compiler intrinsics are available? It'd be nice to get down the compiler+arch-specificity to just compiler-specificity.

rkmisra · 2018-01-31T06:36:48Z

I thought that with "bsrq" it would be "what we see is what we get". i.e. what is in code is what is executed and made our intent clear to reader of code.

Below is to give complete context:

Ideally we could have simply used 1ULL << "64 - __builtin_clzll(x - 1)" for our purpose
but this produces below kind of code by gcc

movl	$64, %ecx
movl	$1, %edx
subq	$1, %rax
bsrq	%rax, %rax
xorq	$63, %rax
subl	%eax, %ecx
xorl	%eax, %eax
salq	%cl, %rdx

We can see that xorq and subl are just adding 1 into result of bsrq ( first is xor with 63 and second subtraction with 64 which just add 1 into result of bsrq or %eax). What we needed addl $1,%eax instead of 2 instructions. This is why we needed second version (63 ^ __builtin_clzll(x - 1)) + 1 to force gcc compiler to produces optimized code.

Just for completeness sake (64 - __builtin_clzll(x - 1)) is equivalent to calculating 2's compliment of number __builtin_clzll(x - 1) in 6 bit world and thus equivalent to xor with (2 ** 6 - 1) = 63 and adding 1. Second way to look at it is (64 - __builtin_clzll(x - 1)) => (63 - __builtin_clzll(x - 1)) + 1 => (63 ^ __builtin_clzll(x - 1)) + 1.

As reader of code, it was not immediately obvious why (64 - __builtin_clzll(x - 1)) was not used and instead its convoluted version "(63 ^ __builtin_clzll(x - 1)) + 1" was used.

I thought it might be easier and straight forward for someone (as well as for compiler) to read "bsrq" assembly instruction and understand quickly that we simply wanted to calculate index of most significant ON bit and shift one more to get the answer. Both way we finally end up in assembly code with "bsrq"/"addl", so I thought why not directly use this instruction instead of getting it in roundabout way thru ""(63 ^ __builtin_clzll(x - 1)) + 1".

I agree that compiler intrinsic are more portable/better and should be preferred.

I am thinking that presence of "if" condition is probably deal breaker and defeats the purpose of this code change. Same problem would exists with use of __builtin_clzll(x - 1) as this intrinsic is undefined for x=0 and 1 and we would need "if" to handle edge case.

I am not sure whether current way of setting all lower bits to 1 by bit shifts (and finally add 1) is better (with 6 OR and 6 SHIFTs) or whether new approach with "if" would perform better.

gnzlbg · 2018-08-01T09:28:02Z

IIUC (please correct me if I am wrong), given:

uint64_t pow2_ceil_u64(uint64_t x) {
    if (x <= 1) { return x; }
    return 1ULL << (64 - __builtin_clzll(x - 1));
}

this invokes undefined behavior if __builtin_clzll returns 0 (see https://en.cppreference.com/w/cpp/language/operator_arithmetic):

In any case, if the value of the right operand is negative or is greater or equal to the number of bits in the promoted left operand, the behavior is undefined.

(emphasis mine)

This would apply to all the proposed versions that use a shift, including the one implemented in the PR that uses inline assembly if ret + 1 can be 64 and ULL is a 64-bit integer which is the case in many platforms.

gnzlbg · 2018-08-01T09:50:14Z

FWIW a possible fix for the undefined behavior would be to, e.g., zero-extend 1ULL to a 128-bit wide integer, shift that by N (where N can be 64), and then truncate to a 64-bit integer. Such an implementation generates the following assembly with LLVM:

pow2_ceil::pow2_ceil_clz:
 push    rbp
 mov     rbp, rsp
 cmp     rdi, 2
 jb      LBB1_2
 dec     rdi
 bsr     rcx, rdi
 xor     ecx, 192
 add     ecx, 65
 mov     eax, 1
 shl     rax, cl
 xor     edi, edi
 test    cl, 64
 cmove   rdi, rax
LBB1_2:
 mov     rax, rdi
 pop     rbp
 ret

which is far from perfect, but still appears to offer a slight performance advantage over the xor based one (and would be portable so that we wouldn't have to keep two implementations around).

I've benchmarked this implementation against the current one using the sequence of xor by timing the time it takes to compute the result for all integers in [0, u32::max/1000] on an old Haswell CPU:

clz: 14,236,503 ns / iteration (+/- 1,509,502)
xor: 19,303,768 ns / iteration (+/- 1,986,877)

And in a new Skylake:

clz: 17,461,318 ns / iteration (+/- 8,456)
xor: 19,990,007 ns / iteration (+/- 337,008)

That's 1.14x faster on skylake and 1.4x faster on haswell. If the assembly is suboptimal, we should fill GCC and Clang bugs.

davidtgoldblatt · 2018-08-01T16:40:27Z

I think if we're trying to compute the next power of two for numbers > 2**63, we'll be in trouble no matter what. I think in practice we filter these out with size checks against large_maxclass, so the "0 clz" issue won't happen.

(We should probably add an assert, though).

davidtgoldblatt · 2018-08-01T16:55:39Z

TBH though, this discussion has convinced me that we should just use inline assembly on x86, and be OK with so-so code generation everywhere else. I can't find anything that gets both gcc and clang to emit something reasonable.

gnzlbg · 2018-08-01T17:19:04Z

I think we should do both, use inline assembly on x86/x86_64 and use clz everywhere else (to me the clz code feels simpler to understand and produces slightly better code).

In both cases we could add a debug assert to the code, but if the situations that might trigger it cannot happen (I don't know where this is used) then a comment might be enough: my point was only that the replacements were not functionally equivalent to the original code for all inputs and that there didn't seemed to be any tests covering these edge cases.

gnzlbg · 2018-08-02T12:52:30Z

@rkmisra if you don't have time to finish this let me know and I can send a PR to your branch with the requested changes.

…or other architectures.

rkmisra · 2018-08-03T02:55:48Z

Proposed changes are only in bit_util.h file. Please ignore other files (extents.c & extents_struct.h) in PR, which got included from other commits due to some mistake. Once we have finalized the changes, I would create a new clean PR with only bit_util.h changes.

gnzlbg · 2018-08-03T06:49:03Z

This LGTM, but on MSVC you probably need to use _BitScanReverse, _BitScanReverse64.

Fallback to older XOR way of doing thing for windows and other platforms where gcc builtin is not available.

rkmisra · 2018-08-03T07:27:02Z

Didn't use _BitScanReverse on MSVC in current commit as there were just too many if/else for different platforms. In MSVC case, we are falling back to older xor implementation. Please let me know if you feel strongly about using _BitScan* instead of xor for MSVC.

gnzlbg · 2018-08-03T07:48:03Z

Please let me know if you feel strongly about using _BitScan* instead of xor for MSVC.

I don't and think it would be the wrong call anyways.

What we could do is implement the __builtin_clz (and friends) on top of _BitScan* for MSVC and enable JEMALLOC_HAVE_BUILTIN_CLZ for the whole library instead, but that should be done in a different PR.

gnzlbg · 2018-08-10T16:45:59Z

@rkmisra could you resolve the conflicts here ? this looks ready

rkmisra · 2018-08-11T03:39:29Z

Closing this as #1303 overrides this.

rkmisra added 6 commits January 11, 2018 20:05

Merge remote-tracking branch 'upstream/master' into extents_cache_iss…

45b8707

…ue_1004

Merge branch 'dev' into extents_cache_issue_1004

65cd31a

Handling issue jemalloc#1004

0ac520f

optimized calculation of next power of 2 number >= x where we calcula…

bea16c4

…te most significant ON bit's index using assembly instruction 'bsr' and add 1 to find next higest 2**N >= x

Only use bsrq on 64 bit platform

7846d17

add code to handle corner case of 0 and 1

984d123

rkmisra added 2 commits August 2, 2018 19:42

Inline assembly on x86/64 to calculate next pow 2 and __builtin_clz f…

fdc2017

…or other architectures.

Merge branch 'dev' into optimize_pow2_674

6f7360f

rkmisra added 2 commits August 2, 2018 20:02

Update extent_structs.h

3897e20

Update bit_util.h

aed7a48

checking presence of __builtin_clz

6712cd5

Fallback to older XOR way of doing thing for windows and other platforms where gcc builtin is not available.

rkmisra mentioned this pull request Aug 11, 2018

optimize calculation of next power of 2 >= of given number #1303

Merged

rkmisra closed this Aug 11, 2018

Conversation

rkmisra commented Jan 25, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

davidtgoldblatt commented Jan 30, 2018

Uh oh!

rkmisra commented Jan 31, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

gnzlbg commented Aug 1, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

gnzlbg commented Aug 1, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

davidtgoldblatt commented Aug 1, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

davidtgoldblatt commented Aug 1, 2018

Uh oh!

gnzlbg commented Aug 1, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

gnzlbg commented Aug 2, 2018

Uh oh!

rkmisra commented Aug 3, 2018

Uh oh!

gnzlbg commented Aug 3, 2018

Uh oh!

rkmisra commented Aug 3, 2018

Uh oh!

gnzlbg commented Aug 3, 2018

Uh oh!

gnzlbg commented Aug 10, 2018

Uh oh!

rkmisra commented Aug 11, 2018

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

rkmisra commented Jan 25, 2018 •

edited

Loading

rkmisra commented Jan 31, 2018 •

edited

Loading

gnzlbg commented Aug 1, 2018 •

edited

Loading

gnzlbg commented Aug 1, 2018 •

edited

Loading

davidtgoldblatt commented Aug 1, 2018 •

edited

Loading

gnzlbg commented Aug 1, 2018 •

edited

Loading