7

While trying to measure the impact of branch miss prediction, I've noticed that there is no penalty at all to branch miss prediction.

Based on the famous stack overflow question : Why is processing a sorted array faster than processing an unsorted array?

I wrote a simple piece of code to measure the penalty of branch prediction.

  • Fill an array with random numbers
  • Count the numbers above 5 (should be many miss predictions) - measure it
  • Sort the array
  • Count the numbers above 5 (should be few miss predictions) - measure it

After running the code I got pretty much the same results for both measurements.

Tested on:

  1. Visual studio 2017, release (Maximum Optimization (Favor Speed) (/O2)), windows.
  2. Linux, g++ -Ofast

Then I took the original code from the question I've linked above, and still didn't got any improvement for the sorted array. Why is that ? where is the benefit of branch prediction ?

#include <iostream>
#include <vector>
#include <algorithm>
#include <random>
#include <chrono>

int main()
{
    // Step 1: Allocate a vector of size 1 million
    std::vector<int> vec(100'000'000);

    // Step 2: Fill it with random numbers between 0 and 10
    std::random_device rd;
    std::mt19937 gen(rd());
    std::uniform_int_distribution<> dis(0, 10);

    for (auto& val : vec)
    {
        val = dis(gen);
    }

    // Step 3: Count numbers above 5 (and measure time)
    auto start = std::chrono::high_resolution_clock::now();
    int count_above_5 = 0;
    for (size_t i = 0; i < vec.size(); i++)
    {
        if (vec[i] < 5)
        {
            ++count_above_5;
        }
    }

    auto end = std::chrono::high_resolution_clock::now();

    auto duration_before_sort = std::chrono::duration_cast<std::chrono::nanoseconds>(end - start).count();

    std::cout << "Count of numbers above 5 (before sorting): " << count_above_5 << std::endl;
    std::cout << "Time taken (before sorting): " << duration_before_sort << " ns" << std::endl;

    // Step 4: Sort the array
    std::sort(vec.begin(), vec.end());

    // Step 5: Count numbers above 5 in the sorted array (and measure time)

    start = std::chrono::high_resolution_clock::now();
    count_above_5 = 0;
    for (size_t i = 0; i < vec.size(); i++)
    {
        if (vec[i] < 5)
        {
            ++count_above_5;
        }
    }
    end = std::chrono::high_resolution_clock::now();


    auto duration_after_sort = std::chrono::duration_cast<std::chrono::nanoseconds>(end - start).count();

    std::cout << "Count of numbers above 5 (after sorting): " << count_above_5 << std::endl;
    std::cout << "Time taken (after sorting):  " << duration_after_sort << " ns" << std::endl;

    return 0;
}
5
  • 1
    @NathanOliver just run again in Clang, and you will get different results (I did it in your link) Commented Apr 27 at 13:37
  • 1
    @NathanOliver Godbolt is really not reliable for benchmarking. It is not actually design for that. To prove that, here is a link where I just change a useless space in the code. I got very different timing and different conclusions ;) . Commented Apr 27 at 13:37
  • @JérômeRichard: Godbolt doesn't always run on the same hardware; the AWS instance you get can be Zen 3, Ice Lake Xeon, or some more recent CPU, and not necessarily the same frequency. At least you can control for that by having your program print a line from /proc/cpuinfo or something to bin results by hardware. But your program is also sharing the physical hardware with other AWS instances, and maybe other users of Godbolt running their stuff. So it's far from an idle system. Commented Apr 27 at 19:33
  • Basically a duplicate of Why is processing an unsorted array the same speed as processing a sorted array with modern x86-64 clang?, which is already linked from the bottom of the original question. (I edited it in years ago, but apparently that's not an easy spot to notice since others before you have commented about not being able to repro with modern C++ compilers.) Commented Apr 27 at 19:36
  • @PeterCordes Thanks for the link, didn't noticed it in the original question. Probably should go to the beginning of the post Commented Apr 28 at 7:47

1 Answer 1

17

TL;DR: GCC, MSVC and Clang all generate a branch-less assembly code so there is no actual branch hence no impact of branch (miss)prediction.


On Linux, with GCC 14.2.0, on my machine, there is no impact due to branch prediction because there is actually no branches. Indeed, GCC generates a branch-less assembly code with SIMD instructions:

180:   
    movdqu  xmm2,XMMWORD PTR [rax]   ; Load a block of items from the array `vec`
    movdqa  xmm1,XMMWORD PTR [rsp]   ; Reload xmm1 from memory
    add     rax,0x10                 ; Move the `rax` pointer to the next block
    pcmpgtd xmm1,xmm2                ; Compare items of the block with 5 and move the mask in `xmm1`
    psubd   xmm0,xmm1                ; Increment the number of item found in `xmm0`
    cmp     rax,rbp                  ; Loop until we reach the last block
    jne     180            

On Godbolt, we can see that both MSVC and Clang also generate a branch-less code. Here is the code produced by MSVC (do not use SIMD instructions but cmovge instead which should be less efficient):

$LL7@main:
    lea     eax, DWORD PTR [rdi+1]
    cmp     DWORD PTR [rsi+rcx*4], 5
    cmovge  eax, edi
    mov     edi, eax
    inc     rcx
    cmp     rcx, r14
    jb      SHORT $LL7@main

Here the code produce by Clang (uses SIMD instruction and unroll the loop 4 times):

.LBB0_10:
    movdqu  xmm0, xmmword ptr [rbx + 4*r12 - 48]
    movdqu  xmm1, xmmword ptr [rbx + 4*r12 - 32]
    movdqu  xmm2, xmmword ptr [rbx + 4*r12 - 16]
    movdqu  xmm3, xmmword ptr [rbx + 4*r12]
    movdqa  xmm4, xmm5
    pcmpgtd xmm4, xmm0
    psubd   xmm6, xmm4
    movdqa  xmm0, xmm5
    pcmpgtd xmm0, xmm1
    psubd   xmm7, xmm0
    movdqa  xmm0, xmm5
    pcmpgtd xmm0, xmm2
    psubd   xmm6, xmm0
    movdqa  xmm0, xmm5
    pcmpgtd xmm0, xmm3
    psubd   xmm7, xmm0
    add     r12, 16
    cmp     r12, 100012
    jne     .LBB0_10
Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.