Why branch (miss)prediction doesn't impact performance (C++)?

Question

While trying to measure the impact of branch miss prediction, I've noticed that there is no penalty at all to branch miss prediction.

Based on the famous stack overflow question : Why is processing a sorted array faster than processing an unsorted array?

I wrote a simple piece of code to measure the penalty of branch prediction.

Fill an array with random numbers
Count the numbers above 5 (should be many miss predictions) - measure it
Sort the array
Count the numbers above 5 (should be few miss predictions) - measure it

After running the code I got pretty much the same results for both measurements.

Tested on:

Visual studio 2017, release (Maximum Optimization (Favor Speed) (/O2)), windows.
Linux, g++ -Ofast

Then I took the original code from the question I've linked above, and still didn't got any improvement for the sorted array. Why is that ? where is the benefit of branch prediction ?

#include <iostream>
#include <vector>
#include <algorithm>
#include <random>
#include <chrono>

int main()
{
    // Step 1: Allocate a vector of size 1 million
    std::vector<int> vec(100'000'000);

    // Step 2: Fill it with random numbers between 0 and 10
    std::random_device rd;
    std::mt19937 gen(rd());
    std::uniform_int_distribution<> dis(0, 10);

    for (auto& val : vec)
    {
        val = dis(gen);
    }

    // Step 3: Count numbers above 5 (and measure time)
    auto start = std::chrono::high_resolution_clock::now();
    int count_above_5 = 0;
    for (size_t i = 0; i < vec.size(); i++)
    {
        if (vec[i] < 5)
        {
            ++count_above_5;
        }
    }

    auto end = std::chrono::high_resolution_clock::now();

    auto duration_before_sort = std::chrono::duration_cast<std::chrono::nanoseconds>(end - start).count();

    std::cout << "Count of numbers above 5 (before sorting): " << count_above_5 << std::endl;
    std::cout << "Time taken (before sorting): " << duration_before_sort << " ns" << std::endl;

    // Step 4: Sort the array
    std::sort(vec.begin(), vec.end());

    // Step 5: Count numbers above 5 in the sorted array (and measure time)

    start = std::chrono::high_resolution_clock::now();
    count_above_5 = 0;
    for (size_t i = 0; i < vec.size(); i++)
    {
        if (vec[i] < 5)
        {
            ++count_above_5;
        }
    }
    end = std::chrono::high_resolution_clock::now();


    auto duration_after_sort = std::chrono::duration_cast<std::chrono::nanoseconds>(end - start).count();

    std::cout << "Count of numbers above 5 (after sorting): " << count_above_5 << std::endl;
    std::cout << "Time taken (after sorting):  " << duration_after_sort << " ns" << std::endl;

    return 0;
}

@NathanOliver just run again in Clang, and you will get different results (I did it in your link) — OopsUser
– OopsUser, Commented Apr 27 at 13:37
@NathanOliver Godbolt is really not reliable for benchmarking. It is not actually design for that. To prove that, here is a link where I just change a useless space in the code. I got very different timing and different conclusions ;) . — Jérôme Richard
– Jérôme Richard, Commented Apr 27 at 13:37
@JérômeRichard: Godbolt doesn't always run on the same hardware; the AWS instance you get can be Zen 3, Ice Lake Xeon, or some more recent CPU, and not necessarily the same frequency. At least you can control for that by having your program print a line from /proc/cpuinfo or something to bin results by hardware. But your program is also sharing the physical hardware with other AWS instances, and maybe other users of Godbolt running their stuff. So it's far from an idle system. — Peter Cordes
– Peter Cordes, Commented Apr 27 at 19:33
Basically a duplicate of Why is processing an unsorted array the same speed as processing a sorted array with modern x86-64 clang?, which is already linked from the bottom of the original question. (I edited it in years ago, but apparently that's not an easy spot to notice since others before you have commented about not being able to repro with modern C++ compilers.) — Peter Cordes
– Peter Cordes, Commented Apr 27 at 19:36
@PeterCordes Thanks for the link, didn't noticed it in the original question. Probably should go to the beginning of the post — OopsUser
– OopsUser, Commented Apr 28 at 7:47

Jérôme Richard · Accepted Answer · 2025-04-27 14:28:48Z

TL;DR: GCC, MSVC and Clang all generate a branch-less assembly code so there is no actual branch hence no impact of branch (miss)prediction.

On Linux, with GCC 14.2.0, on my machine, there is no impact due to branch prediction because there is actually no branches. Indeed, GCC generates a branch-less assembly code with SIMD instructions:

180:   
    movdqu  xmm2,XMMWORD PTR [rax]   ; Load a block of items from the array `vec`
    movdqa  xmm1,XMMWORD PTR [rsp]   ; Reload xmm1 from memory
    add     rax,0x10                 ; Move the `rax` pointer to the next block
    pcmpgtd xmm1,xmm2                ; Compare items of the block with 5 and move the mask in `xmm1`
    psubd   xmm0,xmm1                ; Increment the number of item found in `xmm0`
    cmp     rax,rbp                  ; Loop until we reach the last block
    jne     180

On Godbolt, we can see that both MSVC and Clang also generate a branch-less code. Here is the code produced by MSVC (do not use SIMD instructions but cmovge instead which should be less efficient):

$LL7@main:
    lea     eax, DWORD PTR [rdi+1]
    cmp     DWORD PTR [rsi+rcx*4], 5
    cmovge  eax, edi
    mov     edi, eax
    inc     rcx
    cmp     rcx, r14
    jb      SHORT $LL7@main

Here the code produce by Clang (uses SIMD instruction and unroll the loop 4 times):

.LBB0_10:
    movdqu  xmm0, xmmword ptr [rbx + 4*r12 - 48]
    movdqu  xmm1, xmmword ptr [rbx + 4*r12 - 32]
    movdqu  xmm2, xmmword ptr [rbx + 4*r12 - 16]
    movdqu  xmm3, xmmword ptr [rbx + 4*r12]
    movdqa  xmm4, xmm5
    pcmpgtd xmm4, xmm0
    psubd   xmm6, xmm4
    movdqa  xmm0, xmm5
    pcmpgtd xmm0, xmm1
    psubd   xmm7, xmm0
    movdqa  xmm0, xmm5
    pcmpgtd xmm0, xmm2
    psubd   xmm6, xmm0
    movdqa  xmm0, xmm5
    pcmpgtd xmm0, xmm3
    psubd   xmm7, xmm0
    add     r12, 16
    cmp     r12, 100012
    jne     .LBB0_10

Collectives™ on Stack Overflow

Why branch (miss)prediction doesn't impact performance (C++)?

1 Answer 1

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related