While trying to measure the impact of branch miss prediction, I've noticed that there is no penalty at all to branch miss prediction.
Based on the famous stack overflow question : Why is processing a sorted array faster than processing an unsorted array?
I wrote a simple piece of code to measure the penalty of branch prediction.
- Fill an array with random numbers
- Count the numbers above 5 (should be many miss predictions) - measure it
- Sort the array
- Count the numbers above 5 (should be few miss predictions) - measure it
After running the code I got pretty much the same results for both measurements.
Tested on:
- Visual studio 2017, release (Maximum Optimization (Favor Speed) (/O2)), windows.
- Linux, g++ -Ofast
Then I took the original code from the question I've linked above, and still didn't got any improvement for the sorted array. Why is that ? where is the benefit of branch prediction ?
#include <iostream>
#include <vector>
#include <algorithm>
#include <random>
#include <chrono>
int main()
{
// Step 1: Allocate a vector of size 1 million
std::vector<int> vec(100'000'000);
// Step 2: Fill it with random numbers between 0 and 10
std::random_device rd;
std::mt19937 gen(rd());
std::uniform_int_distribution<> dis(0, 10);
for (auto& val : vec)
{
val = dis(gen);
}
// Step 3: Count numbers above 5 (and measure time)
auto start = std::chrono::high_resolution_clock::now();
int count_above_5 = 0;
for (size_t i = 0; i < vec.size(); i++)
{
if (vec[i] < 5)
{
++count_above_5;
}
}
auto end = std::chrono::high_resolution_clock::now();
auto duration_before_sort = std::chrono::duration_cast<std::chrono::nanoseconds>(end - start).count();
std::cout << "Count of numbers above 5 (before sorting): " << count_above_5 << std::endl;
std::cout << "Time taken (before sorting): " << duration_before_sort << " ns" << std::endl;
// Step 4: Sort the array
std::sort(vec.begin(), vec.end());
// Step 5: Count numbers above 5 in the sorted array (and measure time)
start = std::chrono::high_resolution_clock::now();
count_above_5 = 0;
for (size_t i = 0; i < vec.size(); i++)
{
if (vec[i] < 5)
{
++count_above_5;
}
}
end = std::chrono::high_resolution_clock::now();
auto duration_after_sort = std::chrono::duration_cast<std::chrono::nanoseconds>(end - start).count();
std::cout << "Count of numbers above 5 (after sorting): " << count_above_5 << std::endl;
std::cout << "Time taken (after sorting): " << duration_after_sort << " ns" << std::endl;
return 0;
}
/proc/cpuinfoor something to bin results by hardware. But your program is also sharing the physical hardware with other AWS instances, and maybe other users of Godbolt running their stuff. So it's far from an idle system.