What is Continuous Benchmarking?


Continuous Benchmarking is a software development practice where members of a team benchmark their work frequently, usually each person benchmarks at least daily - leading to multiple benchmarks per day. Each benchmark is verified by an automated build to detect performance regressions as quickly as possible. Many teams find that this approach leads to significantly reduced performance regressions and allows a team to develop performant software more rapidly.

By now, everyone in the software industry is aware of Continuous Integration (CI). At a fundamental level, CI is about detecting and preventing software feature regressions before they make it to production. Similarly, Continuous Benchmarking (CB) is about detecting and preventing software performance regressions before they make it to production. For the same reasons that unit tests are run in CI for each code change, performance tests should be run in CB for each code change. This analogy is so apt in fact, that the first paragraph of this section is just a Mad Libs version of Martin Fowler’s 2006 intro to Continuous Integration.

🐰 Performance bugs are bugs!

Benchmarking in CI

Myth: You can’t run benchmarks in CI

Most benchmarking harnesses use the system wall-clock to measure latency or throughput. This is very helpful, as these are the exact metrics that we as developers care the most about. However, general purpose CI environments are often noisy and inconsistent when measuring wall clock time. When performing Continuous Benchmarking, this volatility adds unwanted noise into the results.

There are a few options for handling this:

  1. Bare Metal Runners
  2. Relative Continuous Benchmarking
  3. Switching to a benchmark harnesses that counts instructions as opposed to wall-clock time

By far, Bare Metal Runners are the best option in nearly every case. Bencher offers Bare Metal Runners with less than 2% variance. Compare this to GitHub Action Runners, which can see greater than 30% variance between runs. Reducing the volatility and thus the noise in your Continuous Benchmarking environment will allow you to detect ever finer performance regressions.

Performance Matters

Myth: You can’t notice 100ms of latency

It’s common to hear people claim that humans can’t perceive 100ms of latency. A Nielsen Group article on response times is often cited for this claim.

0.1 second is about the limit for having the user feel that the system is reacting instantaneously, meaning that no special feedback is necessary except to display the result.

  • Jakob Nielsen, 1 Jan 1993

But that simply is not true. On some tasks, people can perceive as little as 2ms of latency. An easy way to prove this is an experiment from Dan Luu: open your terminal and run sleep 0; echo "ping" and then run sleep 0.1; echo "pong". You noticed the difference right‽

Another common point of confusion is the distinction between the perception of latency and human reaction times. Even though it takes around 200ms to respond to a visual stimulus, that is independent from the perception of the event itself. By analogy, you can notice that your train is two minutes late (perceived latency) even though the train ride takes two hours (reaction time).

Performance matters! Performance is a feature!

  • Every 100ms faster → 1% more conversions (Mobify, earning +$380,000/yr)
  • 50% faster → 12% more sales (AutoAnything)
  • 20% faster → 10% more conversions (Furniture Village)
  • 40% faster → 15% more sign-ups (Pinterest)
  • 850ms faster → 7% more conversions (COOK)
  • Every 1 second slower → 10% fewer users (BBC)

With the death of Moore’s Law, workloads that can run in parallel will need to parallelized. However, most workloads need to run in series, and simply throwing more compute at the problem is quickly becoming an intractable and expensive solution.

Continuous Benchmarking is a key component to developing and maintaining performant modern software in the face of this change.

Moore's Law from https://davidwells.io/blog/rise-of-embarrassingly-parallel-serverless-compute

Continuous Benchmarking Tools

Before creating Bencher, we set out to find a tool that could:

  • Execute benchmarks on the exact same bare metal hardware both locally and in CI
  • Track benchmarks across multiple languages
  • Seamlessly ingest language standard benchmark harness output
  • Extensible for custom benchmark harness output
  • Open source and able to self-host
  • Work with multiple CI hosts
  • User authentication and authorization

Unfortunately, nothing that met all of these criteria existed. See prior art from a comprehensive list of the existing benchmarking tools that we took inspiration from.

Continuous Benchmarking Outside of CI

CI is meant to be a final check, not the sole place where tests are performed. Bencher is the first Continuous Benchmarking tool to allow you to run your benchmarks on the exact same bare metal hardware both locally and in CI. This allows developers and agents to compare their local work in progress against any point in their project’s performance history.

When running on local hardware, Bencher Bare Metal allows you to keep multitasking. No need to stop everything else on your system, pull an old branch, and run a comparison.

When running in a cloud environment, Bencher Bare Metal allows you trust the results. No worrying about noisy neighbors, throttling, or midstream host swaps.

Continuous Benchmarking in Big Tech

Tools like Bencher have been developed internally at Microsoft, Facebook (now Meta), Apple, Amazon, Netflix, and Google among countless others. As the titans of the industry, they understand the importance of monitoring performance during development and integrating these insights into the development process through Continuous Benchmarking. We built Bencher to bring Continuous Benchmarking from behind the walls of Big Tech to the open source community. For links to posts related to Continuous Benchmarking from Big Tech see prior art.

Bencher: Continuous Benchmarking

Bencher is a suite of continuous benchmarking tools. Have you ever had a performance regression impact your users? Bencher could have prevented that from happening. Bencher allows you to detect and prevent performance regressions before they merge.

  • Run: Run your benchmarks locally or in CI using the exact same bare metal runners and your favorite benchmarking tools. The bencher CLI orchestrates running your benchmarks on bare metal and stores its results.
  • Track: Track the results of your benchmarks over time. Monitor, query, and graph the results using the Bencher web console based on the source branch, testbed, benchmark, and measure.
  • Catch: Catch performance regressions locally or in CI using the exact same bare metal hardware. Bencher uses state of the art, customizable analytics to detect performance regressions before they merge.

For the same reasons that unit tests are run to prevent feature regressions, benchmarks should be run with Bencher to prevent performance regressions. Performance bugs are bugs!

Start catching performance regressions before they merge — try Bencher Cloud for free.

Continuous Benchmarking vs Local Benchmark Comparison

There are several benchmark harnesses that allow you to compare results locally. Local comparison is great for iterating quickly when performance tuning. However, it should not be relied on to catch performance regressions on an ongoing basis. Just as being able to run unit tests locally doesn’t obviate the need for CI, being able to run and compare benchmarks locally doesn’t obviate the need for Continuous Benchmarking.

There are several features Bencher offers that local benchmark comparison tools cannot:

  • Comparison of the same benchmark between different testbeds
  • Comparison of benchmarks across languages and harnesses
  • Collaboration and sharing of benchmark results
  • Running benchmarks on dedicated testbeds to minimize noise
  • No more copypasta

Continuous Benchmarking vs Application Performance Management (APM)

Application Performance Management (APM) is a vital tool for modern software services. However, APM is designed to be used in production. By the time a performance regression is detected, it is already impacting your customers.

Most defects end up costing more than it would have cost to prevent them. Defects are expensive when they occur, both the direct costs of fixing the defects and the indirect costs because of damaged relationships, lost business, and lost development time.

— Kent Beck, Extreme Programming Explained

There are several features Bencher offers that APM tools cannot:

  • Catch performance regressions before they merge
  • Performance changes and impacts included in code review
  • No overhead in production environments
  • Effective for on-prem deployments
  • No changes to production source code

Continuous Benchmarking vs Observability

A rose by any other name would smell as sweet. See Continuous Benchmarking vs Application Performance Management above.

Continuous Benchmarking vs Continuous Integration (CI)

Continuous Benchmarking (CB) is complimentary to Continuous Integration (CI). For the same reasons that unit tests are run in CI for each code change, performance tests should be run in CB for each code change.

While unit and acceptance testing are widely embraced as standard development practices, this trend has not continued into the realm of performance testing. Currently, the common tooling drives testers towards creating throw away code and a click-and-script mentality. Treating performance testing as a first-class citizen enables the creation of better tests that cover more functionality, leading to better tooling to create and run performance tests, resulting in a test suite that is maintainable and can itself be tested.

Thoughworks Technology Radar, 22 May 2013

Continuous Benchmarking vs Continuous Load Testing

In order to understand the difference between Continuous Benchmarking and Continuous Load Testing, you need to understand the difference between benchmarking and load testing.

Test KindTest ScopeTest Users
BenchmarkingFunction - ServiceOne - Many
Load TestingServiceMany

Benchmarking can test the performance of software from the function level (micro-benchmarks) all the way up to the service level (macro-benchmarks). Benchmarks are great for testing the performance of a particular part of your code in an isolated manner. Load testing only tests the performance of software at the service level and mocks multiple concurrent users. Load tests are great for testing the performance of the entire service under a specific load.

🍦 Imagine we wanted to track the performance of an ice-cream truck. Benchmarking could be used to measure how long it takes to scoop an ice-cream cone (micro-benchmark), and benchmarking could also be used to measure how long it takes a single customer to order, get their ice-cream, and pay (macro-benchmark). Load testing could be used to see how well the ice-cream truck serves 100 customer on a hot summer day.

Track your benchmarks in CI

Have you ever had a performance regression impact your users? Bencher could have prevented that from happening with continuous benchmarking.



Published: Sat, August 12, 2023 at 4:07:00 PM UTC | Last Updated: Wed, March 27, 2024 at 7:50:00 AM UTC