3,622 questions
0
votes
0
answers
54
views
How to change my slurm script to perform benchmark on Thin node
I'm trying to run the following script, but after a few seconds (approximately 4) the job ends without creating the output_file with the data and without creating outout and error slurm files.
#!/bin/...
Advice
0
votes
6
replies
131
views
Why is FileReader as efficient as BufferedReader in reading 1KB chunks of data?
I was trying to read data (chars) from a large text file (~250MB) in 1KB chunks and was very surprised that reading that file using either FileReader or BufferedReader takes exactly the same time, ...
1
vote
1
answer
201
views
How to benchmark atomic<int> vs atomic<size_t>?
I have a bounded queue with small size that definitely fit in int. So I want to use atomic<int> instead of atomic<size_t> for indexing/counter, since int is smaller it should be faster.
...
1
vote
0
answers
113
views
Golang benchmarks involving goroutines show higher than expected allocations when controlling the timer manually
Using go version go1.25.3 darwin/arm64.
The below implementation is a simplified version of the actual implementation.
type WaitObject struct{ c chan struct{} }
func StartNewTestObject(d time....
0
votes
1
answer
75
views
Why does MSVC AVX2 /FP:strict sometimes generate inferior (slower) code to SSE2?
I was testing various expressions of a sixth order polynomial to find the fastest possible throughput. I have stumbled upon a simple polynomial expression length 6 that provokes poor code generation ...
1
vote
1
answer
91
views
Browser debugger shows less time taken to download a base64 over a multi-part file despite the larger file size [closed]
Today, I researched Base64 encoding versus other methods and whether to use it in a JSON API, considering the 33-37% size overhead that Base64 introduces and all sorts of related topics.
To ...
2
votes
0
answers
128
views
Why is Horner's method for evaluation faster than expected when loop unrolled for Polynomial lengths N<90?
I have spent some time trying to speed up code that uses Horner's method for evaluating modest length polynomials (N < 32). I have a solution using loop unrolling that works very well at -O2 or ...
3
votes
1
answer
152
views
BenchmarkDotNet: OutOfMemoryException when benchmarking parsing a JSON file
I'm trying to benchmark the performance of a library I've written that can parse large JSON files into both an object model and a JsonDocument. So far as I can tell I'm doing everything right, but I'...
7
votes
1
answer
342
views
How to use plain RDTSC without using asm?
I want to use RDTSC in Rust to benchmark how many ticks my function takes.
There's a built-in std::arch::x86_64::_rdtsc, alas it always translates into:
rdtsc
shl rdx, 32
or rax, rdx
...
4
votes
2
answers
223
views
Why is the the generic implementation of Vector.Log so much slower than the non-generic implementations for me?
I've run some benchmarks on Math.Log, System.Numerics.Vector.Log, System.Runtime.Intrinsics.Vector128.Log, Vector256.Log and Vector512.Log and the results were pretty surprising to me. I was expecting ...
1
vote
1
answer
60
views
Looking for simple garbage collector load test
I'm looking for some code or some benchmark to roughly asses the pause times or cpu load caused by some GC in order to get some rough estimate how efficient it is. I just want to see whether some GC ...
3
votes
1
answer
202
views
Custom hasher is faster for insert and remove, but when done together is slower, when comparing to std::collections::HashMap
I wish to benchmark various hashmaps for the <K,V> pair <u8, BoxedFnMut> where BoxedFnMut.
type BoxedFnMut = Box<dyn FnMut() + Send + 'static>;
To do this, I am using divan(0.1.21) ...
4
votes
1
answer
203
views
Why is a ConcurrentDictionary faster than a Dictionary in benchmark?
I have a really simple benchmark to measure and compare performance of Dictionary<string, int> and ConcurrentDictionary<string, int>:
[MemoryDiagnoser]
public class ...
0
votes
0
answers
165
views
cargo bench throws "MallocStackLogging: can't turn off malloc stack logging because it was not enabled..." on Apple M4
I'm trying to run cargo bench on my new MacBook (Apple Silicon, macOS [Sequioa version 15.5]), but I get this error:
cargo(31826) MallocStackLogging: can't turn off malloc stack logging because it was ...
2
votes
0
answers
59
views
Statistical assumptions for Criterion benchmarks
This question is somewhat specific to Rust's Criterion, but I have kept it general so that anybody with knowledge about benchmarking can help.
In my Rust codebase, I have a struct Model that is very ...
27
votes
5
answers
9k
views
Why is this code 5 times slower in C# compared to Java?
First of all we create a random binary file with 100.000.000 bytes. I used Python for this:
import random
import os
def Main():
length = 100000000
randomArray = random.randbytes(length)
...
1
vote
1
answer
101
views
Minimize noise for benchmarking in docker
I am writing a benchmarking framework for compiler-like programs.
For benchmarking, I use a docker container (for reproducibility).
However, i still measure quite a bit of noise (up to 5%!). My ...
2
votes
1
answer
197
views
What is the reason of this performance discrepancy between NumPy and Numba?
This Python 3.12.7 script with NumPy 2.2.4 and Numba 0.61.2:
import numpy as np, timeit as ti, numba as nb
def f0(a):
p0 = a[:-2]
p1 = a[1:-1]
p2 = a[2:]
return (p0 < p1) & (p1 > p2)
...
0
votes
1
answer
93
views
How do I benchmark queries in PostgreSQL? [closed]
I'm learning PostgreSQL Clustering abilities and I would like to compare performance of the same query with table not clustered and with table clustered.
I tried to generate 25 million user events and ...
2
votes
0
answers
125
views
Best way to trigger lazy evaluation in PySpark and Polars for benchmarking
I'm currently benchmarking PySpark vs the growing alternative Polars.
Basically I'm writing various queries (aggregations, filtering, sorting etc.) and measure the execution time, RAM and CPU. I ...
1
vote
2
answers
157
views
C# SoA vs AoS performance
My benchmark attempts to compare AoS vs SoA for 1 000 000 items.
The result for 1 000 000 items:
| Method | Mean | Error | StdDev |
|---------------------- |---------:|-------...
0
votes
0
answers
52
views
Why does a trailing slash significantly impact performance in benchmarking with wrk or ab?
In the following FastAPI app:
from fastapi import FastAPI
from sqlalchemy import create_engine, text
from sqlalchemy.orm import Session
engine = create_engine("postgresql+psycopg://postgres:...
0
votes
0
answers
63
views
Python equivalent to the R `bench::bench_memory()` function?
I'm benchmarking some functions' memory in R and python, and had a great time getting results with the bench package in R that tracks all allocations by each call, allowing me to get the total and ...
1
vote
0
answers
116
views
RISC-V instruction equivalent to ARM's DSB execution barrier instruction for benchmarking to time loads?
I am writing a RISC-V assembly program whose goal is to assess the performance of main memory, in read access only for now.
I have thought about a simple benchmark code, that would load multiple ...
-1
votes
1
answer
78
views
Best practices for running high-granularity benchmark [closed]
I am trying to run a benchmark on some family of algorithms.
I have multiple algorithms, each of them with one hyperparameter, and I want to test them with multiple data sizes. Each run takes ~60 ...
0
votes
2
answers
75
views
How can I extract the data from profvis in R?
I'm using profvis to profile my functions in R, but I want to extract specific timings for subfunctions. For example if I run
a = profvis({ dat <- data.frame(
x = rnorm(5e4),
y = ...
0
votes
1
answer
126
views
Why is my Python implementation of selection sort seemingly so fast? [closed]
I'm starting to study algorithms and their efficiency. I started by selection sort. For the sake of interest, I wanted to compare the running time of the same algorithm implementation in python and c++...
0
votes
1
answer
177
views
DragonFly benchmark: slow on Cluster
I need help regarding dragonfly db, particularly benchmarking.
So here is the story, I tried benchmarking dragonfly as a cache to replace redis. I got the expected result when testing single node; it ...
1
vote
0
answers
81
views
How to implement a Swift analogue of `benchmark::DoNotOptimize`?
I would like to do some (micro)benchmarking in Swift. I have been using package-benchmark for this. It comes with a blackHole helper function that forces the compiler to assume that a variable is read ...
7
votes
1
answer
177
views
Pointer chasing benchmark - unexpected lack of out of order execution?
I wanted a reliable benchmark which has a lot of cache misses, and the obvious go-to seemed to be a pointer chasing benchmark. I ended up trying google/multichase but the results don't seem to be what ...
0
votes
0
answers
19
views
Tackling variability when recording execution times while benchmarking a library under Linux
We have designed benchmarks to test execution times of some algorithms coded within a library that we are working on.
Those algorithms are mono-threaded. So the algorithms can use at most 100% of the ...
-2
votes
1
answer
101
views
Benchmarking Two C++ Implementations for Counting Pairs Divisible by 7
I have two C++ implementations that count pairs ((x, y)) satisfying ((x + y) % 7 == 0). Method 1 skips unnecessary iterations using y += 6, while Method 2 checks every y. I performed benchmarks on ...
7
votes
1
answer
176
views
Why is faster to do a branch than a lookup?
I have just been trying to benchmark some PRNG code for generating hex characters - I have an input of random values buf, which I want to turn into a series of hex characters, in-place.
Code
#define ...
6
votes
0
answers
97
views
Smaller Vec allocation in serialization results in slower code
There is a binary serialization called BorshSerialize. And in its Rust implementation, when alloc is enabled, it allocates a Vector with initial capacity of 1024 before each serialization. I thought ...
4
votes
0
answers
166
views
Why Java Stream API filter is faster than an imperative loop?
I have an important question.
I performed a benchmark using the following tool versions:
# JMH version: 1.37
# VM version: JDK 22, OpenJDK 64-Bit Server VM, 22+36-FR
# VM invoker: /home/jack/.sdkman/...
1
vote
0
answers
126
views
How to increase the frequency of the CPU from C
I am writing C code for the Raspberry Pi 4 (ARM Cortex-A72), which relies on precise timing in periods of less than 1μs. To get precise timing, I use the following algorithm:
clock_gettime(...
0
votes
1
answer
76
views
Perform a benchmarking test on different cores on a VM Ubuntu system
I want to perform a benchmarking Test (BPFM, IOR, FIO & Sysbench) on a Ubuntu VM. The benchmark should use the available amount of cores in steps of 2^2 (So 2, 4, 8, 16, ... up to the available ...
0
votes
0
answers
37
views
WandB for benchmark results aggregation
I'm trying to use WandB to store and view benchmarking results.
I've got a snipped of code that looks basically like this
for model in models_to_eval:
wandb.init(project="benchmarking", ...
0
votes
1
answer
98
views
Rust Conditional flag set by Cargo bench
A function which is only being used with debug assertions is deactivated by:
#[cfg(debug_assertions)]
fn some_debug_support_fn() {}
But this makes cargo bench fail to compile, as it is missing the ...
2
votes
1
answer
101
views
Getting a trace of memory address accesses using Valgrind
I have a microbenchmark which I'm using to generate memory traffic. I've profiled the application and it seems to constantly hit in L1 cache. I have a Core i5-7260U.
I want to understand the actual ...
0
votes
1
answer
110
views
Why is this benchmark not measuring any branch prediction penalty?
I came across the question Wrapping real numbers which asks for help on implementing a "wrap around" functionality for double values within a given range, stating the following constraint:
...
-6
votes
2
answers
110
views
Two empty functions produce different runtime when measured in Rust with Instant::now calls
I wrote two empty functions in rust with the hope of using each to test the use of retain on vectors or the use of filter on iterators, after writing the empty function for each case, on running the ...
1
vote
0
answers
33
views
Configure AirspeedVelocity for Python package with PyO3 and Maturin
I'm trying to setup Airspeed Velocity to use with my Python project. In other setups, such as github workflows the only build commands needed are:
"python -m pip install --upgrade pip",
...
0
votes
0
answers
92
views
How to accurately measure memory usage for iterative and recursive binary search algorithms in Java?
I am benchmarking the performance of iterative and recursive binary search algorithms in Java, specifically measuring both execution time and memory usage for different dataset sizes. However, I am ...
1
vote
1
answer
121
views
Cannot explain alloc/op in benchmark result
I have pretty basic benchmark comparing performance of mutex vs atomic:
const (
numCalls = 1000
)
var (
wg sync.WaitGroup
)
func BenchmarkCounter(b *testing.B) {
var ...
4
votes
0
answers
367
views
Facing issues with NATS Jetstreams perfomance with large number of parallel consumers. Publisher throughput decreases by 50 times for 100 consumers
Overview
We have been using the the nats bench CLI tool for benchmarking the performance of NATS Jetstream over different variables.
In our use-case, we require 100K ephemeral parallel consumers ...
16
votes
3
answers
1k
views
C++ implementation of a simple map slower than equivalent implementation in Java: Code/Benchmark Issue?
The goal of this research is to explore the performance differences between JIT (just-in-time compilation) and AOT (ahead-of-time compilation) strategies and to understand their respective advantages ...
0
votes
0
answers
37
views
What is the most "empty" Linux system call to benchmark against? [duplicate]
I want to benchmark some performance aspects of a Linux device driver (a loadable module). Specifically, how fast certain code paths are when they are invoked from userspace via system calls.
In ...
5
votes
2
answers
211
views
Why are inner parallel streams faster with new pools than with the commonPool for this scenario?
So I recently ran a benchmark where I compared the performance of nested streams in 3 cases:
Parallel outer stream and sequential inner stream
Parallel outer and inner streams (using parallelStream) -...
1
vote
0
answers
130
views
Spec '06 benchmarks on Gem5: Increasing stack size by one page
I am trying to run the Spec '06 benchmarks on Gem5. All of the benchmarks I've tried seem to start up normally and then output the following and stall indefinitely:
src/sim/mem_state.cc:448: info: ...