3,413 questions
2
votes
1
answer
56
views
rustc logging for the optimization stage of a build
The optimization stage of building my rust program takes a really long time. I determined this by adjusting the cargo profile settings opt-level and lto.
I want to better understand what is happening ...
Tooling
0
votes
5
replies
122
views
How can I make a custom programming language run closer to C-level performance?
I'm building a custom programming language and want performance close to C.
Currently I'm deciding between:
Generating C code and compiling with GCC/Clang
Generating LLVM IR directly
Using a JIT
My ...
3
votes
1
answer
161
views
Looking for deeper understanding of std::launder and compiler behavior
I'm trying to get a deeper understanding about the implementation of std::launder. I'll use a classical example to illustrate:
#include <cstddef>
#include <memory>
#include <new>
...
0
votes
0
answers
113
views
Can compiler merge distinct tail call sites in template-generated functions
I am writing my first interpreter and am interested in using tail calls to make branch prediciton better.
Consider I have something like
template<void *Handler>
void wrapper(Cpu& cpu) {
...
Advice
1
vote
3
replies
138
views
C Compiler Optimization MUL 0x70078071?
I've been looking at the zlib1.dll that comes with Win 11 Pro and I was hoping for some assistance with the following passage:
56b: b8 71 80 07 80 mov eax,0x80078071
[570: 41 0f 42 ...
5
votes
1
answer
187
views
Optimal code generation for std::isnan in Visual Studio 2026
As far as I know the fastest assembly code for testing that a floating point variable contains Not-a-Number (std::isnan) is comparing it to itself, where not-equal will signify NaN. And as I can see, ...
3
votes
2
answers
155
views
How to get GCC to inline strcpy (string copy)
I'm using www.godbolt.org website and the x86-64 gcc 15.2 compiler selection and the -O2 compiler options for optimization but it still makes a call to strcpy in libc.
I would like this compiler to ...
5
votes
2
answers
155
views
Can the compiler choose memory ordering for atomic operations?
Recently, when working with C atomics, I've noticed that there are two variants: a "normal" version, whose order is always seq_cst, and an "explicit" version, where the programmer ...
2
votes
3
answers
260
views
Is it possible to eliminate the this pointer in a singleton?
This question is about the singleton pattern in modern C++ and one of its limitations in particular.
I can implement the singleton pattern like this:
class Logger
{
public:
static Logger& ...
5
votes
3
answers
242
views
Log2 implementation with Cygwin 64-bit seems inefficient
I noticed there was a double precision divide after all log2 calls when using Cygwin 64-bit 3.6.4 and using compiler options -O3 and
-D_FILE_OFFSET_BITS=64, so I wrote a small program to illustrate it....
3
votes
0
answers
130
views
inlined strcmp() on ppc64le (sometimes) returns a wrong value when the strings are same [closed]
gcc (GCC) 11.2.1 20220127 (Red Hat 11.2.1-9)
comparing 2 identical strings with strcmp() sometimes returns a negative value. The code is compiled with -O and strcmp() seems to be inlined.
...
13
votes
3
answers
3k
views
Why don't C overflow checks use CPU flags?
Both amd64 and arm64 architecture processors have an overflow flag. However, in C, the most common method to detect whether an operation causes overflow/underflow is to make functions like these:
int ...
6
votes
0
answers
248
views
Hoist an optimizer-inserted test out of a loop
Clang's optimizer is inserting a run-time test for instruction selection into a tight loop. How can I communicate to it that the condition it's checking is unnecessary?
Details
Here's a function to ...
0
votes
0
answers
49
views
ONNX Script rewriter: how to match patterns with multiple outputs?
I am trying to implement this rewrite rule from the TASO paper with ONNX Script rewriter. However, I cannot figure out how to implement a pattern with multiple outputs X and Y.
The ONNX Script does ...
Best practices
0
votes
0
replies
34
views
How to extract nested loop features from CUDA kernels for LLM-based optimization?
Question
I am working on an experimental project where I aim to have a large language model (LLM) automatically optimize CUDA kernels’ nested loops. The key requirement is to extract static loop and ...
-2
votes
1
answer
221
views
Is the C compiler allowed to emit spaghetti assembly for the sake of optimization? How can I get the compiler to perform such an optimization? [duplicate]
So I have the following code:
float param1 = SOME_VALUE;
switch (State)
{
case A:
{
foo(param1);
statement1;
break;
}
case B:
{
bar();
...
Advice
2
votes
7
replies
150
views
determine cpu after c++ compilation with gcc?
Does anyone know if there is, in c++, any way to determine at runtime the cpu characteristics of the machine that compiled the code? For example, in gcc (which I'm using) the preprocessor variable ...
3
votes
2
answers
312
views
Why do GCC and Clang fail to auto-vectorize simple loop?
I have two functions counting the occurrences of a target char in the given input buffer. The functions vary only in how they communicate the result back to the caller; one returns the result and the ...
3
votes
1
answer
114
views
Why does the compiler give strncpy 'stringop-truncation' warning only with -O2?
With avr-gcc 14, the compiler gives this warning:
avrenv.c: In function ‘main’
avrenv.c:12:13: warning: ‘strncpy’ output may be truncated copying 255 bytes from a string of length 509 [-Wstringop-...
Best practices
2
votes
10
replies
316
views
How to tell the C compiler that data pointed to by a pointer won't be constantly modified by another thread after being passed to a function?
In C, when I pass a pointer to a function, the compiler always seems to assume that the data pointed to by that pointer might be continuously modified in another thread, even though in actual API ...
0
votes
1
answer
127
views
How to call inline functions by using dl library?
I have to use dl library for calling libapt-pkg functions. The method I need to call is pkgCacheFile::GetPkgCache() that is declared as inline. The problem is that dlsym returns error while trying to ...
4
votes
5
answers
693
views
What is special about a ternary statement instead of a if-else statement in terms of optimization?
I'm talking from a language point of view in C or C++, where the compiler sees:
return condition ? a : b;
vs:
if (condition)
return a;
else
return b;
I've tried in my code, and both of them ...
2
votes
2
answers
155
views
Why would MSVC 2022 create two idiv calls for one std::div without any optimizations?
Using CMAKE_BUILD_TYPE="Debug" my MSVC 2022 [17.4.33403.182] produced one idiv call for the quotient and an identical idiv call for the remainder. The code was simply [see here for the ...
26
votes
2
answers
4k
views
Does excessive use of [[likely]] and [[unlikely]] really degrade program performance in C++?
The C++ standard [dcl.attr.likelihood] says:
[Note 2: Excessive usage of either of these attributes is liable to result in performance degradation.
— end note]
I’m trying to understand what “...
3
votes
0
answers
134
views
How well can clang 20 infer the likelihood of branches without annotations?
I have a performance-critical C++ code base, and I want to improve (or at least measure if it's worth improving) the likelihood that clang assigns to branches, and in general understand what it's ...
0
votes
0
answers
52
views
How to build a gcc_tree_node from custom language Nodes
Nodes:
building a gcc_tree_node for a custom prograimming language
compile and base on C++26
the modules are avilable
the language using tab-block system
every keyword start with '/'
I want to ...
1
vote
1
answer
154
views
Can the compiler elide a const local copy of const& vector parameter?
Consider these two functions:
int foo(std::array<int, 10> const& v) {
auto const w = v;
int s{};
for (int i = 0; i < v.size(); ++i) {
s += (i % 2 == 0 ? v : w)[i];
...
5
votes
3
answers
380
views
How to make the optimiser treat a local function as a black box and not optimise based on its implementation?
I thought that the noinline function attribute would force the compiler to treat a local function as a black box:
__attribute__((noinline)) void touch_noinline(int&) {}
void touch_external(int&...
0
votes
1
answer
94
views
Why does MSVC AVX2 /FP:strict sometimes generate inferior (slower) code to SSE2?
I was testing various expressions of a sixth order polynomial to find the fastest possible throughput. I have stumbled upon a simple polynomial expression length 6 that provokes poor code generation ...
2
votes
0
answers
64
views
Why is sequential indexing with fixed length stride slower in Estrin's method?
Preparing to make Estrin's method vectorisable I changed from normal linear indexing of the coefficients to bitreversed and restricted it to strictly powers of 2. Neither MSVC nor ICX can see how to ...
1
vote
1
answer
134
views
How is rust able to optimize Option::is_some_and so effectively? [closed]
Looking at the codegen of a check inside for-loop I wanted to see if there is an optimization opportunity by outlining is_some_and but both cases had the same codegen.
struct V {
len: Option<...
5
votes
1
answer
201
views
Why are [[no_unique_address]] members not transparently replaceable?
In the classic talk An (In-)Complete Guide to C++ Object Lifetimes by Jonathan Müller, there is a useful guideline as follows:
Q: When do I need to use std::launder?
A: When you want to re-use the ...
7
votes
1
answer
340
views
Does GCC optimize array access with __int128 indexes incorrectly?
When compiling the following code using GCC 9.3.0 with O2 optimization enabled and running it on Ubuntu 20.04 LTS, x86_64 architecture, unexpected output occurs.
#include <algorithm>
#include &...
1
vote
2
answers
201
views
GCC switch statements do not simplify on identical handling
The switch statements in the following two functions
int foo(int value) {
switch (value) {
case 0:
return 0;
case 1:
return 0;
case 2:
return 1;
}
}
int ...
1
vote
0
answers
101
views
Does critical section protected by semaphore, mutex, etc, implicitly volatile? [duplicate]
Say if I have an array of integers, int array[NUM_ELEMENTS];, access to it is encapsulated as setter and getter function well protected by synchronization such as semaphore, mutex, etc, do I need to ...
4
votes
1
answer
160
views
optimize computation of real part of complex product
I need (only) the real part of the product of two complex numbers. Naturally, I can code this as
real(x)*real(y) - imag(x)*imag(y);
or
real(x*y);
The latter, however, formally first computes the ...
29
votes
1
answer
4k
views
Why do C compilers still prefer PUSH over MOV for saving registers, even when MOV appears faster in llvm-mca?
I noticed that modern C compilers typically use push instructions to save caller-saved registers, rather than explicit mov + sub sequences. However, based on llvm-mca simulations, the mov approach ...
5
votes
1
answer
292
views
Why does the C compiler save registers in a noreturn function?
I mainly use clang, but I have also explored other compilers during my experiments, such as MinGW GCC and MSVC, but they all have this problem.
E:\code\test>clang -v
clang version 20.1.7
Target: ...
10
votes
3
answers
1k
views
Why doesn't the Windows C compiler reuse incoming shadow space in noreturn functions?
I mainly use Clang, but I have also explored other compilers during my experiments, such as MinGW GCC and MSVC, but they all have this problem.
cd C:\Users\Moi5t
clang -v
Output:
clang version 20.1.7
...
1
vote
2
answers
188
views
Can I tell my compiler that a floating-point value is not NaN nor +/-infinity?
I'm writing a C function float foo(float x) which manipulates a floating-point value. It so happens, that I can guarantee the function will only ever be called with finite values - neither NaN's, nor +...
0
votes
1
answer
133
views
Why does NVCC not optimize ldexpf with a constexpr power-of-two exponent into a simple fmul?
Consider the following CUDA code:
enum { p = 5 };
__device__ float adjust_mul(float x) { return x * (1 << p); }
__device__ float adjust_ldexpf(float x) { return ldexpf(x, p); }
I would expect ...
4
votes
0
answers
127
views
MSVC fixed short length polynomial evaluation curiosity - why does it not keep my coefficients array contiguous, and why subtract the absolute value?
MSVC seems to be taking the values from my array of coefficients and scattering them around in its .rdata section, not keeping them contiguous even though they're all used together. And it takes the ...
13
votes
1
answer
1k
views
Does C++23 guarantee std::launder can be omitted in placement new scenarios?
#include <iostream>
#include <new>
struct A {
int const n;
void f() {
new (this) A{2};
}
void g() {
std::cout << this->n;
}
void h() {
...
0
votes
0
answers
147
views
About JVM's tiered compilation sequence, does isolated method optimization occur before inlining?
The tiered steps provided by Oracle are:
It seems to me that...
I'd be a reasonable assumption to think that optimizations should occur with methods in isolation (detached from its call-site context),...
31
votes
2
answers
5k
views
Why do modern compilers assume malloc never fails?
In case of failure, malloc returns a null pointer.
In the following code the latest GCC and clang assume malloc never fails and simple remove the branch
#include <cstdlib>
int main() {
if (!...
0
votes
0
answers
81
views
MLIR: map an operation to an external function call
I am not sure whether this is the correct category/group to ask this question.
I see that there is a way to map an op to external function:
https://mlir.llvm.org/docs/Dialects/Linalg/
Property 5: May ...
2
votes
0
answers
104
views
FSharp inlining downcast optimization?
I am currently reading through the F# core library source code and stumbled upon a common pattern which made me wonder a little about the performance of it, and could not find anything about it by a ...
0
votes
1
answer
120
views
Code bloat after switching from Apple clang 9.1.0 to 12.0.0
I've recently switched from Apple clang 9.1.0 to 12.0.0 and I've noticed that the generated code is now somewhat bloated. Here's a little test project:
const char *s = "Hello World";
...
3
votes
2
answers
202
views
How to give C compiler freedom about return value
Say I have following C function:
i64 my_comparator1(i64 x, i64 y)
{
if (x > y) { return 1; }
if (x < y) { return -1; }
return 0;
}
If I happen to know something about arguments that ...
1
vote
0
answers
127
views
How to make a custom LLVM analysis for custom passes in the new pass manager?
I'm trying to write a function analysis whose results are accessible by a few custom passes in LLVM that work on functions and loops in the new pass manager. Unfortunately documentation for the new ...