Newest 'amd-processor' Questions

Advice

0 votes

4 replies

76 views

How are cache line and next-page prefetchers made aware of page sizes?

I have done some tests and verified that on more-"recent" Intel and AMD processors, the cache line prefetcher behaves differently when a line belongs to a base page vs a huge page. How is ...

Mani

110

asked Mar 27 at 21:06

0 votes

0 answers

36 views

Serialization with MSR 010B [Flush Command] in AMD CPUs for execution time measure

AMD Processor Programming Reference (PPR) for AMD Family 19h Model 70h, Revision A0 Processors says: MSR0000_010B [Flush Command] (Core::X86::Msr::FLUSH_CMD) Writes to this register do not execute ...

Akon

461

asked Jan 4 at 13:56

Advice

0 votes

1 replies

106 views

Branch predictor training depends on call site? (Spectre experiment)

While analyzing the Spectre vulnerability, I ran into a question about how branch prediction training works. My understanding is that the CPU accumulates prediction history for a specific conditional ...

Nikolay Isaev

36

asked Dec 16, 2025 at 17:33

3 votes

0 answers

176 views

The cost of non contiguous reads and writes (naive matrix transpose, power-of-2 and other sizes)

I was benchmarking a naive transposition and noticed a very large performance discrepancy in performance between: a naive operation where we read data contiguously and write with a large stride; the ...

Etienne M

765

asked Nov 3, 2025 at 14:49

10 votes

1 answer

487 views

AVX-512 MD5 implementation: unexplained performance regression on Zen 4

I have written an implementation of the MD5 hash function using AVX-512. While it uses SIMD instructions, it is fundamentally a scalar algorithm. The point of using SIMD instructions is to access ...

fuz

95k

asked Oct 8, 2025 at 16:55

1 vote

0 answers

43 views

Tracking Per Channel Memory Traffic in AMD Zen 2 (Rome)

I am using perf to profile workloads on my system, and I need to track the memory traffic generated by my workload on each NUMA node. Currently, I only have perf results for LLC cache misses, which ...

smz

515

asked Aug 20, 2025 at 19:51

0 votes

1 answer

169 views

Why does PERF_COUNT_HW_REF_CPU_CYCLES have much higher variance on Zen5 cpus than PERF_COUNT_HW_CPU_CYCLES?

My understanding is that PERF_COUNT_HW_REF_CPU_CYCLES should map to some counter that counts at a constant rate, as opposed to PERF_COUNT_HW_CPU_CYCLES which is affected by frequency scaling. I'd ...

Joseph Garvin

22.5k

asked Jul 24, 2025 at 17:17

1 vote

1 answer

209 views

Cache line sizes for AMD Zen 3 Architecture

I wanted to see if I am correctly interpreting the attached diagram. It shows the AMD Zen 3's cache lines. OC Fetch is Opcode Cache, IC Fetch is Instruction Cache. I am just unable to make sense of ...

Kush Jenamani

11

asked May 27, 2025 at 15:27

1 vote

0 answers

93 views

Why is the core-to-core-latency performance of EPYC 4 so poor in NUMA2 mode?

I test the EYPC 9564 CPU (dual socket), the core-to-core latency of the second socket is very high, even greater than the latency for inter-socket communication. As shown for AMD EPYC 7R13, 48 Cores, ...

wang fuqiang

81

asked Apr 25, 2025 at 2:34

2 votes

0 answers

77 views

Why perf complains that it cannot open this L1 cache event on Zen 2?

I am trying to read cache events on a AMD Zen2: L1d all read accesses L1d all write accesses L1d read misses (not shown below) L1d write misses (not shown below) According to the perf_event_open(2) ...

onlycparra

845

asked Mar 20, 2025 at 5:02

1 vote

0 answers

134 views

Can Zen 4 run more than 1 branch per cycle

In Performance optimization, and how to do it wrong the author claims: the CPU can't predict more than one branch per cycle A single if statement inside a loop is enough to stop any further ...

HesLg

58

asked Mar 8, 2025 at 0:54

1 vote

0 answers

106 views

What do the StD and IntD components mean in the Zen 5 CPU microarchitecture?

In the AMD Zen5 architecture block diagram, the FP/Vector execution unit has two components, StD and IntD with arrows connecting them to the "Load/Store Queue". What are the functions of ...

Frontier_Setter

839

asked Mar 4, 2025 at 3:48

2 votes

0 answers

92 views

How to verify the granularity of memory access interleaving across different channels?

According to AMD's material, access to contiguous physical addresses will be interleaved across all memory channels (if set to NPS1). When a machine has 8 memory channels and the size of memory ...

Frontier_Setter

839

asked Dec 17, 2024 at 5:31

1 vote

1 answer

203 views

Unable to cross-compile Rust project using tokio-udev

I want to cross-compile a minimal project which uses tokio-udev. The linker fails because of missing libudev: aarch64-linux-musl/bin/ld: cannot find -ludev I can cross-compile Rust projects which do ...

Twonky

814

asked Nov 11, 2024 at 13:25

2 votes

1 answer

256 views

Tracking DRAM traffic in AMD Zen 2 (Rome)

I want to track the number of read/write accesses at each of the Unified Memory Controllers (UMCs) in my AMD EPYC processor (family: 0x17 and model: 0x31). The AMDuProfPcm tool, when used with the -m ...

smz

515

asked Oct 4, 2024 at 18:37

0 votes

1 answer

293 views

Choose CPU processor (intel or AMD) of machine hosting action runner

Besides choosing between linux/windows/mac and 32/64 bit, is it possible to choose the processor of the machine where the action runner will be running? In my organization we have been using actions ...

Alberto Gascón

3

asked Sep 17, 2024 at 7:23

0 votes

1 answer

69 views

Problems opening FOC motor control app in Vivado 2023.2

I have bought the Kria KD240 Starter Kit to get used to working with drives applications and FOC control. I am following the steps mentioned here but I can't open the Vivado project correctly. When I ...

alagal

1

asked Sep 16, 2024 at 11:53

5 votes

0 answers

134 views

Repeated x87 fnstenv yields cleared instruction pointer after arbitrary time

I have a program that calls the x87 instruction fnstenv multiple times per second and with only the occasional floating point computation being executed (in periods of multiple seconds apart), I had ...

Thomas Reitmayr

51

asked Aug 24, 2024 at 13:44

2 votes

1 answer

112 views

Why does AMD processor use sub instruction instead of xor to verify the stack canary?

So I've been exploring the 12 chapter in the picoCTF primer and suddenly saw difference in my assembly of the program and the picoCTF's in the end of main function, where the stack canary is being ...

digitale

23

asked Aug 7, 2024 at 18:16

3 votes

1 answer

196 views

Twice as slow SIMD performance without extra copy

I've been optimizing some code, and stumbled across some peculiar case. Here are the two assembly codes: ; FAST lea rcx,[rsp+50h] call qword ptr [Random_get_float3] ;this function ...

Alex

592

asked Jul 19, 2024 at 8:54

2 votes

1 answer

212 views

SymFromAddr fails on AMD Machine with the error message "Attempt to access Invalid address"

struct StackFrame { DWORD64 address; std::string name; std::string module; std::string filename; int line_number; }; std::vector<StackFrame> GetStackTrace(CONTEXT context) { ...

Hari E

490

asked Mar 14, 2024 at 7:08

1 vote

0 answers

977 views

Cache inclusivity policy differences on x86 between Intel and AMD

(tldr: the question itself is at the bottom) I've read that on AMD family 17h processors (Zen-Zen2, although it might be the case with the following generations as well, but I am not familiar with ...

Andriy Sultanov

88

asked Feb 8, 2024 at 19:50

0 votes

0 answers

166 views

How to debug an HIP/HIPRT application on windows?

I'm writing a path tracer using HIPRT on Windows but I couldn't find anything to debug my application yet. I'd like to be able to execute my kernels line by line, watch kernel variables, print to ...

Tom Clabault

502

asked Feb 2, 2024 at 9:54

2 votes

0 answers

139 views

Why instructions after atomic operation make execution faster (on AMD CPU)?

I wrote the following test cases to bench some operations: #define BENCH_ROUNDS 1000000000 // 10**9 static volatile UINT64 _test_argument, _test_result; static _Atomic(UINT64) _test_atom; // For ...

Wilderness Ranger

310

asked Feb 2, 2024 at 9:29

1 vote

0 answers

296 views

Why polars on intel cpu is faster than on amd cpu?

I have two pc, one is Intel i7 13700KF with 64GB RAM and another is AMD 3970X with same RAM, both pc use ssd as storage and both pc has python 3.11 and polars 0.20.5. I run code below: df = pl....

Hakase

341

asked Jan 30, 2024 at 2:56

9 votes

0 answers

304 views

Are there processors on which VPMASKMOVD generates faults for the masked-out elements?

Are there processors on which VPMASKMOVD generates faults for the masked-out elements? Going by the Intel Software Developer's Manual, the answer is plainly "no": Faults occur only due to ...

user555045

66k

asked Jan 28, 2024 at 15:16

0 votes

0 answers

179 views

What's the difference between those "cache_as_ram.S" in coreboot?

I want to learn how the "cache as ram" work, so i find some asm file in "/src/cpu/intel/car/" from coreboot. But there are four folders containing "cache_as_ram.S". What'...

50han Bill

1

asked Jan 13, 2024 at 9:38

0 votes

1 answer

202 views

Why amd_pmu_v2_handle_irq being called when not using perf?

amd_pmu_v2_handle_irq should be used to handle PMU overflow in AMD processor. When I use perf top -ag in the system, it is heavily called. But when I use the perf stat -a command, there are fewer ...

Frontier_Setter

839

asked Dec 22, 2023 at 11:21

-1 votes

1 answer

386 views

Why is the frequency of the CPU lower than the Max. Boost Clock？

I am using AMD's EPYC 7713 CPU. According to the specification, its maximum frequency is 3.675GHz. But when I run stress-ng (only running single threaded cpu loads), its frequency does not exceed 3....

Frontier_Setter

839

asked Dec 4, 2023 at 15:35

2 votes

2 answers

1k views

What x86 CPUs, if any, still have MOVDIRI or MOVDIR64b instructions?

I've recently been checking the Intel CPUs that I have access to. None of them (they're all Xeons) have the MOVDIRI or MOVDIR64b instructions, which are store instructions that bypass the caches. Are ...

user22797201

asked Nov 1, 2023 at 16:36

0 votes

1 answer

368 views

Illegal instruction (core dumped) in cv::findHomography

I am getting this error: Illegal instruction (core dumped) When calling: cv::findHomography(query_points, reference_points, cv::RANSAC, homography_ransac_threshold_, h_mask); This happen only an AWS ...

Humam Helfawi

20.5k

asked Oct 27, 2023 at 20:36

0 votes

0 answers

2k views

What are the advantages of write-combine memory compared to write-back memory?

In Software Optimization Guide for the AMD Zen4 Microarchitecture, it is written that: Write-combining is the merging of multiple memory write cycles that target locations within the address range of ...

Frontier_Setter

839

asked Oct 15, 2023 at 4:35

2 votes

0 answers

504 views

What does the cache bank mean in AMD CPU?

In AMD's optimization manual, the L1 Data cache is described as follows: The L1 DC provides multiple access ports using a banked structure. The read ports are shared by three load pipes and victim ...

Frontier_Setter

839

asked Oct 13, 2023 at 11:25

2 votes

1 answer

1k views

What's the difference between dispatching and issuing in CPU pipeline

In Software Optimization Guide for the AMD Zen4 Microarchitecture, the terminology are explained as follows: Dispatching: Dispatching refers to the act of transferring macro ops from the front end of ...

Frontier_Setter

839

asked Oct 13, 2023 at 9:36

4 votes

1 answer

622 views

What does L2 poison mean in CPU?

I have encountered the same problem as this. What does L2 poison mean? I'm using AMD CPU.

Frontier_Setter

839

asked Oct 7, 2023 at 3:29

0 votes

1 answer

351 views

How to test the latency and throughput of an intrinsic function？

In Intel's Intrinsic guide, each function has its own latency and throughput. For example, _mm256_loadu_ps: Architecture, Latency, Throughput (CPI) Alderlake, 7, 0.333333333 Icelake Intel Core, 7, 0.5 ...

Frontier_Setter

839

asked Sep 26, 2023 at 13:20

0 votes

0 answers

108 views

model.fit() stopping halfway on 1 epoch using tensorflow-directml. What to do?

Currently using tensorflow-directml as I am training a model on AMD (RX 580). The problem is, upon model.fit() it seems to be stuck at epoch 1 with no progress. Here's my code and error: with ...

user21525821

45

asked Sep 20, 2023 at 8:14

7 votes

0 answers

4k views

Intel OneAPI MPI MKL with AMD, is there an AMD flavor?

I've always happened to use Intel cpus in intel chipset based servers, as such have used Intel's MPI and MKL for the past 20 years that's all I kinda know. With their OneAPI I only need and use MPI, ...

ron

1,035

asked Sep 19, 2023 at 13:59

1 vote

0 answers

115 views

How do different monitoring tools calculate memory bandwidth?

For monitoring memory bandwidth, there is pcm-memory on the Intel platform and AMDuProf on the AMD platform. How do they calculate memory bandwidth usage? Which PMUs were used? Is it using 1024 or ...

Frontier_Setter

839

asked Sep 8, 2023 at 13:19

2 votes

0 answers

1k views

Difference of floating arithmetics on AMD CPUs and Intel CPUs

I am student majoring in computational science. When I deal with mixed-precision projects on AMD CPUs, I find that single precision data behaves similarly to double precision data. Sometimes, single-...

Singyuk Lau

21

asked Sep 6, 2023 at 3:53

4 votes

0 answers

101 views

Use perf to see if I'm write bound?

I have a loop that's running slower than I expected. I measure how long it takes per collection it processes and notice it takes twice as long when I use 8 cores (overall 4x faster). There's no data ...

David

41

asked Aug 28, 2023 at 19:04

1 vote

0 answers

131 views

Ryzen AMD x86_64 increment for 64 bits on memory runs 8 times faster than 8,16 or 32 bit increment

I wanted to benchmark the atomic instructions compared to the non atomic, so I wrote the code that follows bellow. Besides benchmarking locked accesses I noticed a different aspect too that seems to ...

George Kourtis

2,634

asked Aug 25, 2023 at 16:20

0 votes

1 answer

64 views

How can I use kernel functions in SVM root(execute) mode?

I ran into the following problem: When I initialized the kernel hypervisor, for me it is SVM and I exit from vmrun and get into my SvmExitHandler (this is the dispatcher that manages exit codes), then ...

Barbosso

11

asked Aug 17, 2023 at 13:10

1 vote

1 answer

93 views

Assembly instructions showing how zenbleed was found

While looking at this zenbleed article, it was found that a randomly generated sequence of instructions and the same sequence but with randomized alignment, serialization and speculation fences added ...

vengy

2,477

asked Aug 13, 2023 at 17:30

2 votes

0 answers

489 views

Obtaining SMI_COUNT on amd cpu

I am trying to get familiar with AMD's interface of SMM. Want to implement simple task: Check SMI_COUNT Trigger SMI Check SMI_COUNT after trigger The SMI-interrupt is a rare thing (I believe), so ...

Rockrid3r

331

asked Aug 11, 2023 at 19:26

1 vote

0 answers

119 views

numpy built with locally built blis does not use multithreading

I'm looking for help with an issue I'm having building Numpy against locally built blis for zen3. I've configured blis to enable threading using openmp. (it is installed and working on my machine, ...

Crispy Holiday

472

asked Aug 10, 2023 at 15:06

-1 votes

1 answer

316 views

Windows 10 nested virtualization on AMD CPU

I am working on a Software company, mainly developing on Linux. For Windows development we have couple of machines that are shared. However, a new project came up, and we need more resources on ...

wizard

155

asked Aug 10, 2023 at 13:46

1 vote

1 answer

236 views

Not getting any cache-pollution benefit from PREFETCHNTA on Zen 3

I'm trying to write a non-cache-polluting memcpy (using PREFETCHNTA for reads and streaming writes) and first doing some artificial benchmarking to determine what prefetch distances work well. I've ...

Bruce Merry

790

asked Aug 7, 2023 at 9:35

2 votes

0 answers

79 views

Can rdpmc be used to read the fixed-function counters on AMD?

On Intel the fixed-function performance counters can be read by setting bit 30 of ecx as well the index of the counter to read (0-4) in the bottom bits of that same register. Is something similar ...

BeeOnRope

66.6k

asked Aug 3, 2023 at 23:52

7 votes

1 answer

968 views

AMD DE_CFG[9] documentation

As a mitigation against the recent zenbleed vulnerability (https://lock.cmpxchg8b.com/zenbleed.html) it is advised to set DE_CFG[9] = 1. I have not manage to find anything on this MSR, except for Is ...

benjamin-lieser

1,888

asked Jul 25, 2023 at 13:10

Collectives™ on Stack Overflow