Skip to main content
Filter by
Sorted by
Tagged with
Advice
0 votes
4 replies
76 views

I have done some tests and verified that on more-"recent" Intel and AMD processors, the cache line prefetcher behaves differently when a line belongs to a base page vs a huge page. How is ...
Mani's user avatar
  • 110
0 votes
0 answers
36 views

AMD Processor Programming Reference (PPR) for AMD Family 19h Model 70h, Revision A0 Processors says: MSR0000_010B [Flush Command] (Core::X86::Msr::FLUSH_CMD) Writes to this register do not execute ...
Akon's user avatar
  • 461
Advice
0 votes
1 replies
106 views

While analyzing the Spectre vulnerability, I ran into a question about how branch prediction training works. My understanding is that the CPU accumulates prediction history for a specific conditional ...
Nikolay Isaev's user avatar
3 votes
0 answers
176 views

I was benchmarking a naive transposition and noticed a very large performance discrepancy in performance between: a naive operation where we read data contiguously and write with a large stride; the ...
Etienne M's user avatar
  • 765
10 votes
1 answer
487 views

I have written an implementation of the MD5 hash function using AVX-512. While it uses SIMD instructions, it is fundamentally a scalar algorithm. The point of using SIMD instructions is to access ...
fuz's user avatar
  • 95k
1 vote
0 answers
43 views

I am using perf to profile workloads on my system, and I need to track the memory traffic generated by my workload on each NUMA node. Currently, I only have perf results for LLC cache misses, which ...
smz's user avatar
  • 515
0 votes
1 answer
169 views

My understanding is that PERF_COUNT_HW_REF_CPU_CYCLES should map to some counter that counts at a constant rate, as opposed to PERF_COUNT_HW_CPU_CYCLES which is affected by frequency scaling. I'd ...
Joseph Garvin's user avatar
1 vote
1 answer
209 views

I wanted to see if I am correctly interpreting the attached diagram. It shows the AMD Zen 3's cache lines. OC Fetch is Opcode Cache, IC Fetch is Instruction Cache. I am just unable to make sense of ...
Kush Jenamani's user avatar
1 vote
0 answers
93 views

I test the EYPC 9564 CPU (dual socket), the core-to-core latency of the second socket is very high, even greater than the latency for inter-socket communication. As shown for AMD EPYC 7R13, 48 Cores, ...
wang fuqiang's user avatar
2 votes
0 answers
77 views

I am trying to read cache events on a AMD Zen2: L1d all read accesses L1d all write accesses L1d read misses (not shown below) L1d write misses (not shown below) According to the perf_event_open(2) ...
onlycparra's user avatar
1 vote
0 answers
134 views

In Performance optimization, and how to do it wrong the author claims: the CPU can't predict more than one branch per cycle A single if statement inside a loop is enough to stop any further ...
HesLg's user avatar
  • 58
1 vote
0 answers
106 views

In the AMD Zen5 architecture block diagram, the FP/Vector execution unit has two components, StD and IntD with arrows connecting them to the "Load/Store Queue". What are the functions of ...
Frontier_Setter's user avatar
2 votes
0 answers
92 views

According to AMD's material, access to contiguous physical addresses will be interleaved across all memory channels (if set to NPS1). When a machine has 8 memory channels and the size of memory ...
Frontier_Setter's user avatar
1 vote
1 answer
203 views

I want to cross-compile a minimal project which uses tokio-udev. The linker fails because of missing libudev: aarch64-linux-musl/bin/ld: cannot find -ludev I can cross-compile Rust projects which do ...
Twonky's user avatar
  • 814
2 votes
1 answer
256 views

I want to track the number of read/write accesses at each of the Unified Memory Controllers (UMCs) in my AMD EPYC processor (family: 0x17 and model: 0x31). The AMDuProfPcm tool, when used with the -m ...
smz's user avatar
  • 515
0 votes
1 answer
293 views

Besides choosing between linux/windows/mac and 32/64 bit, is it possible to choose the processor of the machine where the action runner will be running? In my organization we have been using actions ...
Alberto Gascón's user avatar
0 votes
1 answer
69 views

I have bought the Kria KD240 Starter Kit to get used to working with drives applications and FOC control. I am following the steps mentioned here but I can't open the Vivado project correctly. When I ...
alagal's user avatar
  • 1
5 votes
0 answers
134 views

I have a program that calls the x87 instruction fnstenv multiple times per second and with only the occasional floating point computation being executed (in periods of multiple seconds apart), I had ...
Thomas Reitmayr's user avatar
2 votes
1 answer
112 views

So I've been exploring the 12 chapter in the picoCTF primer and suddenly saw difference in my assembly of the program and the picoCTF's in the end of main function, where the stack canary is being ...
digitale's user avatar
3 votes
1 answer
196 views

I've been optimizing some code, and stumbled across some peculiar case. Here are the two assembly codes: ; FAST lea rcx,[rsp+50h] call qword ptr [Random_get_float3] ;this function ...
Alex's user avatar
  • 592
2 votes
1 answer
212 views

struct StackFrame { DWORD64 address; std::string name; std::string module; std::string filename; int line_number; }; std::vector<StackFrame> GetStackTrace(CONTEXT context) { ...
Hari E's user avatar
  • 490
1 vote
0 answers
977 views

(tldr: the question itself is at the bottom) I've read that on AMD family 17h processors (Zen-Zen2, although it might be the case with the following generations as well, but I am not familiar with ...
Andriy Sultanov's user avatar
0 votes
0 answers
166 views

I'm writing a path tracer using HIPRT on Windows but I couldn't find anything to debug my application yet. I'd like to be able to execute my kernels line by line, watch kernel variables, print to ...
Tom Clabault's user avatar
2 votes
0 answers
139 views

I wrote the following test cases to bench some operations: #define BENCH_ROUNDS 1000000000 // 10**9 static volatile UINT64 _test_argument, _test_result; static _Atomic(UINT64) _test_atom; // For ...
Wilderness Ranger's user avatar
1 vote
0 answers
296 views

I have two pc, one is Intel i7 13700KF with 64GB RAM and another is AMD 3970X with same RAM, both pc use ssd as storage and both pc has python 3.11 and polars 0.20.5. I run code below: df = pl....
Hakase's user avatar
  • 341
9 votes
0 answers
304 views

Are there processors on which VPMASKMOVD generates faults for the masked-out elements? Going by the Intel Software Developer's Manual, the answer is plainly "no": Faults occur only due to ...
user555045's user avatar
0 votes
0 answers
179 views

I want to learn how the "cache as ram" work, so i find some asm file in "/src/cpu/intel/car/" from coreboot. But there are four folders containing "cache_as_ram.S". What'...
50han Bill's user avatar
0 votes
1 answer
202 views

amd_pmu_v2_handle_irq should be used to handle PMU overflow in AMD processor. When I use perf top -ag in the system, it is heavily called. But when I use the perf stat -a command, there are fewer ...
Frontier_Setter's user avatar
-1 votes
1 answer
386 views

I am using AMD's EPYC 7713 CPU. According to the specification, its maximum frequency is 3.675GHz. But when I run stress-ng (only running single threaded cpu loads), its frequency does not exceed 3....
Frontier_Setter's user avatar
2 votes
2 answers
1k views

I've recently been checking the Intel CPUs that I have access to. None of them (they're all Xeons) have the MOVDIRI or MOVDIR64b instructions, which are store instructions that bypass the caches. Are ...
user avatar
0 votes
1 answer
368 views

I am getting this error: Illegal instruction (core dumped) When calling: cv::findHomography(query_points, reference_points, cv::RANSAC, homography_ransac_threshold_, h_mask); This happen only an AWS ...
Humam Helfawi's user avatar
0 votes
0 answers
2k views

In Software Optimization Guide for the AMD Zen4 Microarchitecture, it is written that: Write-combining is the merging of multiple memory write cycles that target locations within the address range of ...
Frontier_Setter's user avatar
2 votes
0 answers
504 views

In AMD's optimization manual, the L1 Data cache is described as follows: The L1 DC provides multiple access ports using a banked structure. The read ports are shared by three load pipes and victim ...
Frontier_Setter's user avatar
2 votes
1 answer
1k views

In Software Optimization Guide for the AMD Zen4 Microarchitecture, the terminology are explained as follows: Dispatching: Dispatching refers to the act of transferring macro ops from the front end of ...
Frontier_Setter's user avatar
4 votes
1 answer
622 views

I have encountered the same problem as this. What does L2 poison mean? I'm using AMD CPU.
Frontier_Setter's user avatar
0 votes
1 answer
351 views

In Intel's Intrinsic guide, each function has its own latency and throughput. For example, _mm256_loadu_ps: Architecture, Latency, Throughput (CPI) Alderlake, 7, 0.333333333 Icelake Intel Core, 7, 0.5 ...
Frontier_Setter's user avatar
0 votes
0 answers
108 views

Currently using tensorflow-directml as I am training a model on AMD (RX 580). The problem is, upon model.fit() it seems to be stuck at epoch 1 with no progress. Here's my code and error: with ...
user21525821's user avatar
7 votes
0 answers
4k views

I've always happened to use Intel cpus in intel chipset based servers, as such have used Intel's MPI and MKL for the past 20 years that's all I kinda know. With their OneAPI I only need and use MPI, ...
ron's user avatar
  • 1,035
1 vote
0 answers
115 views

For monitoring memory bandwidth, there is pcm-memory on the Intel platform and AMDuProf on the AMD platform. How do they calculate memory bandwidth usage? Which PMUs were used? Is it using 1024 or ...
Frontier_Setter's user avatar
2 votes
0 answers
1k views

I am student majoring in computational science. When I deal with mixed-precision projects on AMD CPUs, I find that single precision data behaves similarly to double precision data. Sometimes, single-...
Singyuk Lau's user avatar
4 votes
0 answers
101 views

I have a loop that's running slower than I expected. I measure how long it takes per collection it processes and notice it takes twice as long when I use 8 cores (overall 4x faster). There's no data ...
David's user avatar
  • 41
1 vote
0 answers
131 views

I wanted to benchmark the atomic instructions compared to the non atomic, so I wrote the code that follows bellow. Besides benchmarking locked accesses I noticed a different aspect too that seems to ...
George Kourtis's user avatar
0 votes
1 answer
64 views

I ran into the following problem: When I initialized the kernel hypervisor, for me it is SVM and I exit from vmrun and get into my SvmExitHandler (this is the dispatcher that manages exit codes), then ...
Barbosso's user avatar
1 vote
1 answer
93 views

While looking at this zenbleed article, it was found that a randomly generated sequence of instructions and the same sequence but with randomized alignment, serialization and speculation fences added ...
vengy's user avatar
  • 2,477
2 votes
0 answers
489 views

I am trying to get familiar with AMD's interface of SMM. Want to implement simple task: Check SMI_COUNT Trigger SMI Check SMI_COUNT after trigger The SMI-interrupt is a rare thing (I believe), so ...
Rockrid3r's user avatar
  • 331
1 vote
0 answers
119 views

I'm looking for help with an issue I'm having building Numpy against locally built blis for zen3. I've configured blis to enable threading using openmp. (it is installed and working on my machine, ...
Crispy Holiday's user avatar
-1 votes
1 answer
316 views

I am working on a Software company, mainly developing on Linux. For Windows development we have couple of machines that are shared. However, a new project came up, and we need more resources on ...
wizard's user avatar
  • 155
1 vote
1 answer
236 views

I'm trying to write a non-cache-polluting memcpy (using PREFETCHNTA for reads and streaming writes) and first doing some artificial benchmarking to determine what prefetch distances work well. I've ...
Bruce Merry's user avatar
2 votes
0 answers
79 views

On Intel the fixed-function performance counters can be read by setting bit 30 of ecx as well the index of the counter to read (0-4) in the bottom bits of that same register. Is something similar ...
BeeOnRope's user avatar
  • 66.6k
7 votes
1 answer
968 views

As a mitigation against the recent zenbleed vulnerability (https://lock.cmpxchg8b.com/zenbleed.html) it is advised to set DE_CFG[9] = 1. I have not manage to find anything on this MSR, except for Is ...
benjamin-lieser's user avatar

1
2 3 4 5
11