Newest 'cpu' Questions

1 vote

0 answers

24 views

Intellij Utimate edition V2025.3 "Profiler" does not exist in settings

I have Intellij Ultimate edition V2025.3 "Profiler" does not exist in Settings/Preferences > Build, Execution, Deployment > Java Profiler. I have tried the below option as well, no ...

Hoda Alemi

91

asked yesterday

Advice

1 vote

2 replies

139 views

How the Computer Handles Interrupts

What is the difference between an interrupt and a context switch? I understand the concept of an interrupt and how it occurs. However, I'm digging deeper into the topic. I studied Computer ...

Gabriele

11

asked Nov 8 at 19:25

3 votes

1 answer

154 views

How to catch EXCEPTION_PRIV_INSTRUCTION from RDPMC directly in Assembly (and without SEH)?

I'm experimenting with measuring CPU's instructions latency and throughput on P and E cores using RDPMC on Win 11, something like that: MOV ECX, 0x40000000 ; Instructions Counter RDPMC ; Read ...

Andrey Dmitriev

179

asked Oct 21 at 18:37

0 votes

1 answer

71 views

Cache Allocation Technology in 13th Generation Core i9 13900E Intel CPU [closed]

I am trying to implement Cache allocation Technology`s impact with my CPU. However, when I use either lscpu to see whether my CPU supports, or cpuid -l 0x10, output is false. How is this possible? How ...

Ali Hosseini

1

asked Oct 10 at 12:38

1 vote

1 answer

106 views

Randomness instructions vs syscalls [closed]

I've been digging into "true" randomness idea, and I've noticed that modern CPUs support instructions for generating randomness. X64 has RDRAND instruction, while ARM has RNDR (I'm not ...

freakish

57k

asked Sep 29 at 8:00

1 vote

1 answer

108 views

Is CPU multithreading effected by divergence?

Building on this question here The term thread divergence is used in CUDA; from my understanding it's a situation where different threads are assigned to do different tasks and this results in a big ...

bigcodeszzer

960

asked Sep 18 at 1:37

0 votes

1 answer

407 views

How to handle "Could not initialize NNPACK! Reason: Unsupported hardware" warning in PyTorch / Silero VAD on cloud CPU?

I’m running Silero VAD (via PyTorch + torchaudio) on a Linode cloud instance (2 dedicated CPUs, 4 GB RAM). When I process 10-minute audio chunks, I always get repeated warnings like this and it doesn'...

Uktamjon

11

asked Sep 15 at 14:16

7 votes

1 answer

228 views

Why are all IMUL µOPs dispatched to Port 1 only (on Haswell), even when multiple IMULs are executed in parallel?

I'm experimenting with the IMUL r64, r64 instruction on an Intel Xeon E5-1620 v3 (Haswell architecture, base clock 3.5 GHz, turbo boost up to 3.6 GHz, Hyper Threading is enabled). My test loop is ...

Andrey Dmitriev

179

asked Sep 12 at 9:26

2 votes

0 answers

74 views

Need to do CPU profiling of Jruby application

Need to do CPU profiling for Jruby application (jruby version : 1.7.20.1-8) which uses ruby version (1.9.3). I tried using default profiler but getting below error due to version compatibility issue ...

maulik trapasiya

745

asked Sep 7 at 18:30

0 votes

1 answer

58 views

Fargate Cloudwatch CPU Utilisation differs from docker stats

Looking at the CPUUtilized Cloudwatch metric for my Fargate service, it's showing max cpu units used as 1040 over the past 4 weeks, using a sampling period of 1 minute. I have 4 vCPUs provisioned to ...

Seanf123

1

asked Sep 7 at 17:41

0 votes

1 answer

178 views

Performance regression in a Kubernetes deployment that does not occur locally [closed]

I have a docker image and an EC2. When I run this image on my EC2, it takes x seconds to finish. When I run the app natively, it also takes x seconds. But if I deploy the exact image in a container in ...

wildcat

81

asked Sep 1 at 17:50

2 votes

0 answers

220 views

Why does floating point division take less than 50% of the latency of integer division and also 10x more latency than usual when underflow occurs?

I am measuring the latency of instructions. For 64-bit primitives, integer division takes about 25 cycles each, usually on my 2.3GHz Digital Ocean vCPU, while floating point division takes about 10 ...

Zack Light

362

asked Aug 22 at 5:35

0 votes

0 answers

75 views

Why must align memory address

Memory addresses must be aligned before they are used. I know that if they are not, performance costs more in CPU caching. I discovered that certain processors raise exceptions when unaligned memories ...

LEE LUNA

1

asked Jul 8 at 9:39

-3 votes

1 answer

114 views

Understanding when a hazard in MIPS occurs

I have a question regarding these two instructions: lw r2, 10(r1) lw r1, 10(r2) Is there a hazard here, do I need stalls in between two of them? I want to know if any kind of hazard happens here? I ...

mer mer

17

asked Jun 28 at 15:34

1 vote

0 answers

44 views

How to optimize CPU tensor slicing and asynchronous transfer to the GPU?

My code involves slicing large tensors on the CPU by index and asynchronously transmitting them back to the GPU. However, through the Profiler debugging tool, I found that this step would seriously ...

Ponytail

11

asked Jun 19 at 16:19

1 vote

0 answers

86 views

popcnt instruction not as fast as loop on core ultra 155h [duplicate]

I think the title says it all: i have implemented a popcnt function that counts bits as a loop with shifts and one with inline asm with the actual cpu instruction. This is my c code: #define ...

newbee.a

10

asked Jun 17 at 10:25

2 votes

1 answer

135 views

CPU cache invalidation control from application - clear cache store queues (?) for x86/x64 architectures (Invalidate data after read, skip write-back)

We have some multimedia processing applications designed as a set of filters for processing data buffers. If temporal data in between filters is not very large and can fit in L1 or L2/L3 caches - the ...

DTL2020

101

asked May 22 at 10:37

1 vote

0 answers

77 views

How to analyze the microarchitecture resource requirements based on the trace generated by program execution?

I'm doing an in-depth CPU microarchitectural resource analysis. I want to know the requirements of my program on processor microarchitectural resources and compare the requirements of different ...

Gerrie

455

asked May 19 at 12:26

0 votes

0 answers

104 views

Mutex Implementations and Memory Fences in C

I have been writing my own x86 32-bit operating system for the past month or so. My system uses just one core. Anyway, I have been reading a lot about memory fences, CPU optimizations, and compiler ...

c.abate

442

asked May 4 at 7:27

0 votes

0 answers

51 views

XGBoost GPU version not outperforming CPU on small dataset despite parameter tuning – suggestions needed

I'm currently working on a parallel and distributed computing project where I'm comparing the performance of XGBoost running on CPU vs GPU. The goal is to demonstrate how GPU acceleration can improve ...

Mxneeb

19

asked May 2 at 16:17

1 vote

1 answer

283 views

Trying to get the CPU temperature using several libraries returns wrong results

I want to get the CPU temperature using Python code. I’m using Windows 11 24H2 and Python 3.10.6. I’ve already tried using WinTmp.CPU_Temp(): import WinTmp print(WinTmp.CPU_Temp()) >>> 0.0 ...

Tim Ryzikov

11

asked May 1 at 15:27

0 votes

1 answer

177 views

Linux UIO IRQ related periodic CPU usage

I have an Intel Arria 10 SoC FPGA system with 5.4.104-lts Linux built with Yocto 3.3.1 and Poky. The installed FPGA image is doing nothing more than making interrupts to an UIO device, 50 times a sec. ...

yepp

1

asked Apr 17 at 8:29

0 votes

0 answers

62 views

How to fix CPU feature error when running nextjs project on Ubuntu server?

The Docker Compose project only returns this error in the logs and no more details, and even the twa process stops and stays on the first page, which is the splash-screen, and the process does not ...

Ali Ghorbani

1

asked Apr 17 at 0:23

2 votes

1 answer

117 views

Why does VPERM2I128/_mm256_permute2x128_si256 (and also FP variants) not exist in AVX512 instruction set?

It could operate identically on both 256-bit halves of a 512-bit AVX512 register. Like identical operation on 128-bits lanes of 256-bits registers in AVX/AVX2. Any tech reasons?

Akon

481

asked Apr 13 at 5:03

1 vote

1 answer

289 views

To understand how multithreading works in a Kubernetes pod

I have a multithreaded Spring Boot microservice running in a Kubernetes pod with a CPU limit of 1 (1000m). Does this mean only one CPU core is used to run all my threads one by one, or can multiple ...

jashan khangura

43

asked Apr 7 at 10:40

0 votes

1 answer

115 views

Execution stages in a superscalar microarchitecture

In this article https://www.lighterra.com/papers/modernmicroprocessors it is stated that (under Multiple issue - Superscalar) the fetch and decode/dispatch stages must be enhanced so they can decode ...

Rishi

41

asked Mar 27 at 9:33

1 vote

2 answers

429 views

How to get processor information with Delphi using no third party units?

I need to get the processor name using Delphi. Nothing fancy, i just need to retrieve what Windows System > About shows ; in the example below, i want to retrieve the '13th Gen Intel(R) Core(TM) i9-...

delphirules

7,780

asked Mar 26 at 12:00

-4 votes

1 answer

152 views

How SIMD vs SIMT handle divergence [closed]

What exactly happens at the hardware level when a divergence occurs in SIMD and SIMT architectures, and how does each handle the execution of different instruction paths? I found this question, but ...

Rishi

41

asked Mar 24 at 4:29

2 votes

1 answer

124 views

Context switching in hardware threads

In Hyper-threading (or SMT) when two threads of a CPU core gets swapped in and out, does a context-switch occur. Would it be called a context switch?, if not what is the terminology for it.

Rishi

41

asked Mar 23 at 4:47

1 vote

2 answers

133 views

Why does each DRAM chip have to contribute 8 bit to the 64 bit bus width parallely, instead of a single chip contribute all 64 bits

Okay my question is probably dumb. But I cant find any answers that correct me. I learned that in DDR4 -lets say the stick has 8 chips- each chip parallelly contributes 8 bit to the 64 bit bus width. ...

Rishi

41

asked Mar 21 at 4:18

6 votes

1 answer

206 views

How do latency of FP division and sqrt vary with input data, or is it just type?

I have recently been looking into the latency and throughput of CPU instructions and have even written some benchmarks to experiment. However, I am struggling to understand how to properly benchmark ...

mihai145

75

asked Mar 11 at 19:36

0 votes

2 answers

253 views

How to wait until the CPU usage drops below 60% in VBA?

The following code is using for measuring CPU % usage. Public Sub Macro1() Dim strComputer As String Dim objWMIService As Object Dim colItems As Object Dim objItem As Object strComputer = ".&...

Kram Kramer

121

asked Mar 10 at 7:44

0 votes

1 answer

118 views

Raspberry Pi 5 Automatically Adjust Virtual Environment & CPU Cores Without Rebooting

I'm configuring .bashrc on my Raspberry Pi 5 to automatically activate a virtual environment and limit the CPU cores from 4 to 1 when I navigate to a specific directory. When I move to a different ...

이정환

11

asked Mar 5 at 7:33

0 votes

1 answer

184 views

Get-Counter not working on certain servers to get average CPU Percent Utilization

This is my code: (Get-Counter '\Processor(_Total)\% Processor Time').CounterSamples.CookedValue I am trying to receive the average CPU Utilization with Get-Counter but every time i try i get this ...

mimi m

71

asked Feb 28 at 19:39

0 votes

0 answers

49 views

Using Jupyter notebook online doesn't use any CPU?

Apologies for the very primitive question. I am using the online version of Jupyter notebook for some programming assignemnts because I have only an old and slow chromebook- I did not want to download ...

Meep

413

asked Feb 23 at 11:27

0 votes

1 answer

110 views

Running test on Rocket core CPU - global variable initialized to 0 is unsuccessful, output wrong value instead

While I am benchmarking my Rocketcore CPU, I encountered failed Coremark benchmarking. After some debug, I reduce the issue scope to unsuccessful global initialization of 0 value. In Coremark, it will ...

Jasminy

119

asked Feb 21 at 10:06

1 vote

0 answers

151 views

Programmatically get CPU utilization of the process in % on MacOS

I need to get % CPU currently used by my process on macOS. I expect it to be calculated this way, and it works on Windows: (currentAppTime - lastTrackedAppTime) * 100% / (currentSysTime - ...

Bibasmall

73

asked Feb 19 at 11:35

1 vote

1 answer

83 views

Cache Effects in Statically Compiled Binaries: Unexpected Cache Misses

I have a simple Hello World program written in C, which I statically compiled using: gcc -static -fno-pie -o hello{1|2} hello.c. I expected that executing these two binaries would exhibit cache ...

Khrn

354

asked Feb 5 at 7:43

1 vote

1 answer

128 views

What do the letters in port usage on uops.info mean?

What do the letters in the ports of the uops.info table mean? For example ADD (R64, R64) lists 1*p0156B at ports. The documentation says 1*p0156 means one microinstruction can be executed at ports 0, ...

asdfldsfdfjjfddjf

501

asked Jan 29 at 15:39

0 votes

0 answers

240 views

Created TensorFlow Lite XNNPACK delegate for CPU - ('--log-level=1') doesn't work

A simple Python script (Selenium + ChromeDriver): # import the By class, which allows you to choose how to search for an element from selenium.webdriver.common.by import By # initialize the browser ...

Sergey Saz

1

asked Jan 29 at 14:02

-1 votes

1 answer

143 views

If cache invalidation happens every time memory mappings change, why not opt for VIVT?

As far as I know, L1 is VIPT for at least Intel chips. VIVT caches don't depend on address translation, so they can fully operate in parallel with TLB lookup. VIPT can also achieve some parallelism by ...

Devashish

193

asked Jan 21 at 6:50

0 votes

0 answers

113 views

SDL CPU rendering project, rendering error when resizing window: Window surface is invalid

I was working on a cpu only rendering project with SDL in C. I implemented very good error handling and I got this error when I try to resize the window, "ERROR: SDL Error in render thread: ...

Tejas Patil

11

asked Jan 20 at 12:25

1 vote

0 answers

126 views

How to increase the frequency of the CPU from C

I am writing C code for the Raspberry Pi 4 (ARM Cortex-A72), which relies on precise timing in periods of less than 1μs. To get precise timing, I use the following algorithm: clock_gettime(...

Pygmalion

921

asked Jan 18 at 16:59

-1 votes

1 answer

98 views

Pod restart issue in java based micro-service architecture

There were 2 pods running in my micro-service, both of them got restarted with kubernetes reason as OOM killed enter image description here (The above dashboard uses the following query->sum(0,...

Yash Arora

1

asked Jan 18 at 16:21

0 votes

1 answer

76 views

Perform a benchmarking test on different cores on a VM Ubuntu system

I want to perform a benchmarking Test (BPFM, IOR, FIO & Sysbench) on a Ubuntu VM. The benchmark should use the available amount of cores in steps of 2^2 (So 2, 4, 8, 16, ... up to the available ...

JulianW

1

asked Jan 18 at 16:10

0 votes

1 answer

118 views

Why is my AI training on GPU is a lot slower than CPU

I'm currently training my simple prediction AI but my GPU is training at 40S per epochs while my CPU is training at 9S per epochs my CPU is i7-4720HQ and my GPU is Nvidia 950m this is my code `import ...

Vio Octavio

1

asked Jan 16 at 15:11

0 votes

1 answer

233 views

How to Update Clock Seconds in SwiftUI Without Re-rendering the Entire View?

I’m building a SwiftUI app where I display the current time with seconds. I use the .numericText transition for the text to add a smooth animation whenever the seconds change. However, I’ve noticed a ...

user1569766

25

asked Jan 15 at 0:41

0 votes

2 answers

94 views

platform-tools\adb.exe - High CPU usage on server (Windows)

Using ADB in a java application to monitor android device status every three seconds. Height adb commands are used : adb shell settings get global airplane_mode_on adb shell settings get system ...

rejdrouin

101

asked Jan 1 at 21:41

2 votes

0 answers

129 views

Matrix multiply fastest with -O0 [duplicate]

I timed a fairly naive BLAS-like matrix multiplication (DGEMM) function: void dgemm_naive(const int M, const int N, const int K, const double alpha, const double *A, const int lda, ...

ligro

29

asked Jan 1 at 18:31

2 votes

1 answer

129 views

Is there a way to get node level information in kubernetes pods?

I need low level information about the node, like number of cores, core ID and other things which is part of the kubelet in a pod running in the node. How do I get this?

imawful

135

asked Jan 1 at 15:19

Collectives™ on Stack Overflow