4,715 questions
1
vote
0
answers
24
views
Intellij Utimate edition V2025.3 "Profiler" does not exist in settings
I have Intellij Ultimate edition V2025.3 "Profiler" does not exist in Settings/Preferences > Build, Execution, Deployment > Java Profiler.
I have tried the below option as well, no ...
Advice
1
vote
2
replies
139
views
How the Computer Handles Interrupts
What is the difference between an interrupt and a context switch?
I understand the concept of an interrupt and how it occurs. However, I'm digging deeper into the topic.
I studied Computer ...
3
votes
1
answer
154
views
How to catch EXCEPTION_PRIV_INSTRUCTION from RDPMC directly in Assembly (and without SEH)?
I'm experimenting with measuring CPU's instructions latency and throughput on P and E cores using RDPMC on Win 11, something like that:
MOV ECX, 0x40000000 ; Instructions Counter
RDPMC ; Read ...
0
votes
1
answer
71
views
Cache Allocation Technology in 13th Generation Core i9 13900E Intel CPU [closed]
I am trying to implement Cache allocation Technology`s impact with my CPU. However, when I use either lscpu to see whether my CPU supports, or cpuid -l 0x10, output is false.
How is this possible?
How ...
1
vote
1
answer
106
views
Randomness instructions vs syscalls [closed]
I've been digging into "true" randomness idea, and I've noticed that modern CPUs support instructions for generating randomness. X64 has RDRAND instruction, while ARM has RNDR (I'm not ...
1
vote
1
answer
108
views
Is CPU multithreading effected by divergence?
Building on this question here
The term thread divergence is used in CUDA; from my understanding it's a situation where different threads are assigned to do different tasks and this results in a big ...
0
votes
1
answer
407
views
How to handle "Could not initialize NNPACK! Reason: Unsupported hardware" warning in PyTorch / Silero VAD on cloud CPU?
I’m running Silero VAD (via PyTorch + torchaudio) on a Linode cloud instance (2 dedicated CPUs, 4 GB RAM). When I process 10-minute audio chunks, I always get repeated warnings like this and it doesn'...
7
votes
1
answer
228
views
Why are all IMUL µOPs dispatched to Port 1 only (on Haswell), even when multiple IMULs are executed in parallel?
I'm experimenting with the IMUL r64, r64 instruction on an Intel Xeon E5-1620 v3 (Haswell architecture, base clock 3.5 GHz, turbo boost up to 3.6 GHz, Hyper Threading is enabled).
My test loop is ...
2
votes
0
answers
74
views
Need to do CPU profiling of Jruby application
Need to do CPU profiling for Jruby application (jruby version : 1.7.20.1-8) which uses ruby version (1.9.3).
I tried using default profiler but getting below error due to version compatibility issue ...
0
votes
1
answer
58
views
Fargate Cloudwatch CPU Utilisation differs from docker stats
Looking at the CPUUtilized Cloudwatch metric for my Fargate service, it's showing max cpu units used as 1040 over the past 4 weeks, using a sampling period of 1 minute. I have 4 vCPUs provisioned to ...
0
votes
1
answer
178
views
Performance regression in a Kubernetes deployment that does not occur locally [closed]
I have a docker image and an EC2. When I run this image on my EC2, it takes x seconds to finish. When I run the app natively, it also takes x seconds.
But if I deploy the exact image in a container in ...
2
votes
0
answers
220
views
Why does floating point division take less than 50% of the latency of integer division and also 10x more latency than usual when underflow occurs?
I am measuring the latency of instructions.
For 64-bit primitives, integer division takes about 25 cycles each, usually on my 2.3GHz Digital Ocean vCPU, while floating point division takes about 10 ...
0
votes
0
answers
75
views
Why must align memory address
Memory addresses must be aligned before they are used. I know that if they are not, performance costs more in CPU caching. I discovered that certain processors raise exceptions when unaligned memories ...
-3
votes
1
answer
114
views
Understanding when a hazard in MIPS occurs
I have a question regarding these two instructions:
lw r2, 10(r1)
lw r1, 10(r2)
Is there a hazard here, do I need stalls in between two of them?
I want to know if any kind of hazard happens here? I ...
1
vote
0
answers
44
views
How to optimize CPU tensor slicing and asynchronous transfer to the GPU?
My code involves slicing large tensors on the CPU by index and asynchronously transmitting them back to the GPU. However, through the Profiler debugging tool, I found that this step would seriously ...
1
vote
0
answers
86
views
popcnt instruction not as fast as loop on core ultra 155h [duplicate]
I think the title says it all: i have implemented a popcnt function that counts bits as a loop with shifts and one with inline asm with the actual cpu instruction.
This is my c code:
#define ...
2
votes
1
answer
135
views
CPU cache invalidation control from application - clear cache store queues (?) for x86/x64 architectures (Invalidate data after read, skip write-back)
We have some multimedia processing applications designed as a set of filters for processing data buffers. If temporal data in between filters is not very large and can fit in L1 or L2/L3 caches - the ...
1
vote
0
answers
77
views
How to analyze the microarchitecture resource requirements based on the trace generated by program execution?
I'm doing an in-depth CPU microarchitectural resource analysis. I want to know the requirements of my program on processor microarchitectural resources and compare the requirements of different ...
0
votes
0
answers
104
views
Mutex Implementations and Memory Fences in C
I have been writing my own x86 32-bit operating system for the past month or so. My system uses just one core.
Anyway, I have been reading a lot about memory fences, CPU optimizations, and compiler ...
0
votes
0
answers
51
views
XGBoost GPU version not outperforming CPU on small dataset despite parameter tuning – suggestions needed
I'm currently working on a parallel and distributed computing project where I'm comparing the performance of XGBoost running on CPU vs GPU. The goal is to demonstrate how GPU acceleration can improve ...
1
vote
1
answer
283
views
Trying to get the CPU temperature using several libraries returns wrong results
I want to get the CPU temperature using Python code. I’m using Windows 11 24H2 and Python 3.10.6. I’ve already tried using WinTmp.CPU_Temp():
import WinTmp
print(WinTmp.CPU_Temp())
>>> 0.0
...
0
votes
1
answer
177
views
Linux UIO IRQ related periodic CPU usage
I have an Intel Arria 10 SoC FPGA system with 5.4.104-lts Linux built with Yocto 3.3.1 and Poky.
The installed FPGA image is doing nothing more than making interrupts to an UIO device, 50 times a sec.
...
0
votes
0
answers
62
views
How to fix CPU feature error when running nextjs project on Ubuntu server?
The Docker Compose project only returns this error in the logs and no more details, and even the twa process stops and stays on the first page, which is the splash-screen, and the process does not ...
2
votes
1
answer
117
views
Why does VPERM2I128/_mm256_permute2x128_si256 (and also FP variants) not exist in AVX512 instruction set?
It could operate identically on both 256-bit halves of a 512-bit AVX512 register. Like identical operation on 128-bits lanes of 256-bits registers in AVX/AVX2. Any tech reasons?
1
vote
1
answer
289
views
To understand how multithreading works in a Kubernetes pod
I have a multithreaded Spring Boot microservice running in a Kubernetes pod with a CPU limit of 1 (1000m). Does this mean only one CPU core is used to run all my threads one by one, or can multiple ...
0
votes
1
answer
115
views
Execution stages in a superscalar microarchitecture
In this article https://www.lighterra.com/papers/modernmicroprocessors it is stated that (under Multiple issue - Superscalar)
the fetch and decode/dispatch stages must be enhanced so they can decode ...
1
vote
2
answers
429
views
How to get processor information with Delphi using no third party units?
I need to get the processor name using Delphi. Nothing fancy, i just need to retrieve what Windows System > About shows ; in the example below, i want to retrieve the '13th Gen Intel(R) Core(TM) i9-...
-4
votes
1
answer
152
views
How SIMD vs SIMT handle divergence [closed]
What exactly happens at the hardware level when a divergence occurs in SIMD and SIMT architectures, and how does each handle the execution of different instruction paths?
I found this question, but ...
2
votes
1
answer
124
views
Context switching in hardware threads
In Hyper-threading (or SMT) when two threads of a CPU core gets swapped in and out, does a context-switch occur.
Would it be called a context switch?, if not what is the terminology for it.
1
vote
2
answers
133
views
Why does each DRAM chip have to contribute 8 bit to the 64 bit bus width parallely, instead of a single chip contribute all 64 bits
Okay my question is probably dumb. But I cant find any answers that correct me.
I learned that in DDR4 -lets say the stick has 8 chips- each chip parallelly contributes 8 bit to the 64 bit bus width.
...
6
votes
1
answer
206
views
How do latency of FP division and sqrt vary with input data, or is it just type?
I have recently been looking into the latency and throughput of CPU instructions and have even written some benchmarks to experiment.
However, I am struggling to understand how to properly benchmark ...
0
votes
2
answers
253
views
How to wait until the CPU usage drops below 60% in VBA?
The following code is using for measuring CPU % usage.
Public Sub Macro1()
Dim strComputer As String
Dim objWMIService As Object
Dim colItems As Object
Dim objItem As Object
strComputer = ".&...
0
votes
1
answer
118
views
Raspberry Pi 5 Automatically Adjust Virtual Environment & CPU Cores Without Rebooting
I'm configuring .bashrc on my Raspberry Pi 5 to automatically activate a virtual environment and limit the CPU cores from 4 to 1 when I navigate to a specific directory. When I move to a different ...
0
votes
1
answer
184
views
Get-Counter not working on certain servers to get average CPU Percent Utilization
This is my code:
(Get-Counter '\Processor(_Total)\% Processor Time').CounterSamples.CookedValue
I am trying to receive the average CPU Utilization with Get-Counter but every time i try i get this ...
0
votes
0
answers
49
views
Using Jupyter notebook online doesn't use any CPU?
Apologies for the very primitive question. I am using the online version of Jupyter notebook for some programming assignemnts because I have only an old and slow chromebook- I did not want to download ...
0
votes
1
answer
110
views
Running test on Rocket core CPU - global variable initialized to 0 is unsuccessful, output wrong value instead
While I am benchmarking my Rocketcore CPU, I encountered failed Coremark benchmarking. After some debug, I reduce the issue scope to unsuccessful global initialization of 0 value. In Coremark, it will ...
1
vote
0
answers
151
views
Programmatically get CPU utilization of the process in % on MacOS
I need to get % CPU currently used by my process on macOS. I expect it to be calculated this way, and it works on Windows:
(currentAppTime - lastTrackedAppTime) * 100% / (currentSysTime - ...
1
vote
1
answer
83
views
Cache Effects in Statically Compiled Binaries: Unexpected Cache Misses
I have a simple Hello World program written in C, which I statically compiled using: gcc -static -fno-pie -o hello{1|2} hello.c.
I expected that executing these two binaries would exhibit cache ...
1
vote
1
answer
128
views
What do the letters in port usage on uops.info mean?
What do the letters in the ports of the uops.info table mean?
For example ADD (R64, R64) lists 1*p0156B at ports. The documentation says 1*p0156 means one microinstruction can be executed at ports 0, ...
0
votes
0
answers
240
views
Created TensorFlow Lite XNNPACK delegate for CPU - ('--log-level=1') doesn't work
A simple Python script (Selenium + ChromeDriver):
# import the By class, which allows you to choose how to search for an element
from selenium.webdriver.common.by import By
# initialize the browser ...
-1
votes
1
answer
143
views
If cache invalidation happens every time memory mappings change, why not opt for VIVT?
As far as I know, L1 is VIPT for at least Intel chips. VIVT caches don't depend on address translation, so they can fully operate in parallel with TLB lookup. VIPT can also achieve some parallelism by ...
0
votes
0
answers
113
views
SDL CPU rendering project, rendering error when resizing window: Window surface is invalid
I was working on a cpu only rendering project with SDL in C.
I implemented very good error handling and I got this error when I try to resize the window, "ERROR: SDL Error in render thread: ...
1
vote
0
answers
126
views
How to increase the frequency of the CPU from C
I am writing C code for the Raspberry Pi 4 (ARM Cortex-A72), which relies on precise timing in periods of less than 1μs. To get precise timing, I use the following algorithm:
clock_gettime(...
-1
votes
1
answer
98
views
Pod restart issue in java based micro-service architecture
There were 2 pods running in my micro-service, both of them got restarted with kubernetes reason as OOM killed
enter image description here
(The above dashboard uses the following query->sum(0,...
0
votes
1
answer
76
views
Perform a benchmarking test on different cores on a VM Ubuntu system
I want to perform a benchmarking Test (BPFM, IOR, FIO & Sysbench) on a Ubuntu VM. The benchmark should use the available amount of cores in steps of 2^2 (So 2, 4, 8, 16, ... up to the available ...
0
votes
1
answer
118
views
Why is my AI training on GPU is a lot slower than CPU
I'm currently training my simple prediction AI but my GPU is training at 40S per epochs while my CPU is training at 9S per epochs
my CPU is i7-4720HQ and my GPU is Nvidia 950m
this is my code
`import ...
0
votes
1
answer
233
views
How to Update Clock Seconds in SwiftUI Without Re-rendering the Entire View?
I’m building a SwiftUI app where I display the current time with seconds. I use the .numericText transition for the text to add a smooth animation whenever the seconds change. However, I’ve noticed a ...
0
votes
2
answers
94
views
platform-tools\adb.exe - High CPU usage on server (Windows)
Using ADB in a java application to monitor android device status every three seconds. Height adb commands are used :
adb shell settings get global airplane_mode_on
adb shell settings get system ...
2
votes
0
answers
129
views
Matrix multiply fastest with -O0 [duplicate]
I timed a fairly naive BLAS-like matrix multiplication (DGEMM) function:
void dgemm_naive(const int M, const int N, const int K, const double alpha,
const double *A, const int lda, ...
2
votes
1
answer
129
views
Is there a way to get node level information in kubernetes pods?
I need low level information about the node, like number of cores, core ID and other things which is part of the kubelet in a pod running in the node. How do I get this?