Newest 'nvidia' Questions

-5 votes

0 answers

33 views

Fail to install nvidia driver on Kali with run file but packages not present. Black screen with system drivers [closed]

I have some recurring issues with Kali (last kernel) and my nvidia graphic card. For example, the header of the windows (of all applications) is not displayed. I can't close the windows (only with the ...

James

1,471

asked 11 hours ago

0 votes

0 answers

17 views

clone metadata in deepstream pipeline in python

I am currently implementing a deepstream pipeline in python with a tee split as follows: streammux -> tee -> queue1 -> detector1 -> tracker1 -> queue2 -> detector2 -> tracker2 The ...

user32306963

1

asked yesterday

1 vote

1 answer

211 views

Error while loading shared libraries: libcuda.so.1: cannot open shared object file: No such file or directory

I am trying to set up a docker container using the nvidia container toolkit on a remote server, so that I can run cuda programs developed with the Futhark Programming Language - however, the issue ...

Artemijo5

13

asked Jan 14 at 18:25

0 votes

2 answers

62 views

pycuda._driver.Error: cuInit failed: unknown error

I have a problem with pycuda. I used it for a python script i develop. I know this script work because i use it on other server. But on a specific server i got a problem : >>> import pycuda....

Julien

1

asked Jan 6 at 9:19

-5 votes

1 answer

73 views

Performance Degradation of LAMMPS with Increased MPI Ranks on a A100 GPU [closed]

I tested the performance of LAMMPS with DeepMD-kit for MD simulations on an HPC cluster. The job was allocated 8 CPUs, 64 GB of RAM, and one A100 GPU. I observed that when running with mpirun -np 1 ...

link89

2,017

asked Dec 24, 2025 at 1:01

Tooling

0 votes

1 replies

84 views

How to dynamically estimate maximum number of cameras my GPU can handle for YOLOv8 inference?

I’m trying to simulate multiple camera streams feeding into a YOLOv8l model on a single GPU and monitor real-time hardware utilization. My setup: Single GPU (48GB VRAM, CUDA-enabled) YOLOv8l model ...

Madesh Prasad

23

asked Dec 22, 2025 at 12:23

2 votes

1 answer

874 views

How to correctly install JAX with CUDA on Linux when `jax[cuda12_pip]` consistently falls back to the CPU version?

I am trying to install JAX with GPU support on a powerful, dedicated Linux server, but I am stuck in what feels like a Catch-22 where every official installation method fails in a different way, ...

PowerPoint Trenton

115

asked Nov 12, 2025 at 9:36

4 votes

1 answer

191 views

Unable to run CUDA program in google colab

I am trying to run basic CUDA program in google colab but its not giving kernel output. Below are the steps what I tried: Changed run type to T4 GPU. !pip install nvcc4jupyter %load_ext ...

Digvijay Singh Thakur

3,351

asked Nov 6, 2025 at 7:52

1 vote

1 answer

75 views

How to debug cuda in Visual Studio with "step over"

I installed NVIDIA Nsight Visual Studio Edition 2025.01 in Visual Studio 2022. I want to debug code, but I can't debug with step over(F10), The debugger always stops at a location without a breakpoint....

Imagination Youth

11

asked Oct 31, 2025 at 2:36

1 vote

1 answer

601 views

TensorFlow not detecting NVIDIA GPU (RTX 3050, CUDA 12.7, TF 2.20.0) [duplicate]

I’ve been trying to get TensorFlow to use my GPU on Windows, and even though everything seems installed correctly, it shows 0 available GPUs. System setup Windows 11 RTX 3050 Laptop GPU NVIDIA driver ...

Houssem Eddine

27

asked Oct 26, 2025 at 22:21

1 vote

0 answers

310 views

Why does “Command Buffer Full” appear in PyTorch CUDA kernel launches?

I’m using the PyTorch profiler to analyze sglang, and I noticed that in the CUDA timeline, some kernels show “Command Buffer Full”. This causes the cudaLaunchKernel time to become very long, as shown ...

plznobug

123

asked Oct 23, 2025 at 12:36

2 votes

0 answers

420 views

jax plugin configuration error: Exception when calling jax_plugins.xla_cuda12.initialize()

I am using WSL2 on windows 10. I have NVIDIA graphics card. I recently installed GPU jax using the command pip install -U "jax[cuda12]". This completed successfully, but when I run any jax ...

DrMittal

51

asked Oct 14, 2025 at 14:14

2 votes

1 answer

228 views

Executing a CUDA Graph from a CUDA kernel

I’m trying to launch a captured CUDA Graph from inside a regular CUDA kernel (i.e., device-side graph launch). From the NVIDIA blog on device graph launch, it seems this should be supported on newer ...

Mohammad Siavashi

1,292

asked Oct 13, 2025 at 11:38

0 votes

1 answer

110 views

CPU-GPU producer-consumer pattern using unified memory but GPU is in spin loop

I am trying to implement producer consumer problem in GPU-CPU. Required for some other project. GPU requests some data via Unified memory to CPU. CPU copies that data to a specific location in global ...

Chinmaya Bhat K K

1

asked Sep 30, 2025 at 18:38

3 votes

1 answer

127 views

TensorRT PWC-Net Causing 2.4km Trajectory Error in iSLAM - Original PyTorch Works Fine

Problem Statement My iSLAM system works correctly with the original PyTorch PWC-Net but produces catastrophic trajectory errors (2.4km ATE RMSE) when I replace it with a TensorRT-converted version. ...

Unknown

705

asked Sep 19, 2025 at 11:57

0 votes

0 answers

155 views

TensorRT DLA Engine Build Fails for PWC-Net on Jetson NX - Missing Layer Support?

I'm converting a PWC-Net optical flow model to run on Jetson NX DLA using the iSLAM framework, but the TensorRT engine build fails during DLA optimization. Environment Hardware: NVIDIA Jetson NX ...

Unknown

705

asked Sep 15, 2025 at 7:33

0 votes

0 answers

76 views

Using a scalar tensor as image source for a Holoscan HolovizOp

I am attempting to write my own holoscan::Operator for creating some images that should be displayed as a short video using a holoscan::ops::HolovizOp. So I compose()-d an application flow: add_flow(...

Markus-Hermann

1,061

asked Sep 12, 2025 at 6:28

0 votes

1 answer

216 views

How to correctly monitor a program’s GPU memory bandwidth utilization and SM utilization? (DCGM DRAM_ACTIVE vs in-program bandwidth differs a lot)

I want to quantitatively measure the memory bandwidth utilization and SM utilization of a CUDA program for performance analysis and regression testing. My approach so far: Compute the theoretical ...

plznobug

123

asked Sep 5, 2025 at 10:48

2 votes

1 answer

165 views

How to define "pool" for Nvidia holoscan::ops::FormatConverterOp

I am trying to get the holoscan example "bring your own model" https://docs.nvidia.com/holoscan/sdk-user-guide/examples/byom.html to run, translating it from Python into CPP. One necessary ...

Markus-Hermann

1,061

asked Aug 27, 2025 at 11:52

1 vote

1 answer

485 views

How are fp6 and fp4 supported on NVIDIA Tensor Core on Blackwell?

I am writing PTX assembly code on CUDA C++ for research. This is my setup: I have just downloaded the latest CUDA C++ toolkit (13.0) yesterday on WSL linux. The local compilation environment does not ...

Junhao Liu

11

asked Aug 14, 2025 at 10:03

1 vote

1 answer

115 views

What is the actual maximum nesting depth of dynamic parallelism in CUDA?

Without getting into too much detail, the project I'm working on needs three different phases, each corresponding to a different kernel. I only know the number of threads needed in the second phase ...

StefanoTrv

308

asked Aug 12, 2025 at 13:29

2 votes

1 answer

77 views

How to correctly pass float4 vector to kernel using PyCUDA?

I am trying to pass a float4 as argument to my cuda kernel (by value) using PyCUDA’s make_float4(). But there seems to be some misalignment when the data is transferred to the kernel. If I read the ...

Dodilei

308

asked Aug 7, 2025 at 19:49

1 vote

0 answers

49 views

Unresolved extern function '__write_pipe_2' when building an OpenCL program

I'm using the OpenCL clBuildProgram() API function on a program created from a source string. The source is: kernel void foo(int val, write_only pipe int outPipe) { write_pipe(outPipe, &val); }...

einpoklum

138k

asked Jul 27, 2025 at 19:20

2 votes

0 answers

41 views

What do shuffle instructions do on the hardware? [duplicate]

https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#warp-shuffle-functions https://docs.nvidia.com/cuda/parallel-thread-execution/index.html#data-movement-and-conversion-instructions-shfl-...

Tom Huntington

3,768

asked Jul 23, 2025 at 5:49

1 vote

1 answer

121 views

Doing inference for Nvidia's onnx model (PeopleNet)

I am implementing PeopleNet onnx model at my cpp application. The following are preprocessing and postprocessing functions. void preprocessGpuBatch(const std::vector<cv::cuda::GpuMat>& ...

batuman

7,346

asked Jul 15, 2025 at 5:01

1 vote

1 answer

235 views

(NVIDIA/nv-embed-v2) ImportError: cannot import name 'MISTRAL_INPUTS_DOCSTRING' from 'transformers.models.mistral.modeling_mistral'

My code: from transformers import AutoTokenizer, AutoModel model_name = "NVIDIA/nv-embed-v2" tokenizer = AutoTokenizer.from_pretrained(model_name) model = AutoModel.from_pretrained(...

6zL

21

asked Jul 5, 2025 at 13:27

0 votes

1 answer

476 views

Distinction CuTe and NVIDIA Cutlass

I'm confused what exactly is handled by CuTe and by Cutlass. From my understanding Cutlass handles the following: Gemm computation of CuTe Tensors Communication between CPU and GPU Abstract memory ...

jonithani123

254

asked Jul 2, 2025 at 14:23

0 votes

0 answers

51 views

SLURM service in invalid state

I have a 1 gpu machine with this configuration: This is my slurm.conf: NodeName=TechdivAISLURM CPUs=4 Boards=1 SocketsPerBoard=1 CoresPerSocket=4 ThreadsPerCore=1 RealMemory=32096 Gres=gpu:1 State=...

RobM

835

asked Jun 25, 2025 at 0:50

-4 votes

1 answer

53 views

For a large project, how should I search for a specific kernel in nsys?

When I was analyzing a large project, there were many kernel files. I wanted to find a specific kernel in the file obtained from nsys analysis. How should I operate

rongtao zhou

1

asked Jun 23, 2025 at 8:44

0 votes

1 answer

287 views

how to use nsys when the program is running

I want to use nsys to profile a sever，how can I use nsys when the server is running，should I use nsys launch or nsys profile to restart the sever？Is there any way for me not to restart the service？

rongtao zhou

1

asked Jun 17, 2025 at 8:21

2 votes

0 answers

87 views

Reusing shared data between global functions

Is there an officially sanctioned way to reuse shared data between global functions? Consider the following code https://cuda.godbolt.org/z/KMj9EKKbf: #include <cuda.h> #include <stdio.h> ...

Johan

77.5k

asked Jun 3, 2025 at 9:42

1 vote

0 answers

44 views

DOCA Switch Application segmentation fault

Environment: Hardware: BlueField-2, model MBF2H516A-CEEOT OS: Linux version 5.15.0-1060-bluefield (buildd@bos03-arm64-114) DOCA SDK: 2.10.0087 Description: I'm trying to run the doca_switch sample ...

user24906747

11

asked Jun 3, 2025 at 6:16

-4 votes

1 answer

347 views

Is Bryce Lelbach's claim regarding progress guarantees on non-NVIDIA GPUs true?

In a talk on The C++ Execution Model, from the cppunderthesea 2024 conference, at around 44:50, NVIDIA's Bryce Adelstein Lelbach claims, that non-NVIDIA GPUs give no guarantee of threads progressing (&...

einpoklum

138k

asked May 30, 2025 at 11:36

1 vote

1 answer

125 views

Unexpected result with cublasStrmm (cublas triangular matmul)

With the following test example, the output matrix doesn't give the desired output or maybe I'm misunderstanding certain parameters: #include <cstdio> #include <cublas_v2.h> #include <...

A. K.

39.7k

asked May 29, 2025 at 22:13

0 votes

0 answers

319 views

docker-compose.yaml: services.OD Additional property device_requests is not allowed

I am trying to run a docker compose and failing. I have here a minimal reproducible example. First with this docker-compose.yml services: hello-app: image: python:3.10-slim command: python ...

KansaiRobot

10.6k

asked May 20, 2025 at 2:58

0 votes

0 answers

68 views

Why is my OpenCL optimized convolution kernel slower than the naive version at higher workgroup sizes?

I'm working on a GPU-accelerated 2D convolution in OpenCL for a 2048x2048 image using a 3x3 Sobel filter. I implemented two versions of the kernel: A naive version that uses only global memory. An ...

Mxneeb

19

asked May 1, 2025 at 23:07

0 votes

0 answers

110 views

Trouble Detecting NIC Ports in DPDK Program with ConnectX-6

I'm encountering an issue while developing a DPDK-based program using a dual-port ConnectX-6 NIC on Ubuntu 24.04. Despite following the setup instructions, my program fails to detect the NIC ports. ...

Mohammad P

1

asked Apr 27, 2025 at 9:49

1 vote

0 answers

47 views

Why does ML.NET Image Classification with Ampere Gpu return fixed results when it otherwise works with CPU and Turing Gpu support?

We currently have a trained ResnetV250 image classification model that performs as expected on CPU and with GPU support on Turing based cards with CUDA 10.1 and cudnn 7.6.4. When transferring this to ...

Ross Halliday

935

asked Apr 25, 2025 at 11:51

2 votes

2 answers

99 views

Standard way of calling math functions in C when using OpenMP & its offloading feature(s)?

I am writing some code in C in which I want to add the optional ability to have certain sections of the code accelerated using OpenMP, and with an additional optional ability to have them accelerated ...

Matthew G.

124

asked Mar 30, 2025 at 19:01

0 votes

0 answers

16 views

CrashLoopBackOff for NVIDIA VSS Pod on MicroK8s – Troubleshooting Deployment Issues

I'm working on the NVIDIA VSS project as detailed in the official documentation: NVIDIA VSS Run Guide. I'm deploying the service on a MicroK8s cluster. However, one of the pods—named similar to vss-...

Cody Hubman

1

asked Mar 28, 2025 at 5:12

0 votes

1 answer

80 views

Why is tensorflow not recognizing my gpu after installing it with anaconda

So im trying to use tensorflow with my yolov8 project but for some reason it is not recognizing my gpu. I had originally installed it using pip but i was told i should use conda instead, so i switched ...

James Pelham-Burn

1

asked Mar 27, 2025 at 2:17

2 votes

0 answers

47 views

NVIDIA webinar on parallel reduction gridsize

In the last example of Mark Harris' webinar I don't understand the indexing before the parallel reduction part. In "Reduction #6" the gridSize/number of dispatches was ceil[N (the size of ...

Michael Bay

71

asked Mar 22, 2025 at 14:25

0 votes

0 answers

149 views

TensorRT Access Violation Error (0xC0000005) at nvinfer_10.dll - How to Resolve?

Environment: OS: Windows Operating System TensorRT Version: TensorRT-10.3.0.26 NVIDIA CUDA Version: 12.6 cuDNN Version: 9.8 GPU: RTX 3050ti laptop GPU Issue Description: I am encountering an "...

B.Uluer

11

asked Mar 21, 2025 at 13:31

0 votes

0 answers

121 views

Got Segmentation fault (core dumped) after run IExecutionContext.execute_async_v3()

I used the following commands to convert an ONNX model to a TRT engine, where the input.onnx file is the original model: polygraphy surgeon sanitize --fold-constants ./input.onnx -o output.onnx ...

simonzgx

31

asked Mar 21, 2025 at 7:51

1 vote

0 answers

71 views

DuplicateOutput Fails with NVIDIA set as Preferred graphic Adapter on Dual Graphics System

I have a laptop with an integrated Intel graphics card and an NVIDIA T1000 graphics card. I set the NVIDIA card as the preferred graphic processor in the Managed 3D in NVIDIA Control Panel. However, ...

Martin121233

21

asked Mar 13, 2025 at 22:56

2 votes

1 answer

356 views

Inconsistent results when training models using different GPUs

I've been trying to train a language model (text classification), our lab has two GPUs, a 4090 and a 3090. However, I encountered a perplexing phenomenon during training: the model's performance ...

tong

21

asked Mar 5, 2025 at 10:10

2 votes

0 answers

353 views

Not able to access GPU within the docker container

I am using Ubuntu 22.04. I have nvidia-570 driver installed along with cuda 12.4 on my host machine. However, I am not able to access gpu in my container. This is my docker-compose-file version: '3.8' ...

prarthana sigedar

21

asked Mar 5, 2025 at 7:59

0 votes

1 answer

112 views

TAO Command Not found, whenever trying to run nvidia model inside TAO TOOLKIT TENSORFLOW docker container on WSL2

Hello even though I am inside the TAO toolkit tensorflow I am still having the issue of decrypting a model with tao docker run -it --rm -v /home/models:/workspace/model_dir -p 8888:8888 --runtime=...

Novice

84

asked Mar 4, 2025 at 20:44

1 vote

0 answers

154 views

Error response from daemon: could not select device driver "nvidia" with capabilities: [[utility compute]]

I'm trying to build a docker for realtime-whiper. The build process finishes successflly but at the end it gives this error: Error response from daemon: could not select device driver "nvidia&...

Ali Zekai Deveci

11

asked Mar 4, 2025 at 14:25

2 votes

1 answer

122 views

Deploy TPU TF Serving Model to AWS SageMaker

I have a couple of pre-trained and tested TensorFlow LSTM models, which have been trained on Google Colab. I want to deploy these models with AWS as our entire application is deployed there. I've ...

Manu Sisko

310

asked Feb 26, 2025 at 15:51

Collectives™ on Stack Overflow