Newest 'cuda' Questions

Best practices

1 vote

0 replies

28 views

How do you fully automate a Kaggle notebook that requires a kernel restart mid-run after installing a library?

looking for advice on a workflow problem rather than a specific library issue. The situation I have a notebook that: Installs packages via pip in Cell 1 # REVISED CELL 1 import subprocess, sys # ...

Sanjid Hasan

1

asked Apr 11 at 6:55

0 votes

0 answers

63 views

How to stop cuda register resource from removing with host-device mapping of an OpenGL shader storage buffer?

I'm trying to take an OpenGL storage buffer, load data into it from the CPU (persistently mapping it), taking that data and processing it with cuda, then using that same data again in further OpenGL ...

Krupip

5,608

asked Apr 9 at 19:02

Best practices

0 votes

0 replies

34 views

CUDA Channelizer Implementation

A channelizer is, conceptually, a bunch of reduction operations (polyphase filter) followed by an equal number of transforms (FFT). Though I have this "mostly working", there are still some ...

Jim Clay

141

asked Apr 9 at 15:58

-2 votes

1 answer

67 views

What tearing guarantees are provided when reading/writing from global memory?

Lets say there are 10 threads writing and 10 threads reading from the same 32 bit integer stored in global memory, in device code, all at the same time. Are there any guarantees provided about the ...

Box Box Box Box

5,377

asked Apr 9 at 1:11

2 votes

1 answer

147 views

Why does order of defining hidden friend binary operator- vs operator==, and using int vs concepts matter to NVCC?

I'm trying to create a pointer like type wrapper for CUDA device pointers, but ran into a problem I can't reproduce with other compilers in regular MSVC or GCC. Basically, depending on the order I ...

Krupip

5,608

asked Mar 31 at 4:05

1 vote

1 answer

103 views

Making non cuda code compatible with cuda which uses std functions where cuda::std exists

I'm attempting to create a function that handles bit casting a float/double/float16 to an unsigned key of equivalent size. I can do this in host code easily with template<std::floating_point T> ...

Krupip

5,608

asked Mar 26 at 22:21

2 votes

2 answers

139 views

nvcc insists thrust::complex doesn't have real() or imag()

#include <stdio.h> #include <thrust/complex.h> #include <thrust/device_vector.h> // #include <thrust/host_vector.h> int main(int argc, char *argv[]) { thrust::device_vector&...

Jim Clay

141

asked Mar 26 at 6:10

0 votes

0 answers

51 views

How to use Cub::BlockHistogram when some of the values are not valid?

I'm using cub::BlockHistogram<std::uint16_t, 256, items_per_thread, 256, cub::BLOCK_HISTO_ATOMIC>; and I was expecting an API with similar utility to cub::Warp* functions, however there doesn't ...

Krupip

5,608

asked Mar 26 at 2:33

Advice

0 votes

1 replies

45 views

Performing sorts on segments of images at the same time using CUB/CudaSDK with out performance compromise

Background I have an image that I'm splitting into tiles of some size, determined by parameters tile_x, and tile_y, so tile_x = 2 and tile_y = 2 would mean the image is split into a 2x2 grid of tiles (...

Krupip

5,608

asked Mar 24 at 22:18

1 vote

2 answers

136 views

CUDA SETUP ERROR: Missing libnvJitLink.so.13

I am trying to load Qwen on google colab. Concurrently asked this question on github bitsandbytes foundation -> https://github.com/bitsandbytes-foundation/bitsandbytes/issues/1905#issuecomment-...

tanvir

53

asked Mar 24 at 6:11

-2 votes

0 answers

135 views

ONNXRuntimeError: CUDA error: cudaErrorNoKernelImageForDevice:no kernel image is available for execution on the device

I'm trying to run model on GPU: clf2 = PunctCapSegModelONNX.from_pretrained( "1-800-BAD-CODE/xlm-roberta_punctuation_fullstop_truecase", ort_provider=["CUDAExecutionProvider&...

xæliudzyh

1

asked Mar 18 at 10:36

0 votes

1 answer

104 views

What does the extra pointer parameter for CUDA's nan function actually do?

The CUDA function providing a quiet NaN is: __device__ double nan(const char *tagp) And the documentation says that Argument tagp selects one of the possible representations. My question: What does ...

einpoklum

139k

asked Mar 12 at 13:22

2 votes

0 answers

116 views

__float128 addition: inconsistent result on CPU and CUDA

I'm testing __float128 on CUDA with a simple program: // nvcc -arch=sm_100 test_fp128_add.cu -o test_fp128_add #include <cstdio> #include <cuda_runtime.h> __global__ void add_fp128(...

Rahn

5,605

asked Mar 4 at 0:52

4 votes

1 answer

125 views

Can DRAM and SMEM instructions be issued in a single cycle?

In the Ampere architecture, consider the following scenarios: A single warp executes two load instructions: one from Shared Memory and one from DRAM. Two warps within the same SM, each executing a ...

Gal Avineri

576

asked Mar 2 at 21:40

Advice

0 votes

0 replies

92 views

Learning Resource for TensorRT and PyCUDA (or other CUDA Python inferface library)

I'm trying to deploy a PyTorch model to an edge device (Jetson Orin Nano). Deploying the PyTorch model directly does not satisfy the requirements posed by the clients, so I'm looking into optimize the ...

Trung

1

asked Mar 2 at 8:56

Advice

0 votes

0 replies

84 views

Why doesn’t pytorch LayerNorm implementation get close to A100 peak memory bandwidth?

I’m was trying to benchmark the torch.nn.LayerNorm on an A100 to compare it against a custom CUDA kernel I made. I expected the PyTorch kernel to get much closer to A100’s peak memory bandwidth (which ...

eltonjohnfanboy

1

asked Feb 27 at 10:36

Advice

1 vote

0 replies

64 views

What are the exact steps/commands to install CUDA using Conda?

I'm new to deep learning and I would like to install CUDA using Conda, but I'm confused about the correct order of the steps and which commands I should use. I’ve seen very different commands in ...

Balázs Tobak

1

asked Feb 25 at 9:33

Advice

0 votes

2 replies

74 views

How does CUBLAS achieve 1000-fold reuse?

If I multiply two 8192 x 8192 matrices of float's with CUBLAS, ncu --metrics dram__bytes_read.sum tells me it reads 4.42 GB of data in total (on a 3070). One matrix is 0.268 GB, so we read each matrix ...

asdfldsfdfjjfddjf

531

asked Feb 25 at 8:57

Best practices

0 votes

1 replies

115 views

How to reduce GPU memory usage when fine-tuning a large transformer model?

“I’m fine-tuning a transformer with batch size 8 and getting CUDA out-of-memory errors. Would gradient checkpointing or mixed precision help?”

Colin Leede

31

asked Feb 21 at 12:22

0 votes

0 answers

90 views

Why am I getting an "out of memory" error from cudaSetDevice()?

Below is a small C++ program which queries the number of devices, gets the first device's properties, then tries to cudaSetDevice() to it. I build this program one one machine, then copy it to a ...

einpoklum

139k

asked Feb 12 at 13:39

0 votes

0 answers

44 views

Checking if a GPU function has been loaded (to execute kernel using CUDA Driver API) causes segmentation fault

For educational purposes, I am assembling my own CUDA fatbins by hand and executing them using the CUDA Driver API. The goal is to launch the kernel and display the result of the kernel on the CPU, ...

Ryunosuke Yanashita

9

asked Feb 11 at 1:53

4 votes

0 answers

72 views

cudaMemcpyAsync (P2P D2D) serializes with kernel execution

Hi all — I’m debugging an unexpected ordering/progress issue with peer-to-peer device copies and I’d like help understanding what CUDA/driver/runtime behavior could explain it. Setup Single node, 2 ...

Abhishek Ghosh

755

asked Feb 8 at 2:49

2 votes

1 answer

155 views

Struct with dynamically allocated array that works on host and device with Cuda

I would like to create a struct that can work on the host and device that stores an array that has been dynamically allocated (unknown size at compile time). This struct would be sent to a kernel so ...

Programing Bear

23

asked Feb 7 at 1:42

0 votes

0 answers

66 views

cupy kernel template for elementwise reductions

I want to get the kernel template that cupy uses for its subtract function to compare with my (really simple) kernel: subtract_kernel = cp.RawKernel( """ extern "C" __global__ ...

mle

101

asked Feb 3 at 19:09

0 votes

0 answers

122 views

CUDA - Make a specific memory access skip the cache

I have a kernel which first reads values from certain memory locations, then writes to those memory locations. I also have a lock which ensures that at any point in time, only one thread is writing to ...

Yoogi Kovendhan

63

asked Jan 25 at 14:43

2 votes

0 answers

71 views

16-way radix sort without threadgroup/warp shared memory prefixsums

I've been implementing radix sort for SPH simulation use for some time now, and initially I've had increased to to use 4 bits radix and make it 16 way radix sort which I haven't seen documented too ...

harism

6,113

asked Jan 23 at 16:14

3 votes

1 answer

609 views

CUDA_ARCHITECTURES is set to "native", but no NVIDIA GPU was detected

I am trying to install llama-cpp-python with GPU support. I installed Nvidia CUDA Toolkit v13.1, nvidia-smi shows that my graphics card - Geforce GTX 1050 Ti - supports CUDA v13, nvcc is installed ...

Стебан

41

asked Jan 21 at 20:24

2 votes

1 answer

154 views

How can I determine if the memory at a certain address was allocated by / is managed by CUDA?

Suppose I get a pointer, and I want to determine whether it's "CUDA-associated", i.e. allocated by CUDA as pinned host-side memory, device-side memory, managed memory, array memory, etc. - ...

einpoklum

139k

asked Jan 16 at 16:09

1 vote

1 answer

360 views

Error while loading shared libraries: libcuda.so.1: cannot open shared object file: No such file or directory

I am trying to set up a docker container using the nvidia container toolkit on a remote server, so that I can run cuda programs developed with the Futhark Programming Language - however, the issue ...

Artemijo5

13

asked Jan 14 at 18:25

2 votes

1 answer

231 views

CMake Error: Error required internal CMake variable not set ... _CMAKE_CUDA_WHOLE_FLAG

I'm trying to configure a project of mine involving CUDA, like so: cmake \ -DCMAKE_CUDA_ARCHITECTURES=61 \ -DCMAKE_BUILD_TYPE=Release \ -DCUDAToolkit_ROOT=/usr/local/cuda-11.6 \ -DCMAKE_CUDA_COMPILER=/...

einpoklum

139k

asked Jan 12 at 23:31

Best practices

0 votes

2 replies

71 views

Designing a std::pmr-like vector for device memory using Thrust

I want to implement a container similar to std::pmr::vector, but backed by CUDA device memory, with usage semantics close to std::pmr::vector / thrust::device_vector. Requirements: PMR-style ...

Savrona

408

asked Jan 7 at 20:19

0 votes

2 answers

75 views

pycuda._driver.Error: cuInit failed: unknown error

I have a problem with pycuda. I used it for a python script i develop. I know this script work because i use it on other server. But on a specific server i got a problem : >>> import pycuda....

Julien

1

asked Jan 6 at 9:19

4 votes

1 answer

205 views

printf() not working on colab while running a CUDA c++ code

This is my first time working with CUDA programs. So I just wrote a simple hello world program. #include <stdio.h> __global__ void hello(){ printf("Hello block: %u and thread: %u\n"...

Alvin Alex

43

asked Dec 26, 2025 at 21:58

-5 votes

1 answer

86 views

Performance Degradation of LAMMPS with Increased MPI Ranks on a A100 GPU [closed]

I tested the performance of LAMMPS with DeepMD-kit for MD simulations on an HPC cluster. The job was allocated 8 CPUs, 64 GB of RAM, and one A100 GPU. I observed that when running with mpirun -np 1 ...

link89

2,035

asked Dec 24, 2025 at 1:01

Advice

1 vote

6 replies

197 views

Cannot install Tensorflow GPU on Win 11

i was training AI in my pc using tensorflow with CPU due i have a amd GPU, well i bought a Nvidia RTX 5060 TI 16GB and i couldn't make it work. First, i made a new environment with anaconda, installed ...

José Timaure

69

asked Dec 19, 2025 at 23:42

3 votes

0 answers

263 views

Why is my “Second naive” SGEMM kernel slower than the “global memory coalesced” version？

I am benchmarking several very simple CUDA SGEMM kernels on an NVIDIA Hopper GPU (H800, sm_90), and I observed something that I do not fully understand. I have two kernels that, to my understanding, ...

plznobug

123

asked Dec 19, 2025 at 15:56

1 vote

0 answers

143 views

Small error in CUDA stream mandelbrot kernel

I'm new to CUDA, and I can't see where's the mistake in my kernel. Upon comparing it with the result from my professor, the difference was extremely small, with the pixel average being off by 0,0039. ...

peng33

11

asked Dec 14, 2025 at 18:35

1 vote

0 answers

131 views

Thrust device allocator vs std allocator

I have very simple class using thrust device allocator. I also have class on host side uses std::allocator. It works fine. But this one gives segmentation fault. I am not sure what is wrong here. How ...

Savrona

408

asked Dec 10, 2025 at 18:32

3 votes

1 answer

141 views

What is the exactly layout of CuTe's thread value layout?

I am learning CuTe's thread value layout, and I followed leimao's blog: https://leimao.github.io/blog/CuTe-Thread-Value-Layout/ I want to figure out which thread read which part of data in matrix. So ...

Damons

363

asked Dec 10, 2025 at 11:05

-6 votes

1 answer

160 views

PyCharm and PyTorch - Not able to run CUDA [closed]

I have CUDA installed via the regular Windows downloadable installer via the official website, and am trying to use PyTorch in the PyCharm program using CUDA as kernel. PyTorch now works fine, however ...

alexanderjansma

55

asked Dec 8, 2025 at 20:31

Advice

1 vote

5 replies

132 views

CUDA C: How to keep an entire, somewhat complex calculcation on the GPU w/o bringing intermediate results back to host

So I'm trying to learn CUDA C. I had an idea for a simple code that could calculate the simple average of a float array. The idea is that main() will call a host function get_average(), which will ...

bob.sacamento

6,723

asked Dec 4, 2025 at 13:59

5 votes

2 answers

482 views

clangd in CUDA mode treats host-side C++ standard library as unavailable (std::format, chrono, iostream errors)

Problem I'm trying to use clangd for LSP in Neovim with CUDA .cu files, but it fails to recognize standard C++ library features on the host side. Even simple host functions using std::format, std::...

NeKon

312

asked Nov 15, 2025 at 15:58

3 votes

1 answer

112 views

Can I modify host data after cudaMemcpyAsync

Can I modify host data in host_data_ptr after the following ? cudaMemcpyAsync(device_data_ptr, host_data_ptr, size, cudaMemcpyHostToDevice, ...

YSF

41

asked Nov 12, 2025 at 9:49

3 votes

1 answer

2k views

How to correctly install JAX with CUDA on Linux when `jax[cuda12_pip]` consistently falls back to the CPU version?

I am trying to install JAX with GPU support on a powerful, dedicated Linux server, but I am stuck in what feels like a Catch-22 where every official installation method fails in a different way, ...

PowerPoint Trenton

125

asked Nov 12, 2025 at 9:36

3 votes

1 answer

128 views

Deleted function compiler errors using thrust::remove in C++

I am currently attempting to use the thrust::remove function on a thrust::device_vector of structs in my main function as shown bellow: #include <iostream> #include <thrust/device_vector.h>...

AowynB

33

asked Nov 11, 2025 at 8:44

-2 votes

1 answer

132 views

Why Cuda threads are repeating same task?

I have a coded my simple CUDA ZIP password cracker but it seems that it prints same password for a number of times and i couldn't figure out why and this is weighing down my program. Here is the full ...

actgroup inc

17

asked Nov 10, 2025 at 5:19

0 votes

1 answer

380 views

Linking fails with: in function `main.cold': undefined reference to `__cxa_call_terminate'

I'm trying to build, using CMake, a program involving C++ and CUDA-C++ code. It used to build file, several months ago, but - now am getting a linker error I'm not familiar with: in function `main....

einpoklum

139k

asked Nov 9, 2025 at 23:14

4 votes

2 answers

237 views

Unable to run CUDA program in google colab

I am trying to run basic CUDA program in google colab but its not giving kernel output. Below are the steps what I tried: Changed run type to T4 GPU. !pip install nvcc4jupyter %load_ext ...

Digvijay Singh Thakur

3,351

asked Nov 6, 2025 at 7:52

1 vote

1 answer

128 views

Problem compiling a skeleton CUDA & C++ project using CMake

I want to create a skeleton for a project in which there are multiple cuda and cpp files. They will be compiled individually and then linked together to form a single executable. Currently I have the ...

ThErOmAnEmPiRe

63

asked Nov 4, 2025 at 20:08

1 vote

1 answer

91 views

How to debug cuda in Visual Studio with "step over"

I installed NVIDIA Nsight Visual Studio Edition 2025.01 in Visual Studio 2022. I want to debug code, but I can't debug with step over(F10), The debugger always stops at a location without a breakpoint....

Imagination Youth

11

asked Oct 31, 2025 at 2:36

Collectives™ on Stack Overflow