14,752 questions
Best practices
1
vote
0
replies
28
views
How do you fully automate a Kaggle notebook that requires a kernel restart mid-run after installing a library?
looking for advice on a workflow problem rather than a specific library issue.
The situation
I have a notebook that:
Installs packages via pip in Cell 1
# REVISED CELL 1
import subprocess, sys
# ...
0
votes
0
answers
63
views
How to stop cuda register resource from removing with host-device mapping of an OpenGL shader storage buffer?
I'm trying to take an OpenGL storage buffer, load data into it from the CPU (persistently mapping it), taking that data and processing it with cuda, then using that same data again in further OpenGL ...
Best practices
0
votes
0
replies
34
views
CUDA Channelizer Implementation
A channelizer is, conceptually, a bunch of reduction operations (polyphase filter) followed by an equal number of transforms (FFT).
Though I have this "mostly working", there are still some ...
-2
votes
1
answer
67
views
What tearing guarantees are provided when reading/writing from global memory?
Lets say there are 10 threads writing and 10 threads reading from the same 32 bit integer stored in global memory, in device code, all at the same time. Are there any guarantees provided about the ...
2
votes
1
answer
147
views
Why does order of defining hidden friend binary operator- vs operator==, and using int vs concepts matter to NVCC?
I'm trying to create a pointer like type wrapper for CUDA device pointers, but ran into a problem I can't reproduce with other compilers in regular MSVC or GCC. Basically, depending on the order I ...
1
vote
1
answer
103
views
Making non cuda code compatible with cuda which uses std functions where cuda::std exists
I'm attempting to create a function that handles bit casting a float/double/float16 to an unsigned key of equivalent size. I can do this in host code easily with
template<std::floating_point T>
...
2
votes
2
answers
139
views
nvcc insists thrust::complex doesn't have real() or imag()
#include <stdio.h>
#include <thrust/complex.h>
#include <thrust/device_vector.h>
// #include <thrust/host_vector.h>
int main(int argc, char *argv[]) {
thrust::device_vector&...
0
votes
0
answers
51
views
How to use Cub::BlockHistogram when some of the values are not valid?
I'm using cub::BlockHistogram<std::uint16_t, 256, items_per_thread, 256, cub::BLOCK_HISTO_ATOMIC>; and I was expecting an API with similar utility to cub::Warp* functions, however there doesn't ...
Advice
0
votes
1
replies
45
views
Performing sorts on segments of images at the same time using CUB/CudaSDK with out performance compromise
Background
I have an image that I'm splitting into tiles of some size, determined by parameters tile_x, and tile_y, so tile_x = 2 and tile_y = 2 would mean the image is split into a 2x2 grid of tiles (...
1
vote
2
answers
136
views
CUDA SETUP ERROR: Missing libnvJitLink.so.13
I am trying to load Qwen on google colab.
Concurrently asked this question on github bitsandbytes foundation ->
https://github.com/bitsandbytes-foundation/bitsandbytes/issues/1905#issuecomment-...
-2
votes
0
answers
135
views
ONNXRuntimeError: CUDA error: cudaErrorNoKernelImageForDevice:no kernel image is available for execution on the device
I'm trying to run model on GPU:
clf2 = PunctCapSegModelONNX.from_pretrained(
"1-800-BAD-CODE/xlm-roberta_punctuation_fullstop_truecase",
ort_provider=["CUDAExecutionProvider&...
0
votes
1
answer
104
views
What does the extra pointer parameter for CUDA's nan function actually do?
The CUDA function providing a quiet NaN is:
__device__ double nan(const char *tagp)
And the documentation says that
Argument tagp selects one of the possible representations.
My question: What does ...
2
votes
0
answers
116
views
__float128 addition: inconsistent result on CPU and CUDA
I'm testing __float128 on CUDA with a simple program:
// nvcc -arch=sm_100 test_fp128_add.cu -o test_fp128_add
#include <cstdio>
#include <cuda_runtime.h>
__global__ void add_fp128(...
4
votes
1
answer
125
views
Can DRAM and SMEM instructions be issued in a single cycle?
In the Ampere architecture, consider the following scenarios:
A single warp executes two load instructions: one from Shared Memory and one from DRAM.
Two warps within the same SM, each executing a ...
Advice
0
votes
0
replies
92
views
Learning Resource for TensorRT and PyCUDA (or other CUDA Python inferface library)
I'm trying to deploy a PyTorch model to an edge device (Jetson Orin Nano). Deploying the PyTorch model directly does not satisfy the requirements posed by the clients, so I'm looking into optimize the ...
Advice
0
votes
0
replies
84
views
Why doesn’t pytorch LayerNorm implementation get close to A100 peak memory bandwidth?
I’m was trying to benchmark the torch.nn.LayerNorm on an A100 to compare it against a custom CUDA kernel I made. I expected the PyTorch kernel to get much closer to A100’s peak memory bandwidth (which ...
Advice
1
vote
0
replies
64
views
What are the exact steps/commands to install CUDA using Conda?
I'm new to deep learning and I would like to install CUDA using Conda, but I'm confused about the correct order of the steps and which commands I should use. I’ve seen very different commands in ...
Advice
0
votes
2
replies
74
views
How does CUBLAS achieve 1000-fold reuse?
If I multiply two 8192 x 8192 matrices of float's with CUBLAS, ncu --metrics dram__bytes_read.sum tells me it reads 4.42 GB of data in total (on a 3070). One matrix is 0.268 GB, so we read each matrix ...
Best practices
0
votes
1
replies
115
views
How to reduce GPU memory usage when fine-tuning a large transformer model?
“I’m fine-tuning a transformer with batch size 8 and getting CUDA out-of-memory errors. Would gradient checkpointing or mixed precision help?”
0
votes
0
answers
90
views
Why am I getting an "out of memory" error from cudaSetDevice()?
Below is a small C++ program which queries the number of devices, gets the first device's properties, then tries to cudaSetDevice() to it.
I build this program one one machine, then copy it to a ...
0
votes
0
answers
44
views
Checking if a GPU function has been loaded (to execute kernel using CUDA Driver API) causes segmentation fault
For educational purposes, I am assembling my own CUDA fatbins by hand and executing them using the CUDA Driver API. The goal is to launch the kernel and display the result of the kernel on the CPU, ...
4
votes
0
answers
72
views
cudaMemcpyAsync (P2P D2D) serializes with kernel execution
Hi all — I’m debugging an unexpected ordering/progress issue with peer-to-peer device copies and I’d like help understanding what CUDA/driver/runtime behavior could explain it.
Setup
Single node, 2 ...
2
votes
1
answer
155
views
Struct with dynamically allocated array that works on host and device with Cuda
I would like to create a struct that can work on the host and device that stores an array that has been dynamically allocated (unknown size at compile time). This struct would be sent to a kernel so ...
0
votes
0
answers
66
views
cupy kernel template for elementwise reductions
I want to get the kernel template that cupy uses for its subtract function to compare with my (really simple) kernel:
subtract_kernel = cp.RawKernel(
"""
extern "C" __global__
...
0
votes
0
answers
122
views
CUDA - Make a specific memory access skip the cache
I have a kernel which first reads values from certain memory locations, then writes to those memory locations. I also have a lock which ensures that at any point in time, only one thread is writing to ...
2
votes
0
answers
71
views
16-way radix sort without threadgroup/warp shared memory prefixsums
I've been implementing radix sort for SPH simulation use for some time now, and initially I've had increased to to use 4 bits radix and make it 16 way radix sort which I haven't seen documented too ...
3
votes
1
answer
609
views
CUDA_ARCHITECTURES is set to "native", but no NVIDIA GPU was detected
I am trying to install llama-cpp-python with GPU support. I installed Nvidia CUDA Toolkit v13.1, nvidia-smi shows that my graphics card - Geforce GTX 1050 Ti - supports CUDA v13, nvcc is installed ...
2
votes
1
answer
154
views
How can I determine if the memory at a certain address was allocated by / is managed by CUDA?
Suppose I get a pointer, and I want to determine whether it's "CUDA-associated", i.e. allocated by CUDA as pinned host-side memory, device-side memory, managed memory, array memory, etc. - ...
1
vote
1
answer
360
views
Error while loading shared libraries: libcuda.so.1: cannot open shared object file: No such file or directory
I am trying to set up a docker container using the nvidia container toolkit on a remote server, so that I can run cuda programs developed with the Futhark Programming Language - however, the issue ...
2
votes
1
answer
231
views
CMake Error: Error required internal CMake variable not set ... _CMAKE_CUDA_WHOLE_FLAG
I'm trying to configure a project of mine involving CUDA, like so:
cmake \
-DCMAKE_CUDA_ARCHITECTURES=61 \
-DCMAKE_BUILD_TYPE=Release \
-DCUDAToolkit_ROOT=/usr/local/cuda-11.6 \
-DCMAKE_CUDA_COMPILER=/...
Best practices
0
votes
2
replies
71
views
Designing a std::pmr-like vector for device memory using Thrust
I want to implement a container similar to std::pmr::vector, but backed by CUDA device memory, with usage semantics close to std::pmr::vector / thrust::device_vector.
Requirements:
PMR-style ...
0
votes
2
answers
75
views
pycuda._driver.Error: cuInit failed: unknown error
I have a problem with pycuda. I used it for a python script i develop. I know this script work because i use it on other server. But on a specific server i got a problem :
>>> import pycuda....
4
votes
1
answer
205
views
printf() not working on colab while running a CUDA c++ code
This is my first time working with CUDA programs. So I just wrote a simple hello world program.
#include <stdio.h>
__global__ void hello(){
printf("Hello block: %u and thread: %u\n"...
-5
votes
1
answer
86
views
Performance Degradation of LAMMPS with Increased MPI Ranks on a A100 GPU [closed]
I tested the performance of LAMMPS with DeepMD-kit for MD simulations on an HPC cluster.
The job was allocated 8 CPUs, 64 GB of RAM, and one A100 GPU.
I observed that when running with mpirun -np 1 ...
Advice
1
vote
6
replies
197
views
Cannot install Tensorflow GPU on Win 11
i was training AI in my pc using tensorflow with CPU due i have a amd GPU, well i bought a Nvidia RTX 5060 TI 16GB and i couldn't make it work.
First, i made a new environment with anaconda, installed ...
3
votes
0
answers
263
views
Why is my “Second naive” SGEMM kernel slower than the “global memory coalesced” version?
I am benchmarking several very simple CUDA SGEMM kernels on an NVIDIA Hopper GPU (H800, sm_90), and I observed something that I do not fully understand.
I have two kernels that, to my understanding, ...
1
vote
0
answers
143
views
Small error in CUDA stream mandelbrot kernel
I'm new to CUDA, and I can't see where's the mistake in my kernel. Upon comparing it with the result from my professor, the difference was extremely small, with the pixel average being off by 0,0039. ...
1
vote
0
answers
131
views
Thrust device allocator vs std allocator
I have very simple class using thrust device allocator.
I also have class on host side uses std::allocator. It works fine. But this one gives segmentation fault.
I am not sure what is wrong here. How ...
3
votes
1
answer
141
views
What is the exactly layout of CuTe's thread value layout?
I am learning CuTe's thread value layout, and I followed leimao's blog:
https://leimao.github.io/blog/CuTe-Thread-Value-Layout/
I want to figure out which thread read which part of data in matrix.
So ...
-6
votes
1
answer
160
views
PyCharm and PyTorch - Not able to run CUDA [closed]
I have CUDA installed via the regular Windows downloadable installer via the official website, and am trying to use PyTorch in the PyCharm program using CUDA as kernel.
PyTorch now works fine, however ...
Advice
1
vote
5
replies
132
views
CUDA C: How to keep an entire, somewhat complex calculcation on the GPU w/o bringing intermediate results back to host
So I'm trying to learn CUDA C. I had an idea for a simple code that could calculate the simple average of a float array. The idea is that main() will call a host function get_average(), which will ...
5
votes
2
answers
482
views
clangd in CUDA mode treats host-side C++ standard library as unavailable (std::format, chrono, iostream errors)
Problem
I'm trying to use clangd for LSP in Neovim with CUDA .cu files, but it fails to recognize standard C++ library features on the host side. Even simple host functions using std::format, std::...
3
votes
1
answer
112
views
Can I modify host data after cudaMemcpyAsync
Can I modify host data in host_data_ptr after the following ?
cudaMemcpyAsync(device_data_ptr,
host_data_ptr,
size,
cudaMemcpyHostToDevice,
...
3
votes
1
answer
2k
views
How to correctly install JAX with CUDA on Linux when `jax[cuda12_pip]` consistently falls back to the CPU version?
I am trying to install JAX with GPU support on a powerful, dedicated Linux server, but I am stuck in what feels like a Catch-22 where every official installation method fails in a different way, ...
3
votes
1
answer
128
views
Deleted function compiler errors using thrust::remove in C++
I am currently attempting to use the thrust::remove function on a thrust::device_vector of structs in my main function as shown bellow:
#include <iostream>
#include <thrust/device_vector.h>...
-2
votes
1
answer
132
views
Why Cuda threads are repeating same task?
I have a coded my simple CUDA ZIP password cracker but it seems that it prints same password for a number of times and i couldn't figure out why and this is weighing down my program.
Here is the full ...
0
votes
1
answer
380
views
Linking fails with: in function `main.cold': undefined reference to `__cxa_call_terminate'
I'm trying to build, using CMake, a program involving C++ and CUDA-C++ code. It used to build file, several months ago, but - now am getting a linker error I'm not familiar with:
in function `main....
4
votes
2
answers
237
views
Unable to run CUDA program in google colab
I am trying to run basic CUDA program in google colab but its not giving kernel output.
Below are the steps what I tried:
Changed run type to T4 GPU.
!pip install nvcc4jupyter
%load_ext ...
1
vote
1
answer
128
views
Problem compiling a skeleton CUDA & C++ project using CMake
I want to create a skeleton for a project in which there are multiple cuda and cpp files. They will be compiled individually and then linked together to form a single executable.
Currently I have the ...
1
vote
1
answer
91
views
How to debug cuda in Visual Studio with "step over"
I installed NVIDIA Nsight Visual Studio Edition 2025.01 in Visual Studio 2022.
I want to debug code, but I can't debug with step over(F10), The debugger always stops at a location without a breakpoint....